currently in a mess [hashing method]

Discuss all kind of algorithms and data structures from their mathematical and programming sides.

Moderators: Darobat, RecursiveS, Dante Shamest, Bugdude, Wizard

currently in a mess [hashing method]

Postby redrabbit » Mon Feb 07, 2005 1:18 pm

I hope this is the right area to post this problem i am having. Its a complicated program and im just got a brain freeze and in a situation where i dont how iam going to achieve this methodology. This explanation is going to be pretty long!

Im using a hashing method with a hashtable which looks for repeats from a given text file. These repeats are DNA sequences! So these are, A,T,G,C.



Basically,the reason to use hashing in repeats is because once you have found the repeats in the substring, you can extend those repeats to look for maximal repeats.

Ive worked my arse off trying to get the damn hashing method to work to how and almost there but i dont know how to use it to extend those repeats.I have a feeling the way i have done it,may not make it possible. Im at the last hurdle and having trouble jumping over it.


At the moment,i have added user input and allowed the person to specify:
1.the length of the repeat he wants e.g 4 in this example
2.the top n repeats he wants to show (eg.top 20) repeats.

Code: Select all
The results show:

Number of repeats in order: AAAA  with 48 repeats
72 140 407 408 409 410 411 412 413 414 415 421 422 423 424 433 490 501 560 561 562 709 710 717 718 719 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893
Number of repeats in order: TTTT  with 25 repeats
49 129 194 219 220 221 222 228 229 230 231 356 357 460 461 537 538 548 549 555 556 1018 1079 1243 1244
Number of repeats in order: ATTT  with 17 repeats
20 128 161 282 321 355 436 442 459 927 981 985 1041 1078 1128 1219 1250
Number of repeats in order: CAAA  with 15 repeats
88 99 121 175 270 406 420 432 500 716 803 871 922 1147 1201
Number of repeats in order: TTTG  with 15 repeats
50 195 223 232 239 322 516 626 917 986 1042 1220 1227 1245 1270
Number of repeats in order: TATT  with 15 repeats
2 19 127 281 316 363 441 458 543 980 984 1032 1040 1077 1127
Number of repeats in order: TTTA  with 14 repeats
0 212 265 437 462 550 557 928 982 1064 1080 1089 1129 1159
Number of repeats in order: CTTT  with 13 repeats
211 264 515 536 554 625 1017 1063 1088 1158 1226 1242 1269
Number of repeats in order: AAAT  with 13 repeats
73 100 303 434 563 711 894 1011 1138 1148 1171 1202 1279
Number of repeats in order: TTTC  with 12 repeats
21 130 162 283 358 403 443 495 539 809 1019 1251


SUBLEN = AAAA, k =4;
the integers show the position of the SUBLEN on the string with how many repeats have been found.

My problem i think i m going to have is that K has always been specified by the user,so there is going to be always K length. So the results show what the user has entered as k =4. The way i have calculated this,is my hashtable size is going to be always: 4^k....

The reason its 4,is because im working with A,T,G,C..


so if k = 2, the hashtable size is: 16
k = 3, the hashtable size is: 64
k = 4,the hashtable size is: 256

and so on... Hopefully this has made it a little bit clearer of the foundations of this program...


So i have two ways of thinking to implement this:
1) Quite tricky thing to do,but Instead of having k = 4 or any number, is there a way to go through the whole txt without specifying what k is? I think im going to have a problem with the SIZE part.

e.g
Code: Select all

#define SUBLEN 7
#define SIZE 16384


SIZE is the size of the hashtable. I dont have faith in this option at all because what happens with you get k = 20? the size isnt worth thinking about!


2) The other way which i thought might work could be,if k =4 or any integer,
maybe i could add a cycle to extend the repeats until the matches i found were greater or equal to 2.


This is a pretty complicated program and i need some help on how i can do this.

I can provide code if that helps to explain things.

kind regards

joHnnY
redrabbit
 
Posts: 10
Joined: Sun Nov 21, 2004 10:05 am
Location: London

Postby Darryl » Mon Feb 07, 2005 2:07 pm

I am curious as what you consider a repeat for example

Your example seems to indicate that

AAAAAA would be consider 3 repeats -> (AAAA)AA, A(AAAA)A, and AA(AAAA).

based on this
Code: Select all
Number of repeats in order: AAAA  with 48 repeats
72 140 407 408 409 410 411 412 413 414 415 421 422 423 424 433 490 501 560 561 562 709 710 717 718

see like 407, 408 etc..

Is this correct?
User avatar
Darryl
 
Posts: 1342
Joined: Wed Sep 01, 2004 10:50 am
Location: Cayman Islands

Postby redrabbit » Mon Feb 07, 2005 2:15 pm

yes that is correct sir.

my text file is a mixture of A,T,G,Cs..The file has lots and lots of lines of these sequences...

let me know if you want me to provide you with any information that could help :|
redrabbit
 
Posts: 10
Joined: Sun Nov 21, 2004 10:05 am
Location: London

Postby Darryl » Mon Feb 07, 2005 2:25 pm

How long can the sequences be? I think the problem you are having is that you are trying to allow room for every combination, where as maybe a certain combination might not show up. I think you should grow your hash tables. If I were implementing it I would do it as:

map<string, vector<int>>

that way it will grow as needed and you won't need to worry about size up front.

Darryl
User avatar
Darryl
 
Posts: 1342
Joined: Wed Sep 01, 2004 10:50 am
Location: Cayman Islands

Postby Darryl » Mon Feb 07, 2005 5:36 pm

Ok i was bored...

Code: Select all
                                                                                                                                                                                                                                                               

#include <iostream>
#include <fstream>
#include <string>
#include <map>
#include <algorithm>
#include <vector>
using namespace std;

typedef vector<int> vect_int;
map<string,vect_int*> *ptMap;  // need global ptr to use in std::sort compare predicate.

bool sort_map_greater( string item1, string item2)
{
   return (*ptMap)[item1]->size()>(*ptMap)[item2]->size();
}


int main( int argc, char *argv[])  //
{
   // TODO:check that filename was given as command line argument here

   char charbuf;
   string sequence;
   string inputfile(argv[1]);
   map<string, vect_int*> repeats;
   vector<string> index; // to be used for sorting

   ptMap = &repeats;
   
   ifstream filein(inputfile.c_str());

   // TODO: check valid file opened successfully here;

   while(filein.get(charbuf))
   {
      sequence.push_back(charbuf);
      //assuming only char in file are A,T,C,G (ie no /n or /r)
   }

   filein.close();


   // TODO: add functionality to input get k
   // get topmost
   int k;
   cout << "Enter k: ";
   cin >> k;
   //TODO: check valid input
   cout << "Enter top n repeats to show: ";
   int topmost;
   cin >> topmost;
   //TODO: check for valid input

   for (int i = 0;i<sequence.size()-k;++i)
   {
      string sub = sequence.substr(i,k);//get substring based on k size;
      if (repeats.find(sub)==repeats.end())// have we already mapped it?
      {
         //no, then add key with new vector
         repeats[sub]=new vector<int>;
         index.push_back(sub);//also add string to our index while we are at it.
      }
      repeats[sub]->push_back(i);// add position to hash
   }

   // ok now that we got them in
   // need to sort them to get topmost

   sort(index.begin(), index.end(),sort_map_greater);//std::sort makes things easy

   
   //let's output the result
   for (int i = 0; (i < topmost) && (i < index.size()); i++)
   {
      cout << "Number of repeats in order: " << index[i] << " with " << repeats[index[i]]->size() << endl;
      for (vector<int>::iterator it = repeats[index[i]]->begin();it != repeats[index[i]]->end();it++)
      {
         cout << *it << " ";
      }
      cout << endl;
   }


   //finally we need to delete all those new vectors
   for (map<string, vect_int*>::iterator mIt = repeats.begin(); mIt!= repeats.end();mIt++)
   {
      delete mIt->second;
   }
}
User avatar
Darryl
 
Posts: 1342
Joined: Wed Sep 01, 2004 10:50 am
Location: Cayman Islands

Postby redrabbit » Tue Feb 08, 2005 8:58 am

hi darryl, thank you for your time on trying to help me.Sorry i took so long to reply.
How long can the sequences be? I think the problem you are having is that you are trying to allow room for every combination, where as maybe a certain combination might not show up. I think you should grow your hash tables.


The sequences can be of any length..so the hashtable will definately need to grow.

I will have a look at the program you wrote. That was very kind and nice of you to do something like that.Although,i dont know how i can use that in my code.

Below is what my code looks like,maybe it will make it things clearer.
Code: Select all
// k occurences using Hashing Methods
// C Header files
#include <ctype.h>
#include <math.h>

// C++ Header files
#include <vector>
#include <string>
#include <iostream>
#include <fstream>
#include <map>
#include <algorithm>

// Namespaces
using namespace std;
typedef std::vector<int> HashVector;
typedef std::vector<HashVector> HashTable;
// Function Prototypes

string readFile();
int    getATCGInt(char ATCG);
void displayTopHashTable(HashTable& hashtable,int top,int SUBLEN, int SIZE, const std::string& sdata);
int computeHash (const std::string& str,int SUBLEN);
char getATCGfromhashtable(int i);
std::string HashfromInt(int hash,int length);


int main(int argc, char * argv[]) {

  std::string sdata;
  int i = 0;
  int index,top; //integer value which shows 'n' length value
  int SIZE;
  int SUBLEN;  //K-Tuples

  sdata = (readFile());
 
  cout << endl;
  cout << "Hello...This Program uses Hashing" << endl;
 
  cout << "1. Please enter length of repeat you want to find: ";
  cin >> SUBLEN;

  SIZE = pow(4.0f,(double)SUBLEN); 
 
  cout << SIZE << endl;
 
  cout << "Enter the top number of repeats u want to see:  ";
  cin >> top;
  cout << endl;

  HashTable hashtable(SIZE);

  cout << sdata.length() << " Nucelotides read in" << endl;





  /******* build a hash table *********/
  for (i = 0; i < (sdata.length() - SUBLEN); i++) {
    index = computeHash(sdata.substr(i,SUBLEN),SUBLEN);     /*** IMPORTANT BIT oF HASHING ******/
    hashtable[index].push_back(i);
  }
  /************************************/



  //showhashtable(hashtable,SUBLEN,SIZE,sdata);                  /*** Display the whole hash table ***/
  displayTopHashTable(hashtable,top,SUBLEN,SIZE,sdata);          /*** Display 'n' number of repeats to show ***/


 

 
  return 0;

}

/**************************************************************************************************/

void displayTopHashTable(HashTable& hashtable,int top,int SUBLEN,int SIZE, const std::string& sdata){

  int mostoccuring = 0;
  int highest = 0;
  int i, total = 0;
  multimap<int,int> rank;


  //ofstream output("testdata.txt"); // output stream

  for (i = 0; i < SIZE; i++)
  {    //need to put in order, is i more occoring then mostoccoring?
     // removed all your comparison stuff - multimap will automatically sort it
     rank.insert(make_pair(hashtable[i].size(),i));
  }

  cout << "K Length is " << SUBLEN << endl;


  multimap<int,int>::reverse_iterator rIter = rank.rbegin();
  //use reverse_iter because multimap will store them lowest to highest, so we want the last 20;
 
  for (int count = 0;count < top; count++)  //count < SIZE can be modified to an integer
  {
     if(rIter == rank.rend()) break; //in case there are less then 20 elements
     
     //cout << "K Length is " << SUBLEN << endl;
     cout << "Number of repeats in order: " << HashfromInt(rIter->second,SUBLEN)
         << " with " << hashtable[rIter->second].size() << " repeats " << endl;
     
     for (i = 0; i < hashtable[rIter->second].size(); i++) {
       cout << hashtable[rIter->second][i] << " ";
     }
     cout << endl;
     rIter++;
      total += hashtable[i].size();
  }
 



    return ;
}


/*********************************************************************************************************/
string readFile() {

  string filename; 
  int j;
  string information;
  char tmp = 0;       
  string   data("");   



 
  cout << "Enter filename: ";
  cin >> filename;

  ifstream infile(filename.c_str());

for(j=0; j < 3; j++)
{
   cout << "." << endl;
}

if( !filename.c_str()    )
   {
     cout << filename.c_str() << " has not been loaded " << endl;
     cin.get();
   }
else
   {
     cout << filename.c_str() << " has been loaded" <<  endl;
   }
/*  if ( !infile.is_open() ) {
    cout << "Could not open file.\n";
    cin.get();
    return -1;
  }
*/

while(!infile.eof()){
   infile.get(tmp);
   tmp= toupper(tmp);
    if ((tmp == 'A') || (tmp == 'T') || (tmp == 'C') || (tmp == 'G'))
      {
      data += tmp;
    }
}

return data;

}

/*

string readFile(char * filename) {
  ifstream file(filename);
  string   data("");
  char     tmp = 0;

  // grab all the characters
  while (!file.eof()) {
    file.get(tmp);
    // fix case
    tmp = toupper(tmp);
    // only add ATC or G to the string
    if ((tmp == 'A') || (tmp == 'T') || (tmp == 'C') || (tmp == 'G')) {
      data += tmp;
    }
  }

  return data;
}
*/

/**************************************************************************************************/

// Computer Hashing Index
int computeHash (const std::string& str,int SUBLEN) { //const std::string& str
  int hash = 0, i = 0;
  int len = SUBLEN-1;

  for (i = 0; i<str.length(); i++) {
    // compute 4^i * char
    hash += ((int)pow((double)4, (double)(len-i)) * getATCGInt(str[i]));
  }

  return hash;
}

/*********************************************************************************************/

int getATCGInt(char ATCG) {
  switch (ATCG) {
    case 'A':
      return 0;
    case 'T':
      return 1;
    case 'C':
      return 2;
    case 'G':
      return 3;
  }
}

char getATCGfromhashtable(int i)
{
    static char str[] = "ATCG";
    if ( i >= 0 && i < strlen(str))
       return str[i];
    return ' '; // or whatever default
}

/************************************************************************************************/
std::string HashfromInt(int hash,int length)
{
    int patern = hash;
    int temp;
    std::string str = " ";
    std::string strTemp;
    for(int i=0;i<length;i++)
    {
        temp = patern % 4;
        strTemp = str;   
        str = getATCGfromhashtable(temp);
        str += strTemp;
        patern = (patern-temp)/4;       
    } 
    return str;
}




kind regards

JoHnnY
redrabbit
 
Posts: 10
Joined: Sun Nov 21, 2004 10:05 am
Location: London

Postby Darryl » Tue Feb 08, 2005 10:44 am

The biggest difference between our methods is that you try to size yours at the beginning which like you say, if you do k=20 will be ridiculously big, where mine only resizes when a new key (hash) is found.

Regarding size, imagine if you had a sequence of say 1000000 and you use k=20 at the very most you can only have n-k different sequences which in this case is 999,980 which is a lot less than 4^20=1099511627776 which is what you would size your hash to at the start.

I will make some notes on your code, when I get a chance, a bit busy right now. But basically what you need to do is make an empty size 0 hash table. Everytime you get a new hash, add a hashvector and push the position onto it. For existing hashvectors you only need to push the new position onto it.


D
User avatar
Darryl
 
Posts: 1342
Joined: Wed Sep 01, 2004 10:50 am
Location: Cayman Islands


Return to Algorithms & Data Structures

Who is online

Users browsing this forum: No registered users and 1 guest