## currently in a mess [hashing method]

Discuss all kind of algorithms and data structures from their mathematical and programming sides.

Moderators: Darobat, RecursiveS, Dante Shamest, Bugdude, Wizard

### currently in a mess [hashing method]

I hope this is the right area to post this problem i am having. Its a complicated program and im just got a brain freeze and in a situation where i dont how iam going to achieve this methodology. This explanation is going to be pretty long!

Im using a hashing method with a hashtable which looks for repeats from a given text file. These repeats are DNA sequences! So these are, A,T,G,C.

Basically,the reason to use hashing in repeats is because once you have found the repeats in the substring, you can extend those repeats to look for maximal repeats.

Ive worked my arse off trying to get the damn hashing method to work to how and almost there but i dont know how to use it to extend those repeats.I have a feeling the way i have done it,may not make it possible. Im at the last hurdle and having trouble jumping over it.

At the moment,i have added user input and allowed the person to specify:
1.the length of the repeat he wants e.g 4 in this example
2.the top n repeats he wants to show (eg.top 20) repeats.

Code: Select all
`The results show:Number of repeats in order: AAAA  with 48 repeats 72 140 407 408 409 410 411 412 413 414 415 421 422 423 424 433 490 501 560 561 562 709 710 717 718 719 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 Number of repeats in order: TTTT  with 25 repeats 49 129 194 219 220 221 222 228 229 230 231 356 357 460 461 537 538 548 549 555 556 1018 1079 1243 1244 Number of repeats in order: ATTT  with 17 repeats 20 128 161 282 321 355 436 442 459 927 981 985 1041 1078 1128 1219 1250 Number of repeats in order: CAAA  with 15 repeats 88 99 121 175 270 406 420 432 500 716 803 871 922 1147 1201 Number of repeats in order: TTTG  with 15 repeats 50 195 223 232 239 322 516 626 917 986 1042 1220 1227 1245 1270 Number of repeats in order: TATT  with 15 repeats 2 19 127 281 316 363 441 458 543 980 984 1032 1040 1077 1127 Number of repeats in order: TTTA  with 14 repeats 0 212 265 437 462 550 557 928 982 1064 1080 1089 1129 1159 Number of repeats in order: CTTT  with 13 repeats 211 264 515 536 554 625 1017 1063 1088 1158 1226 1242 1269 Number of repeats in order: AAAT  with 13 repeats 73 100 303 434 563 711 894 1011 1138 1148 1171 1202 1279 Number of repeats in order: TTTC  with 12 repeats 21 130 162 283 358 403 443 495 539 809 1019 1251 `

SUBLEN = AAAA, k =4;
the integers show the position of the SUBLEN on the string with how many repeats have been found.

My problem i think i m going to have is that K has always been specified by the user,so there is going to be always K length. So the results show what the user has entered as k =4. The way i have calculated this,is my hashtable size is going to be always: 4^k....

The reason its 4,is because im working with A,T,G,C..

so if k = 2, the hashtable size is: 16
k = 3, the hashtable size is: 64
k = 4,the hashtable size is: 256

and so on... Hopefully this has made it a little bit clearer of the foundations of this program...

So i have two ways of thinking to implement this:
1) Quite tricky thing to do,but Instead of having k = 4 or any number, is there a way to go through the whole txt without specifying what k is? I think im going to have a problem with the SIZE part.

e.g
Code: Select all
`#define SUBLEN 7#define SIZE 16384`

SIZE is the size of the hashtable. I dont have faith in this option at all because what happens with you get k = 20? the size isnt worth thinking about!

2) The other way which i thought might work could be,if k =4 or any integer,
maybe i could add a cycle to extend the repeats until the matches i found were greater or equal to 2.

This is a pretty complicated program and i need some help on how i can do this.

I can provide code if that helps to explain things.

kind regards

joHnnY
redrabbit

Posts: 10
Joined: Sun Nov 21, 2004 10:05 am
Location: London

I am curious as what you consider a repeat for example

Your example seems to indicate that

AAAAAA would be consider 3 repeats -> (AAAA)AA, A(AAAA)A, and AA(AAAA).

based on this
Code: Select all
`Number of repeats in order: AAAA  with 48 repeats72 140 407 408 409 410 411 412 413 414 415 421 422 423 424 433 490 501 560 561 562 709 710 717 718`

see like 407, 408 etc..

Is this correct?

Darryl

Posts: 1342
Joined: Wed Sep 01, 2004 10:50 am
Location: Cayman Islands

yes that is correct sir.

my text file is a mixture of A,T,G,Cs..The file has lots and lots of lines of these sequences...

let me know if you want me to provide you with any information that could help
redrabbit

Posts: 10
Joined: Sun Nov 21, 2004 10:05 am
Location: London

How long can the sequences be? I think the problem you are having is that you are trying to allow room for every combination, where as maybe a certain combination might not show up. I think you should grow your hash tables. If I were implementing it I would do it as:

map<string, vector<int>>

that way it will grow as needed and you won't need to worry about size up front.

Darryl

Darryl

Posts: 1342
Joined: Wed Sep 01, 2004 10:50 am
Location: Cayman Islands

Ok i was bored...

Code: Select all
`                                                                                                                                                                                                                                                               #include <iostream>#include <fstream>#include <string>#include <map>#include <algorithm>#include <vector>using namespace std;typedef vector<int> vect_int;map<string,vect_int*> *ptMap;  // need global ptr to use in std::sort compare predicate.bool sort_map_greater( string item1, string item2){   return (*ptMap)[item1]->size()>(*ptMap)[item2]->size();}int main( int argc, char *argv[])  // {   // TODO:check that filename was given as command line argument here   char charbuf;   string sequence;   string inputfile(argv[1]);   map<string, vect_int*> repeats;   vector<string> index; // to be used for sorting   ptMap = &repeats;      ifstream filein(inputfile.c_str());   // TODO: check valid file opened successfully here;   while(filein.get(charbuf))   {      sequence.push_back(charbuf);      //assuming only char in file are A,T,C,G (ie no /n or /r)   }   filein.close();   // TODO: add functionality to input get k    // get topmost   int k;   cout << "Enter k: ";   cin >> k;   //TODO: check valid input   cout << "Enter top n repeats to show: ";   int topmost;   cin >> topmost;   //TODO: check for valid input   for (int i = 0;i<sequence.size()-k;++i)   {      string sub = sequence.substr(i,k);//get substring based on k size;      if (repeats.find(sub)==repeats.end())// have we already mapped it?      {         //no, then add key with new vector         repeats[sub]=new vector<int>;         index.push_back(sub);//also add string to our index while we are at it.      }      repeats[sub]->push_back(i);// add position to hash   }   // ok now that we got them in   // need to sort them to get topmost   sort(index.begin(), index.end(),sort_map_greater);//std::sort makes things easy      //let's output the result   for (int i = 0; (i < topmost) && (i < index.size()); i++)   {      cout << "Number of repeats in order: " << index[i] << " with " << repeats[index[i]]->size() << endl;      for (vector<int>::iterator it = repeats[index[i]]->begin();it != repeats[index[i]]->end();it++)      {         cout << *it << " ";      }      cout << endl;   }   //finally we need to delete all those new vectors   for (map<string, vect_int*>::iterator mIt = repeats.begin(); mIt!= repeats.end();mIt++)   {      delete mIt->second;   }}`

Darryl

Posts: 1342
Joined: Wed Sep 01, 2004 10:50 am
Location: Cayman Islands

hi darryl, thank you for your time on trying to help me.Sorry i took so long to reply.
How long can the sequences be? I think the problem you are having is that you are trying to allow room for every combination, where as maybe a certain combination might not show up. I think you should grow your hash tables.

The sequences can be of any length..so the hashtable will definately need to grow.

I will have a look at the program you wrote. That was very kind and nice of you to do something like that.Although,i dont know how i can use that in my code.

Below is what my code looks like,maybe it will make it things clearer.
Code: Select all
`// k occurences using Hashing Methods// C Header files#include <ctype.h>#include <math.h>// C++ Header files#include <vector>#include <string>#include <iostream>#include <fstream>#include <map>#include <algorithm>// Namespacesusing namespace std;typedef std::vector<int> HashVector;typedef std::vector<HashVector> HashTable;// Function Prototypesstring readFile();int    getATCGInt(char ATCG);void displayTopHashTable(HashTable& hashtable,int top,int SUBLEN, int SIZE, const std::string& sdata);int computeHash (const std::string& str,int SUBLEN);char getATCGfromhashtable(int i);std::string HashfromInt(int hash,int length); int main(int argc, char * argv[]) {  std::string sdata;  int i = 0;  int index,top; //integer value which shows 'n' length value  int SIZE;  int SUBLEN;  //K-Tuples  sdata = (readFile());    cout << endl;  cout << "Hello...This Program uses Hashing" << endl;    cout << "1. Please enter length of repeat you want to find: ";  cin >> SUBLEN;  SIZE = pow(4.0f,(double)SUBLEN);      cout << SIZE << endl;    cout << "Enter the top number of repeats u want to see:  ";  cin >> top;  cout << endl;  HashTable hashtable(SIZE);  cout << sdata.length() << " Nucelotides read in" << endl;   /******* build a hash table *********/  for (i = 0; i < (sdata.length() - SUBLEN); i++) {    index = computeHash(sdata.substr(i,SUBLEN),SUBLEN);     /*** IMPORTANT BIT oF HASHING ******/    hashtable[index].push_back(i);  }  /************************************/  //showhashtable(hashtable,SUBLEN,SIZE,sdata);                  /*** Display the whole hash table ***/  displayTopHashTable(hashtable,top,SUBLEN,SIZE,sdata);          /*** Display 'n' number of repeats to show ***/      return 0;}/**************************************************************************************************/void displayTopHashTable(HashTable& hashtable,int top,int SUBLEN,int SIZE, const std::string& sdata){  int mostoccuring = 0;   int highest = 0;   int i, total = 0;  multimap<int,int> rank;   //ofstream output("testdata.txt"); // output stream  for (i = 0; i < SIZE; i++)   {    //need to put in order, is i more occoring then mostoccoring?      // removed all your comparison stuff - multimap will automatically sort it      rank.insert(make_pair(hashtable[i].size(),i));   }   cout << "K Length is " << SUBLEN << endl;  multimap<int,int>::reverse_iterator rIter = rank.rbegin();   //use reverse_iter because multimap will store them lowest to highest, so we want the last 20;     for (int count = 0;count < top; count++)  //count < SIZE can be modified to an integer  {      if(rIter == rank.rend()) break; //in case there are less then 20 elements           //cout << "K Length is " << SUBLEN << endl;     cout << "Number of repeats in order: " << HashfromInt(rIter->second,SUBLEN)          << " with " << hashtable[rIter->second].size() << " repeats " << endl;           for (i = 0; i < hashtable[rIter->second].size(); i++) {       cout << hashtable[rIter->second][i] << " ";     }     cout << endl;     rIter++;       total += hashtable[i].size();  }       return ;}/*********************************************************************************************************/string readFile() {  string filename;    int j;  string information;  char tmp = 0;          string   data("");       cout << "Enter filename: ";  cin >> filename;  ifstream infile(filename.c_str()); for(j=0; j < 3; j++) {   cout << "." << endl; } if( !filename.c_str()    )   {     cout << filename.c_str() << " has not been loaded " << endl;     cin.get();   } else   {     cout << filename.c_str() << " has been loaded" <<  endl;   } /*  if ( !infile.is_open() ) {     cout << "Could not open file.\n";     cin.get();     return -1;   } */ while(!infile.eof()){   infile.get(tmp);   tmp= toupper(tmp);    if ((tmp == 'A') || (tmp == 'T') || (tmp == 'C') || (tmp == 'G'))       {      data += tmp;    } } return data; }/*string readFile(char * filename) {  ifstream file(filename);  string   data("");  char     tmp = 0;  // grab all the characters  while (!file.eof()) {    file.get(tmp);    // fix case    tmp = toupper(tmp);    // only add ATC or G to the string    if ((tmp == 'A') || (tmp == 'T') || (tmp == 'C') || (tmp == 'G')) {      data += tmp;    }  }  return data;}*//**************************************************************************************************/// Computer Hashing Indexint computeHash (const std::string& str,int SUBLEN) { //const std::string& str  int hash = 0, i = 0;  int len = SUBLEN-1;  for (i = 0; i<str.length(); i++) {    // compute 4^i * char    hash += ((int)pow((double)4, (double)(len-i)) * getATCGInt(str[i]));  }  return hash;}/*********************************************************************************************/int getATCGInt(char ATCG) {  switch (ATCG) {    case 'A':      return 0;    case 'T':      return 1;    case 'C':      return 2;    case 'G':      return 3;  }}char getATCGfromhashtable(int i) {    static char str[] = "ATCG";    if ( i >= 0 && i < strlen(str))       return str[i];    return ' '; // or whatever default}/************************************************************************************************/std::string HashfromInt(int hash,int length) {    int patern = hash;    int temp;    std::string str = " ";    std::string strTemp;     for(int i=0;i<length;i++)     {        temp = patern % 4;        strTemp = str;            str = getATCGfromhashtable(temp);        str += strTemp;        patern = (patern-temp)/4;           }      return str;}`

kind regards

JoHnnY
redrabbit

Posts: 10
Joined: Sun Nov 21, 2004 10:05 am
Location: London

The biggest difference between our methods is that you try to size yours at the beginning which like you say, if you do k=20 will be ridiculously big, where mine only resizes when a new key (hash) is found.

Regarding size, imagine if you had a sequence of say 1000000 and you use k=20 at the very most you can only have n-k different sequences which in this case is 999,980 which is a lot less than 4^20=1099511627776 which is what you would size your hash to at the start.

I will make some notes on your code, when I get a chance, a bit busy right now. But basically what you need to do is make an empty size 0 hash table. Everytime you get a new hash, add a hashvector and push the position onto it. For existing hashvectors you only need to push the new position onto it.

D

Darryl

Posts: 1342
Joined: Wed Sep 01, 2004 10:50 am
Location: Cayman Islands