Im using a hashing method with a hashtable which looks for repeats from a given text file. These repeats are DNA sequences! So these are, A,T,G,C.
Basically,the reason to use hashing in repeats is because once you have found the repeats in the substring, you can extend those repeats to look for maximal repeats.
Ive worked my arse off trying to get the damn hashing method to work to how and almost there but i dont know how to use it to extend those repeats.I have a feeling the way i have done it,may not make it possible. Im at the last hurdle and having trouble jumping over it.
At the moment,i have added user input and allowed the person to specify:
1.the length of the repeat he wants e.g 4 in this example
2.the top n repeats he wants to show (eg.top 20) repeats.
- Code: Select all
The results show:
Number of repeats in order: AAAA with 48 repeats
72 140 407 408 409 410 411 412 413 414 415 421 422 423 424 433 490 501 560 561 562 709 710 717 718 719 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893
Number of repeats in order: TTTT with 25 repeats
49 129 194 219 220 221 222 228 229 230 231 356 357 460 461 537 538 548 549 555 556 1018 1079 1243 1244
Number of repeats in order: ATTT with 17 repeats
20 128 161 282 321 355 436 442 459 927 981 985 1041 1078 1128 1219 1250
Number of repeats in order: CAAA with 15 repeats
88 99 121 175 270 406 420 432 500 716 803 871 922 1147 1201
Number of repeats in order: TTTG with 15 repeats
50 195 223 232 239 322 516 626 917 986 1042 1220 1227 1245 1270
Number of repeats in order: TATT with 15 repeats
2 19 127 281 316 363 441 458 543 980 984 1032 1040 1077 1127
Number of repeats in order: TTTA with 14 repeats
0 212 265 437 462 550 557 928 982 1064 1080 1089 1129 1159
Number of repeats in order: CTTT with 13 repeats
211 264 515 536 554 625 1017 1063 1088 1158 1226 1242 1269
Number of repeats in order: AAAT with 13 repeats
73 100 303 434 563 711 894 1011 1138 1148 1171 1202 1279
Number of repeats in order: TTTC with 12 repeats
21 130 162 283 358 403 443 495 539 809 1019 1251
SUBLEN = AAAA, k =4;
the integers show the position of the SUBLEN on the string with how many repeats have been found.
My problem i think i m going to have is that K has always been specified by the user,so there is going to be always K length. So the results show what the user has entered as k =4. The way i have calculated this,is my hashtable size is going to be always: 4^k....
The reason its 4,is because im working with A,T,G,C..
so if k = 2, the hashtable size is: 16
k = 3, the hashtable size is: 64
k = 4,the hashtable size is: 256
and so on... Hopefully this has made it a little bit clearer of the foundations of this program...
So i have two ways of thinking to implement this:
1) Quite tricky thing to do,but Instead of having k = 4 or any number, is there a way to go through the whole txt without specifying what k is? I think im going to have a problem with the SIZE part.
e.g
- Code: Select all
#define SUBLEN 7
#define SIZE 16384
SIZE is the size of the hashtable. I dont have faith in this option at all because what happens with you get k = 20? the size isnt worth thinking about!
2) The other way which i thought might work could be,if k =4 or any integer,
maybe i could add a cycle to extend the repeats until the matches i found were greater or equal to 2.
This is a pretty complicated program and i need some help on how i can do this.
I can provide code if that helps to explain things.
kind regards
joHnnY
