So I currently have a script that generates hashes from the contents of a text file and saves them to a dictionary, and it then goes into a second text file and generates hashes from there and compares them to said dictionary. I'm trying to implement some sort of incomplete matching; for example, I want to program some tolerance: for example, I'd like to make it so that every third element in the hash is unimportant to the matching protocol, so if there is a mismatch, it will continue iterating unimpeded. Is it possible to do this?
Furthermore, and this is a separate case, would it be possible to determine a conditional mismatch? For example, if there is a mismatch, there are several elements that would still qualify as "matching", like if I wanted a vowel at a certain position, but it didn't matter which vowel showed up.
In summary, I'm trying to make it so that my script either goes
check,check,disregard,check,check,disregard,etc.
OR
check,check,conditional mismatch?,check,check,conditional mismatch?,etc.
along the hashes. Is this doable?
EDIT: I suppose it's not really hashchecking, but more of string comparison. Here's the relevant code I'm trying to tweak:
# hash table for finding hits
lookup = defaultdict(list)
# store sequence hashes in hash table
for i in xrange(len(file1) - hashlen + 1):
key = file1[i:i+hashlen]
lookup[key].append(i)
# look up hashes in hash table
hits = []
for i in xrange(len(file2) - hashlen + 1):
key = file2[i:i+hashlen]
# store hits to hits list
for hit in lookup.get(key, []):
hits.append((i, hit))
where hashlen is the length of the hash I want to generate (and thus the buffer so I don't go off the end of the file.
As commented, hashes do not have order. You can consider using an OrderedDict.
But maybe this code help you.
skip_rate = 3
for index, (key, value) in enumerate(your_hash.items()):
if index % skip_rate != 0:
do_something(key, value)
Related
I want to make a function that takes the directory of a .txt file as an input and returns a dictionary based on specific parameters. If the .txt file is empty,
then the function will return nothing. When writing this function, I request that no imports, no list comprehension, and only for/while and if statements are used.
This is for the sake of the content I am learning right now, and I would like to be able to learn and interpret the function step-by-step.
An example of a .txt file is below. The amount of lines can vary but every line is formatted such that they appear in the order:
word + a string of 3 numbers connected by commas.
terra,4,5,6
cloud,5,6,7
squall,6,0,8
terra,4,5,8
cloud,6,5,7
First I would like to break down the steps of the function
Each component of the string that is separated by a comma serves a specific purpose:
The last number in the string will be subtracted by the second to last number in a string to form a value in the dictionary.
for example, the last two characters of terra,4,5,6 will be subtracted to form a value of [1] in the dictionary
The alphabetical words will form the keys of the dictionary. If there are multiple entries of the same word in a .txt file then a single key will be formed
and it will contain all the values of the duplicate keys.
for example, terra,4,5,6 , terra,4,4,6 , and terra,4,4,7 will output ('terra', 4):[1,2,3] as a key and value respectively.
However, in order for a key to be marked as a duplicate, the first values of the keys must be the same. If they are not, then they will be separate values.
For example, terra,4,5,6 and terra,5,4,6 will appear separately from eachother in the dictionary as ('terra', 4):[1] and ('terra', 5):[2] respectively.
Example input
if we use the example .txt file mentioned above, the input should look like create_dict("***files/example.txt") and should ouput a dictionary
{('terra', 4):[1,3],('cloud', 5):[1],('squall', 6):[8],('cloud', 6):[2]}. I will add a link to the .txt file for the sake of recreating this example. (note that *** are placeholders for the rest of the directory)
What I'm Trying:
testfiles = (open("**files/example.txt").read()).split('\n')
int_list = []
alpha_list = []
for values in testfiles:
ao = values.split(',') #returns only a portion of the list. why?
for values in ao:
if values.isnumeric():
int_list.append(values) #retrives list of ints from list
for values in ao:
if values.isalpha():
alpha_list.append(values) #retrieves a list of words
{((alpha_list[0]), int(int_list[0])):(int(int_list[2])-(int(int_list[1])))} #each line will always have 3 number values so I used index
this returns {('squall', 6): 1} which is mainly just a proof of concept and not a solution to the function. I wanted to see if it was possible to use the numbers and words I found in int_list and alpha_list using indexes to generate entries in the dictionary. If possible, the same could be applied to the rest of the strings in the .txt file.
Your input is in CSV format.
You really should be using one of these
https://docs.python.org/3/library/csv.html#csv.reader
https://docs.python.org/3/library/csv.html#csv.DictReader
since "odd" characters within a comma-separated field
are non-trivial to handle.
Better to let the library worry about such details.
Using defaultdict(list) is the most natural way,
the most readable way, to implement your dup key requirement.
https://docs.python.org/3/library/collections.html#collections.defaultdict
I know, I know, "no import";
now on to a variant solution.
d = {}
with open('example.txt') as f:
for line in f:
word, nums = line.split(',', maxsplit=1)
a, b, c = map(int, nums.split(','))
delta = c - b
key = (word, a)
if key not in d:
d[key] = []
d[key].append(delta)
return d
Hopefully someone familiar with Biopython can help me out. I have a function that takes FASTA files (DNA sequence files) and creates a dictionary with the sequence ID as the key and the molecular weight of the sequence as the value. Since sequences can be ambiguous, I also have a function that spits out all possible real sequences from the ambiguous one and uses that as input for the dictionary-making function that I just described; I integrated it into the dictionary-creating function so that for ambiguous sequences, the function spits out a minimum and maximum molecular weight values for the possible real sequences represented by the ambiguous one.
def seq_ID_and_weight(file_name):
with open (file_name) as file:
ID_weight = {} #create an empty dictionary
for sequence in SeqIO.parse(file,'fasta'):
weight_min = 10000
weight_max = 0
all_poss_sequences = ambiguous_to_unambiguous(sequence.seq) # only call the function once and store it in variable to improve performance
if len(all_poss_sequences) != 1: #if the length would be 1, its unambiguous
for possib in all_poss_sequences:
if SeqUtils.molecular_weight(possib) < weight_min:
weight_min = SeqUtils.molecular_weight(possib)
elif SeqUtils.molecular_weight(possib) > weight_max:
weight_max = SeqUtils.molecular_weight(possib)
ID_weight[sequence.id] = [weight_min, weight_max]
else:
ID_weight[sequence.id] = [SeqUtils.molecular_weight(sequence.seq)]
return ID_weight
The function spits out something like this, where the values are either the definitive molecular weight of the sequence (if the seq is unambiguous) or the min and max of the possible molecular weights of the sequence (if seq is ambiguous):
{'seq_7009': [6236.9764, 6367.049999999999], 'seq_418': [3716.3642000000004, 3796.4124000000006], 'seq_9143_unamb': [4631.958999999999]}
However, now I need to use this function to make a new one that does something slightly different. The new function needs to take a FASTA file name and min and max molecular weights as inputs and return a list of sequence IDs for sequences that have a molecular weight within that interval. Basically, the function should return the ID of an ambiguous sequence for which the weight interval overlaps the weight interval that you specify.
My approach to this would be as follows:
Initialize a dictionary containing the output of the previous function, like the example I gave above.
Iterate over the dictionary, checking if the key has only one value or multiple (a tuple).
a. If only one value, then check if the value is in the given range, and if so, print that sequence ID. If not, break (do nothing).
b. If multiple values, then check if either the first or second is in the given range (because if so, there is some overlap). If so, print that sequence ID. If not, break.
How would I actually implement this? This is all I have so far - I've really only created the dictionary:
def find_sequence(file_name, min_weight, max_weight):
with open (file_name) as file:
dictionary = {}
dictionary.update(seq_ID_and_weight(file_name))
for key in dictionary:
Now I need to check how many values the keys have, but I don't know how to do that. Any ideas?
You just have to traverse the dictionary & check the first 2 values.
This is the approach.
def find_sequence(file_name, min_weight, max_weight):
li=[] # list to store ids
with open (file_name) as file:
dictionary = {}
dictionary.update(seq_ID_and_weight(file_name))
for k,v in dictionary.items(): # traverse the dictionary
for i in range(min(2,len(v))): # if len(v) > 2 , then it range will be 2 else 1
if v[i]>min_weight and v[i]<max_weight: # if value is within range append the sequence_id to list
li.append(k)
break
return li
I'm creating a very basic search engine in python, I am working on creating a method for handling phrase queries, so if the position of 2 words are within 1 they are next to each other in the document and it will output all document numbers where this happens.
I currently have a dictionary which looks like this
{'8':[['1170', '1264', '1307', '1559', '1638'], ['197', '1169']],
'6':[['345', '772'], ['346']}
This is just a layout example.
w=word, p=position ||
{doc1:[w1p1, w1p2, w1p3],[w2p1, w2p2]}
The key is the document id, followed by the positions in that document that the 1st word contains, then the positions of the 2nd word. There will be as many words (grouping of the positions) as that of in the query.
My questions is, is there a way were i can compare the values of the 1 and 2nd + 3rd etc ... values for the same document ID?. I want to compare them to see if a words position is only +1 of the other word.
So you can see for doc 6 word 2 follows word 1, this would result in the key being sent back.
There are a couple ways to achieve what you're trying to do here. I'm assuming based on the example you gave me that there are always only two words, and the lists are always ordered ordered.
No matter what the method, you'll want to iterate over the documents (The dictionary). Iterating over dictionaries is simple in Python; you can see an example here. After that, the steps change
First option - less efficient, slightly simpler:
Iterate over each item (location) in list 1 (the locations of the first word).
Iterate over each item (location) in list 2 (the locations of the second word).
Compare the two locations, and if they're within 1 of each other, return the document id.
Example:
for documentNumber in docdictionary:
for word1location in docdictionary[documentNumber][0]:
for word2location in docdictionary[documentNumber][1]:
if abs(word1location - word2location) == 1:
return documentNumber
Second Option - more efficient, slightly more complicated:
Start at the beginning of each list of word locations, keeping track of where you are
Check the two values at the locations you're at.
If the two values are 1 word apart, return the document number
If the two values are not, check which list item (page position), has a lower value and move to the next item in that list, repeat
If one of the lists (ex. list 1) runs out of numbers, and the other list (list 2) is at a value that is greater than the last value of the first (list 1), return None.
Example:
for documentNumber in docdictionary:
list1pos = 0
list2pos = 0
while True:
difference = docdictionary[documentNumber][0][list1pos] - docdictionary[documentNumber][1][list2pos]
if abs(difference) == 1:
return documentNumber
if difference < 0: #Page location 2 is greater
list1pos++
if list1pos == len(docdictionary[documentNumber][0]): #We were at the end of list 1, there will be no more matches
break
else: #Page location 1 is greater
list2pos++
if list2pos == len(docdictionary[documentNumber][1]): #We were at the end of list 2, there will be no more matches
break
return None
As a reminder, option 2 only works if the lists are always sorted. Also, you don't always need to return the document id right away. You could just add the document id to a list if you want all the documents that the pair happens in instead of the first one it finds. You could even use a dictionary to easily keep track of how many times the word pair appears in each document.
Hope this helped! Please let me know if anything isn't clear.
This is a question that has two parts:
First, I have a python UDF that creates a list of strings of unknown length. The input to the UDF is a map (dict in python) and the number of keys is essentially unknown (it is what I'm trying to obtain).
What I don't know is how to output that in a schema that lets me return it as a list (or some other iterable data structure). This is what I have so far:
#outputSchema("?????") #WHAT SHOULD THE SCHEMA BE!?!?
def test_func(input):
output = []
for k, v in input.items():
output.append(str(key))
return output
Now, the second part of the question. Once in Pig I want to apply a SHA hash to each element in the "list" for all my users. Some Pig pseudo code:
USERS = LOAD 'something' as (my_map:map[chararray])
UDF_OUT = FOREACH USERS GENERATE my_udfs.test_func(segment_map)
SHA_OUT = FOREACH UDF_OUT GENERATE SHA(UDF_OUT)
The last line is likely wrong as I want to apply the SHA to each element in the list, NOT to the whole list.
To answer your question, since you are returning a python list who's contents are a string, you will want your decorator to be
#outputSchema('name_of_bag:{(keys:chararray)}')
It can be confusing when specifying this structure because you only need to define what one element in the bag would look like.
That being said, there is a much simpler way to do what you require. There is a function KEYSET() (You can reference this question I answered) that will extract the keys from a Pig Map. So using the data set from that example and adding a few more keys to the first one since you said your map contents are variable in length
maps
----
[a#1,b#2,c#3,d#4,e#5]
[green#sam,eggs#I,ham#am]
Query:
REGISTER /path/to/jar/datafu-1.2.0.jar;
DEFINE datafu.pig.hash.SHA();
A = LOAD 'data' AS (M:[]);
B = FOREACH A GENERATE FLATTEN(KEYSET(M));
hashed = FOREACH B GENERATE $0, SHA($0);
DUMP hashed;
Output:
(d,18ac3e7343f016890c510e93f935261169d9e3f565436429830faf0934f4f8e4)
(e,3f79bb7b435b05321651daefd374cdc681dc06faa65e374e38337b88ca046dea)
(b,3e23e8160039594a33894f6564e1b1348bbd7a0088d42c4acb73eeaed59c009d)
(c,2e7d2c03a9507ae265ecf5b5356885a53393a2029d241394997265a1a25aefc6)
(a,ca978112ca1bbdcafac231b39a23dc4da786eff8147c4e72b9807785afee48bb)
(ham,eccfe263668d171bd19b7d491c3ef5c43559e6d3acf697ef37596181c6fdf4c)
(eggs,46da674b5b0987431bdb496e4982fadcd400abac99e7a977b43f216a98127721)
(green,ba4788b226aa8dc2e6dc74248bb9f618cfa8c959e0c26c147be48f6839a0b088)
I'm new here and I need some help with some code I've been working on because I've gotten myself lost and am now just confused.
First, I created a dictionary based on some help from this website. A sample of my dictionary looks like this:
length = {'A': [(0,21), (30,41), (70,80)] 'B': [(0,42), (70,80)]..etc}
I have a file that I would like to use to iterate over my dictionary that contains this information:
A 32
B 15
etc
What I want to do is to take the first feature in my file and match it to the key of my dictionary. Once I have it matched, I want to to see which range the number in my file matches to. For example, the first feature in my file would match to A and the second range. That means I would want my output to show the name (A) and display 2 because it matched to the second range.
I've tried my code below:
import csv
with open('Exome_agg_cons_snps_pct_RefSeq_HGMD_reinitialized.txt') as f:
reader = csv.DictReader(f,delimiter="\t")
for row in reader:
snppos = row['snp_rein']
name = row['isoform']
snpos = int(snppos)
if name in exons:
y = exons[name]
if y[0] <= snpos <=y[1]:
print name,snppos
This, however, doesn't give me any output. I'm not sure what is wrong with my code. I am new though. I think I might be missing something. Also, I realize that my code won't do what I want it to do (tell me what range it matched to). I was thinking of using the .index() function but I'm not sure I can use it in the case I have. Any suggestions?
You just need to loop over the spans in a dict value. It's nicer to let the for-loop split them already:
for row in reader:
snppos = row['snp_rein']
name = row['isoform']
if name in exons:
for low, high in exons[name]:
if low <= snppos <= high:
print name, low, high
break # Since exons can't overlap, go to next row immediately
If you need the index in the exon list rather than the span (e.g., index 2 rather (70, 80)), then add enumerate:
...
for i, (low, high) in enumerate(exons[name]):
if low <= snppos <= high:
print name, i
break
If your list in the dictionary is ordered then this method will work
The enumerate method provides two output one will be index and the next one will be value .
Since the value here is list we are iterating over the lists[one list at a time ].
We are using max method to find the maximum of the list
Then we are comparing it with the value from the text .
If so we will print the index +1 and name
code:
for sd, i in enumerate(exons[name]):
if snpos<=max(i) and snpos>=min(i):
print sd+1,name
break