pset 6 DNA, checking database for matching profiles - python

I am currently on pset 6 dna in cs50, I have completed the majoraty of the problem, but I can't seem to wrap my head round the final step, checking the database for matching profiles.
all of my code is located below to provide context for variables, I am unsure on the usage of my if loop and what I should be comparing, I think I may be overcompilcating it so any help with understanding or solving this problem would be appriciated
# TODO: Read database file into a variable
database = []
filename = sys.argv[1]
with open(filename) as f:
reader = csv.DictReader(f)
for row in reader:
database.append(row)
# TODO: Read DNA sequence file into a variable
sequence = []
filename = sys.argv[2]
with open(filename) as f:
r = f.read()
for column in r:
sequence.append(column)
# TODO: Find longest match of each STR in DNA sequence
subsequences = list(database[0].keys())[1:]
longest_sequence = {}
for subsequence in subsequences:
longest_sequence[subsequence] = longest_match(sequence, subsequence)
# TODO: Check database for matching profiles
databaselen = len(database)
sequencelen = len(sequence)
str_counts = [longest_sequence[subsequence]]
for i in range(databaselen):
for j in range(sequencelen):
if str_counts[j] == database[i][1:][j]:
print(database["name"])
return

Before checking the database for matching profiles, you need to check your previous steps. When you do, you will find several problems:
First, sequence is not what you think it is. (You probably think
it is a string. Instead, it is a list of single character
strings.) This occurs because you create sequence as a string, and
are appending items to it.
Because of that error, longest_match() doesn't return the correct
counts for the subsequences. As a result, you have no chance to find
matches in the database.
The lesson: sometimes errors appear downstream from the real error. You need to check every line as you code.
Fix those errors, then work on the database match procedure. When you do, you will find additional errors.
You create variable str_counts which is the max count of any subsequence. That is not what you should be checking. You need to check the count for EVERY subsequence for each person against the database. (So, for sequence 1: {'AGATC': 4, 'AATG': 1, 'TATC': 5}).
Next, you are accessing elements of database incorrectly. database is a list of dictionaries (that uses keys). So, use list syntax to get each dictionary and dictionary syntax to get the key/value pairs.
Finally, you need to loop over each person and check their subsequence counts against the database. (Also, notice that STR values in database and longest_sequence are different types.) Procedure should look something like this. You need to add the details.
Code:
# database is a LIST of people DICTIONARIES
for person in database: # to loop on people in the list
# longest_sequence is a dictionary of STR:count values
for STR in longest_sequence:
# Check ALL longest_sequence[STR] values against all person[STR] values
# If ALL match, person is a match
# Otherwise, person is NOT a match
Good luck.

Related

Is there a way to "transform" a CSV table into a simple nested if... else block in python?

I'm fairly new to python and I'm looking forward to achieve the following:
I have a table with several conditions as in the image below (maximum 5 conditions) along with various attributes. Each condition comes from a specific set of values, for example Condition 1 has 2 possible values, Condition 2 has 4 possible values, Condition 3 has 2 possible values etc..
What I would like to do: From the example table above, I would like to generate a simple python code so that when I execute my function and import a CSV file containing the table above, I should get the following output saved as a *.py file:
def myFunction(Attribute, Condition):
if Attribute1 & Condition1:
myValue = val_11
if Attribute1 & Condition2:
myValue = val_12
...
...
if Attribute5 & Condition4:
myValue = val_54
NOTE: Each CSV file will contain only one sheet and the titles for the columns do not change.
UPDATE, NOTE#2: Both "Attribute" and "Condition" are string values, so simple string comparisons would suffice.
Is there a simple way to do this? I dove into NLP and realized that it is not possible (at least from what I found in the literature). I'm open to all forms of suggestions/answers.
You can't really use "If"s and "else"s, since, if I understand your question correctly, you want to be able to read the conditions, attributes and values from a CSV file. Using "If"s and "else"s, you would only be able to check a fixed range of conditions and attributes defined in your code. What I would do, is to write a parser (piece of code, which reads the contents of your CSV file and saves it in another, more usable form).
In this case, the parser is the parseCSVFile() function. Instead of the ifs and elses comparing attributes and conditions, you now use the attributes and conditions to access a specific element in a dictionary (similar to an array or list, but you can now use for example string keys instead of the numerical indexes). I used a dictionary containing a dictionary at each position to split the CSV contents into their rows and columns. Since I used dictionaries, you can now use the strings of the Attributes and Conditions to access your values instead of doing lots of comparisons.
#Output Dictionary
ParsedDict = dict()
#This is either ';' or ',' depending on your operating system or you can open a CSV file with notepad for example to check which character is used
CSVSeparator = ';'
def parseCSVFile(filePath):
global ParsedDict
f = open(filePath)
fileLines = f.readlines()
f.close()
#Extract the conditions
ConditionsArray = (fileLines[0].split(CSVSeparator))[1:]
for x in range(len(fileLines)-1):
#Remove unwanted characters such as newline characters
line = fileLines[1 + x].strip()
#Split by the CSV separation character
LineContents = line.split(CSVSeparator)
ConditionsDict = dict()
for y in range(len(ConditionsArray)):
ConditionsDict.update({ConditionsArray[y]: LineContents[1 + y]})
ParsedDict.update({LineContents[0]: ConditionsDict})
def myFunction(Attribute, Condition):
myValue = ParsedDict[Attribute][Condition]
The "[1:]" is to ignore the contents in the first column (empty field at the top left and the "Attribute x" fields) when reading either the conditions or the values
Use the parseCSVFile() function to extract the information from the csv file
and the myFunction() to get the value you want

How to extract all occurrences of a JSON object that share a duplicate key:value pair?

I am writing a python script that reads a large JSON file containing data from an API, and iterates through all the objects. I want to extract all objects that have a specific matching/duplicate "key:value", and save it to a separate JSON file.
Currently, I have it almost doing this, however the one flaw in my code that I cannot fix is that it skips the first occurrence of the duplicate object, and does not add it to my dupObjects list. I have an OrderedDict keeping track of unique objects, and a regular list for duplicate objects. I know this means that when I add the second occurrence, I must add the first (unique) object, but how would I create a conditional statement that only does this once per unique object?
This is my code at the moment:
import collections import OrderedDict
import json
with open('input.json') as data:
data = json.load(data)
uniqueObjects = OrderedDict()
dupObjects = list()
for d in data:
value = d["key"]
if value in uniqueObjects:
# dupObjects.append(uniqueObjects[hostname])
dupHostnames.append(d)
if value not in uniqueObjects:
uniqueObjects[value] = d
with open('duplicates.json', 'w') as g:
json.dump(dupObjects, g, indent=4)
Where you see that one commented line is where I tried to just add the object from the OrderedList to my list, but that causes it to add it as many times as there are duplicates. I only want it to add it one time.
Edit:
There are several unique objects that have duplicates. I'm looking for some conditional statement that can add the first occurrence of an object that has duplicates, once per unique object.
You could group by key.
Using itertools:
def by_key(element):
return ["key"]
grouped_by_key = itertools.groupby(data, key_func=by_key)
Then is just a matter of finding groups that have more than one element.
For details check: https://docs.python.org/3/howto/functional.html#grouping-elements
In this line you forgot .keys(), so you skip need values
if value in uniqueObjects.keys():
And this line
if value not in uniqueObjects.keys():
Edit #1
My mistake :)
You need to add first duplicate object from uniqueObjects in first if
if value in uniqueObjects:
if uniqueObjects[value] != -1:
dupObjects.append(uniqueObjects[value])
uniqueObjects[value] = -1
dupHostnames.append(d)
Edit #2
Try this option, it will write only the first occurrence in duplicates
if value in uniqueObjects:
if uniqueObjects[value] != -1:
dupObjects.append(uniqueObjects[value])
uniqueObjects[value] = -1

Trouble converting "for key in dict" to == for exact matching

Good morning,
I am having trouble pulling the correct value from my dictionary because there are similar keys. I believe I need to use the == instead of in however when I try to change if key in c_item_number_one: to if key == c_item_number_one: it just returns my if not_found: print("Specify Size One") however I know 12" is in the dictionary.
c_item_number_one = ('12", Pipe,, SA-106 GR. B,, SCH 40, WALL smls'.upper())
print(c_item_number_one)
My formula is as follows:
def item_one_size_one():
not_found = True
for key in size_one_dict:
if key in c_item_number_one:
item_number_one_size = size_one_dict[key]
print(item_number_one_size)
not_found = False
break
if not_found:
print("Specify Size One")
item_one_size_one()
The current result is:
12", PIPE,, SA-106 GR. B,, SCH 40, WALL SMLS
Specify Size One
To split the user input into fields, use re.split
>>> userin
'12", PIPE,, SA-106 GR. B,, SCH 40, WALL SMLS'
>>> import re
>>> fields = re.split('[ ,]*',userin)
>>> fields
['12"', 'PIPE', 'SA-106', 'GR.', 'B', 'SCH', '40', 'WALL', 'SMLS']
Then compare the key to the first field, or to all fields:
if key == fields[0]:
There are two usages of the word in here - the first is in the context of a for loop, and the second entirely distinct one is in the context of a comparison.
In the construction of a for loop, the in keyword connects the variable that will be used to hold the values extracted from the loop to the object containing values to be looped over.
e.g.
for x in list:
Meanwhile, the entirely distinct usage of the in keyword can be used to tell python to perform a collection test where the left-hand side item is tested to see whether it exists in the rhs-object's collection.
e.g.
if key in c_item_number_one:
So the meaning of the in keyword is somewhat contextual.
If your code is giving unexpected results then you should be able to replace the if-statement to use an == test, while keeping everything else the same.
e.g.
if key == c_item_number_one:
However, since the contents of c_item_number_one is a tuple, you might only want to test equality for the first item in that tuple - the number 12 for example. You should do this by indexing the element in the tuple for which you want to do the comparison:
if key == c_item_number_one[0]:
Here the [0] is telling python to extract only the first element from the tuple to perform the == test.
[edit] Sorry, your c_item_number_one isn't a tuple, it's a long string. What you need is a way of clearly identifying each item to be looked up, using a unique code or value that the user can enter that will uniquely identify each thing. Doing a string-match like this is always going to throw up problems.
There's potential then for a bit of added nuance, the 1st key in your example tuple is a string of '12'. If the key in your == test is a numeric value of 12 (i.e. an integer) then the test 12 == '12' will return false and you won't extract the value you're after. That your existing in test succeeds currently suggests though that this isn't a problem here, but might be something to be aware of later.

Pulling data from a files after being matched to a regex

I have two files that contain hashes, one of them looks something like this:
user_id,user_bio,user_pass,user_name,user_email,user_banned,user_regdate,user_numposts,user_timezone,user_bio_status,user_lastsession,user_newpassword,user_email_public,user_allowviewonline,user_lasttimereadpost
1,<blank>,a1fba56e72b37d0ba83c2ccer7172ec8eb1fda6d,human,human#place.com,0,1115584099,1,2.0,1,1115647107,<blank>,0,1,1115647107
2,<blank>,b404bac52c91ef1f291ba9c2719aa7d916dc55e5,josh,josh#place.com,0,1115584767,1,2.0,5,1115585298,<blank>,0,1,1115585126
3,<blank>,3a5fb7652e4c4319455769d5462eb2c4ac4cbe79,rich,rich#place.com,0,1167079798,1,2.0,5,1167079798,<blank>,0,1,1167079887
The other one, looks something like this:
This is a random assortment 3a5fb7652e4c4319455769d5462eb2c4ac4cbe79 of characters in order 3a5fb7652e4c4319455769d5462eb2c4ac4cbe79 to see if I can find a file 3a5fb7652e4c4319455769d5462eb2c4ac4cbe79 full of hashes
I'm trying to pull the hashes from these files using a regular expression to match the hash:
def hash_file_generator(self):
def __fix_re_pattern(regex_string, to_add=r""):
regex_string = list(regex_string)
regex_string[0] = to_add
regex_string[-1] = to_add
return re.compile(''.join(regex_string))
matched_hashes = set()
keys = [k for k in bin.verify_hashes.verify.HASH_TYPE_REGEX.iterkeys()]
with open(self.words) as wordlist:
for item in wordlist.readlines():
for s in item.split("\n"):
for k in keys:
k = __fix_re_pattern(k.pattern)
print k.pattern
if k.findall(s):
matched_hashes.add(s)
return matched_hashes
The regular expression that matches these hashes, looks like this: [a-fA-F0-9]{40}.
However, when this is run, it pulls everything from the first file and saves it into the set, and in the second file it will work successfully:
First file:
set(['1<blank>,a1fba56e72b37d0ba83c2ccer7172ec8eb1fda6d,human,human#place.com,0,1115584099,1,2.0,1,1115647107,<blank>,0,1,1115647107','2,<blank>,b404bac52c91ef1f291ba9c2719aa7d916dc55e5,josh,josh#place.com,0,1115584767,1,2.0,5,1115585298,<blank>,0,1,1115585126','3,<blank>,3a5fb7652e4c4319455769d5462eb2c4ac4cbe79,rich,rich#place.com,0,1167079798,1,2.0,5,1167079798,<blank>,0,1,1167079887'])
Second file:
set(['3a5fb7652e4c4319455769d5462eb2c4ac4cbe79'])
How can I pull just the matched data from the first file using the regex as seen here, and why is it pulling everything instead of just the matched data?
Edit for comments
def hash_file_generator(self):
"""
Parse a given file for anything that matches the hashes in the
hash type regex dict. Possible that this will pull random bytes
of data from the files.
"""
def __fix_re_pattern(regex_string, to_add=r""):
regex_string = list(regex_string)
regex_string[0] = to_add
regex_string[-1] = to_add
return ''.join(regex_string)
matched_hashes = []
keys = [k for k in bin.verify_hashes.verify.HASH_TYPE_REGEX.iterkeys()]
with open(self.words) as hashes:
for k in keys:
k = re.compile(__fix_re_pattern(k.pattern))
matched_hashes = [
i for line in hashes
for i in k.findall(line)
]
return matched_hashes
Output:
[]
If you just want to pull the hashes, this should work:
import re
hash_pattern = re.compile("[a-fA-F0-9]{40}")
with open("hashes.txt", "r") as hashes:
matched_hashes = [i for line in hashes
for i in hash_pattern.findall(line)]
print(matched_hashes)
Note that this doesn't match some of what look like hashes because they contain, for example, an 'r', but it uses your specified regex.
The way this works is by using re.findall, which just return a list of strings representing each match, and using a list comprehension to do this for each line of the file.
When hashes.txt is
user_id,user_bio,user_pass,user_name,user_email,user_banned,user_regdate,user_numposts,user_timezone,user_bio_status,user_lastsession,user_newpassword,user_email_public,user_allowviewonline,user_lasttimereadpost
1,<blank>,a1fba56e72b37d0ba83c2ccer7172ec8eb1fda6d,human,human#place.com,0,1115584099,1,2.0,1,1115647107,<blank>,0,1,1115647107
2,<blank>,b404bac52c91ef1f291ba9c2719aa7d916dc55e5,josh,josh#place.com,0,1115584767,1,2.0,5,1115585298,<blank>,0,1,1115585126
3,<blank>,3a5fb7652e4c4319455769d5462eb2c4ac4cbe79,rich,rich#place.com,0,1167079798,1,2.0,5,1167079798,<blank>,0,1,1167079887
this has the output
['b404bac52c91ef1f291ba9c2719aa7d916dc55e5', '3a5fb7652e4c4319455769d5462eb2c4ac4cbe79']
Having looked at your code as it stands, I can tell you one thing: __fix_re_pattern probably isn't doing what you want it to. It currently removes the first and last character of any regex you pass it, which will ironically and horribly mangle the regex.
def __fix_re_pattern(regex_string, to_add=r""):
regex_string = list(regex_string)
regex_string[0] = to_add
regex_string[-1] = to_add
return ''.join(regex_string)
print(__fix_re_pattern("[a-fA-F0-9]{40}"))
will output
a-fA-F0-9]{40
I'm still missing a lot of context in your code, and it's not quite modular enough to do without. I can't meaningfully reconstruct your code to reproduce any problems, leaving me to troubleshoot by eye. Presumably this is an instance method of an object which has the words, which for some reason contains a file name. I can't really tell what keys is, for example, so I'm still finding it difficult to provide you with an entire 'fix'. I also don't know what the intention behind __fix_re_pattern is, but I think your code would work fine if you just took it out entirely.
Another problem is that for each k in whatever keys is, you overwrite the variable matched_hashes, so you return only the matched hashes for the last key.
Also the whole keys thing is kind of intriguing me.. Is it a call to some kind of globally defined function/module/class which knows about hash regexes?
Now you probably know best what your code wants, but it nevertheless seems a little complicated.. I'd advise you to keep in the back of your mind that my first answer, as it stands, also entirely meets the specification of your question.

How to dynamically append to array in dict?

This has taken me over a day of trial and error. I am trying to keep a dictionary of queries and their respective matches in a search. My problem is that there can be one or more matches. My current solution is:
match5[query_site] will already have the first match but if it finds another match it will append it using the code below.
temp5=[] #temporary variable to create array
if isinstance(match5[query_site],list): #check if already a list
temp5.extend(match5[query_site])
temp5.append(match_site)
else:
temp5.append(match5[query_site])
match5[query_site]=temp5 #add new location
That if statement is literally to prevent extend converting my str element into an array of letters. If I try to initialize the first match as a single element array I get None if I try to directly append. I feel like there should be a more pythonic method to achieve this without a temporary variable and conditional statement.
Update: Here is an example of my output when it works
5'flank: ['8_73793824', '6_133347883', '4_167491131', '18_535703', '14_48370386']
3'flank: X_11731384
There's 5 matches for my "5'flank" and only 1 match for my "3'flank".
So what about this:
if query_site not in match5: # here for the first time
match5[query_site] = [match_site]
elif isinstance(match5[query_site], str): # was already here, a single occurrence
match5[query_site] = [match5[query_site], match_site] # make it a list of strings
else: # already a list, so just append
match5[query_site].append(match_site)
I like using setdefault() for cases like this.
temp5 = match5.setdefault(query_site, [])
temp5.append(match_site)
It's sort of like get() in that it returns an existing value if the key exists but you can provide a default value. The difference is that if the key doesn't exist already setdefault inserts the default value into the dict.
This is all you need to do
if query_site not in match5:
match5[query_site] = []
temp5 = match5[query_site]
temp5.append(match_site)
You could also do
temp5 = match5.setdefault(query_site, [])
temp5.append(match_site)
Assuming match5 is a dictionary, what about this:
if query_site not in match5: # first match ever
match5[query_site] = [match_site]
else: # entry already there, just append
match5[query_site].append(temp5)
Make the entries of the dictionary to be always a list, and just append to it.

Categories