Pulling data from a files after being matched to a regex - python

I have two files that contain hashes, one of them looks something like this:
user_id,user_bio,user_pass,user_name,user_email,user_banned,user_regdate,user_numposts,user_timezone,user_bio_status,user_lastsession,user_newpassword,user_email_public,user_allowviewonline,user_lasttimereadpost
1,<blank>,a1fba56e72b37d0ba83c2ccer7172ec8eb1fda6d,human,human#place.com,0,1115584099,1,2.0,1,1115647107,<blank>,0,1,1115647107
2,<blank>,b404bac52c91ef1f291ba9c2719aa7d916dc55e5,josh,josh#place.com,0,1115584767,1,2.0,5,1115585298,<blank>,0,1,1115585126
3,<blank>,3a5fb7652e4c4319455769d5462eb2c4ac4cbe79,rich,rich#place.com,0,1167079798,1,2.0,5,1167079798,<blank>,0,1,1167079887
The other one, looks something like this:
This is a random assortment 3a5fb7652e4c4319455769d5462eb2c4ac4cbe79 of characters in order 3a5fb7652e4c4319455769d5462eb2c4ac4cbe79 to see if I can find a file 3a5fb7652e4c4319455769d5462eb2c4ac4cbe79 full of hashes
I'm trying to pull the hashes from these files using a regular expression to match the hash:
def hash_file_generator(self):
def __fix_re_pattern(regex_string, to_add=r""):
regex_string = list(regex_string)
regex_string[0] = to_add
regex_string[-1] = to_add
return re.compile(''.join(regex_string))
matched_hashes = set()
keys = [k for k in bin.verify_hashes.verify.HASH_TYPE_REGEX.iterkeys()]
with open(self.words) as wordlist:
for item in wordlist.readlines():
for s in item.split("\n"):
for k in keys:
k = __fix_re_pattern(k.pattern)
print k.pattern
if k.findall(s):
matched_hashes.add(s)
return matched_hashes
The regular expression that matches these hashes, looks like this: [a-fA-F0-9]{40}.
However, when this is run, it pulls everything from the first file and saves it into the set, and in the second file it will work successfully:
First file:
set(['1<blank>,a1fba56e72b37d0ba83c2ccer7172ec8eb1fda6d,human,human#place.com,0,1115584099,1,2.0,1,1115647107,<blank>,0,1,1115647107','2,<blank>,b404bac52c91ef1f291ba9c2719aa7d916dc55e5,josh,josh#place.com,0,1115584767,1,2.0,5,1115585298,<blank>,0,1,1115585126','3,<blank>,3a5fb7652e4c4319455769d5462eb2c4ac4cbe79,rich,rich#place.com,0,1167079798,1,2.0,5,1167079798,<blank>,0,1,1167079887'])
Second file:
set(['3a5fb7652e4c4319455769d5462eb2c4ac4cbe79'])
How can I pull just the matched data from the first file using the regex as seen here, and why is it pulling everything instead of just the matched data?
Edit for comments
def hash_file_generator(self):
"""
Parse a given file for anything that matches the hashes in the
hash type regex dict. Possible that this will pull random bytes
of data from the files.
"""
def __fix_re_pattern(regex_string, to_add=r""):
regex_string = list(regex_string)
regex_string[0] = to_add
regex_string[-1] = to_add
return ''.join(regex_string)
matched_hashes = []
keys = [k for k in bin.verify_hashes.verify.HASH_TYPE_REGEX.iterkeys()]
with open(self.words) as hashes:
for k in keys:
k = re.compile(__fix_re_pattern(k.pattern))
matched_hashes = [
i for line in hashes
for i in k.findall(line)
]
return matched_hashes
Output:
[]

If you just want to pull the hashes, this should work:
import re
hash_pattern = re.compile("[a-fA-F0-9]{40}")
with open("hashes.txt", "r") as hashes:
matched_hashes = [i for line in hashes
for i in hash_pattern.findall(line)]
print(matched_hashes)
Note that this doesn't match some of what look like hashes because they contain, for example, an 'r', but it uses your specified regex.
The way this works is by using re.findall, which just return a list of strings representing each match, and using a list comprehension to do this for each line of the file.
When hashes.txt is
user_id,user_bio,user_pass,user_name,user_email,user_banned,user_regdate,user_numposts,user_timezone,user_bio_status,user_lastsession,user_newpassword,user_email_public,user_allowviewonline,user_lasttimereadpost
1,<blank>,a1fba56e72b37d0ba83c2ccer7172ec8eb1fda6d,human,human#place.com,0,1115584099,1,2.0,1,1115647107,<blank>,0,1,1115647107
2,<blank>,b404bac52c91ef1f291ba9c2719aa7d916dc55e5,josh,josh#place.com,0,1115584767,1,2.0,5,1115585298,<blank>,0,1,1115585126
3,<blank>,3a5fb7652e4c4319455769d5462eb2c4ac4cbe79,rich,rich#place.com,0,1167079798,1,2.0,5,1167079798,<blank>,0,1,1167079887
this has the output
['b404bac52c91ef1f291ba9c2719aa7d916dc55e5', '3a5fb7652e4c4319455769d5462eb2c4ac4cbe79']
Having looked at your code as it stands, I can tell you one thing: __fix_re_pattern probably isn't doing what you want it to. It currently removes the first and last character of any regex you pass it, which will ironically and horribly mangle the regex.
def __fix_re_pattern(regex_string, to_add=r""):
regex_string = list(regex_string)
regex_string[0] = to_add
regex_string[-1] = to_add
return ''.join(regex_string)
print(__fix_re_pattern("[a-fA-F0-9]{40}"))
will output
a-fA-F0-9]{40
I'm still missing a lot of context in your code, and it's not quite modular enough to do without. I can't meaningfully reconstruct your code to reproduce any problems, leaving me to troubleshoot by eye. Presumably this is an instance method of an object which has the words, which for some reason contains a file name. I can't really tell what keys is, for example, so I'm still finding it difficult to provide you with an entire 'fix'. I also don't know what the intention behind __fix_re_pattern is, but I think your code would work fine if you just took it out entirely.
Another problem is that for each k in whatever keys is, you overwrite the variable matched_hashes, so you return only the matched hashes for the last key.
Also the whole keys thing is kind of intriguing me.. Is it a call to some kind of globally defined function/module/class which knows about hash regexes?
Now you probably know best what your code wants, but it nevertheless seems a little complicated.. I'd advise you to keep in the back of your mind that my first answer, as it stands, also entirely meets the specification of your question.

Related

pset 6 DNA, checking database for matching profiles

I am currently on pset 6 dna in cs50, I have completed the majoraty of the problem, but I can't seem to wrap my head round the final step, checking the database for matching profiles.
all of my code is located below to provide context for variables, I am unsure on the usage of my if loop and what I should be comparing, I think I may be overcompilcating it so any help with understanding or solving this problem would be appriciated
# TODO: Read database file into a variable
database = []
filename = sys.argv[1]
with open(filename) as f:
reader = csv.DictReader(f)
for row in reader:
database.append(row)
# TODO: Read DNA sequence file into a variable
sequence = []
filename = sys.argv[2]
with open(filename) as f:
r = f.read()
for column in r:
sequence.append(column)
# TODO: Find longest match of each STR in DNA sequence
subsequences = list(database[0].keys())[1:]
longest_sequence = {}
for subsequence in subsequences:
longest_sequence[subsequence] = longest_match(sequence, subsequence)
# TODO: Check database for matching profiles
databaselen = len(database)
sequencelen = len(sequence)
str_counts = [longest_sequence[subsequence]]
for i in range(databaselen):
for j in range(sequencelen):
if str_counts[j] == database[i][1:][j]:
print(database["name"])
return
Before checking the database for matching profiles, you need to check your previous steps. When you do, you will find several problems:
First, sequence is not what you think it is. (You probably think
it is a string. Instead, it is a list of single character
strings.) This occurs because you create sequence as a string, and
are appending items to it.
Because of that error, longest_match() doesn't return the correct
counts for the subsequences. As a result, you have no chance to find
matches in the database.
The lesson: sometimes errors appear downstream from the real error. You need to check every line as you code.
Fix those errors, then work on the database match procedure. When you do, you will find additional errors.
You create variable str_counts which is the max count of any subsequence. That is not what you should be checking. You need to check the count for EVERY subsequence for each person against the database. (So, for sequence 1: {'AGATC': 4, 'AATG': 1, 'TATC': 5}).
Next, you are accessing elements of database incorrectly. database is a list of dictionaries (that uses keys). So, use list syntax to get each dictionary and dictionary syntax to get the key/value pairs.
Finally, you need to loop over each person and check their subsequence counts against the database. (Also, notice that STR values in database and longest_sequence are different types.) Procedure should look something like this. You need to add the details.
Code:
# database is a LIST of people DICTIONARIES
for person in database: # to loop on people in the list
# longest_sequence is a dictionary of STR:count values
for STR in longest_sequence:
# Check ALL longest_sequence[STR] values against all person[STR] values
# If ALL match, person is a match
# Otherwise, person is NOT a match
Good luck.

How to mach and get the value from dict n python?

dict ={"Rahul":"male",
"sahana":"female"
"pavan":"male" }
in a text file we have
rahul|sharma
sahana|jacob
Pavan|bhat
in a python program we have to open the text file and read the all line and "Name" we have to match with dict what we have and make a new text file with gender..
OUTPUT SHOULD BE LIKE
rahul|sharma|male
sahana|jacob|female
Pavan|bhat|male
It would seem to me that this is roughly what you want. Note that your formatting for input and output was slightly off, but I'm pretty sure I've got it.
genders = {"rahul":"male",
"sahana":"female",
"pavan":"male" }
with open("input.txt") as in_file:
for line in in_file:
a, b = line.strip().split("|")
gen = genders[a]
print("{}|{}|{}".format(a, b, gen))
where input.txt contains
rahul|sharma
sahana|jacob
pavan|bhat
will correctly (I think) produce the output
rahul|sharma|male
sahana|jacob|female
pavan|bhat|male
I have changed all of your data to be lowercase, as with your casing, it would have been ambiguous as to how to lookup in the dictionary, and how to end up providing output (only one key was capital-cased, so I couldn't use any kind of reasonable string function to accomodate the keys as they were). I've also had to add a comma to your dictionary.
I've also renamed your dictionary - it's no longer dict, because dict is a Python builtin. It seems a bit strange to me that you will have available in your code a dictionary that can anticipate your input file, but this is what I got from the question.
To get the value for the key in a dict, the syntax is simply:
b = "Rahul"
dict = {"Rahul":"male", "Mahima":"female"}
dict[b]

How to split by recurrent identifier word in a file via Python

For the sake of minimalism, I simplify my case.
I have a file like this:
Name=[string] --------
Date=[string]
Target=[string]
Size=[string]
Name=[string] --------
Date=[string]
Size=[string]
Value=[string]
Name=[string] --------
Target=[string]
Date=[string]
Size=[string]
Value=[string]
I would like to store the record set(couple lines)that starts with Name=[some string] and continues until the next occurrence of Name=[another string] in a tuple/dictionary structure and enumerate them.
So, the desired output might look like this:
Enum,Name=[string],Date=[string],Target=[string],Size=[string], ---None---
Enum,Name=[string],Date=[string], ---None--- ,Size=[string], Value=[string]
Enum,Name=[string],Date=[string],Target=[string],Size=[string], Value=[string]
I started working with line by line approach, yet it became computationally expensive.
Is there any work around or functionality that may catch such recurring patterns and will they be useful and feasible for such formatting?
import re
dict = {}
with open("sockets_collection") as file:
i=0 for line in file:
match=re.findall(r'([^=]+)=([^=]+)(?:,|$)',line)
dict[i]=(match[0])
i=i+1
print(dict)
This is a snippet for storing them as key value pairs. What I want to achieve is storing them as key value pairs but instead of having each enumerated, I want to have them grouped via key: name
P.S. if there any ambiguous parts please let me know.

How to find the index of multiple sub string in a string by python?

I'd like to find the location in a string for certain characters such as "FC" or "FL". For single case like FC, I used the find() function to return the index of the characters in a string, as below.
for line in myfile:
location = line.find('FC')
But when it comes to adding FL, how do I add it without using an if statement in the for loop? I don't want to add redundant lines so I hope there is an elegant solution.
Each line will include either "FC" or "FL" but not both.
Each line will include either "FC" or "FL" but not both.
That makes the following somewhat hacky trick possible:
location = max(line.find(sub) for sub in ("FC", "FL"))
The idea is that of the two values, one will be -1 and the other will be positive (the index where it was found) so the greater value is the location.
Note that if "FC" is found the method will still search for "FL" and not find it, which will reduce performance if the string being searched is long, whereas solutions using a conditional will avoid this redundant calculation. However if the string is short then using the least amount of python code and letting C do all the thinking is probably fastest (though you should test your case if it really matters).
You can also avoid using a for comprehension for this simple function call:
location = max(map(line.find, ("FC", "FL")))
You can do this:
for line in myfile:
location = [line.find(substring) for substring in ('FC', 'FL') if line.find(substring) != -1][0]
It's similar to the solution suggested by #NathanielFord, the only difference is, I added if line.find(substring) != -1 to the generator to solve the problem I pointed at and moved getting the element with zero index to the same line to make it shorter. (#NathanielFord, I'm sorry you removed your answer before I suggested this in the comments)
Though, it's not a very elegant solution because it will call .find() twice, but it is shorter than using fors.
If you want the most elegant solution, then a conditional is probably your solution. It won't be a "redundant" line, but it will make your code look nice and readable:
for line in myfile:
location = line.find('FC')
if location == -1:
location = line.find('FL')
It is a little unclear what your desired output is, and there are more elegant ways to handle it depending on that, but essentially you're looking for:
def multifind(line):
for substring in ['FC', 'FL']:
location = line.find(substring)
if location is not -1:
return location
return None
locations = [multifind(line) for line in myfile]
Sample run:
myfile = ["abcFCabc","abFLcabc", "abc", ""]
>>> def multifind(line):
... for substring in ['FC', 'FL']:
... location = line.find(substring)
... if location is not -1:
... return location
... return None
...
>>> locations = [multifind(line) for line in myfile]
>>> locations
[3, 2, None, None]
Note that this is not quite as elegant as the solution with the if inside the for loop.

Is there a better way to create dynamic functions on the fly, without using string formatting and exec?

I have written a little program that parses log files of anywhere between a few thousand lines to a few hundred thousand lines. For this, I have a function in my code which parses every line, looks for keywords, and returns the keywords with the associated values.
These log files contain of little sections. Each section has some values I'm interested in and want to store as a dictionary.
I have simplified the sample below, but the idea is the same.
My original function looked like this, it gets called between 100 and 10000 times per run, so you can understand why I want to optimize it:
def parse_txt(f):
d = {}
for line in f:
if not line:
pass
elif 'apples' in line:
d['apples'] = True
elif 'bananas' in line:
d['bananas'] = True
elif line.startswith('End of section'):
return d
f = open('fruit.txt','r')
d = parse_txt(f)
print d
The problem I run into, is that I have a lot of conditionals in my program, because it checks for a lot of different things and stores the values for it. And when checking every line for anywhere between 0 and 30 keywords, this gets slow fast. I don't want to do that, because, not every time I run the program I'm interested in everything. I'm only ever interested in 5-6 keywords, but I'm parsing every line for 30 or so keywords.
In order to optimize it, I wrote the following by using exec on a string:
def make_func(args):
func_str = """
def parse_txt(f):
d = {}
for line in f:
if not line:
pass
"""
if 'apples' in args:
func_str += """
elif 'apples' in line:
d['apples'] = True
"""
if 'bananas' in args:
func_str += """
elif 'bananas' in line:
d['bananas'] = True
"""
func_str += """
elif line.startswith('End of section'):
return d"""
print func_str
exec(func_str)
return parse_txt
args = ['apples','bananas']
fun = make_func(args)
f = open('fruit.txt','r')
d = fun(f)
print d
This solution works great, because it speeds up the program by an order of magnitude and it is relatively simple. Depending on the arguments I put in, it will give me the first function, but without checking for all the stuff I don't need.
For example, if I give it args=['bananas'], it will not check for 'apples', which is exactly what I want to do.
This makes it much more efficient.
However, I do not like it this solution very much, because it is not very readable, difficult to change something and very error prone whenever I modify something. Besides that, it feels a little bit dirty.
I am looking for alternative or better ways to do this. I have tried using a set of functions to call on every line, and while this worked, it did not offer me the speed increase that my current solution gives me, because it adds a few function calls for every line. My current solution doesn't have this problem, because it only has to be called once at the start of the program. I have read about the security issues with exec and eval, but I do not really care about that, because I'm the only one using it.
EDIT:
I should add that, for the sake of clarity, I have greatly simplified my function. From the answers I understand that I didn't make this clear enough.
I do not check for keywords in a consistent way. Sometimes I need to check for 2 or 3 keywords in a single line, sometimes just for 1. I also do not treat the result in the same way. For example, sometimes I extract a single value from the line I'm on, sometimes I need to parse the next 5 lines.
I would try defining a list of keywords you want to look for ("keywords") and doing this:
for word in keywords:
if word in line:
d[word] = True
Or, using a list comprehension:
dict([(word,True) for word in keywords if word in line])
Unless I'm mistaken this shouldn't be much slower than your version.
No need to use eval here, in my opinion. You're right in that an eval based solution should raise a red flag most of the time.
Edit: as you have to perform a different action depending on the keyword, I would just define function handlers and then use a dictionary like this:
def keyword_handler_word1(line):
(...)
(...)
def keyword_handler_wordN(line):
(...)
keyword_handlers = { 'word1': keyword_handler_word1, (...), 'wordN': keyword_handler_wordN }
Then, in the actual processing code:
for word in keywords:
# keyword_handlers[word] is a function
keyword_handlers[word](line)
Use regular expressions. Something like the next:
>>> lookup = {'a': 'apple', 'b': 'banane'} # keyword: characters to look for
>>> pattern = '|'.join('(?P<%s>%s)' % (key, val) for key, val in lookup.items())
>>> re.search(pattern, 'apple aaa').groupdict()
{'a': 'apple', 'b': None}
def create_parser(fruits):
def parse_txt(f):
d = {}
for line in f:
if not line:
pass
elif line.startswith('End of section'):
return d
else:
for testfruit in fruits:
if testfruit in line:
d[testfruit] = True
This is what you want - create a test function dynamically.
Depending on what you really want to do, it is, of course, possibe to remove one level of complexity and define
def parse_txt(f, fruits):
[...]
or
def parse_txt(fruits, f):
[...]
and work with functools.partial.
You can use set structure, like this:
fruit = set(['cocos', 'apple', 'lime'])
need = set (['cocos', 'pineapple'])
need. intersection(fruit)
return to you 'cocos'.

Categories