Related
I have two files that contain hashes, one of them looks something like this:
user_id,user_bio,user_pass,user_name,user_email,user_banned,user_regdate,user_numposts,user_timezone,user_bio_status,user_lastsession,user_newpassword,user_email_public,user_allowviewonline,user_lasttimereadpost
1,<blank>,a1fba56e72b37d0ba83c2ccer7172ec8eb1fda6d,human,human#place.com,0,1115584099,1,2.0,1,1115647107,<blank>,0,1,1115647107
2,<blank>,b404bac52c91ef1f291ba9c2719aa7d916dc55e5,josh,josh#place.com,0,1115584767,1,2.0,5,1115585298,<blank>,0,1,1115585126
3,<blank>,3a5fb7652e4c4319455769d5462eb2c4ac4cbe79,rich,rich#place.com,0,1167079798,1,2.0,5,1167079798,<blank>,0,1,1167079887
The other one, looks something like this:
This is a random assortment 3a5fb7652e4c4319455769d5462eb2c4ac4cbe79 of characters in order 3a5fb7652e4c4319455769d5462eb2c4ac4cbe79 to see if I can find a file 3a5fb7652e4c4319455769d5462eb2c4ac4cbe79 full of hashes
I'm trying to pull the hashes from these files using a regular expression to match the hash:
def hash_file_generator(self):
def __fix_re_pattern(regex_string, to_add=r""):
regex_string = list(regex_string)
regex_string[0] = to_add
regex_string[-1] = to_add
return re.compile(''.join(regex_string))
matched_hashes = set()
keys = [k for k in bin.verify_hashes.verify.HASH_TYPE_REGEX.iterkeys()]
with open(self.words) as wordlist:
for item in wordlist.readlines():
for s in item.split("\n"):
for k in keys:
k = __fix_re_pattern(k.pattern)
print k.pattern
if k.findall(s):
matched_hashes.add(s)
return matched_hashes
The regular expression that matches these hashes, looks like this: [a-fA-F0-9]{40}.
However, when this is run, it pulls everything from the first file and saves it into the set, and in the second file it will work successfully:
First file:
set(['1<blank>,a1fba56e72b37d0ba83c2ccer7172ec8eb1fda6d,human,human#place.com,0,1115584099,1,2.0,1,1115647107,<blank>,0,1,1115647107','2,<blank>,b404bac52c91ef1f291ba9c2719aa7d916dc55e5,josh,josh#place.com,0,1115584767,1,2.0,5,1115585298,<blank>,0,1,1115585126','3,<blank>,3a5fb7652e4c4319455769d5462eb2c4ac4cbe79,rich,rich#place.com,0,1167079798,1,2.0,5,1167079798,<blank>,0,1,1167079887'])
Second file:
set(['3a5fb7652e4c4319455769d5462eb2c4ac4cbe79'])
How can I pull just the matched data from the first file using the regex as seen here, and why is it pulling everything instead of just the matched data?
Edit for comments
def hash_file_generator(self):
"""
Parse a given file for anything that matches the hashes in the
hash type regex dict. Possible that this will pull random bytes
of data from the files.
"""
def __fix_re_pattern(regex_string, to_add=r""):
regex_string = list(regex_string)
regex_string[0] = to_add
regex_string[-1] = to_add
return ''.join(regex_string)
matched_hashes = []
keys = [k for k in bin.verify_hashes.verify.HASH_TYPE_REGEX.iterkeys()]
with open(self.words) as hashes:
for k in keys:
k = re.compile(__fix_re_pattern(k.pattern))
matched_hashes = [
i for line in hashes
for i in k.findall(line)
]
return matched_hashes
Output:
[]
If you just want to pull the hashes, this should work:
import re
hash_pattern = re.compile("[a-fA-F0-9]{40}")
with open("hashes.txt", "r") as hashes:
matched_hashes = [i for line in hashes
for i in hash_pattern.findall(line)]
print(matched_hashes)
Note that this doesn't match some of what look like hashes because they contain, for example, an 'r', but it uses your specified regex.
The way this works is by using re.findall, which just return a list of strings representing each match, and using a list comprehension to do this for each line of the file.
When hashes.txt is
user_id,user_bio,user_pass,user_name,user_email,user_banned,user_regdate,user_numposts,user_timezone,user_bio_status,user_lastsession,user_newpassword,user_email_public,user_allowviewonline,user_lasttimereadpost
1,<blank>,a1fba56e72b37d0ba83c2ccer7172ec8eb1fda6d,human,human#place.com,0,1115584099,1,2.0,1,1115647107,<blank>,0,1,1115647107
2,<blank>,b404bac52c91ef1f291ba9c2719aa7d916dc55e5,josh,josh#place.com,0,1115584767,1,2.0,5,1115585298,<blank>,0,1,1115585126
3,<blank>,3a5fb7652e4c4319455769d5462eb2c4ac4cbe79,rich,rich#place.com,0,1167079798,1,2.0,5,1167079798,<blank>,0,1,1167079887
this has the output
['b404bac52c91ef1f291ba9c2719aa7d916dc55e5', '3a5fb7652e4c4319455769d5462eb2c4ac4cbe79']
Having looked at your code as it stands, I can tell you one thing: __fix_re_pattern probably isn't doing what you want it to. It currently removes the first and last character of any regex you pass it, which will ironically and horribly mangle the regex.
def __fix_re_pattern(regex_string, to_add=r""):
regex_string = list(regex_string)
regex_string[0] = to_add
regex_string[-1] = to_add
return ''.join(regex_string)
print(__fix_re_pattern("[a-fA-F0-9]{40}"))
will output
a-fA-F0-9]{40
I'm still missing a lot of context in your code, and it's not quite modular enough to do without. I can't meaningfully reconstruct your code to reproduce any problems, leaving me to troubleshoot by eye. Presumably this is an instance method of an object which has the words, which for some reason contains a file name. I can't really tell what keys is, for example, so I'm still finding it difficult to provide you with an entire 'fix'. I also don't know what the intention behind __fix_re_pattern is, but I think your code would work fine if you just took it out entirely.
Another problem is that for each k in whatever keys is, you overwrite the variable matched_hashes, so you return only the matched hashes for the last key.
Also the whole keys thing is kind of intriguing me.. Is it a call to some kind of globally defined function/module/class which knows about hash regexes?
Now you probably know best what your code wants, but it nevertheless seems a little complicated.. I'd advise you to keep in the back of your mind that my first answer, as it stands, also entirely meets the specification of your question.
For the sake of minimalism, I simplify my case.
I have a file like this:
Name=[string] --------
Date=[string]
Target=[string]
Size=[string]
Name=[string] --------
Date=[string]
Size=[string]
Value=[string]
Name=[string] --------
Target=[string]
Date=[string]
Size=[string]
Value=[string]
I would like to store the record set(couple lines)that starts with Name=[some string] and continues until the next occurrence of Name=[another string] in a tuple/dictionary structure and enumerate them.
So, the desired output might look like this:
Enum,Name=[string],Date=[string],Target=[string],Size=[string], ---None---
Enum,Name=[string],Date=[string], ---None--- ,Size=[string], Value=[string]
Enum,Name=[string],Date=[string],Target=[string],Size=[string], Value=[string]
I started working with line by line approach, yet it became computationally expensive.
Is there any work around or functionality that may catch such recurring patterns and will they be useful and feasible for such formatting?
import re
dict = {}
with open("sockets_collection") as file:
i=0 for line in file:
match=re.findall(r'([^=]+)=([^=]+)(?:,|$)',line)
dict[i]=(match[0])
i=i+1
print(dict)
This is a snippet for storing them as key value pairs. What I want to achieve is storing them as key value pairs but instead of having each enumerated, I want to have them grouped via key: name
P.S. if there any ambiguous parts please let me know.
I've found how to split a delimited string into key:value pairs in a dictionary elsewhere, but I have an incoming string that also includes two parameters that amount to dictionaries themselves: parameters with one or three key:value pairs inside:
clientid=b59694bf-c7c1-4a3a-8cd5-6dad69f4abb0&keyid=987654321&userdata=ip:192.168.10.10,deviceid:1234,optdata:75BCD15&md=AMT-Cam:avatar&playbackmode=st&ver=6&sessionid=&mk=PC&junketid=1342177342&version=6.7.8.9012
Obviously these are dummy parameters to obfuscate proprietary code, here. I'd like to dump all this into a dictionary with the userdata and md keys' values being dictionaries themselves:
requestdict {'clientid' : 'b59694bf-c7c1-4a3a-8cd5-6dad69f4abb0', 'keyid' : '987654321', 'userdata' : {'ip' : '192.168.10.10', 'deviceid' : '1234', 'optdata' : '75BCD15'}, 'md' : {'Cam' : 'avatar'}, 'playbackmode' : 'st', 'ver' : '6', 'sessionid' : '', 'mk' : 'PC', 'junketid' : '1342177342', 'version' : '6.7.8.9012'}
Can I take the slick two-level delimitation parsing command that I've found:
requestDict = dict(line.split('=') for line in clientRequest.split('&'))
and add a third level to it to handle & preserve the 2nd-level dictionaries? What would the syntax be? If not, I suppose I'll have to split by & and then check & handle splits that contain : but even then I can't figure out the syntax. Can someone help? Thanks!
I basically took Kyle's answer and made it more future-friendly:
def dictelem(input):
parts = input.split('&')
listing = [part.split('=') for part in parts]
result = {}
for entry in listing:
head, tail = entry[0], ''.join(entry[1:])
if ':' in tail:
entries = tail.split(',')
result.update({ head : dict(e.split(':') for e in entries) })
else:
result.update({head: tail})
return result
Here's a two-liner that does what I think you want:
dictelem = lambda x: x if ':' not in x[1] else [x[0],dict(y.split(':') for y in x[1].split(','))]
a = dict(dictelem(x.split('=')) for x in input.split('&'))
Can I take the slick two-level delimitation parsing command that I've found:
requestDict = dict(line.split('=') for line in clientRequest.split('&'))
and add a third level to it to handle & preserve the 2nd-level dictionaries?
Of course you can, but (a) you probably don't want to, because nested comprehensions beyond two levels tend to get unreadable, and (b) this super-simple syntax won't work for cases like yours, where only some of the data can be turned into a dict.
For example, what should happen with 'PC'? Do you want to make that into {'PC': None}? Or maybe the set {'PC'}? Or the list ['PC']? Or just leave it alone? You have to decide, and write the logic for that, and trying to write it as an expression will make your decision very hard to read.
So, let's put that logic in a separate function:
def parseCommasAndColons(s):
bits = [bit.split(':') for bit in s.split(',')]
try:
return dict(bits)
except ValueError:
return bits
This will return a dict like {'ip': '192.168.10.10', 'deviceid': '1234', 'optdata': '75BCD15'} or {'AMT-Cam': 'avatar'} for cases where each comma-separated component has a colon inside it, but a list like ['1342177342'] for cases where any of them don't.
Even this may be a little too clever; I might make the "is this in dictionary format" check more explicit instead of just trying to convert the list of lists and see what happens.
Either way, how would you put that back into your original comprehension?
Well, you want to call it on the value in the line.split('='). So let's add a function for that:
def parseCommasAndColonsForValue(keyvalue):
if len(keyvalue) == 2:
return keyvalue[0], parseCommasAndColons(keyvalue[1])
else:
return keyvalue
requestDict = dict(parseCommasAndColonsForValue(line.split('='))
for line in clientRequest.split('&'))
One last thing: Unless you need to run on older versions of Python, you shouldn't often be calling dict on a generator expression. If it can be rewritten as a dictionary comprehension, it will almost certainly be clearer that way, and if it can't be rewritten as a dictionary comprehension, it probably shouldn't be a 1-liner expression in the first place.
Of course breaking expressions up into separate expressions, turning some of them into statements or even functions, and naming them does make your code longer—but that doesn't necessarily mean worse. About half of the Zen of Python (import this) is devoted to explaining why. Or one quote from Guido: "Python is a bad language for code golf, on purpose."
If you really want to know what it would look like, let's break it into two steps:
>>> {k: [bit2.split(':') for bit2 in v.split(',')] for k, v in (bit.split('=') for bit in s.split('&'))}
{'clientid': [['b59694bf-c7c1-4a3a-8cd5-6dad69f4abb0']],
'junketid': [['1342177342']],
'keyid': [['987654321']],
'md': [['AMT-Cam', 'avatar']],
'mk': [['PC']],
'playbackmode': [['st']],
'sessionid': [['']],
'userdata': [['ip', '192.168.10.10'],
['deviceid', '1234'],
['optdata', '75BCD15']],
'ver': [['6']],
'version': [['6.7.8.9012']]}
That illustrates why you can't just add a dict call for the inner level—because most of those things aren't actually dictionaries, because they had no colons. If you changed that, then it would just be this:
{k: dict(bit2.split(':') for bit2 in v.split(',')) for k, v in (bit.split('=') for bit in s.split('&'))}
I don't think that's very readable, and I doubt most Python programmers would. Reading it 6 months from now and trying to figure out what I meant would take a lot more effort than writing it did.
And trying to debug it will not be fun. What happens if you run that on your input, with missing colons? ValueError: dictionary update sequence element #0 has length 1; 2 is required. Which sequence? No idea. You have to break it down step by step to see what doesn't work. That's no fun.
So, hopefully that illustrates why you don't want to do this.
I have a dict like this:
(100002: 'APPLE', 100004: 'BANANA', 100005: 'CARROT')
I am trying to make my dict have ints for the keys (as it does now) but have sets for the values (rather than strings as it is now.) My goal is to be able to read from a .csv file with one column for the key (an int which is the item id number) and then columns for things like size, shape, and color. I want to add this information into my dict so that only the information for keys already in dict are added.
My goal dict might look like this:
(100002: set(['APPLE','MEDIUM','ROUND','RED']), 100004: set(['Banana','MEDIUM','LONG','YELLOW']), 100005: set(['CARROT','MEDIUM','LONG','ORANGE'])
Starting with my dict of just key + string for item name, I tried code like this to read the extra information in from a .csv file:
infile = open('FileWithTheData.csv', 'r')
for line in infile.readlines():
spl_line = line.split(',')
if int(spl_line[0]) in MyDict.keys():
MyDict[int(spl_line[0])].update(spl_line[1:])
Unfortunately this errors out saying AttributeError: 'str' object has no attribute 'update'. My attempts to change my dictionary's values into sets so that I can then .update them have yielded things like this: (100002: set(['A','P','L','E']), 100004: set(['B','A','N']), 100005: set(['C','A','R','O','T']))
I want to convert the values to a set so that the string that is currently the value will be the first string in the set rather than breaking up the string into letters and making a set of those letters.
I also tried making the values a set when I create the dict by zipping two lists together but it didn't seem to make any difference. Something like this
MyDict = dict(zip(listofkeys, set(listofnames)))
still makes the whole listofnames list into a set but it doesn't achieve my goal of making each value in MyDict into a set with the corresponding string from listofnames as the first string in the set.
How can I make the values in MyDict into a set so that I can add additional strings to that set without turning the string that is currently the value in the dict into a set of individual letters?
EDIT:
I currently make MyDict by using one function to generate a list of item ids (which are the keys) and another function which looks up those item ids to generate a list of corresponding item names (using a two column .csv file as the data source) and then I zip them together.
ANSWER:
Using the suggestions here I came up with this solution. I found that the section that has set()).update can easily be changed to list()).append to yield a list rather than a set (so that the order is preserved.) I also found it easier to update by .csv data input files by adding the column containing names to the FileWithTheData.csv so that I didn't have to mess with making the dict, converting the values to sets, and then adding in more data. My code for this section now looks like this:
MyDict = {}
infile = open('FileWithTheData.csv', 'r')
for line in infile.readlines():
spl_line = line.split(',')
if int(spl_line[0]) in itemidlist: #note that this is the list I was formerly zipping together with a corresponding list of names to make my dict
MyDict.setdefault(int(spl_line[0]), list()).append(spl_line[1:])
print MyDict
Your error is because originally your MyDict variable maps an integer to a string. When you are trying to update it you are treating the value like a set, when it is a string.
You can use a defaultdict for this:
combined_dict = defaultdict(set)
# first add all the values from MyDict
for key, value in MyDict.iteritems():
combined_dict[int(key)].add(value)
# then add the values from the file
infile = open('FileWithTheData.csv', 'r')
for line in infile.readlines():
spl_line = line.split(',')
combined_dict[int(sp_line[0])].update(spl_line[1:])
Your issue is with how you are initializing MyDict, try changing it to the following:
MyDict = dict(zip(listofkeys, [set([name]) for name in listofnames]))
Here is a quick example of the difference:
>>> listofkeys = [100002, 100004, 100005]
>>> listofnames = ['APPLE', 'BANANA', 'CARROT']
>>> dict(zip(listofkeys, set(listofnames)))
{100002: 'CARROT', 100004: 'APPLE', 100005: 'BANANA'}
>>> dict(zip(listofkeys, [set([name]) for name in listofnames]))
{100002: set(['APPLE']), 100004: set(['BANANA']), 100005: set(['CARROT'])}
set(listofnames) is just going to turn your list into a set, and the only effect that might have is to reorder the values as seen above. You actually want to take each string value in your list, and convert it to a one-element set, which is what the list comprehension does.
After you make this change, your current code should work fine, although you can just do the contains check directly on the dictionary instead of explicitly checking the keys (key in MyDict is the same as key in MyDict.keys()).
I am writing a script that looks through my inventory, compares it with a master list of all possible inventory items, and tells me what items I am missing. My goal is a .csv file where the first column contains a unique key integer and then the remaining several columns would have data related to that key. For example, a three row snippet of my end-goal .csv file might look like this:
100001,apple,fruit,medium,12,red
100002,carrot,vegetable,medium,10,orange
100005,radish,vegetable,small,10,red
The data for this is being drawn from a couple sources. 1st, a query to an API server gives me a list of keys for items that are in inventory. 2nd, I read in a .csv file into a dict that matches keys with item name for all possible keys. A snippet of the first 5 rows of this .csv file might look like this:
100001,apple
100002,carrot
100003,pear
100004,banana
100005,radish
Note how any key in my list of inventory will be found in this two column .csv file that gives all keys and their corresponding item name and this list minus my inventory on hand yields what I'm looking for (which is the inventory I need to get).
So far I can get a .csv file that contains just the keys and item names for the items that I don't have in inventory. Give a list of inventory on hand like this:
100003,100004
A snippet of my resulting .csv file looks like this:
100001,apple
100002,carrot
100005,radish
This means that I have pear and banana in inventory (so they are not in this .csv file.)
To get this I have a function to get an item name when given an item id that looks like this:
def getNames(id_to_name, ids):
return [id_to_name[id] for id in ids]
Then a function which gives a list of keys as integers from my inventory server API call that returns a list and I've run this function like this:
invlist = ServerApiCallFunction(AppropriateInfo)
A third function takes this invlist as its input and returns a dict of keys (the item id) and names for the items I don't have. It also writes the information of this dict to a .csv file. I am using the set1 - set2 method to do this. It looks like this:
def InventoryNumbers(inventory):
with open(csvfile,'w') as c:
c.write('InvName' + ',InvID' + '\n')
missinginvnames = []
with open("KeyAndItemNameTwoColumns.csv","rb") as fp:
reader = csv.reader(fp, skipinitialspace=True)
fp.readline() # skip header
invidsandnames = {int(id): str.upper(name) for id, name in reader}
invids = set(invidsandnames.keys())
invnames = set(invidsandnames.values())
invonhandset = set(inventory)
missinginvidsset = invids - invonhandset
missinginvids = list(missinginvidsset)
missinginvnames = getNames(invidsandnames, missinginvids)
missinginvnameswithids = dict(zip(missinginvnames, missinginvids))
print missinginvnameswithids
with open(csvfile,'a') as c:
for invname, invid in missinginvnameswithids.iteritems():
c.write(invname + ',' + str(invid) + '\n')
return missinginvnameswithids
Which I then call like this:
InventoryNumbers(invlist)
With that explanation, now on to my question here. I want to expand the data in this output .csv file by adding in additional columns. The data for this would be drawn from another .csv file, a snippet of which would look like this:
100001,fruit,medium,12,red
100002,vegetable,medium,10,orange
100003,fruit,medium,14,green
100004,fruit,medium,12,yellow
100005,vegetable,small,10,red
Note how this does not contain the item name (so I have to pull that from a different .csv file that just has the two columns of key and item name) but it does use the same keys. I am looking for a way to bring in this extra information so that my final .csv file will not just tell me the keys (which are item ids) and item names for the items I don't have in stock but it will also have columns for type, size, number, and color.
One option I've looked at is the defaultdict piece from collections, but I'm not sure if this is the best way to go about what I want to do. If I did use this method I'm not sure exactly how I'd call it to achieve my desired result. If some other method would be easier I'm certainly willing to try that, too.
How can I take my dict of keys and corresponding item names for items that I don't have in inventory and add to it this extra information in such a way that I could output it all to a .csv file?
EDIT: As I typed this up it occurred to me that I might make things easier on myself by creating a new single .csv file that would have date in the form key,item name,type,size,number,color (basically just copying in the column for item name into the .csv that already has the other information for each key.) This way I would only need to draw from one .csv file rather than from two. Even if I did this, though, how would I go about making my desired .csv file based on only those keys for items not in inventory?
ANSWER: I posted another question here about how to implement the solution I accepted (becauseit was giving me a value error since my dict values were strings rather than sets to start with) and I ended up deciding that I wanted a list rather than a set (to preserve the order.) I also ended up adding the column with item names to my .csv file that had all the other data so that I only had to draw from one .csv file. That said, here is what this section of code now looks like:
MyDict = {}
infile = open('FileWithAllTheData.csv', 'r')
for line in infile.readlines():
spl_line = line.split(',')
if int(spl_line[0]) in missinginvids: #note that this is the list I was using as the keys for my dict which I was zipping together with a corresponding list of item names to make my dict before.
MyDict.setdefault(int(spl_line[0]), list()).append(spl_line[1:])
print MyDict
it sounds like what you need is a dict mapping ints to sets, ie,
MyDict = {100001: set([apple]), 100002: set([carrot])}
you can add with update:
MyDict[100001].update([fruit])
which would give you: {100001: set([apple, fruit]), 100002: set([carrot])}
Also if you had a list of attributes of carrot... [vegetable,orange]
you could say MyDict[100002].update([vegetable, orange])
and get: {100001: set([apple, fruit]), 100002: set([carrot, vegetable, orange])}
does this answer your question?
EDIT:
to read into CSV...
infile = open('MyFile.csv', 'r')
for line in infile.readlines():
spl_line = line.split(',')
if int(spl_line[0]) in MyDict.keys():
MyDict[spl_line[0]].update(spl_line[1:])
This isn't an answer to the question, but here is a possible way of simplifying your current code.
This:
invids = set(invidsandnames.keys())
invnames = set(invidsandnames.values())
invonhandset = set(inventory)
missinginvidsset = invids - invonhandset
missinginvids = list(missinginvidsset)
missinginvnames = getNames(invidsandnames, missinginvids)
missinginvnameswithids = dict(zip(missinginvnames, missinginvids))
Can be replaced with:
invonhandset = set(inventory)
missinginvnameswithids = {k: v for k, v in invidsandnames.iteritems() if k in in inventory}
Or:
invonhandset = set(inventory)
for key in invidsandnames.keys():
if key not in invonhandset:
del invidsandnames[key]
missinginvnameswithids = invidsandnames
Have you considered making a temporary RDB (python has sqlite support baked in) and for reasonable numbers of items I don't think you would have a performance issues.
I would turn each CSV file and the result from the web-api into a tables (one table per data source). You can then do everything you want to do with some SQL queries + joins. Once you have the data you want, you can then dump it back to CSV.