Related
I have two files that contain hashes, one of them looks something like this:
user_id,user_bio,user_pass,user_name,user_email,user_banned,user_regdate,user_numposts,user_timezone,user_bio_status,user_lastsession,user_newpassword,user_email_public,user_allowviewonline,user_lasttimereadpost
1,<blank>,a1fba56e72b37d0ba83c2ccer7172ec8eb1fda6d,human,human#place.com,0,1115584099,1,2.0,1,1115647107,<blank>,0,1,1115647107
2,<blank>,b404bac52c91ef1f291ba9c2719aa7d916dc55e5,josh,josh#place.com,0,1115584767,1,2.0,5,1115585298,<blank>,0,1,1115585126
3,<blank>,3a5fb7652e4c4319455769d5462eb2c4ac4cbe79,rich,rich#place.com,0,1167079798,1,2.0,5,1167079798,<blank>,0,1,1167079887
The other one, looks something like this:
This is a random assortment 3a5fb7652e4c4319455769d5462eb2c4ac4cbe79 of characters in order 3a5fb7652e4c4319455769d5462eb2c4ac4cbe79 to see if I can find a file 3a5fb7652e4c4319455769d5462eb2c4ac4cbe79 full of hashes
I'm trying to pull the hashes from these files using a regular expression to match the hash:
def hash_file_generator(self):
def __fix_re_pattern(regex_string, to_add=r""):
regex_string = list(regex_string)
regex_string[0] = to_add
regex_string[-1] = to_add
return re.compile(''.join(regex_string))
matched_hashes = set()
keys = [k for k in bin.verify_hashes.verify.HASH_TYPE_REGEX.iterkeys()]
with open(self.words) as wordlist:
for item in wordlist.readlines():
for s in item.split("\n"):
for k in keys:
k = __fix_re_pattern(k.pattern)
print k.pattern
if k.findall(s):
matched_hashes.add(s)
return matched_hashes
The regular expression that matches these hashes, looks like this: [a-fA-F0-9]{40}.
However, when this is run, it pulls everything from the first file and saves it into the set, and in the second file it will work successfully:
First file:
set(['1<blank>,a1fba56e72b37d0ba83c2ccer7172ec8eb1fda6d,human,human#place.com,0,1115584099,1,2.0,1,1115647107,<blank>,0,1,1115647107','2,<blank>,b404bac52c91ef1f291ba9c2719aa7d916dc55e5,josh,josh#place.com,0,1115584767,1,2.0,5,1115585298,<blank>,0,1,1115585126','3,<blank>,3a5fb7652e4c4319455769d5462eb2c4ac4cbe79,rich,rich#place.com,0,1167079798,1,2.0,5,1167079798,<blank>,0,1,1167079887'])
Second file:
set(['3a5fb7652e4c4319455769d5462eb2c4ac4cbe79'])
How can I pull just the matched data from the first file using the regex as seen here, and why is it pulling everything instead of just the matched data?
Edit for comments
def hash_file_generator(self):
"""
Parse a given file for anything that matches the hashes in the
hash type regex dict. Possible that this will pull random bytes
of data from the files.
"""
def __fix_re_pattern(regex_string, to_add=r""):
regex_string = list(regex_string)
regex_string[0] = to_add
regex_string[-1] = to_add
return ''.join(regex_string)
matched_hashes = []
keys = [k for k in bin.verify_hashes.verify.HASH_TYPE_REGEX.iterkeys()]
with open(self.words) as hashes:
for k in keys:
k = re.compile(__fix_re_pattern(k.pattern))
matched_hashes = [
i for line in hashes
for i in k.findall(line)
]
return matched_hashes
Output:
[]
If you just want to pull the hashes, this should work:
import re
hash_pattern = re.compile("[a-fA-F0-9]{40}")
with open("hashes.txt", "r") as hashes:
matched_hashes = [i for line in hashes
for i in hash_pattern.findall(line)]
print(matched_hashes)
Note that this doesn't match some of what look like hashes because they contain, for example, an 'r', but it uses your specified regex.
The way this works is by using re.findall, which just return a list of strings representing each match, and using a list comprehension to do this for each line of the file.
When hashes.txt is
user_id,user_bio,user_pass,user_name,user_email,user_banned,user_regdate,user_numposts,user_timezone,user_bio_status,user_lastsession,user_newpassword,user_email_public,user_allowviewonline,user_lasttimereadpost
1,<blank>,a1fba56e72b37d0ba83c2ccer7172ec8eb1fda6d,human,human#place.com,0,1115584099,1,2.0,1,1115647107,<blank>,0,1,1115647107
2,<blank>,b404bac52c91ef1f291ba9c2719aa7d916dc55e5,josh,josh#place.com,0,1115584767,1,2.0,5,1115585298,<blank>,0,1,1115585126
3,<blank>,3a5fb7652e4c4319455769d5462eb2c4ac4cbe79,rich,rich#place.com,0,1167079798,1,2.0,5,1167079798,<blank>,0,1,1167079887
this has the output
['b404bac52c91ef1f291ba9c2719aa7d916dc55e5', '3a5fb7652e4c4319455769d5462eb2c4ac4cbe79']
Having looked at your code as it stands, I can tell you one thing: __fix_re_pattern probably isn't doing what you want it to. It currently removes the first and last character of any regex you pass it, which will ironically and horribly mangle the regex.
def __fix_re_pattern(regex_string, to_add=r""):
regex_string = list(regex_string)
regex_string[0] = to_add
regex_string[-1] = to_add
return ''.join(regex_string)
print(__fix_re_pattern("[a-fA-F0-9]{40}"))
will output
a-fA-F0-9]{40
I'm still missing a lot of context in your code, and it's not quite modular enough to do without. I can't meaningfully reconstruct your code to reproduce any problems, leaving me to troubleshoot by eye. Presumably this is an instance method of an object which has the words, which for some reason contains a file name. I can't really tell what keys is, for example, so I'm still finding it difficult to provide you with an entire 'fix'. I also don't know what the intention behind __fix_re_pattern is, but I think your code would work fine if you just took it out entirely.
Another problem is that for each k in whatever keys is, you overwrite the variable matched_hashes, so you return only the matched hashes for the last key.
Also the whole keys thing is kind of intriguing me.. Is it a call to some kind of globally defined function/module/class which knows about hash regexes?
Now you probably know best what your code wants, but it nevertheless seems a little complicated.. I'd advise you to keep in the back of your mind that my first answer, as it stands, also entirely meets the specification of your question.
I have the following code that is filtering and printing a list. The final output is json that is in the form of name.example.com. I want to substitute that with name.sub.example.com but I'm having a hard time actually doing that. filterIP is a working bit of code that removes elements entirely and I have been trying to re-use that bit to also modify elements, it doesn't have to be handled this way.
def filterIP(fullList):
regexIP = re.compile(r'\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}$')
return filter(lambda i: not regexIP.search(i), fullList)
def filterSub(fullList2):
regexSub = re.compile(r'example\.com, sub.example.com')
return filter(lambda i: regexSub.search(i), fullList2)
groups = {key : filterSub(filterIP(list(set(items)))) for (key, items) in groups.iteritems() }
print(self.json_format_dict(groups, pretty=True))
This is what I get without filterSub
"type_1": [
"server1.example.com",
"server2.example.com"
],
This is what I get with filterSub
"type_1": [],
This is what I'm trying to get
"type_1": [
"server1.sub.example.com",
"server2.sub.example.com"
],
The statement:
regexSub = re.compile(r'example\.com, sub.example.com')
doesn't do what you think it does. It creates a compiled regular expression that matches the string "example.com" followed by a comma, a space, the string "sub", an arbitrary character, the string "example", an arbitrary character, and the string "com". It does not create any sort of substitution.
Instead, you want to write something like this, using the re.sub function to perform the substitution and using map to apply it:
def filterSub(fullList2):
regexSub = re.compile(r'example\.com')
return map(lambda i: re.sub(regexSub, "sub.example.com", i),
filter(lambda i: re.search(regexSub, i), fullList2))
If the examples are all truly as simple as those you listed, a regex is probably overkill. A simple solution would be to use string .split and .join. This would likely give better performance.
First split the url at the first period:
url = 'server1.example.com'
split_url = url.split('.', 1)
# ['server1', 'example.com']
Then you can use the sub to rejoin the url:
subbed_url = '.sub.'.join(split_url)
# 'server1.sub.example.com'
Of course you can do the split and the join at the same time
'.sub.'.join(url.split('.', 1))
Or create a simple function:
def sub_url(url):
return '.sub.'.join(url.split('.', 1))
To apply this to the list you can take several approaches.
A list comprehension:
subbed_list = [sub_url(url)
for url in url_list]
Map it:
subbed_list = map(sub_url, url_list)
Or my favorite, a generator:
gen_subbed = (sub_url(url)
for url in url_list)
The last looks like a list comprehension but gives the added benefit that you don't rebuild the entire list. It processes the elements one item at a time as the generator is iterated through. If you decide you do need the list later you can simply convert it to a list as follows:
subbed_list = list(gen_subbed)
I have the two following lists:
# List of tuples representing the index of resources and their unique properties
# Format of (ID,Name,Prefix)
resource_types=[('0','Group','0'),('1','User','1'),('2','Filter','2'),('3','Agent','3'),('4','Asset','4'),('5','Rule','5'),('6','KBase','6'),('7','Case','7'),('8','Note','8'),('9','Report','9'),('10','ArchivedReport',':'),('11','Scheduled Task',';'),('12','Profile','<'),('13','User Shared Accessible Group','='),('14','User Accessible Group','>'),('15','Database Table Schema','?'),('16','Unassigned Resources Group','#'),('17','File','A'),('18','Snapshot','B'),('19','Data Monitor','C'),('20','Viewer Configuration','D'),('21','Instrument','E'),('22','Dashboard','F'),('23','Destination','G'),('24','Active List','H'),('25','Virtual Root','I'),('26','Vulnerability','J'),('27','Search Group','K'),('28','Pattern','L'),('29','Zone','M'),('30','Asset Range','N'),('31','Asset Category','O'),('32','Partition','P'),('33','Active Channel','Q'),('34','Stage','R'),('35','Customer','S'),('36','Field','T'),('37','Field Set','U'),('38','Scanned Report','V'),('39','Location','W'),('40','Network','X'),('41','Focused Report','Y'),('42','Escalation Level','Z'),('43','Query','['),('44','Report Template ','\\'),('45','Session List',']'),('46','Trend','^'),('47','Package','_'),('48','RESERVED','`'),('49','PROJECT_TEMPLATE','a'),('50','Attachments','b'),('51','Query Viewer','c'),('52','Use Case','d'),('53','Integration Configuration','e'),('54','Integration Command f'),('55','Integration Target','g'),('56','Actor','h'),('57','Category Model','i'),('58','Permission','j')]
# This is a list of resource ID's that we do not want to reference directly, ever.
unwanted_resource_types=[0,1,3,10,11,12,13,14,15,16,18,20,21,23,25,27,28,32,35,38,41,47,48,49,50,57,58]
I'm attempting to compare the two in order to build a third list containing the 'Name' of each unique resource type that currently exists in unwanted_resource_types. e.g. The final result list should be:
result = ['Group','User','Agent','ArchivedReport','ScheduledTask','...','...']
I've tried the following that (I thought) should work:
result = []
for res in resource_types:
if res[0] in unwanted_resource_types:
result.append(res[1])
and when that failed to populate result I also tried:
result = []
for res in resource_types:
for type in unwanted_resource_types:
if res[0] == type:
result.append(res[1])
also to no avail. Is there something i'm missing? I believe this would be the right place to perform list comprehension, but that's still in my grey basket of understanding fully (The Python docs are a bit too succinct for me in this case).
I'm also open to completely rethinking this problem, but I do need to retain the list of tuples as it's used elsewhere in the script. Thank you for any assistance you may provide.
Your resource types are using strings, and your unwanted resources are using ints, so you'll need to do some conversion to make it work.
Try this:
result = []
for res in resource_types:
if int(res[0]) in unwanted_resource_types:
result.append(res[1])
or using a list comprehension:
result = [item[1] for item in resource_types if int(item[0]) in unwanted_resource_types]
The numbers in resource_types are numbers contained within strings, whereas the numbers in unwanted_resource_types are plain numbers, so your comparison is failing. This should work:
result = []
for res in resource_types:
if int( res[0] ) in unwanted_resource_types:
result.append(res[1])
The problem is that your triples contain strings and your unwanted resources contain numbers, change the data to
resource_types=[(0,'Group','0'), ...
or use int() to convert the strings to ints before comparison, and it should work. Your result can be computed with a list comprehension as in
result=[rt[1] for rt in resource_types if int(rt[0]) in unwanted_resource_types]
If you change ('0', ...) into (0, ... you can leave out the int() call.
Additionally, you may change the unwanted_resource_types variable into a set, like
unwanted_resource_types=set([0,1,3, ... ])
to improve speed (if speed is an issue, else it's unimportant).
The one-liner:
result = map(lambda x: dict(map(lambda a: (int(a[0]), a[1]), resource_types))[x], unwanted_resource_types)
without any explicit loop does the job.
Ok - you don't want to use this in production code - but it's fun. ;-)
Comment:
The inner dict(map(lambda a: (int(a[0]), a[1]), resource_types)) creates a dictionary from the input data:
{0: 'Group', 1: 'User', 2: 'Filter', 3: 'Agent', ...
The outer map chooses the names from the dictionary.
I have about a thousand files that are named in a semi-sensible way like the following:
aaa.ba.ca.01
aaa.ba.ca.02
aaa.ba.ca.03
aaa.ba.da.01
aaa.ba.da.02
aaa.ba.da.03
and so on. Let's say each file contains 2 columns of numbers which I need to read in to the dictionaries: wavelength, flux. The reading in part is easy for me, the hard part is that I need to load these dictionaries so that they store the information like:
wavelength['aaa.ba.ca.01'] (which is the wavelengths of that one file)
wavelength['aaa.ba.ca'] (which is the wavelengths of all subfiles ie ...ca.01, ...ca.02, and ...ca.03 -- in order)
wavelength['aaa.ba'] (which also includes all wavelengths of all "subfiles" as well -- again in order).
and so on. The filenames are well-behaved (the sections are separated by periods, the grouping hierarchy is always the same direction, etc.) but the files can be between 4 sections, or 8 sections long.
My question: is there some sensible way to have python glob the names of the files and by parsing strings or some other magic get the data into these dictionaries? I've hit a brick wall.
A simple, but not efficient, way to do so is to subclass Pythons dictionary, so that when given one non-complete key, it concatenates the contents of all matching keys, in alphabetical order.
There could be more efficient designs: this one major drawback being it will sort and verify all existing dictionary keys on a partial key request. Otherwise, it is so simple to implement that it is worth a try:
class MultiDict(dict):
def __getitem__(self, key):
if key in self:
return dict.__getitem__(self, key)
result = []
for complete_key in sorted(self.keys()):
if complete_key.startswith(key):
result.extend(self[complete_key])
return result
#example
a = MultiDict()
a["a0"] = [1]
a["a1"] = [2]
print a["a"]
[1, 2]
As for getting teh data in the dictionary, just iterate over all files, with glob or os.listdir, and read the desired contents, as a list, into a MultiDict item using the filename as a key.
What you want does not sound like a dictionary at all. In many ways, I'd say that this is a data structure comparable to a tree. So instead of using a dictionary you're going to want to make a data structure wherein you've got a first node:
Root
'ba' 'ca' 'cd' 'fg'
/ | \ / \ / \ |
/ | \ / \ / \ |
'aa' 'di' '30' '34' '45' 'ac' 'ty' '01'
and then perform a depth first search wherein you've indicated the number of leafs searched by the name: 'ba.aa' would only return things from the 'ba'->'aa' leaf, while 'ba' would return 'ba'->'aa', 'ba'->'di', and 'ba'->'30'.
If you want, I'd make each "level" of nesting into it's own dictionary. That way you could map quickly to the wavelengths and such.
If you only have 1000 files a linear search to look them up is probably fine. On my machine it took 250 us to do one look up. Then you can use itertools.chain to combine data from multiple files.
class DataGlob(object):
def __init__(self):
self.files = []
self.wavedata = dict()
self.fluxdata = dict()
def add(self, filename):
wlist = []
flist = []
for l in open(filename):
(wlen, flux) = map(float, l.split())
wlist.append(wlen)
flist.append(flux)
self.wavedata[filename] = wlist
self.fluxdata[filename] = flist
def find_keys(self, prefix):
return [f for f in self.files if f.startswith(prefix)]
def wavelength(self,fileprefix):
validkeys = find_keys(prefix)
return itertools.chain.from_iterable(self.wavedata[k] for k in validkeys)
def flux(self, fileprefix):
validkeys = self.find_keys(fileprefix)
return itertools.chain.from_iterable(self.fluxdata[k] for k in validkeys)
I'm processing some files in a directory and need the files to be sorted numerically. I found some examples on sorting—specifically with using the lambda pattern—at wiki.python.org, and I put this together:
import re
file_names = """ayurveda_1.tif
ayurveda_11.tif
ayurveda_13.tif
ayurveda_2.tif
ayurveda_20.tif
ayurveda_22.tif""".split('\n')
num_re = re.compile('_(\d{1,2})\.')
file_names.sort(
key=lambda fname: int(num_re.search(fname).group(1))
)
Is there a better way to do this?
This is called "natural sorting" or "human sorting" (as opposed to lexicographical sorting, which is the default). Ned B wrote up a quick version of one.
import re
def tryint(s):
try:
return int(s)
except:
return s
def alphanum_key(s):
""" Turn a string into a list of string and number chunks.
"z23a" -> ["z", 23, "a"]
"""
return [ tryint(c) for c in re.split('([0-9]+)', s) ]
def sort_nicely(l):
""" Sort the given list in the way that humans expect.
"""
l.sort(key=alphanum_key)
It's similar to what you're doing, but perhaps a bit more generalized.
Just use :
tiffFiles.sort(key=lambda var:[int(x) if x.isdigit() else x for x in re.findall(r'[^0-9]|[0-9]+', var)])
is faster than use try/except.
If you are using key= in your sort method you shouldn't use cmp which has been removed from the latest versions of Python. key should be equated to a function which takes a record as input and returns any object which will compare in the order you want your list sorted. It doesn't need to be a lambda function and might be clearer as a stand alone function. Also regular expressions can be slow to evaluate.
You could try something like the following to isolate and return the integer part of the file name:
def getint(name):
basename = name.partition('.')
alpha, num = basename.split('_')
return int(num)
tiffiles.sort(key=getint)
#April provided a good solution in How is Pythons glob.glob ordered? that you could try
#First, get the files:
import glob
import re
files = glob.glob1(img_folder,'*'+output_image_format)
# Sort files according to the digits included in the filename
files = sorted(files, key=lambda x:float(re.findall("(\d+)",x)[0]))
Partition results in Tuple
def getint(name):
(basename, part, ext) = name.partition('.')
(alpha, num) = basename.split('_')
return int(num)
This is a modified version of #Don O'Donnell's answer, because I couldn't get it working as-is, but I think it's the best answer here as it's well-explained.
def getint(name):
_, num = name.split('_')
num, _ = num.split('.')
return int(num)
print(sorted(tiffFiles, key=getint))
Changes:
1) The alpha string doesn't get stored, as it's not needed (hence _, num)
2) Use num.split('.') to separate the number from .tiff
3) Use sorted instead of list.sort, per https://docs.python.org/2/howto/sorting.html