Correct python Regular expression to create double dict - python

I have a list of files with names name_x01_y01_000.h5 or name_y01_x01_000.h5
What is the correct regular expression (or other method) to create a list of:
file, x_ind, y_ind
So far i have this code:
name = 'S3_FullBrain_Mosaic_'
type = '.h5'
wildc = name + '*' + type
files = glob.glob(wildc)
files = np.asarray(files)
wildre = 'r\"' +name+'x(?P<x_ind>\d+)_y(?P<y_ind>\d+).+\"'
m = re.match(wildre,files)

Since the glob already ensures the correct filename and extension, the regex need only match the indices. re.search allows a partial match. .groupdict creates a dictionary with named groups as keys. The file key can be handled manually.
>>> file = 'S3_FullBrain_Mosaic_x02_y05_abcd.h5'
>>> result = re.search(r'x(?P<x_ind>\d+)_y(?P<y_ind>\d+)', file).groupdict()
>>> result
{'y_ind': '05', 'x_ind': '02'}
>>> result['file'] = file
>>> result
{'y_ind': '05', 'file': 'S3_FullBrain_Mosaic_x02_y05_abcd.h5', 'x_ind': '02'}
You can iterate over the files to produce the list of dicts. For this there's no need to create a numpy array, since I doubt you're going to do any heavy numerical calculations on the files list.
To handle both possible formats you will need to call re.search with two regexes. One will return None, the other a match on which you can use groupdict.

You could use re.findall
import re
names = ['name_x01_y01_000.h5', 'name_y01_x01_000.h5']
for name in names:
matches = re.findall(r'_([xy])(\d+)(?=_)', name)
d = {k: int(v) for k, v in matches}
d['name'] = name

Related

Reverse a list in python with a condition

I want to sort this list in this way:
.log suffix should be the first item
and .gz file should be in a descending order
my_list = [
'/abc/a.log.1.gz',
'/abc/a.log',
'/abc/a.log.30.gz',
'/abc/a.log.2.gz',
'/abc/a.log.5.gz',
'/abc/a.log.3.gz',
'/abc/a.log.6.gz',
'/abc/a.log.4.gz',
]
expected_result:
my_list = ['/abc/a.log',
'/abc/a.log.30.gz',
'/abc/a.log.6.gz',
'/abc/a.log.5.gz',
'/abc/a.log.4.gz',
'/abc/a.log.3.gz',
'/abc/a.log.2.gz'
'/abc/a.log.1.gz']
reversed(mylist) is also not getting me the desired solution.
I would use this sort helper:
import re
def sort_helper(i:str):
m = re.search('^.*?(\d+)(\.gz)',i)
try:
return int(m.group(1))
except AttributeError:
return 0
print(sorted(my_list, key= sort_helper, reverse=True))
Perhaps using a re is overkill, but it is a flexible tool.
To get around the lack of a leading zero in your filenames, return an int.
also, note the use of the lazy operator at the start of the regular expression:
^.*?
not just
^.*
this matches a little as possible, letting the match for the numbers in the filename be as greedy as possible.
it seems like you are trying to sort file names. I would recommend using os.path to manipulate these strings.
First you can use os.path.splitext split out the extension to compare between .log or .gz. Then strip off the extension again to get the file number, and convert it to an integer.
For example:
import os
def get_sort_keys(filepath):
split_file_path = os.path.splitext(filepath)
sort_key = (split_file_path[1], *os.path.splitext(split_file_path[0]))
return (sort_key[0], sort_key[1], int(sort_key[2].strip(".")) if sort_key[2] else 0)
print(sorted(my_list, key=get_sort_keys, reverse=
I am relying on the fact that the log extension will sort after gz lexicographically.

Regex to match strings in a list without .csv extension

How can i write a regular expression to only match the string names without .csv extension. This should be the required output
Required Output:
['ap_2010', 'class_size', 'demographics', 'graduation','hs_directory', 'sat_results']
Input:
data_files = [
"ap_2010.csv",
"class_size.csv",
"demographics.csv",
"graduation.csv",
"hs_directory.csv",
"sat_results.csv"]
I tried but it return a empty list.
for i in data_files:
regex = re.findall(r'/w+/_[/d{4}][/w*]?', i)
If you really want to use a regular expression, you can use re.sub to remove the extension if it exists, and if not, leave the string alone:
[re.sub(r'\.csv$', '', i) for i in data_files]
['ap_2010',
'class_size',
'demographics',
'graduation',
'hs_directory',
'sat_results']
A better approach in general is using the os module to handle anything to do with filenames:
[os.path.splitext(i)[0] for i in data_files]
['ap_2010',
'class_size',
'demographics',
'graduation',
'hs_directory',
'sat_results']
If you want regex, the solution is r'(.*)\.csv:
for i in data_files:
regex = re.findall(r'(.*)\.csv', i)
print(regex)
Split the string at '.' and then take the last element of the split (using index [-1]). If this is 'csv' then it is a csv file.
for i in data_files:
if i.split('.')[-1].lower() == 'csv':
# It is a CSV file
else:
# Not a CSV
# Input
data_files = [ 'ap_2010.csv', 'class_size.csv', 'demographics.csv', 'graduation.csv', 'hs_directory.csv', 'sat_results.csv' ]
import re
pattern = '(?P<filename>[a-z0-9A-Z_]+)\.csv'
prog = re.compile(pattern)
# `map` function yields:
# - a `List` in Python 2.x
# - a `Generator` in Python 3.x
result = map(lambda data_file: re.search(prog, data_file).group('filename'), data_files)
l = [
"ap_2010.csv",
"class_size.csv",
"demographics.csv",
"graduation.csv",
"hs_directory.csv",
"sat_results.csv"]
print([i.rstrip('.'+i.split('.')[-1]) for i in l])

How to filter out file names with certain prefix and postfix (extension)?

I have a list of files like this:
file_list = ['file1.zip', 'file1.txt']
file_prefix = 'file1'
I'd like to use filter and re to only get file1.txt above. I'm trying this:
regex = re.compile(file_prefix + '.*(!zip).*')
result = list(filter(regex.search, file_list))
# in the above, result should be populated with just ['file1.txt']
But the regex pattern is not working. Could someone help me out on this? Thanks very much in advanced!
You can use negative lookahead like this:
regex = re.compile(file_prefix + '(?!\.zip)')
Code:
>>> file_list = ['file1.zip', 'file1.txt']
>>> file_prefix = 'file1'
>>> regex = re.compile(file_prefix + '(?!\.zip)')
>>> print list(filter(regex.search, file_list))
['file1.txt']
(?!\.zip) makes it a negative lookahead that asserts true when .zip is not present at next position.
Read more about look-arounds
No need for regex for this solution - you don't need to bring a cannon to a thumb-fight. Use Python's native string search/check:
file_list = ["file1.zip", "file1.txt"]
file_prefix = "file1"
file_exclude = "zip"
result = [e for e in file_list if e.startswith(file_prefix) and not e.endswith(file_exclude)]
# ['file1.txt']
Should be considerably faster, too.
If you don't want to search for edges only and you want to filter out only entries that don't have the zip suffix after the file_prefix no matter where it is in the string (so you want to match some_file1.txt, or even a_zip_file1.txt, but not file1_zip.txt) you can slightly modify it:
file_list = ["file1.zip", "file1.txt", "some_file1.txt", "a_zip_file1.txt", "file1_zip.txt"]
file_prefix = "file1"
file_exclude = "zip"
result = [e for e in file_list if e.find(file_exclude) < e.find(file_prefix)]
# ['file1.txt', 'some_file1.txt', 'a_zip_file1.txt']

Python: Group a list of file names according to common name identifier

In a directory I have some files:
temperature_Resu05_les_spec_r0.0300.0
temperature_Resu05_les_spec_r0.0350.0
temperature_Resu05_les_spec_r0.0400.0
temperature_Resu05_les_spec_r0.0450.0
temperature_Resu06_les_spec_r0.0300.0
temperature_Resu06_les_spec_r0.0350.0
temperature_Resu06_les_spec_r0.0400.0
temperature_Resu06_les_spec_r0.0450.0
temperature_Resu07_les_spec_r0.0300.0
temperature_Resu07_les_spec_r0.0350.0
temperature_Resu07_les_spec_r0.0400.0
temperature_Resu07_les_spec_r0.0450.0
temperature_Resu08_les_spec_r0.0300.0
temperature_Resu08_les_spec_r0.0350.0
temperature_Resu08_les_spec_r0.0400.0
temperature_Resu08_les_spec_r0.0450.0
temperature_Resu09_les_spec_r0.0300.0
temperature_Resu09_les_spec_r0.0350.0
temperature_Resu09_les_spec_r0.0400.0
temperature_Resu09_les_spec_r0.0450.0
I need a list of all the files that have the same identifier XXXX as in _rXXXX. For example one such list would be composed of
temperature_Resu05_les_spec_r0.0300.0
temperature_Resu06_les_spec_r0.0300.0
temperature_Resu07_les_spec_r0.0300.0
temperature_Resu08_les_spec_r0.0300.0
temperature_Resu09_les_spec_r0.0300.0
I don't know a priori what the XXXX values are going to be so I can't iterate through them and match like that. Im thinking this might best be handles with a regular expression. Any ideas?
Yes, regular expressions are a fun way to do it! It could look something like this:
results = {}
for fname in fnames:
id = re.search('.*_r(.*)', fname).group(1) # grabs whatever is after the final "_r" as an identifier
if id in results:
results[id] += fname
else:
results[id] = [fname]
The results will be stored in a dictionary, results, indexed by the id.
I should add that this will work as long as all file names reliably have the _rXXXX structure. If there's any chance that a file name will not match that pattern, you will have to check for it and act accordingly.
No a regex is not the best way, you pattern is very straight forward, just str.rsplit on the _r and use the right element of the split as the key to group the data with. A defaultdict will do the grouping efficiently:
from collections import defaultdict
with open("yourfile") as f:
groups = defaultdict(list)
for line in f:
groups[line.rsplit("_r",1)[1]].append(line.rstrip())
from pprint import pprint as pp
pp(groups.values())
Which for your sample will give you:
[['temperature_Resu09_les_spec_r0.0450.0'],
['temperature_Resu05_les_spec_r0.0300.0',
'temperature_Resu06_les_spec_r0.0300.0',
'temperature_Resu07_les_spec_r0.0300.0',
'temperature_Resu08_les_spec_r0.0300.0',
'temperature_Resu09_les_spec_r0.0300.0'],
['temperature_Resu05_les_spec_r0.0400.0',
'temperature_Resu06_les_spec_r0.0400.0',
'temperature_Resu07_les_spec_r0.0400.0',
'temperature_Resu08_les_spec_r0.0400.0',
'temperature_Resu09_les_spec_r0.0400.0'],
['temperature_Resu05_les_spec_r0.0450.0',
'temperature_Resu06_les_spec_r0.0450.0',
'temperature_Resu07_les_spec_r0.0450.0',
'temperature_Resu08_les_spec_r0.0450.0'],
['temperature_Resu05_les_spec_r0.0350.0',
'temperature_Resu06_les_spec_r0.0350.0',
'temperature_Resu07_les_spec_r0.0350.0',
'temperature_Resu08_les_spec_r0.0350.0',
'temperature_Resu09_les_spec_r0.0350.0']]

Extract int between two different strings in python

I have a list files of strings of the following format:
files = ['/misc/lmbraid17/bensch/u-net-3d/2dcellnet/2dcellnet_v6w4l1/2dcellnet_v6w4l1_snapshot_iter_418000.caffemodel.h5',
'/misc/lmbraid17/bensch/u-net-3d/2dcellnet/2dcellnet_v6w4l1/2dcellnet_v6w4l1_snapshot_iter_502000.caffemodel.h5', ...]
I want to extract the int between iter_ and .caffemodel and return a list of those ints.
After some research I came up with this solution that does the trick, but I was wondering if there is a more elegant/pythonic way to do it, possibly using a list comprehension?
li = []
for f in files:
tmp = re.search('iter_[\d]+.caffemodel', f).group()
li.append(int(re.search(r'\d+', tmp).group()))
Just to add another possible solution: join the file names together into one big string (looks like the all end with h5, so there is no danger of creating unwanted matches) and use re.findall on that:
import re
li = [int(d) for d in re.findall(r'iter_(\d+)\.caffemodel', ''.join(files))]
Use just:
li = []
for f in files:
tmp = int(re.search('iter_(\d+)\.caffemodel', f).group(1))
li.append(tmp)
If you put an expression into parenthesis it creates another group of matched expressions.
You can also use a lookbehind assertion:
regex = re.compile("(?<=iter_)\d+")
for f in files:
number = regex.search(f).group(0)
Solution with list comprehension, as you wished:
import re
re_model_id = re.compile(r'iter_(?P<model_id>\d+).caffemodel')
li = [int(re_model_id.search(f).group('model_id')) for f in files]
Without a regex:
files = [
'/misc/lmbraid17/bensch/u-net-3d/2dcellnet/2dcellnet_v6w4l1/2dcellnet_v6w4l1_snapshot_iter_418000.caffemodel.h5',
'/misc/lmbraid17/bensch/u-net-3d/2dcellnet/2dcellnet_v6w4l1/2dcellnet_v6w4l1_snapshot_iter_502000.caffemodel.h5']
print([f.rsplit("_", 1)[1].split(".", 1)[0] for f in files])
['418000', '502000']
Or if you want to be more specific:
print([f.rsplit("iter_", 1)[1].split(".caffemodel", 1)[0] for f in files])
But your pattern seems to repeat so the first solution is probably sufficient.
You can also slice using find and rfind:
print( [f[f.find("iter_")+5: f.rfind("caffe")-1] for f in files])
['418000', '502000']

Categories