I want to sort this list in this way:
.log suffix should be the first item
and .gz file should be in a descending order
my_list = [
'/abc/a.log.1.gz',
'/abc/a.log',
'/abc/a.log.30.gz',
'/abc/a.log.2.gz',
'/abc/a.log.5.gz',
'/abc/a.log.3.gz',
'/abc/a.log.6.gz',
'/abc/a.log.4.gz',
]
expected_result:
my_list = ['/abc/a.log',
'/abc/a.log.30.gz',
'/abc/a.log.6.gz',
'/abc/a.log.5.gz',
'/abc/a.log.4.gz',
'/abc/a.log.3.gz',
'/abc/a.log.2.gz'
'/abc/a.log.1.gz']
reversed(mylist) is also not getting me the desired solution.
I would use this sort helper:
import re
def sort_helper(i:str):
m = re.search('^.*?(\d+)(\.gz)',i)
try:
return int(m.group(1))
except AttributeError:
return 0
print(sorted(my_list, key= sort_helper, reverse=True))
Perhaps using a re is overkill, but it is a flexible tool.
To get around the lack of a leading zero in your filenames, return an int.
also, note the use of the lazy operator at the start of the regular expression:
^.*?
not just
^.*
this matches a little as possible, letting the match for the numbers in the filename be as greedy as possible.
it seems like you are trying to sort file names. I would recommend using os.path to manipulate these strings.
First you can use os.path.splitext split out the extension to compare between .log or .gz. Then strip off the extension again to get the file number, and convert it to an integer.
For example:
import os
def get_sort_keys(filepath):
split_file_path = os.path.splitext(filepath)
sort_key = (split_file_path[1], *os.path.splitext(split_file_path[0]))
return (sort_key[0], sort_key[1], int(sort_key[2].strip(".")) if sort_key[2] else 0)
print(sorted(my_list, key=get_sort_keys, reverse=
I am relying on the fact that the log extension will sort after gz lexicographically.
Related
How do extract file name from with numbers on both end?
I have extracted file name 56flybox007 using :
filter(lambda x: x.isalpha(), 56flybox007)
Results in flybox but I want to remove numbers from prefix part so result be like: 56flybox
Try Using this code:
import string
Sample = "56flybox00"
cleaned = Sample.rstrip(string.digits)
print(cleaned)
Output:
56flybox
I would use a regex here, even if there are a lot of other ways to achieve this, regex are powerful and could be easily modified if your needs change:
import re
rx = re.search(r'(\d*\D+)\d*', '123abc456')
print(rx.group(1)) # >>> '123abc'
Try this.
file = "56flybox007"
file[:file.find(filter(lambda x: x.isalpha(), file))]+filter(lambda x: x.isalpha(), file)
You can use the rstrip method of string to remove characters from the right hand side of the string. In this case you pass all the digits to rstrip and it will remove them from just the right hand side.
files = ["56flybox007", "45NotherFile456", "78LasstFile45"]
out_files = [file.rstrip("0123456789") for file in files]
print(files)
print(out_files)
OUTPUT
['56flybox007', '45NotherFile456', '78LasstFile45']
['56flybox', '45NotherFile', '78LasstFile']
How can i write a regular expression to only match the string names without .csv extension. This should be the required output
Required Output:
['ap_2010', 'class_size', 'demographics', 'graduation','hs_directory', 'sat_results']
Input:
data_files = [
"ap_2010.csv",
"class_size.csv",
"demographics.csv",
"graduation.csv",
"hs_directory.csv",
"sat_results.csv"]
I tried but it return a empty list.
for i in data_files:
regex = re.findall(r'/w+/_[/d{4}][/w*]?', i)
If you really want to use a regular expression, you can use re.sub to remove the extension if it exists, and if not, leave the string alone:
[re.sub(r'\.csv$', '', i) for i in data_files]
['ap_2010',
'class_size',
'demographics',
'graduation',
'hs_directory',
'sat_results']
A better approach in general is using the os module to handle anything to do with filenames:
[os.path.splitext(i)[0] for i in data_files]
['ap_2010',
'class_size',
'demographics',
'graduation',
'hs_directory',
'sat_results']
If you want regex, the solution is r'(.*)\.csv:
for i in data_files:
regex = re.findall(r'(.*)\.csv', i)
print(regex)
Split the string at '.' and then take the last element of the split (using index [-1]). If this is 'csv' then it is a csv file.
for i in data_files:
if i.split('.')[-1].lower() == 'csv':
# It is a CSV file
else:
# Not a CSV
# Input
data_files = [ 'ap_2010.csv', 'class_size.csv', 'demographics.csv', 'graduation.csv', 'hs_directory.csv', 'sat_results.csv' ]
import re
pattern = '(?P<filename>[a-z0-9A-Z_]+)\.csv'
prog = re.compile(pattern)
# `map` function yields:
# - a `List` in Python 2.x
# - a `Generator` in Python 3.x
result = map(lambda data_file: re.search(prog, data_file).group('filename'), data_files)
l = [
"ap_2010.csv",
"class_size.csv",
"demographics.csv",
"graduation.csv",
"hs_directory.csv",
"sat_results.csv"]
print([i.rstrip('.'+i.split('.')[-1]) for i in l])
Lets say I have three files in a folder: file9.txt, file10.txt and file11.txt and i want to read them in this particular order. Can anyone help me with this?
Right now I am using the code
import glob, os
for infile in glob.glob(os.path.join( '*.txt')):
print "Current File Being Processed is: " + infile
and it reads first file10.txt then file11.txt and then file9.txt.
Can someone help me how to get the right order?
Files on the filesystem are not sorted. You can sort the resulting filenames yourself using the sorted() function:
for infile in sorted(glob.glob('*.txt')):
print "Current File Being Processed is: " + infile
Note that the os.path.join call in your code is a no-op; with only one argument it doesn't do anything but return that argument unaltered.
Note that your files will sort in alphabetical ordering, which puts 10 before 9. You can use a custom key function to improve the sorting:
import re
numbers = re.compile(r'(\d+)')
def numericalSort(value):
parts = numbers.split(value)
parts[1::2] = map(int, parts[1::2])
return parts
for infile in sorted(glob.glob('*.txt'), key=numericalSort):
print "Current File Being Processed is: " + infile
The numericalSort function splits out any digits in a filename, turns it into an actual number, and returns the result for sorting:
>>> files = ['file9.txt', 'file10.txt', 'file11.txt', '32foo9.txt', '32foo10.txt']
>>> sorted(files)
['32foo10.txt', '32foo9.txt', 'file10.txt', 'file11.txt', 'file9.txt']
>>> sorted(files, key=numericalSort)
['32foo9.txt', '32foo10.txt', 'file9.txt', 'file10.txt', 'file11.txt']
You can wrap your glob.glob( ... ) expression inside a sorted( ... ) statement and sort the resulting list of files. Example:
for infile in sorted(glob.glob('*.txt')):
You can give sorted a comparison function or, better, use the key= ... argument to give it a custom key that is used for sorting.
Example:
There are the following files:
x/blub01.txt
x/blub02.txt
x/blub10.txt
x/blub03.txt
y/blub05.txt
The following code will produce the following output:
for filename in sorted(glob.glob('[xy]/*.txt')):
print filename
# x/blub01.txt
# x/blub02.txt
# x/blub03.txt
# x/blub10.txt
# y/blub05.txt
Now with key function:
def key_func(x):
return os.path.split(x)[-1]
for filename in sorted(glob.glob('[xy]/*.txt'), key=key_func):
print filename
# x/blub01.txt
# x/blub02.txt
# x/blub03.txt
# y/blub05.txt
# x/blub10.txt
EDIT:
Possibly this key function can sort your files:
pat=re.compile("(\d+)\D*$")
...
def key_func(x):
mat=pat.search(os.path.split(x)[-1]) # match last group of digits
if mat is None:
return x
return "{:>10}".format(mat.group(1)) # right align to 10 digits.
It sure can be improved, but I think you get the point. Paths without numbers will be left alone, paths with numbers will be converted to a string that is 10 digits wide and contains the number.
You need to change the sort from 'ASCIIBetical' to numeric by isolating the number in the filename. You can do that like so:
import re
def keyFunc(afilename):
nondigits = re.compile("\D")
return int(nondigits.sub("", afilename))
filenames = ["file10.txt", "file11.txt", "file9.txt"]
for x in sorted(filenames, key=keyFunc):
print xcode here
Where you can set filenames with the result of glob.glob("*.txt");
Additinally the keyFunc function assumes the filename will have a number in it, and that the number is only in the filename. You can change that function to be as complex as you need to isolate the number you need to sort on.
glob.glob(os.path.join( '*.txt'))
returns a list of strings, so you can easily sort the list using pythons sorted() function.
sorted(glob.glob(os.path.join( '*.txt')))
for fname in ['file9.txt','file10.txt','file11.txt']:
with open(fname) as f: # default open mode is for reading
for line in f:
# do something with line
I have a list of files with names name_x01_y01_000.h5 or name_y01_x01_000.h5
What is the correct regular expression (or other method) to create a list of:
file, x_ind, y_ind
So far i have this code:
name = 'S3_FullBrain_Mosaic_'
type = '.h5'
wildc = name + '*' + type
files = glob.glob(wildc)
files = np.asarray(files)
wildre = 'r\"' +name+'x(?P<x_ind>\d+)_y(?P<y_ind>\d+).+\"'
m = re.match(wildre,files)
Since the glob already ensures the correct filename and extension, the regex need only match the indices. re.search allows a partial match. .groupdict creates a dictionary with named groups as keys. The file key can be handled manually.
>>> file = 'S3_FullBrain_Mosaic_x02_y05_abcd.h5'
>>> result = re.search(r'x(?P<x_ind>\d+)_y(?P<y_ind>\d+)', file).groupdict()
>>> result
{'y_ind': '05', 'x_ind': '02'}
>>> result['file'] = file
>>> result
{'y_ind': '05', 'file': 'S3_FullBrain_Mosaic_x02_y05_abcd.h5', 'x_ind': '02'}
You can iterate over the files to produce the list of dicts. For this there's no need to create a numpy array, since I doubt you're going to do any heavy numerical calculations on the files list.
To handle both possible formats you will need to call re.search with two regexes. One will return None, the other a match on which you can use groupdict.
You could use re.findall
import re
names = ['name_x01_y01_000.h5', 'name_y01_x01_000.h5']
for name in names:
matches = re.findall(r'_([xy])(\d+)(?=_)', name)
d = {k: int(v) for k, v in matches}
d['name'] = name
I have a dictionary in the form
{'from.x': 'from.changed.x',...}
possibly quite big, and I have to substitute in text files accordingly to that dictionary in a quite big directory structure.
I didn't find anything which might any nice solution and I end up:
using os.walk
iterating through the dictionary
writing everything out
WIth something like:
def fix_imports(top_dir, not_ui_keys):
"""Walk through the directory and substitute the wrong imports
"""
repl = {}
for n in not_ui_keys:
# interleave a model in between
dotted = extract_dotted(n)
if dotted:
repl[dotted] = add_model(dotted)
for root, dirs, files in walk(top_dir):
py_files = [path.join(root, x) for x in files if x.endswith('.py')]
for py in py_files:
res = replace_text(open(py).read(), repl)
def replace_text(orig_text, replace_map):
res = orig_text
# now try to grep all the keys, using a translate maybe
# with a dictionary of the replacements
for to_replace in replace_map:
res.replace(to_replace, replace_map[to_replace])
# now print the differences
for un in unified_diff(res.splitlines(), orig_text.splitlines()):
print(un)
return res
Is there any better/nicer/faster way to do it?
EDIT:
Clarifying a bit the problem, the substitution are generated from a function, and they are all in the form:
{'x.y.z': 'x.y.added.z', 'x.b.a': 'x.b.added.a'}
And yes, sure I should better use regexps, I just thought I didn't need them this time.
I don't think it can help much, however, because I can't really formalize the whole range of substitutions with only one (or multiple) regexps..
I would write the first function using generators:
def fix_imports(top_dir, not_ui_keys):
"""Walk through the directory and substitute the wrong imports """
from itertools import imap,ifilter
gen = ifilter(None,imap(extract_dotted, not_ui_keys))
repl = dict((dotted,add_model(dotted)) for dotted in gen)
py_files = (path.join(root, x)
for root, dirs, files in walk(top_dir)
for x in files if x[-3:]=='.py')
for py in py_files:
with open(py) as opf:
res = replace_text(opf.read(), repl)
x[-3:]=='.py' is faster than x.endswith('.py')
Thank you everyone, and about the problem of substituting from a mapping in many files, I think I have a working solution:
def replace_map_to_text(repl_map, text_lines):
"""Take a dictionary with the replacements needed and a list of
files and return a list with the substituted lines
"""
res = []
concat_st = "(%s)" % "|".join(repl_map.keys())
# '.' in non raw regexp means one of any characters, so must be
# quoted ore we need a way to make the string a raw string
concat_st = concat_st.replace('.', '\.')
combined_regexp = re.compile(concat_st)
for line in text_lines:
found = combined_regexp.search(line)
if found:
expr = found.group(1)
new_line = re.sub(expr, repl_map[expr], line)
logger.info("from line %s to line %s" % (line, new_line))
res.append(new_line)
else:
res.append(line)
return res
def test_replace_string():
lines = ["from psi.io.api import x",
"from psi.z import f"]
expected = ["from psi.io.model.api import x",
"from psi.model.z import f"]
mapping = {'psi.io.api': 'psi.io.model.api',
'psi.z': 'psi.model.z'}
assert replace_map_to_text(mapping, lines) == expected
In short I compose a big regexp in the form
(first|second|third)
Then I search for it in every line and substitute with re.sub if something was found.
Still a bit rough but the simple test after works fine.
EDIT: fixed a nasty bug in the concatenation, because if it's not a raw string '.' means only one character, not a '.'