How can i write a regular expression to only match the string names without .csv extension. This should be the required output
Required Output:
['ap_2010', 'class_size', 'demographics', 'graduation','hs_directory', 'sat_results']
Input:
data_files = [
"ap_2010.csv",
"class_size.csv",
"demographics.csv",
"graduation.csv",
"hs_directory.csv",
"sat_results.csv"]
I tried but it return a empty list.
for i in data_files:
regex = re.findall(r'/w+/_[/d{4}][/w*]?', i)
If you really want to use a regular expression, you can use re.sub to remove the extension if it exists, and if not, leave the string alone:
[re.sub(r'\.csv$', '', i) for i in data_files]
['ap_2010',
'class_size',
'demographics',
'graduation',
'hs_directory',
'sat_results']
A better approach in general is using the os module to handle anything to do with filenames:
[os.path.splitext(i)[0] for i in data_files]
['ap_2010',
'class_size',
'demographics',
'graduation',
'hs_directory',
'sat_results']
If you want regex, the solution is r'(.*)\.csv:
for i in data_files:
regex = re.findall(r'(.*)\.csv', i)
print(regex)
Split the string at '.' and then take the last element of the split (using index [-1]). If this is 'csv' then it is a csv file.
for i in data_files:
if i.split('.')[-1].lower() == 'csv':
# It is a CSV file
else:
# Not a CSV
# Input
data_files = [ 'ap_2010.csv', 'class_size.csv', 'demographics.csv', 'graduation.csv', 'hs_directory.csv', 'sat_results.csv' ]
import re
pattern = '(?P<filename>[a-z0-9A-Z_]+)\.csv'
prog = re.compile(pattern)
# `map` function yields:
# - a `List` in Python 2.x
# - a `Generator` in Python 3.x
result = map(lambda data_file: re.search(prog, data_file).group('filename'), data_files)
l = [
"ap_2010.csv",
"class_size.csv",
"demographics.csv",
"graduation.csv",
"hs_directory.csv",
"sat_results.csv"]
print([i.rstrip('.'+i.split('.')[-1]) for i in l])
Related
I want to sort this list in this way:
.log suffix should be the first item
and .gz file should be in a descending order
my_list = [
'/abc/a.log.1.gz',
'/abc/a.log',
'/abc/a.log.30.gz',
'/abc/a.log.2.gz',
'/abc/a.log.5.gz',
'/abc/a.log.3.gz',
'/abc/a.log.6.gz',
'/abc/a.log.4.gz',
]
expected_result:
my_list = ['/abc/a.log',
'/abc/a.log.30.gz',
'/abc/a.log.6.gz',
'/abc/a.log.5.gz',
'/abc/a.log.4.gz',
'/abc/a.log.3.gz',
'/abc/a.log.2.gz'
'/abc/a.log.1.gz']
reversed(mylist) is also not getting me the desired solution.
I would use this sort helper:
import re
def sort_helper(i:str):
m = re.search('^.*?(\d+)(\.gz)',i)
try:
return int(m.group(1))
except AttributeError:
return 0
print(sorted(my_list, key= sort_helper, reverse=True))
Perhaps using a re is overkill, but it is a flexible tool.
To get around the lack of a leading zero in your filenames, return an int.
also, note the use of the lazy operator at the start of the regular expression:
^.*?
not just
^.*
this matches a little as possible, letting the match for the numbers in the filename be as greedy as possible.
it seems like you are trying to sort file names. I would recommend using os.path to manipulate these strings.
First you can use os.path.splitext split out the extension to compare between .log or .gz. Then strip off the extension again to get the file number, and convert it to an integer.
For example:
import os
def get_sort_keys(filepath):
split_file_path = os.path.splitext(filepath)
sort_key = (split_file_path[1], *os.path.splitext(split_file_path[0]))
return (sort_key[0], sort_key[1], int(sort_key[2].strip(".")) if sort_key[2] else 0)
print(sorted(my_list, key=get_sort_keys, reverse=
I am relying on the fact that the log extension will sort after gz lexicographically.
I have a list of files like this:
file_list = ['file1.zip', 'file1.txt']
file_prefix = 'file1'
I'd like to use filter and re to only get file1.txt above. I'm trying this:
regex = re.compile(file_prefix + '.*(!zip).*')
result = list(filter(regex.search, file_list))
# in the above, result should be populated with just ['file1.txt']
But the regex pattern is not working. Could someone help me out on this? Thanks very much in advanced!
You can use negative lookahead like this:
regex = re.compile(file_prefix + '(?!\.zip)')
Code:
>>> file_list = ['file1.zip', 'file1.txt']
>>> file_prefix = 'file1'
>>> regex = re.compile(file_prefix + '(?!\.zip)')
>>> print list(filter(regex.search, file_list))
['file1.txt']
(?!\.zip) makes it a negative lookahead that asserts true when .zip is not present at next position.
Read more about look-arounds
No need for regex for this solution - you don't need to bring a cannon to a thumb-fight. Use Python's native string search/check:
file_list = ["file1.zip", "file1.txt"]
file_prefix = "file1"
file_exclude = "zip"
result = [e for e in file_list if e.startswith(file_prefix) and not e.endswith(file_exclude)]
# ['file1.txt']
Should be considerably faster, too.
If you don't want to search for edges only and you want to filter out only entries that don't have the zip suffix after the file_prefix no matter where it is in the string (so you want to match some_file1.txt, or even a_zip_file1.txt, but not file1_zip.txt) you can slightly modify it:
file_list = ["file1.zip", "file1.txt", "some_file1.txt", "a_zip_file1.txt", "file1_zip.txt"]
file_prefix = "file1"
file_exclude = "zip"
result = [e for e in file_list if e.find(file_exclude) < e.find(file_prefix)]
# ['file1.txt', 'some_file1.txt', 'a_zip_file1.txt']
I have a script that searches through config files and finds all matches of strings from another list as follows:
dstn_dir = "C:/xxxxxx/foobar"
dst_list =[]
files = [fn for fn in os.listdir(dstn_dir)if fn.endswith('txt')]
dst_list = []
for file in files:
parse = CiscoConfParse(dstn_dir+'/'+file)
for sfarm in search_str:
int_objs = parse.find_all_children(sfarm)
if len(int_objs) > 0:
dst_list.append(["\n","#" *40,file + " " + sfarm,"#" *40])
dst_list.append(int_objs)
I need to change this part of the code:
for sfarm in search_str:
int_objs = parse.find_all_children(sfarm)
search_str is a list containing strings similar to ['xrout:55','old:23'] and many others.
So it will only find entries that end with the string from the list I am iterating through in sfarm. My understanding is that this would require my to use re and match on something like sfarm$ but Im not sure on how to do this as part of the loop.
Am I correct in saying that sfarm is an iterable? If so I need to know how to regex on an iterable object in this context.
Strings in python are iterable, so sfarm is an iterable, but that has little meaning in this case. From reading what CiscoConfParse.find_all_children() does, it is apparent that your sfarm is the linespec, which is a regular expression string. You do not need to explicitly use the re module here; just pass sfarm concatenated with '$':
search_string = ['xrout:55','old:23']
...
for sfarm in search_str:
int_objs = parse.find_all_children(sfarm + '$') # one of many ways to concat
...
Please check this code. Used glob module to get all "*.txt" files in folder.
Please check here for more info on glob module.
import glob
import re
dst_list = []
search_str = ['xrout:55','old:23']
for file_name in glob.glob(r'C:/Users/dinesh_pundkar\Desktop/*.txt'):
with open(file_name,'r') as f:
text = f.read()
for sfarm in search_str:
regex = re.compile('%s$'%sfarm)
int_objs = regex.findall(text)
if len(int_objs) > 0:
dst_list.append(["\n","#" *40,file_name + " " + sfarm,"#" *40])
dst_list.append(int_objs)
print dst_list
Output:
C:\Users\dinesh_pundkar\Desktop>python a.py
[['\n', '########################################', 'C:/Users/dinesh_pundkar\\De
sktop\\out.txt old:23', '########################################'], ['old:23']]
C:\Users\dinesh_pundkar\Desktop>
Iam using a statement such as :
input_stuff = '1,2,3'
glob(folder+'['+ input_stuff + ']'+'*')
to list files that begin with 1,2 or 3 while this lists files such as 1-my-file, 2-my-file, 3-my-file .
This doesnt work if exact file names are given
input_stuff = '1-my-file, 2-my-file, 3-my-file'
glob(folder+'['+ input_stuff + ']'+'*')
The error is :sre_constants.error: bad character range
worse for :
input_stuff = '1-my-'
glob(folder+'['+ input_stuff + ']'+'*')
It prints everything in the folder such as 3-my-file etc.,
Is there a glob statement that will print files for both
input_stuff = '1,2,3'
or
input_stuff = '1-my-file, 2-my-file, 3-my-file'
?
Glob expression in brackets is a set of characters, not a list of strings.
You first expresion input_stuff = '1,2,3' is equivalent to '123,' and will also match a name starting with comma.
Your second expression contains '-', which is used to denote character ranges like '0-9A-F', hence the error you get.
It is better to drop glob altogether, split input_stuff and use listdir.
import re, os
input_stuff = '1-my-file, 2-my-file, 3-my-file'
folder = '.'
prefixes = re.split(r'\s*,\s*', input_stuff) #split on commas with optional spaces
prefixes = tuple(prefixes) # startswith doesn't work with list
file_names = os.listdir(folder)
filtered_names = [os.path.join(folder, fname) for fname in file_names
if file_name.startswith(prefixes)]
You can use the following:
input_stuff = '1,2,3'
glob(folder+'['+input_stuff+']-my-file*')
EDIT: Since you said in your comment that you can't hardcode "-my-file", you can do something like:
input_stuff = '1,2,3'
name = "-my-file"
print glob.glob(folder+'['+input_stuff+']'+name+'*')
and then just change the "name" variable when you need to.
I have a list files of strings of the following format:
files = ['/misc/lmbraid17/bensch/u-net-3d/2dcellnet/2dcellnet_v6w4l1/2dcellnet_v6w4l1_snapshot_iter_418000.caffemodel.h5',
'/misc/lmbraid17/bensch/u-net-3d/2dcellnet/2dcellnet_v6w4l1/2dcellnet_v6w4l1_snapshot_iter_502000.caffemodel.h5', ...]
I want to extract the int between iter_ and .caffemodel and return a list of those ints.
After some research I came up with this solution that does the trick, but I was wondering if there is a more elegant/pythonic way to do it, possibly using a list comprehension?
li = []
for f in files:
tmp = re.search('iter_[\d]+.caffemodel', f).group()
li.append(int(re.search(r'\d+', tmp).group()))
Just to add another possible solution: join the file names together into one big string (looks like the all end with h5, so there is no danger of creating unwanted matches) and use re.findall on that:
import re
li = [int(d) for d in re.findall(r'iter_(\d+)\.caffemodel', ''.join(files))]
Use just:
li = []
for f in files:
tmp = int(re.search('iter_(\d+)\.caffemodel', f).group(1))
li.append(tmp)
If you put an expression into parenthesis it creates another group of matched expressions.
You can also use a lookbehind assertion:
regex = re.compile("(?<=iter_)\d+")
for f in files:
number = regex.search(f).group(0)
Solution with list comprehension, as you wished:
import re
re_model_id = re.compile(r'iter_(?P<model_id>\d+).caffemodel')
li = [int(re_model_id.search(f).group('model_id')) for f in files]
Without a regex:
files = [
'/misc/lmbraid17/bensch/u-net-3d/2dcellnet/2dcellnet_v6w4l1/2dcellnet_v6w4l1_snapshot_iter_418000.caffemodel.h5',
'/misc/lmbraid17/bensch/u-net-3d/2dcellnet/2dcellnet_v6w4l1/2dcellnet_v6w4l1_snapshot_iter_502000.caffemodel.h5']
print([f.rsplit("_", 1)[1].split(".", 1)[0] for f in files])
['418000', '502000']
Or if you want to be more specific:
print([f.rsplit("iter_", 1)[1].split(".caffemodel", 1)[0] for f in files])
But your pattern seems to repeat so the first solution is probably sufficient.
You can also slice using find and rfind:
print( [f[f.find("iter_")+5: f.rfind("caffe")-1] for f in files])
['418000', '502000']