Regex on list element in for loop - python

I have a script that searches through config files and finds all matches of strings from another list as follows:
dstn_dir = "C:/xxxxxx/foobar"
dst_list =[]
files = [fn for fn in os.listdir(dstn_dir)if fn.endswith('txt')]
dst_list = []
for file in files:
parse = CiscoConfParse(dstn_dir+'/'+file)
for sfarm in search_str:
int_objs = parse.find_all_children(sfarm)
if len(int_objs) > 0:
dst_list.append(["\n","#" *40,file + " " + sfarm,"#" *40])
dst_list.append(int_objs)
I need to change this part of the code:
for sfarm in search_str:
int_objs = parse.find_all_children(sfarm)
search_str is a list containing strings similar to ['xrout:55','old:23'] and many others.
So it will only find entries that end with the string from the list I am iterating through in sfarm. My understanding is that this would require my to use re and match on something like sfarm$ but Im not sure on how to do this as part of the loop.
Am I correct in saying that sfarm is an iterable? If so I need to know how to regex on an iterable object in this context.

Strings in python are iterable, so sfarm is an iterable, but that has little meaning in this case. From reading what CiscoConfParse.find_all_children() does, it is apparent that your sfarm is the linespec, which is a regular expression string. You do not need to explicitly use the re module here; just pass sfarm concatenated with '$':
search_string = ['xrout:55','old:23']
...
for sfarm in search_str:
int_objs = parse.find_all_children(sfarm + '$') # one of many ways to concat
...

Please check this code. Used glob module to get all "*.txt" files in folder.
Please check here for more info on glob module.
import glob
import re
dst_list = []
search_str = ['xrout:55','old:23']
for file_name in glob.glob(r'C:/Users/dinesh_pundkar\Desktop/*.txt'):
with open(file_name,'r') as f:
text = f.read()
for sfarm in search_str:
regex = re.compile('%s$'%sfarm)
int_objs = regex.findall(text)
if len(int_objs) > 0:
dst_list.append(["\n","#" *40,file_name + " " + sfarm,"#" *40])
dst_list.append(int_objs)
print dst_list
Output:
C:\Users\dinesh_pundkar\Desktop>python a.py
[['\n', '########################################', 'C:/Users/dinesh_pundkar\\De
sktop\\out.txt old:23', '########################################'], ['old:23']]
C:\Users\dinesh_pundkar\Desktop>

Related

Regex to match strings in a list without .csv extension

How can i write a regular expression to only match the string names without .csv extension. This should be the required output
Required Output:
['ap_2010', 'class_size', 'demographics', 'graduation','hs_directory', 'sat_results']
Input:
data_files = [
"ap_2010.csv",
"class_size.csv",
"demographics.csv",
"graduation.csv",
"hs_directory.csv",
"sat_results.csv"]
I tried but it return a empty list.
for i in data_files:
regex = re.findall(r'/w+/_[/d{4}][/w*]?', i)
If you really want to use a regular expression, you can use re.sub to remove the extension if it exists, and if not, leave the string alone:
[re.sub(r'\.csv$', '', i) for i in data_files]
['ap_2010',
'class_size',
'demographics',
'graduation',
'hs_directory',
'sat_results']
A better approach in general is using the os module to handle anything to do with filenames:
[os.path.splitext(i)[0] for i in data_files]
['ap_2010',
'class_size',
'demographics',
'graduation',
'hs_directory',
'sat_results']
If you want regex, the solution is r'(.*)\.csv:
for i in data_files:
regex = re.findall(r'(.*)\.csv', i)
print(regex)
Split the string at '.' and then take the last element of the split (using index [-1]). If this is 'csv' then it is a csv file.
for i in data_files:
if i.split('.')[-1].lower() == 'csv':
# It is a CSV file
else:
# Not a CSV
# Input
data_files = [ 'ap_2010.csv', 'class_size.csv', 'demographics.csv', 'graduation.csv', 'hs_directory.csv', 'sat_results.csv' ]
import re
pattern = '(?P<filename>[a-z0-9A-Z_]+)\.csv'
prog = re.compile(pattern)
# `map` function yields:
# - a `List` in Python 2.x
# - a `Generator` in Python 3.x
result = map(lambda data_file: re.search(prog, data_file).group('filename'), data_files)
l = [
"ap_2010.csv",
"class_size.csv",
"demographics.csv",
"graduation.csv",
"hs_directory.csv",
"sat_results.csv"]
print([i.rstrip('.'+i.split('.')[-1]) for i in l])

How to filter out file names with certain prefix and postfix (extension)?

I have a list of files like this:
file_list = ['file1.zip', 'file1.txt']
file_prefix = 'file1'
I'd like to use filter and re to only get file1.txt above. I'm trying this:
regex = re.compile(file_prefix + '.*(!zip).*')
result = list(filter(regex.search, file_list))
# in the above, result should be populated with just ['file1.txt']
But the regex pattern is not working. Could someone help me out on this? Thanks very much in advanced!
You can use negative lookahead like this:
regex = re.compile(file_prefix + '(?!\.zip)')
Code:
>>> file_list = ['file1.zip', 'file1.txt']
>>> file_prefix = 'file1'
>>> regex = re.compile(file_prefix + '(?!\.zip)')
>>> print list(filter(regex.search, file_list))
['file1.txt']
(?!\.zip) makes it a negative lookahead that asserts true when .zip is not present at next position.
Read more about look-arounds
No need for regex for this solution - you don't need to bring a cannon to a thumb-fight. Use Python's native string search/check:
file_list = ["file1.zip", "file1.txt"]
file_prefix = "file1"
file_exclude = "zip"
result = [e for e in file_list if e.startswith(file_prefix) and not e.endswith(file_exclude)]
# ['file1.txt']
Should be considerably faster, too.
If you don't want to search for edges only and you want to filter out only entries that don't have the zip suffix after the file_prefix no matter where it is in the string (so you want to match some_file1.txt, or even a_zip_file1.txt, but not file1_zip.txt) you can slightly modify it:
file_list = ["file1.zip", "file1.txt", "some_file1.txt", "a_zip_file1.txt", "file1_zip.txt"]
file_prefix = "file1"
file_exclude = "zip"
result = [e for e in file_list if e.find(file_exclude) < e.find(file_prefix)]
# ['file1.txt', 'some_file1.txt', 'a_zip_file1.txt']

How to form a glob that works for a wild char or exact match?

Iam using a statement such as :
input_stuff = '1,2,3'
glob(folder+'['+ input_stuff + ']'+'*')
to list files that begin with 1,2 or 3 while this lists files such as 1-my-file, 2-my-file, 3-my-file .
This doesnt work if exact file names are given
input_stuff = '1-my-file, 2-my-file, 3-my-file'
glob(folder+'['+ input_stuff + ']'+'*')
The error is :sre_constants.error: bad character range
worse for :
input_stuff = '1-my-'
glob(folder+'['+ input_stuff + ']'+'*')
It prints everything in the folder such as 3-my-file etc.,
Is there a glob statement that will print files for both
input_stuff = '1,2,3'
or
input_stuff = '1-my-file, 2-my-file, 3-my-file'
?
Glob expression in brackets is a set of characters, not a list of strings.
You first expresion input_stuff = '1,2,3' is equivalent to '123,' and will also match a name starting with comma.
Your second expression contains '-', which is used to denote character ranges like '0-9A-F', hence the error you get.
It is better to drop glob altogether, split input_stuff and use listdir.
import re, os
input_stuff = '1-my-file, 2-my-file, 3-my-file'
folder = '.'
prefixes = re.split(r'\s*,\s*', input_stuff) #split on commas with optional spaces
prefixes = tuple(prefixes) # startswith doesn't work with list
file_names = os.listdir(folder)
filtered_names = [os.path.join(folder, fname) for fname in file_names
if file_name.startswith(prefixes)]
You can use the following:
input_stuff = '1,2,3'
glob(folder+'['+input_stuff+']-my-file*')
EDIT: Since you said in your comment that you can't hardcode "-my-file", you can do something like:
input_stuff = '1,2,3'
name = "-my-file"
print glob.glob(folder+'['+input_stuff+']'+name+'*')
and then just change the "name" variable when you need to.

Find all text files not containing some text string

I'm on Python 2.7.1 and I'm trying to identify all text files that don't contain some text string.
The program seemed to be working at first but whenever I add the text string to a file, it keeps coming up as if it doesn't contain it (false positive). When I check the contents of the text file, the string is clearly present.
The code I tried to write is
def scanFiles2(rdir,sstring,extens,start = '',cSens = False):
fList = []
for fol,fols,fils in os.walk(rdir):
fList.extend([os.path.join(rdir,fol,fil) for fil in fils if fil.endswith(extens) and fil.startswith(start)])
if fList:
for fil in fList:
rFil = open(fil)
for line in rFil:
if not cSens:
line,sstring = line.lower(), sstring.lower()
if sstring in line:
fList.remove(fil)
break
rFil.close()
if fList:
plur = 'files do' if len(fList) > 1 else 'file does'
print '\nThe following %d %s not contain "%s":\n'%(len(fList),plur,sstring)
for fil in fList:
print fil
else:
print 'No files were found that don\'t contain %(sstring)s.'%locals()
scanFiles2(rdir = r'C:\temp',sstring = '!!syn',extens = '.html', start = '#', cSens = False)
I guess there's a flaw in the code but I really don't see it.
UPDATE
The code still comes up with many false positives: files that do contain the search string but are identified as not containing it.
Could text encoding be an issue here? I prefixed the search string with U to account for Unicode encoding but it didn't make any difference.
Does Python in some way cache file contents? I don't think so but that could somewhat account for files to still pop up after having been corrected.
Could some kind of malware cause symptoms like these? Seems highly unlikely to me but I'm kinda desperate to get this fixed.
Modifying element while iterating the list cause unexpected results:
For example:
>>> lst = [1,2,4,6,3,8,0,5]
>>> for n in lst:
... if n % 2 == 0:
... lst.remove(n)
...
>>> lst
[1, 4, 3, 0, 5]
Workaround iterate over copy
>>> lst = [1,2,4,6,3,8,0,5]
>>> for n in lst[:]:
... if n % 2 == 0:
... lst.remove(n)
...
>>> lst
[1, 3, 5]
Alternatively, you can append valid file path, instead of removing from the whole file list.
Modified version (appending file that does not contian sstring instead of removing):
def scanFiles2(rdir, sstring, extens, start='', cSens=False):
if not cSens:
# This only need to called once.
sstring = sstring.lower()
fList = []
for fol, fols, fils in os.walk(rdir):
for fil in fils:
if not (fil.startswith(start) and fil.endswith(extens)):
continue
fil = os.path.join(fol, fil)
with open(fil) as rFil:
for line in rFil:
if not cSens:
line = line.lower()
if sstring in line:
break
else:
fList.append(fil)
...
list.remove takes O(n) time, while list.append takes O(1). See Time Complexity.
Use with statement if possible.
Falsetru already showed you why you should not remove lines from a list while looping over it; list iterators do not and cannot update their counter when a list is shortened, so if item 3 was processed but you removed that item, the next iteration item 4 was previously located at index 5.
List comprehension version using fnmatch.filter() and any() and a filter lambda for case insensitive matching:
import fnmatch
def scanFiles2(rdir, sstring, extens, start='', cSens=False):
lfilter = sstring.__eq__ if cSens else lambda l, s=sstring.lower(): l.lower() == s
ffilter = '{}*{}'.format(start, extens)
return [os.path.join(r, fname)
for r, _, f in os.walk(rdir)
for fname in fnmatch.filter(f, ffilter)
if not any(lfilter(l) for l in open(os.path.join(root, fname)))]
but perhaps you'd be better off sticking to a more readable loop:
def scanFiles2(rdir, sstring, extens, start='', cSens=False):
lfilter = sstring.__eq__ if cSens else lambda l, s=sstring.lower(): l.lower() == s
ffilter = '{}*{}'.format(start, extens)
result = []
for root, _, files in os.walk(rdir):
for fname in fnmatch.filter(files, ffilter):
fname = os.path.join(r, fname)
with open(fname) as infh:
if not any(lfilter(l) for l in infh):
result.append(fname)
return result
Another alternative that opens the searching up for using regular expressions (although just using grep with appropriate options would still be better):
import mmap
import os
import re
import fnmatch
def scan_files(rootdir, search_string, extension, start='', case_sensitive=False):
rx = re.compile(re.escape(search_string), flags=re.I if not case_sensitive else 0)
name_filter = start + '*' + extension
for root, dirs, files in os.walk(rootdir):
for fname in fnmatch.filter(files, name_filter):
with open(os.path.join(root, fname)) as fin:
try:
mm = mmap.mmap(fin.fileno(), 0, access=mmap.ACCESS_READ)
except ValueError:
continue # empty files etc.... include this or not?
if not next(rx.finditer(mm), None):
yield fin.name
Then use list on that if you want the names materialised or treat it as you would any other generator...
Please do not write a python program for that. This program already exists. Use grep:
grep * -Ilre 'main' 2> /dev/null
99client/.git/COMMIT_EDITMSG
99client/taxis-android/build/incremental/mergeResources/production/merger.xml
99client/taxis-android/build/incremental/mergeResources/production/inputs.data
99client/taxis-android/build/incremental/mergeResources/production/outputs.data
99client/taxis-android/build/incremental/mergeResources/release/merger.xml
99client/taxis-android/build/incremental/mergeResources/release/inputs.data
99client/taxis-android/build/incremental/mergeResources/release/outputs.data
99client/taxis-android/build/incremental/mergeResources/debug/merger.xml
99client/taxis-android/build/incremental/mergeResources/debug/inputs.data
(...)
http://www.gnu.org/savannah-checkouts/gnu/grep/manual/grep.html#Introduction
If you need the list in python, simply execute grep from it and collect the result.

Replace recursively from a replacement map

I have a dictionary in the form
{'from.x': 'from.changed.x',...}
possibly quite big, and I have to substitute in text files accordingly to that dictionary in a quite big directory structure.
I didn't find anything which might any nice solution and I end up:
using os.walk
iterating through the dictionary
writing everything out
WIth something like:
def fix_imports(top_dir, not_ui_keys):
"""Walk through the directory and substitute the wrong imports
"""
repl = {}
for n in not_ui_keys:
# interleave a model in between
dotted = extract_dotted(n)
if dotted:
repl[dotted] = add_model(dotted)
for root, dirs, files in walk(top_dir):
py_files = [path.join(root, x) for x in files if x.endswith('.py')]
for py in py_files:
res = replace_text(open(py).read(), repl)
def replace_text(orig_text, replace_map):
res = orig_text
# now try to grep all the keys, using a translate maybe
# with a dictionary of the replacements
for to_replace in replace_map:
res.replace(to_replace, replace_map[to_replace])
# now print the differences
for un in unified_diff(res.splitlines(), orig_text.splitlines()):
print(un)
return res
Is there any better/nicer/faster way to do it?
EDIT:
Clarifying a bit the problem, the substitution are generated from a function, and they are all in the form:
{'x.y.z': 'x.y.added.z', 'x.b.a': 'x.b.added.a'}
And yes, sure I should better use regexps, I just thought I didn't need them this time.
I don't think it can help much, however, because I can't really formalize the whole range of substitutions with only one (or multiple) regexps..
I would write the first function using generators:
def fix_imports(top_dir, not_ui_keys):
"""Walk through the directory and substitute the wrong imports """
from itertools import imap,ifilter
gen = ifilter(None,imap(extract_dotted, not_ui_keys))
repl = dict((dotted,add_model(dotted)) for dotted in gen)
py_files = (path.join(root, x)
for root, dirs, files in walk(top_dir)
for x in files if x[-3:]=='.py')
for py in py_files:
with open(py) as opf:
res = replace_text(opf.read(), repl)
x[-3:]=='.py' is faster than x.endswith('.py')
Thank you everyone, and about the problem of substituting from a mapping in many files, I think I have a working solution:
def replace_map_to_text(repl_map, text_lines):
"""Take a dictionary with the replacements needed and a list of
files and return a list with the substituted lines
"""
res = []
concat_st = "(%s)" % "|".join(repl_map.keys())
# '.' in non raw regexp means one of any characters, so must be
# quoted ore we need a way to make the string a raw string
concat_st = concat_st.replace('.', '\.')
combined_regexp = re.compile(concat_st)
for line in text_lines:
found = combined_regexp.search(line)
if found:
expr = found.group(1)
new_line = re.sub(expr, repl_map[expr], line)
logger.info("from line %s to line %s" % (line, new_line))
res.append(new_line)
else:
res.append(line)
return res
def test_replace_string():
lines = ["from psi.io.api import x",
"from psi.z import f"]
expected = ["from psi.io.model.api import x",
"from psi.model.z import f"]
mapping = {'psi.io.api': 'psi.io.model.api',
'psi.z': 'psi.model.z'}
assert replace_map_to_text(mapping, lines) == expected
In short I compose a big regexp in the form
(first|second|third)
Then I search for it in every line and substitute with re.sub if something was found.
Still a bit rough but the simple test after works fine.
EDIT: fixed a nasty bug in the concatenation, because if it's not a raw string '.' means only one character, not a '.'

Categories