Matching a pattern of file name after wildcard - python

I have a couple of files, XR3_DEV_YEAR20_Z.7_ROP_Current*.csv and
XR3_DEV_YEAR20_Z.7_ROP_Previous*.csv
I'm trying to take the pattern of these file names and get the bit after the wildcard so it matches this:
XR3_DEV_YEAR20_Z.7_ROP_*_xml.txt
I'm trying to do this with the re or glob library, but I'm not really sure how to do it.

here's a way to do that:
exp = re.compile(r"^XR3_DEV_YEAR20_Z.7_ROP_(.*).csv$")
example = "XR3_DEV_YEAR20_Z.7_ROP_Current1234.csv"
res = exp.match(example)
part = res.groups()[0]
new_name = "XR3_DEV_YEAR20_Z.7_ROP_" + part + "_xml.txt"
New name, in this case, would result in :
XR3_DEV_YEAR20_Z.7_ROP_Current1234_xml.txt

Related

Python Glob regex file search with for single result from multiple matches

In Python, I am trying to find a specific file in a directory, let's say, 'file3.txt'. The other files in the directory are 'flie1.txt', 'File2.txt', 'file_12.txt', and 'File13.txt'. The number is unique, so I need to search by a user supplied number.
file_num = 3
my_file = glob.glob('C:/Path_to_dir/' + r'[a-zA-Z_]*' + f'{file_num} + '.txt')
Problem is, that returns both 'file3.txt' and 'File13.txt'. If I try lookbehind, I get no files:
file_num = 3
my_file = glob.glob('C:/Path_to_dir/' + r'[a-zA-Z_]*' + r'(?<![1-9]*)' + f'{file_num}' + '.txt')
How do I only get 'file3.txt'?
glob accepts Unix wildcards, not regexes. Those are less powerful but what you're asking can still be achieved. This:
glob.glob("/path/to/file/*[!0-9]3.txt")
filters the files containing 3 without digits before.
For other cases, you can use a list comprehension and regex:
[x for x in glob.glob("/path/to/file/*") if re.match(some_regex,os.path.basename(x))]
The problem with glob is that it has limited RegEx. For instance, you can't have "[a-z_]+" with glob.
So, it's better to write your own RegEx, like this:
import re
import os
file_num = 3
file_re = r"[a-z_]+{file_num}\.txt".format(file_num=file_num)
match_file = re.compile(file_re, flags=re.IGNORECASE).match
work_dir = "C:/Path_to_dir/"
names = list(filter(match_file, os.listdir(work_dir)))

Extract and modify substring from file path

I have a file path saved as filepath in the form of /home/user/filename. Some examples of what the filename could be:
'1990MAlogfile'
'Tantrologfile'
'2003RF_2004logfile'
I need to write something that turns the filepath into just part of the filename (but I do not have just the filename saved as anything yet). For example:
/home/user/1990MAlogfile becomes '1990 MA', /home/user/Tantrologfile becomes 'Tantro', or /home/user/2003RF_2004logfile becomes '2003 RF'.
So I need everything after the last forward slash and before an underscore if it's present (or before the 'logfile' if it's not), and then I need to insert a space between the last number and first letter if there are numbers present. Then I'd like to save the outcome as objkey. Any idea on how I could do this? I was thinking I could use regex, but don't know how I would handle inserting a space in those certain cases.
Code
def get_filename(filepath):
import re
temp = os.path.basename(example)[:-7].split('_')[0]
a = re.findall('^[0-9]*', temp)[0]
b = temp[len(a):]
return ' '.join([a, b])
example = '/home/user/2003RF_2004logfile'
objkey = get_filename(example)
Explanation
import regular expression package
import re
example filepath
example = '/home/user/2003RF_2004logfile'
/home/user/2003RF_2004logfile
get the filename and remove everything after the _
temp = example.split('/')[-1].split('_')[0]
2003RF
get the beginning portion (splits if numbers at the beginning)
a = re.findall('^[0-9]*', temp)[0]
2003
get the end portion
b = temp[len(a):]
RF
combine the beginning and end portions
return ' '.join([a, b])
2003 RF
import os, re, string
mystr = 'home/user/2003RF_2004logfile'
def format_str(str):
end = os.path.split(mystr)[-1]
m1 = re.match('(.+)logfile', end)
try:
this = m1.group(1)
this = this.split('_')[0]
except AttributeError:
return None
m2 = re.match('(.+[0-9])(.+)', this)
try:
return " ".join([m2.group(1), m2.group(2)])
except AttributeError:
return this

Read or open a file using "fuzzy match" filename - Python

Given a list of files in a directory:
import os
os.listdir('system-outputs/newstest2016/ru-en')
[out]:
['newstest2016.AFRL-MITLL-contrast.4524.ru-en',
'newstest2016.AFRL-MITLL-Phrase.4383.ru-en',
'newstest2016.AMU-UEDIN.4458.ru-en',
'newstest2016.NRC.4456.ru-en',
'newstest2016.online-A.0.ru-en',
'newstest2016.online-B.0.ru-en',
'newstest2016.online-F.0.ru-en',
'newstest2016.online-G.0.ru-en',
'newstest2016.PROMT-Rule-based.4277.ru-en',
'newstest2016.uedin-nmt.4309.ru-en']
And then I have the input:
filename, suffix = 'newstest2016.AFRL-MITLL-contrast', 'ru-en'
Using the filename, if I want to do a regex match such that I can read the file newstest2016.AFRL-MITLL-contrast.4524.ru-en, I could do:
import re
fin = open(next(_fn for _fn in os.list('system-outputs/newstest2016/ru-en') if re.match(filename + '.*.' + suffix, _fn) for _fn in))
But is there a way to read/open a "fuzzy match" filename? There must be a better way than the crude re.match way above.
It's okay to assume that there should always be one clear match from the os.listdir.
I believe glob might be a better way.
You can use glob as suggested, but it can give several matches. I'd go with the pattern that seems to be:
filenames = [
'newstest2016.AFRL-MITLL-contrast.4524.ru-en',
# ...
'newstest2016.PROMT-Rule-based.4277.ru-en',
'newstest2016.uedin-nmt.4309.ru-en'
]
my_filename, suffix = 'newstest2016.AFRL-MITLL-contrast', 'ru-en'
for filename in filenames:
*fn, suff = filename.split('.')
if ('.'.join(fn[:-1]), suff) == (my_filename, suffix):
break
else:
filename = None
# `filename` is now set to real file name
I use python3.x for nicer syntax but this is easy to port to python2.x.

How to extract numbers from filename in Python?

I need to extract just the numbers from file names such as:
GapPoints1.shp
GapPoints23.shp
GapPoints109.shp
How can I extract just the numbers from these files using Python? I'll need to incorporate this into a for loop.
you can use regular expressions:
regex = re.compile(r'\d+')
Then to get the strings that match:
regex.findall(filename)
This will return a list of strings which contain the numbers. If you actually want integers, you could use int:
[int(x) for x in regex.findall(filename)]
If there's only 1 number in each filename, you could use regex.search(filename).group(0) (if you're certain that it will produce a match). If no match is found, the above line will produce a AttributeError saying that NoneType has not attribute group.
So, you haven't left any description of where these files are and how you're getting them, but I assume you'd get the filenames using the os module.
As for getting the numbers out of the names, you'd be best off using regular expressions with re, something like this:
import re
def get_numbers_from_filename(filename):
return re.search(r'\d+', filename).group(0)
Then, to include that in a for loop, you'd run that function on each filename:
for filename in os.listdir(myfiledirectory):
print get_numbers_from_filename(filename)
or something along those lines.
If there is just one number:
filter(lambda x: x.isdigit(), filename)
Hear is my code I used to bring the published year of a paper to the first of filename, after the file is downloaded from google scholar.
The main files usually are constructed so: Author+publishedYear.pdf hence, by implementing this code the filename will become: PublishedYear+Author.pdf.
# Renaming Pdf according to number extraction
# You want to rename a pdf file, so the digits of document published year comes first.
# Use regular expersion
# As long as you implement this file, the other pattern will be accomplished to your filename.
# import libraries
import re
import os
# Change working directory to this folder
address = os.getcwd ()
os.chdir(address)
# defining a class with two function
class file_name:
# Define a function to extract any digits
def __init__ (self, filename):
self.filename = filename
# Because we have tow pattern, we must define tow function.
# First function for pattern as : schrodinger1990.pdf
def number_extrction_pattern_non_digits_first (filename):
pattern = (r'(\D+)(\d+)(\.pdf)')
digits_pattern_non_digits_first = re.search(pattern, filename, re.IGNORECASE).group (2)
non_digits_pattern_non_digits_first = re.search(pattern, filename, re.IGNORECASE).group (1)
return digits_pattern_non_digits_first, non_digits_pattern_non_digits_first
# Second function for pattern as : 1993schrodinger.pdf
def number_extrction_pattern_digits_first (filename):
pattern = (r'(\d+)(\D+)(\.pdf)')
digits_pattern_digits_first = re.search(pattern, filename, re.IGNORECASE).group (1)
non_digits_pattern_digits_first = re.search(pattern, filename, re.IGNORECASE).group (2)
return digits_pattern_digits_first, non_digits_pattern_digits_first
if __name__ == '__main__':
# Define a pattern to check filename pattern
pattern_check1 = (r'(\D+)(\d+)(\.pdf)')
# Declare each file address.
for filename in os.listdir(address):
if filename.endswith('.pdf'):
if re.search(pattern_check1, filename, re.IGNORECASE):
digits = file_name.number_extrction_pattern_non_digits_first (filename)[0]
non_digits = file_name.number_extrction_pattern_non_digits_first (filename)[1]
os.rename(filename, digits + non_digits + '.pdf')
# Else other pattern exists.
else :
digits = file_name.number_extrction_pattern_digits_first (filename)[0]
non_digits = file_name.number_extrction_pattern_digits_first (filename)[1]
os.rename(filename, digits + non_digits + '.pdf')

Replace recursively from a replacement map

I have a dictionary in the form
{'from.x': 'from.changed.x',...}
possibly quite big, and I have to substitute in text files accordingly to that dictionary in a quite big directory structure.
I didn't find anything which might any nice solution and I end up:
using os.walk
iterating through the dictionary
writing everything out
WIth something like:
def fix_imports(top_dir, not_ui_keys):
"""Walk through the directory and substitute the wrong imports
"""
repl = {}
for n in not_ui_keys:
# interleave a model in between
dotted = extract_dotted(n)
if dotted:
repl[dotted] = add_model(dotted)
for root, dirs, files in walk(top_dir):
py_files = [path.join(root, x) for x in files if x.endswith('.py')]
for py in py_files:
res = replace_text(open(py).read(), repl)
def replace_text(orig_text, replace_map):
res = orig_text
# now try to grep all the keys, using a translate maybe
# with a dictionary of the replacements
for to_replace in replace_map:
res.replace(to_replace, replace_map[to_replace])
# now print the differences
for un in unified_diff(res.splitlines(), orig_text.splitlines()):
print(un)
return res
Is there any better/nicer/faster way to do it?
EDIT:
Clarifying a bit the problem, the substitution are generated from a function, and they are all in the form:
{'x.y.z': 'x.y.added.z', 'x.b.a': 'x.b.added.a'}
And yes, sure I should better use regexps, I just thought I didn't need them this time.
I don't think it can help much, however, because I can't really formalize the whole range of substitutions with only one (or multiple) regexps..
I would write the first function using generators:
def fix_imports(top_dir, not_ui_keys):
"""Walk through the directory and substitute the wrong imports """
from itertools import imap,ifilter
gen = ifilter(None,imap(extract_dotted, not_ui_keys))
repl = dict((dotted,add_model(dotted)) for dotted in gen)
py_files = (path.join(root, x)
for root, dirs, files in walk(top_dir)
for x in files if x[-3:]=='.py')
for py in py_files:
with open(py) as opf:
res = replace_text(opf.read(), repl)
x[-3:]=='.py' is faster than x.endswith('.py')
Thank you everyone, and about the problem of substituting from a mapping in many files, I think I have a working solution:
def replace_map_to_text(repl_map, text_lines):
"""Take a dictionary with the replacements needed and a list of
files and return a list with the substituted lines
"""
res = []
concat_st = "(%s)" % "|".join(repl_map.keys())
# '.' in non raw regexp means one of any characters, so must be
# quoted ore we need a way to make the string a raw string
concat_st = concat_st.replace('.', '\.')
combined_regexp = re.compile(concat_st)
for line in text_lines:
found = combined_regexp.search(line)
if found:
expr = found.group(1)
new_line = re.sub(expr, repl_map[expr], line)
logger.info("from line %s to line %s" % (line, new_line))
res.append(new_line)
else:
res.append(line)
return res
def test_replace_string():
lines = ["from psi.io.api import x",
"from psi.z import f"]
expected = ["from psi.io.model.api import x",
"from psi.model.z import f"]
mapping = {'psi.io.api': 'psi.io.model.api',
'psi.z': 'psi.model.z'}
assert replace_map_to_text(mapping, lines) == expected
In short I compose a big regexp in the form
(first|second|third)
Then I search for it in every line and substitute with re.sub if something was found.
Still a bit rough but the simple test after works fine.
EDIT: fixed a nasty bug in the concatenation, because if it's not a raw string '.' means only one character, not a '.'

Categories