How to automatically create a python re pattern from a file name? - python

As an input, I have a filename (e.g. "bla150420.txt") containing a date in a specific format. I need to look into a given folder (containing many files) and find the latest version of my file. (And I have to do it automatically - many times for different files in different folders.)
Example directory (dirname):
...
bla150420.txt
bla150425.txt
bla150510.txt
Example output:
bla150510.txt
How can I do so? My original approach was to parse a date out of the file name, substitute the date with RE pattern and to search this pattern in the list of all filenames. This doesn't seem to work. Any idea? Or different approach?
def get_date(file_name):
DATE_RE = re.compile('([0-9]{6})') #EDITED - TYPO
try:
match = DATE_RE.search(fname).group()
except AttributeError:
sys.stderr.write('ERROR! No date matches string!\n\t' + match)
else:
date = datetime.datetime.strptime(match, '%Y%m%d')
return match, date
date_string, current_date = get_date(fname)
# fname is a given file name (e.g. bla150420.txt)
pattern1 = re.compile(re.sub(date_string, '(.*)', fname))
# pattern1 returns value 'kds_docs-(.*).zip'
pattern2 = re.compile('kds_docs-(.*).zip')
if os.path.isdir(dirname):
matching_files = [x for x in os.listdir(dirname)
if pattern1.search(x)]
It is a wonder to me, my program works with pattern2, but not with pattern1. If I print those two (using .pattern), it looks like the same result, if I compare it with '==' it returns False. I have no idea whether it is because of encoding/whitespace/something else nor how to find the difference. Could you please help?

I think you're merely having issues with producing a working regex in an automated fashion.
Serge pointed out that your code as presented should get tripped up by the fact that your dates seem to have 6 digits instead of 8, but that first regex requires 8 digits - correct or explain that if it's more than a typo.
I suppose you're looking to verify that any string of digits actually is a date, but this seems unnecessary, since a filename could have a string of digits which parses as a date, but isn't the date you're looking for - not ideal. Let me know if it MUST be a date.
I'm unfamiliar with the intricacies of Python, but I would suggest simplifying your regex production by not even using your function:
pattern1 = re.compile(re.sub('([0-9]{6})', '(.*)', fname))
Just make the replacement directly. I would say it would probably be safer to go a little further like so:
pattern1 = re.compile(re.sub('([0-9]{6})', '(\d{6})', fname))
...and if there are any other possible restrictions, you could limit 6 digit matches further. For instance, the 6 digit string might always be at the end of the file name, just before the extension:
pattern1 = re.compile(re.sub('([0-9]{6})(?=\..*$)', '(\d{6})', fname))
# should turn 'kds_docs-120501-151023.zip' into 'kds_docs-150510-(\d{6}).zip'

Related

Text between delimiters starting from the end of the string

I'm really new to python and programming in general, and to practice I'm doing projects where I try to tackle problems from my day to day work, so please excuse me if this may be a silly question. I'm trying to combine a group of files located on a remote folder into a single monthly one based on the date, I've already combined files based on date so I think I can do that, but I'm having trouble with the regex to pick the date from the file name string, the string with the filepath is as follows
\\machinename123\main folder\subfolder\2021-01-24.csv
The file name will always have the same format since it's and automated process, only changing the date on the name, I was trying to pick the date from this string using a regex to select the text between the last \ of the string and the . from the format, so I can get a 2021-01-24 as a result but at the level I'm at, regex are like witchcraft and I don't really know what I'm doing, I've been trying for a few hours to no success, so far this is the closest I can get by trial and error (?:[0-9\-]) but this selects all the numbers on the string, including the ones on the machine name, besides the issue of not knowing why it works the way it works (for example I know that the ?: works by testing, but I don't understand the theory behind it so I couldn't replicate it in the future).
How can I make it ignore the other numbers on the string, or more specifically pick only the text between the last \ and the . from the csv, xlsx or whatever the format is?
I'd like the former option better, since it would allow me to learn how to make it do what I need it to do and not get the result by coincidence.
Thanks for any help
Use re.search() to find a pattern: <4 digits>-<2 digits>-<2 digits>.
s = r'\\machinename123\main folder\subfolder\2021-01-24.csv'
m = re.search(r'\d{4}-\d{2}-\d{2}', s).group(0)
You can use the following regex, that summarizes the structure of your full path:
import re
filename_regex = re.compile(r'^\\\\[(?:\w| )+\\]+((?:\d{4,4})-(?:\d{2,2})-(?:\d{2,2})).csv$')
m = filename_regex.match(r"\\machinename123\main folder\subfolder\2021-01-24.csv")
if m is not None:
print(f'File found with date: {m.groups()[0]}')
else:
print('Filename not of interest')
The output will be:
File found with date: 2021-01-24
filename_regex accepts string starting with \\, followed by a repetition of characters (alphanumeric and underscores) and spaces followed by \, but with the final part corresponding to 4 digits, followed by a minus, then 2 digits, a minus again, 2 digits and the string .csv. The regular expression used here to match the date is very simple, but you can use a more complex one if you prefer.
Another simpler approach would be using ntpath library, extracting the name of the file from the full path and applying the regular expression only to the name of the file:
import ntpath
import re
filename_regex = re.compile(r'^((?:\d{4,4})-(?:\d{2,2})-(?:\d{2,2})).csv$')
filename = ntpath.basename(r"\\machinename123\main folder\subfolder\2021-01-24.csv")
m = filename_regex.match(filename)
m here will have the same value as before.

Trying to find a way to filter out parts of a string python

I'm trying to filter out strings in file names that appear in a for loop
if search == "List":
onlyfiles = [f for f in listdir("path") if isfile(join("path", f))]
for i in onlyfiles:
print(i)
now it will output all the filenames, as expected and wanted, but I want to filter out the .json at the end of the file as well as a few other elements in the name of the file so that I can just see the file name.
For example: filename-IDENTIFIER.json
I want to filter out "-IDENTIFIER.json" out from the for loop's output
Thanks for any help
There are a few approaches here, based on how much your data can vary:
So let's try to build a get_filename(f) method
Quick and dirty
If you know that f always ends in exactly the same way, then you can directly try to remove those characters. So here we have to remove the last 16 characters. It's useful to know that in Python, a string can be considered as an (immutable) array of characters, so you can use list indexing as well.
get_filename(f: str):
return f[:-16]
This will however fail if the Identifier or suffix changes in length.
Varying lenghts
If the suffix changes based on the length, then you should split the string on a fixed delimiter and return the relevant part. In this case you want to split on -.
get_filename(f: str):
return f.split("-")[0]
Note however that this will fail if the filename also contains a -.
You can fix that by dropping the last part and rejoining all the earlier pieces, in the following way.
get_filename(f: str):
return "-".join(f.split("-")[:-1])
Using regexes to match the format
The most general approach would be to use python regexes to select the relevant part. These allow you to very specifically target a specific pattern. The exact regex that you'll need will depend on the complexity of your strings.
Split the string on "-" and get the first element:
filename = f.split("-")[0]
This will get messed up case filename contains "-" though.
This should work:
i.split('-')[0].split('.')[0]
Case 1: filename-IDENTIFIER.json
It takes the substring before the dash, so output will become filename
Case 2: filename.json
There is no dash in the string, so the first split does nothing (full string will be in the 0th element), then it takes the substring before the point. Output will be filename
Case 3: filename
Nothing to split, output will be filename
If it's always .json and -IDENTIFIER, then it's safer to use this:
i.split('-IDENTIFIER')[0].split('.json')[0]
Case 4: filename-blabla.json
If the filename has an extra dash in it, it won't be a problem, output will be filename-blabla

Batch file rename: zero padding time with regex?

I have a whole set of files (10.000+) that include the date and time in the filename. The problem is that the date and time are not zero padded, causing problems with sorting.
The filenames are in the format: output 5-11-2018 9h0m.xml
What I would like is it to be in the format: output 05-11-2018 09h00m.xml
I've searched for different solutions, but most seem to use splitting strings and then recombining them. That seems pretty cumbersome, since in my case day, month, hour and minute then need to be seperate, padded and then recombined.
I thought regex might give me some better solution, but I can't quite figure it out.
I've edited my original code based on the suggestion of Wiktor Stribiżew that you can't use regex in the replacement and to use groups instead:
import os
import glob
import re
old_format = 'output [1-9]-11-2018 [1-2]?[1-9]h[0-9]m.xml'
dir = r'D:\Gebruikers\<user>\Documents\datatest\'
old_pattern = re.compile(r'([1-9])-11-2018 ([1-2][1-9])h([0-9])m')
filelist = glob.glob(os.path.join(dir, old_format))
for file in filelist:
print file
newfile = re.sub(old_pattern, r'0\1-11-2018 \2h0\3m', file)
os.rename(file, newfile)
But this still doesn't function completely as I would like, since it wouldn't change hours under 10. What else could I try?
You can pad the numbers in your file names with .zfill(2) using a lambda expression passed as the replacement argument to the re.sub method.
Also, fix the regex pattern to allow 1 or 2 digits: (3[01]|[12][0-9]|0?[1-9]) for a date, (2[0-3]|[10]?\d) for an hour (24h), and ([0-5]?[0-9]) for minutes:
old_pattern = re.compile(r'\b(3[01]|[12][0-9]|0?[1-9])-11-2018 (2[0-3]|[10]?\d)h([0-5]?[0-9])m')
See the regex demo.
Then use:
for file in filelist:
newfile = re.sub(old_pattern, lambda x: '{}-11-2018 {}h{}m'.format(x.group(1).zfill(2), x.group(2).zfill(2), x.group(3).zfill(2)), file)
os.rename(file, newfile)
See Python re.sub docs:
If repl is a function, it is called for every non-overlapping occurrence of pattern. The function takes a single match object argument, and returns the replacement string.
I suggest going more generic with old_pattern for simplicity, assuming your filenames are only misbehaving with digits:
Because combinations of filenames matching a single-digit field that needs converting in any position but are double digits in other fields would need a long regex to list out more explicitly, I suggest this much simpler one to match the files to rename, which makes assumptions that there are only this matching type of file in the directory as it opens it up more widely in order to be simpler to write and read at a glance - find any single digit field in the filename (one or more of) - ie. non-digit, digit, non-digit:
old_format = r'output\.*\D\d\D.*\.xml'
The fixing re.sub statement could then be:
newfile = re.sub(r'\D(\d)[hm-]', lambda x: x.group()[0]+x.group()[1].zfill(2)+x.group()[2], file)
This would also catch unicode non-ascii digits unless the appropriate re module flags are set.
If the year (2018 in example) might be given as just '18' then it would need special handling for that - could be separate case, and also adding a space into the re.sub regex pattern set (ie [-hm ]).

Extract part of string according to pattern using regular expression Python

I have a files that follow a specific format which look something like this:
test_0800_20180102_filepath.csv
anotherone_0800_20180101_hello.csv
The numbers in the middle represent timestamps, so I would like to extract that information. I know that there is a specific pattern which will always be _time_date_, so essentially I want the part of the string that lies between the first and third underscores. I found some examples and somehow similar problems, but I am new to Python and I am having trouble adapting them.
This is what I have implemented thus far:
datetime = re.search(r"\d+_(\d+)_", "test_0800_20180102_filepath.csv")
But the result I get is only the date part:
20180102
But what I actually need is:
0800_20180101
That's quite simple:
match = re.search(r"_((\d+)_(\d+))_", your_string)
print(match.group(1)) # print time_date >> 0800_20180101
print(match.group(2)) # print time >> 0800
print(match.group(3)) # print date >> 20180101
Note that for such tasks the group operator () inside the regexp is really helpful, it allows you to access certain substrings of a bigger pattern without having to match each one individually (which can sometimes be much more ambiguous than matching a larger one).
The order in which you then access the groups is from 1-n_specified, where group 0 is the whole matched pattern. Groups themselves are assigned from left to right, as defined in your pattern.
On a side note, if you have control over it, use unix timestamps so you only have one number defining both date and time universally.
They key here is you want everything between the first and the third underscores on each line, so there is no need to worry about designing a regex to match your time and date pattern.
with open('myfile.txt', 'r') as f:
for line in f:
x = '_'.join(line.split('_')[1:3])
print(x)
The problem with your implementation is that you are only capturing the date part of your pattern. If you want to stick with a regex solution then simply move your parentheses to capture the entire pattern you want:
re.search(r"(\d+_\d+)_", "test_0800_20180102_filepath.csv").group(1)
gives:
'0800_20180102'
This is very easy to do with .split():
time = filename.split("_")[1]
date = filename.split("_")[2]

Python regex to extract substring at start and end of string

I am looking for a regex that will extract everything up to the first . (period) in a string, and everything including and after the last . (period)
For example:
my_file.10.4.5.6.csv
myfile2.56.3.9.txt
Ideally the regex when run against these strings would return:
my_file.csv
myfile2.txt
The numeric stamp in the file will be different each time the script is run, so I am looking essentially to exclude it.
The following prints out the string up to the first . (period)
print re.search("^[^.]*", data_file).group(0)
I am having trouble though getting it to also return the the last period and string after it.
Sorry just to update this based upon feedback and comments below:
This does need to be a regex. The regex will be passed into the program from a configuration file. The user will not have access to the source code as it will be packaged.
The user may need to change the regex based upon some arbitrary criteria, so they will need to update the config file, rather than edit the application and re-build the package.
Thanks
You don’t need a regular expression!
parts = data_file.split(".")
print parts[0] + "." + parts[-1]
Instead of regular expressions, I would suggest using str.split. For example:
>>> data_file = 'my_file.10.4.5.6.csv'
>>> parts = data_file.split('.')
>>> print parts[0] + '.' + parts[-1]
my_file.csv
However if you insist on regular expressions, here is one approach:
>>> print re.sub(r'\..*\.', '.', data_file)
my_file.csv
You don't need a regex.
tokens = expanded_name.split('.')
compressed_name = '.'.join((tokens[0], tokens[-1]))
If you are concerned about performance, you could use a length limit and rsplit() to only chop up the string as much as you need.
compressed_name = expanded_name.split('.', 1)[0] + '.' + expanded_name.rsplit('.', 1)[1]
Do you need a regex here?
>>> address = "my_file.10.4.5.6.csv"
>>> split_by_periods = address.split(".")
>>> "{}.{}".format(address[0], address[-1])
>>> "my_file.csv"

Categories