I'm really new to python and programming in general, and to practice I'm doing projects where I try to tackle problems from my day to day work, so please excuse me if this may be a silly question. I'm trying to combine a group of files located on a remote folder into a single monthly one based on the date, I've already combined files based on date so I think I can do that, but I'm having trouble with the regex to pick the date from the file name string, the string with the filepath is as follows
\\machinename123\main folder\subfolder\2021-01-24.csv
The file name will always have the same format since it's and automated process, only changing the date on the name, I was trying to pick the date from this string using a regex to select the text between the last \ of the string and the . from the format, so I can get a 2021-01-24 as a result but at the level I'm at, regex are like witchcraft and I don't really know what I'm doing, I've been trying for a few hours to no success, so far this is the closest I can get by trial and error (?:[0-9\-]) but this selects all the numbers on the string, including the ones on the machine name, besides the issue of not knowing why it works the way it works (for example I know that the ?: works by testing, but I don't understand the theory behind it so I couldn't replicate it in the future).
How can I make it ignore the other numbers on the string, or more specifically pick only the text between the last \ and the . from the csv, xlsx or whatever the format is?
I'd like the former option better, since it would allow me to learn how to make it do what I need it to do and not get the result by coincidence.
Thanks for any help
Use re.search() to find a pattern: <4 digits>-<2 digits>-<2 digits>.
s = r'\\machinename123\main folder\subfolder\2021-01-24.csv'
m = re.search(r'\d{4}-\d{2}-\d{2}', s).group(0)
You can use the following regex, that summarizes the structure of your full path:
import re
filename_regex = re.compile(r'^\\\\[(?:\w| )+\\]+((?:\d{4,4})-(?:\d{2,2})-(?:\d{2,2})).csv$')
m = filename_regex.match(r"\\machinename123\main folder\subfolder\2021-01-24.csv")
if m is not None:
print(f'File found with date: {m.groups()[0]}')
else:
print('Filename not of interest')
The output will be:
File found with date: 2021-01-24
filename_regex accepts string starting with \\, followed by a repetition of characters (alphanumeric and underscores) and spaces followed by \, but with the final part corresponding to 4 digits, followed by a minus, then 2 digits, a minus again, 2 digits and the string .csv. The regular expression used here to match the date is very simple, but you can use a more complex one if you prefer.
Another simpler approach would be using ntpath library, extracting the name of the file from the full path and applying the regular expression only to the name of the file:
import ntpath
import re
filename_regex = re.compile(r'^((?:\d{4,4})-(?:\d{2,2})-(?:\d{2,2})).csv$')
filename = ntpath.basename(r"\\machinename123\main folder\subfolder\2021-01-24.csv")
m = filename_regex.match(filename)
m here will have the same value as before.
Related
The problem, and it may not be easily solved with a regex, is that I want to be able to extract a Windows file path from an arbitrary string. The closest that I have been able to come (I've tried a bunch of others) is using the following regex:
[a-zA-Z]:\\([a-zA-Z0-9() ]*\\)*\w*.*\w*
Which picks up the start of the file and is designed to look at patterns (after the initial drive letter) of strings followed by a backslash and ending with a file name, optional dot, and optional extension.
The difficulty is what happens, next. Since the maximum path length is 260 characters, I only need to count 260 characters beyond the start. But since spaces (and other characters) are allowed in file names I would need to make sure that there are no additional backslashes that could indicate that the prior characters are the name of a folder and that what follows isn't the file name, itself.
I am pretty certain that there isn't a perfect solition (the perfect being the enemy of the good) but I wondered if anyone could suggest a "best possible" solution?
Here's the expression I got, based on yours, that allow me to get the path on windows : [a-zA-Z]:\\((?:[a-zA-Z0-9() ]*\\)*).* . An example of it being used is available here : https://regex101.com/r/SXUlVX/1
First, I changed the capture group from ([a-zA-Z0-9() ]*\\)* to ((?:[a-zA-Z0-9() ]*\\)*).
Your original expression captures each XXX\ one after another (eg : Users\ the Users\).
Mine matches (?:[a-zA-Z0-9() ]*\\)*. This allows me to capture the concatenation of XXX\YYYY\ZZZ\ before capturing. As such, it allows me to get the full path.
The second change I made is related to the filename : I'll just match any group of character that does not contain \ (the capture group being greedy). This allows me to take care of strange file names.
Another regex that would work would be : [a-zA-Z]:\\((?:.*?\\)*).* as shown in this example : https://regex101.com/r/SXUlVX/2
This time, I used .*?\\ to match the XXX\ parts of the path.
.*? will match in a non-greedy way : thus, .*?\\ will match the bare minimum of text followed by a back-slash.
Do not hesitate if you have any question regarding the expressions.
I'd also encourage you to try to see how well your expression works using : https://regex101.com . This also has a list of the different tokens you can use in your regex.
Edit : As my previous answer did not work (though I'll need to spend some times to find out exactly why), I looked for another way to do what you want. And I managed to do so using string splitting and joining.
The command is "\\".join(TARGETSTRING.split("\\")[1:-1]).
How does this work : Is plit the original string into a list of substrings, based. I then remove the first and last part ([1:-1]from 2nd element to the one before the last) and transform the resulting list back into a string.
This works, whether the value given is a path or the full address of a file.
Program Files (x86)\\Adobe\\Acrobat Distiller\\acrbd.exe fred is a file path
Program Files (x86)\\Adobe\\Acrobat Distiller\\acrbd.exe fred\ is a directory path
I have a whole set of files (10.000+) that include the date and time in the filename. The problem is that the date and time are not zero padded, causing problems with sorting.
The filenames are in the format: output 5-11-2018 9h0m.xml
What I would like is it to be in the format: output 05-11-2018 09h00m.xml
I've searched for different solutions, but most seem to use splitting strings and then recombining them. That seems pretty cumbersome, since in my case day, month, hour and minute then need to be seperate, padded and then recombined.
I thought regex might give me some better solution, but I can't quite figure it out.
I've edited my original code based on the suggestion of Wiktor Stribiżew that you can't use regex in the replacement and to use groups instead:
import os
import glob
import re
old_format = 'output [1-9]-11-2018 [1-2]?[1-9]h[0-9]m.xml'
dir = r'D:\Gebruikers\<user>\Documents\datatest\'
old_pattern = re.compile(r'([1-9])-11-2018 ([1-2][1-9])h([0-9])m')
filelist = glob.glob(os.path.join(dir, old_format))
for file in filelist:
print file
newfile = re.sub(old_pattern, r'0\1-11-2018 \2h0\3m', file)
os.rename(file, newfile)
But this still doesn't function completely as I would like, since it wouldn't change hours under 10. What else could I try?
You can pad the numbers in your file names with .zfill(2) using a lambda expression passed as the replacement argument to the re.sub method.
Also, fix the regex pattern to allow 1 or 2 digits: (3[01]|[12][0-9]|0?[1-9]) for a date, (2[0-3]|[10]?\d) for an hour (24h), and ([0-5]?[0-9]) for minutes:
old_pattern = re.compile(r'\b(3[01]|[12][0-9]|0?[1-9])-11-2018 (2[0-3]|[10]?\d)h([0-5]?[0-9])m')
See the regex demo.
Then use:
for file in filelist:
newfile = re.sub(old_pattern, lambda x: '{}-11-2018 {}h{}m'.format(x.group(1).zfill(2), x.group(2).zfill(2), x.group(3).zfill(2)), file)
os.rename(file, newfile)
See Python re.sub docs:
If repl is a function, it is called for every non-overlapping occurrence of pattern. The function takes a single match object argument, and returns the replacement string.
I suggest going more generic with old_pattern for simplicity, assuming your filenames are only misbehaving with digits:
Because combinations of filenames matching a single-digit field that needs converting in any position but are double digits in other fields would need a long regex to list out more explicitly, I suggest this much simpler one to match the files to rename, which makes assumptions that there are only this matching type of file in the directory as it opens it up more widely in order to be simpler to write and read at a glance - find any single digit field in the filename (one or more of) - ie. non-digit, digit, non-digit:
old_format = r'output\.*\D\d\D.*\.xml'
The fixing re.sub statement could then be:
newfile = re.sub(r'\D(\d)[hm-]', lambda x: x.group()[0]+x.group()[1].zfill(2)+x.group()[2], file)
This would also catch unicode non-ascii digits unless the appropriate re module flags are set.
If the year (2018 in example) might be given as just '18' then it would need special handling for that - could be separate case, and also adding a space into the re.sub regex pattern set (ie [-hm ]).
I have a files that follow a specific format which look something like this:
test_0800_20180102_filepath.csv
anotherone_0800_20180101_hello.csv
The numbers in the middle represent timestamps, so I would like to extract that information. I know that there is a specific pattern which will always be _time_date_, so essentially I want the part of the string that lies between the first and third underscores. I found some examples and somehow similar problems, but I am new to Python and I am having trouble adapting them.
This is what I have implemented thus far:
datetime = re.search(r"\d+_(\d+)_", "test_0800_20180102_filepath.csv")
But the result I get is only the date part:
20180102
But what I actually need is:
0800_20180101
That's quite simple:
match = re.search(r"_((\d+)_(\d+))_", your_string)
print(match.group(1)) # print time_date >> 0800_20180101
print(match.group(2)) # print time >> 0800
print(match.group(3)) # print date >> 20180101
Note that for such tasks the group operator () inside the regexp is really helpful, it allows you to access certain substrings of a bigger pattern without having to match each one individually (which can sometimes be much more ambiguous than matching a larger one).
The order in which you then access the groups is from 1-n_specified, where group 0 is the whole matched pattern. Groups themselves are assigned from left to right, as defined in your pattern.
On a side note, if you have control over it, use unix timestamps so you only have one number defining both date and time universally.
They key here is you want everything between the first and the third underscores on each line, so there is no need to worry about designing a regex to match your time and date pattern.
with open('myfile.txt', 'r') as f:
for line in f:
x = '_'.join(line.split('_')[1:3])
print(x)
The problem with your implementation is that you are only capturing the date part of your pattern. If you want to stick with a regex solution then simply move your parentheses to capture the entire pattern you want:
re.search(r"(\d+_\d+)_", "test_0800_20180102_filepath.csv").group(1)
gives:
'0800_20180102'
This is very easy to do with .split():
time = filename.split("_")[1]
date = filename.split("_")[2]
I have a string that's
/path/to/file?_subject_ID_SOMEOTHERSTRING
the path/to/file part changes depends on situation, and subject_ID is always there. I try to write a regex that extract only file part of the string. Using ?subject_ID is definite, but I don't know how to safely get the file
My current regex looks like (.*[\/]).*\?_subject_ID
url = '/path/to/file?_subject_ID_SOMEOTHERSTRING'
file_re = re.compile('(.*[\/]).*\?_subject_ID')
file_re.search(url)
this will find the right string, but I still can't extract the file name
printing _.group(1) will get me /path/to/. What's the next step that gets me the actual file name?
As for your '(.*[\/]).*\?_subject_ID' regex approach, you just need to add a capturing group around the second .*. You could use r'(.*/)(.*)\?_subject_ID' (then, there will be .group(1) and .group(2) parts captured), but it is not the most appropriate way to parse URLs in Python.
You may use the non-regex approach here, here is a snippet showing how to leverage urlparse and os.path to parse the URL like yours:
import urlparse
path = urlparse.urlparse('/path/to/file?_subject_ID_SOMEOTHERSTRING').path
import os.path
print(os.path.split(path)[1]) # => file
print(os.path.split(path)[0]) # => /path/to
See the IDEONE demo
It's pretty simple, really. Just match a / before and ?subject_ID after:
([^/?]*)\?subject_ID
The [^/?]* (as opposed to .*) is because otherwise it'd match the part before, too. The ? in the character class
If you want to get both the path and the file, you can do much the same thing, but also grab the part before the /:
([^?]*)([^/?]*)\?subject_ID
It's basically the same as the one before but with the first bit captured instead of ignored.
As an input, I have a filename (e.g. "bla150420.txt") containing a date in a specific format. I need to look into a given folder (containing many files) and find the latest version of my file. (And I have to do it automatically - many times for different files in different folders.)
Example directory (dirname):
...
bla150420.txt
bla150425.txt
bla150510.txt
Example output:
bla150510.txt
How can I do so? My original approach was to parse a date out of the file name, substitute the date with RE pattern and to search this pattern in the list of all filenames. This doesn't seem to work. Any idea? Or different approach?
def get_date(file_name):
DATE_RE = re.compile('([0-9]{6})') #EDITED - TYPO
try:
match = DATE_RE.search(fname).group()
except AttributeError:
sys.stderr.write('ERROR! No date matches string!\n\t' + match)
else:
date = datetime.datetime.strptime(match, '%Y%m%d')
return match, date
date_string, current_date = get_date(fname)
# fname is a given file name (e.g. bla150420.txt)
pattern1 = re.compile(re.sub(date_string, '(.*)', fname))
# pattern1 returns value 'kds_docs-(.*).zip'
pattern2 = re.compile('kds_docs-(.*).zip')
if os.path.isdir(dirname):
matching_files = [x for x in os.listdir(dirname)
if pattern1.search(x)]
It is a wonder to me, my program works with pattern2, but not with pattern1. If I print those two (using .pattern), it looks like the same result, if I compare it with '==' it returns False. I have no idea whether it is because of encoding/whitespace/something else nor how to find the difference. Could you please help?
I think you're merely having issues with producing a working regex in an automated fashion.
Serge pointed out that your code as presented should get tripped up by the fact that your dates seem to have 6 digits instead of 8, but that first regex requires 8 digits - correct or explain that if it's more than a typo.
I suppose you're looking to verify that any string of digits actually is a date, but this seems unnecessary, since a filename could have a string of digits which parses as a date, but isn't the date you're looking for - not ideal. Let me know if it MUST be a date.
I'm unfamiliar with the intricacies of Python, but I would suggest simplifying your regex production by not even using your function:
pattern1 = re.compile(re.sub('([0-9]{6})', '(.*)', fname))
Just make the replacement directly. I would say it would probably be safer to go a little further like so:
pattern1 = re.compile(re.sub('([0-9]{6})', '(\d{6})', fname))
...and if there are any other possible restrictions, you could limit 6 digit matches further. For instance, the 6 digit string might always be at the end of the file name, just before the extension:
pattern1 = re.compile(re.sub('([0-9]{6})(?=\..*$)', '(\d{6})', fname))
# should turn 'kds_docs-120501-151023.zip' into 'kds_docs-150510-(\d{6}).zip'