Cut string in python by counting characters - python

So I have a string inside of a file:
C:\d\folder\project\folder\Folder1\Folder2\Folder3\Module.c
What would be the best way to cut it just by counting backslashes from the end:
So in this case we need to cut everything what is after 4th backslash when counting backward:
Folder1\Folder2\Folder3\Module.c
I need something to do this since I always need to count backward cause I know that in my folder structure it will be like that, I cannot count from the 1st character since number of backslashes "" will not always be the same when counting from the start.

If your string is always a path, you should be using pathlib.Path to handle it:
import os
from pathlib import Path
path = Path(r'C:\d\folder\project\folder\Folder1\Folder2\Folder3\Module.c')
Then we can get the following:
>>> path.parts[-4:]
('Folder1', 'Folder2', 'Folder3', 'Module.c')
>>> os.sep.join(path.parts[-4:])
'Folder1\\Folder2\\Folder3\\Module.c'

Try this:
'\\'.join(s.split('\\')[-4:])
to read your file mentioned in comment:
with open('yourfile') as f:
for s in f: # usually better than for s in f.readlines()
print('\\'.join(s.split('\\')[-4:]))
readlines() loads all file into memory, can be problematic if the file is huge and exceeds process memory limits.

Try this:
s = r'C:\d\folder\project\folder\Folder1\Folder2\Folder3\Module.c'
'\'.join(s.split('\')[:-4])
First the string is split based on the backslashes and all components excluding the last 4 are taken. These are then joined back using the backslash.

Related

How can I easily create a for loop that only continues if the characters are parenthesis followed by integers in Python?

As you can already see, I'm fairly new to coding, specifically Python, and I was wondering how I could loop through a string only if the iteration is a parenthesis followed by integers?
I'm attempting to write a duplicate file finder, and when it comes to file names, I could have the same file copied and named slightly different iryan(1).mp4 and iryan(2).mp4, and using splitext[0] isn't enough for it to be detected as a duplicate. For now, I'm using os.stat().st_size to at least ensure they're the same size before diving into the filename comparison, but I'm struggling with ideas on how to only continue a for loop if the iteration is on an opening parenthesis and is followed by an integer?
Is "regex" something I should deep-dive into to solve this problem?
Assuming you have a list of file names (it could also be any other iterable):
file_names = ['iryan.mp4', 'iryan(1).mp4', 'iryan(2).mp4']
You can do the following to find all the duplicate names:
import re
# This regex only matches names that contain
# a number in brackets followed by `.mp4`
dup_regex = re.compile(r'\(\d+\)\.mp4$')
for duplicate in filter(dup_regex.search, file_names):
print (duplicate)
Output
iryan(1).mp4
iryan(2).mp4
You can use something like this
import re
pattern = re.compile('\(\d+\)')
name = 'iryan(2).mp4'
if len(pattern.findall(name)):
for..You loop here
...
Using the regex command, your code might look something like this:
pattern = re.compile('\(\d+\)')
for f in files:
if pattern.match(f):
# duplicate file found...
# do something
Another alternative is to use the filter function. That code might look something like this:
pattern = re.match('\(\d+\)')
filtered_files = filter(lambda f: p.match(f) is None, files)
for f in filtered_files:
# load file
# do something
The interesting thing here is that the filter function will return an iterator, which you can access easily in the for loop later. This avoids using the if statement in the for-loop, which could make your code look a little nicer.
This method will return to you all of the files without the parenthesis+number combination in them, giving you, ideally, all of the original non-duplicate files, and excluding all of the duplicates. If you want just the duplicates (files with names that contain the parenthesis+number combination), you can change the filter command to this:
duplicate_files = filter(p.match, files)

glob syntax working not as expected( [ ] *)

I have a folder containing 4 files.
Keras_entity_20210223-2138.h5
intent_tokens.pickle
word_entity_set_20210223-2138.pickle
LSTM_history.h5
I used code:
NER_MODEL_FILEPATH = glob.glob("model/[Keras_entity]*.h5")[0]
It's working correctly since NER_MODEL_FILEPATH is a list only containing the path of that Keras_entity file. Not picking that other .h5 file.
But when I use this code:
WORD_ENTITY_SET_FILEPATH = glob.glob("model/[word_entity_set]*.pickle")[0]
It's not working as expected, rather than picking up only that word_entity_set file,
this list contains both of those two pickle files.
Why would this happen?
Simply remove the square brackets: word_entity_set*.pickle
Per the docs:
[seq] matches any character in seq
So word_entity_set_20210223-2138.pickle is matched because it starts with a w, and intent_tokens.pickle is matched because it starts with an i.
To be clear, it is working as expected. Your expectations were incorrect.
Your code selects intent_tokens.pickle and word_entity_set_20210223-2138.pickle because your glob is incorrect. Change the glob to "word_entity_set*.pickle"
When you use [<phrase>]*.pickle, you're telling the globber to match one of any of the characters in <phrase> plus any characters, plus ".pickle". So "wordwordword.pickle" will match, so will:
wwww.pickle
.pickle
w.pickle
But
xw.pickle
foobar.pickle
will not.
There are truly infinite permutations.

Batch file rename: zero padding time with regex?

I have a whole set of files (10.000+) that include the date and time in the filename. The problem is that the date and time are not zero padded, causing problems with sorting.
The filenames are in the format: output 5-11-2018 9h0m.xml
What I would like is it to be in the format: output 05-11-2018 09h00m.xml
I've searched for different solutions, but most seem to use splitting strings and then recombining them. That seems pretty cumbersome, since in my case day, month, hour and minute then need to be seperate, padded and then recombined.
I thought regex might give me some better solution, but I can't quite figure it out.
I've edited my original code based on the suggestion of Wiktor Stribiżew that you can't use regex in the replacement and to use groups instead:
import os
import glob
import re
old_format = 'output [1-9]-11-2018 [1-2]?[1-9]h[0-9]m.xml'
dir = r'D:\Gebruikers\<user>\Documents\datatest\'
old_pattern = re.compile(r'([1-9])-11-2018 ([1-2][1-9])h([0-9])m')
filelist = glob.glob(os.path.join(dir, old_format))
for file in filelist:
print file
newfile = re.sub(old_pattern, r'0\1-11-2018 \2h0\3m', file)
os.rename(file, newfile)
But this still doesn't function completely as I would like, since it wouldn't change hours under 10. What else could I try?
You can pad the numbers in your file names with .zfill(2) using a lambda expression passed as the replacement argument to the re.sub method.
Also, fix the regex pattern to allow 1 or 2 digits: (3[01]|[12][0-9]|0?[1-9]) for a date, (2[0-3]|[10]?\d) for an hour (24h), and ([0-5]?[0-9]) for minutes:
old_pattern = re.compile(r'\b(3[01]|[12][0-9]|0?[1-9])-11-2018 (2[0-3]|[10]?\d)h([0-5]?[0-9])m')
See the regex demo.
Then use:
for file in filelist:
newfile = re.sub(old_pattern, lambda x: '{}-11-2018 {}h{}m'.format(x.group(1).zfill(2), x.group(2).zfill(2), x.group(3).zfill(2)), file)
os.rename(file, newfile)
See Python re.sub docs:
If repl is a function, it is called for every non-overlapping occurrence of pattern. The function takes a single match object argument, and returns the replacement string.
I suggest going more generic with old_pattern for simplicity, assuming your filenames are only misbehaving with digits:
Because combinations of filenames matching a single-digit field that needs converting in any position but are double digits in other fields would need a long regex to list out more explicitly, I suggest this much simpler one to match the files to rename, which makes assumptions that there are only this matching type of file in the directory as it opens it up more widely in order to be simpler to write and read at a glance - find any single digit field in the filename (one or more of) - ie. non-digit, digit, non-digit:
old_format = r'output\.*\D\d\D.*\.xml'
The fixing re.sub statement could then be:
newfile = re.sub(r'\D(\d)[hm-]', lambda x: x.group()[0]+x.group()[1].zfill(2)+x.group()[2], file)
This would also catch unicode non-ascii digits unless the appropriate re module flags are set.
If the year (2018 in example) might be given as just '18' then it would need special handling for that - could be separate case, and also adding a space into the re.sub regex pattern set (ie [-hm ]).

re.sub doesn't replace the string when I execute the file

I am trying to write a python script to practice the re.sub method. But when I use python3 to run the script, I figure out that the string in the file doesn't change.
Here is my location.txt file,
34.3416,108.9398
this is what regex.py contains,
import re
with open ('location.txt','r+') as second:
content = second.read()
content = re.sub('([-+]?\d{2}\.\d{4},[-+]?\d{2}\.\d{4})','44.9740,-93.2277',content)
print (content)
I set up a print statement to test the output, and it gives me
34.3416,108.9398
which is not what I want.
Then I change the "r+" to "w+", it completely removes the location.txt content. Can anyone tell me the reason?
Your regexp has a problem as pointed by Andrej Kesely in the other answer. \d{2} should be \d{2,3}:
content = re.sub(r'([-+]?\d{2,3}\.\d{4},[-+]?\d{2,3}\.\d{4})', ,'44.9740,-93.2277',content)
After fixing that, you changed the string, but you didn't write it back to the file, you're only changing the variable in memory.
second.seek(0) # return to beginning of file
second.write(content) # write the data back to the file
second.truncate() # remove extraneous bytes (in case the content shrinked)
The second number in your location.txt is 108.9398, which has 3 digits before dot and it doesn't match to your regexp. Change your regexp to:
([-+]?\d{2,3}\.\d{4},[-+]?\d{2,3}\.\d{4})
Online regexp here.

Python 3 HTML parser

I'm sure everyone will groan, and tell me to look at the documentation (which I have) but I just don't understand how to achieve the same as the following:
curl -s http://www.maxmind.com/app/locate_my_ip | awk '/align="center">/{getline;print}'
All I have in python3 so far is:
import urllib.request
f = urllib.request.urlopen('http://www.maxmind.com/app/locate_my_ip')
for lines in f.readlines():
print(lines)
f.close()
Seriously, any suggestions (please don't tell me to read http://docs.python.org/release/3.0.1/library/html.parser.html as I have been learning python for 1 day, and get easily confused) a simple example would be amazing!!!
This is based off of larsmans's answer, above.
f = urllib.request.urlopen('http://www.maxmind.com/app/locate_my_ip')
for line in f:
if b'align="center">' in line:
print(next(f).decode().rstrip())
f.close()
Explanation:
for line in f iterates over the lines in the file-like object, f. Python let's you iterate over lines in a file like you would items in a list.
if b'align="center">' in line looks for the string 'align="center">' in the current line. The b indicates that this is a buffer of bytes, rather than a string. It appears that urllib.reqquest.urlopen interpets the results as binary data, rather than unicode strings, and an unadorned 'align="center">' would be interpreted as a unicode string. (That was the source of the TypeError above.)
next(f) takes the next line of the file, because your original awk script printed the line after 'align="center">' rather than the current line. The decode method (strings have methods in Python) takes the binary data and converts it to a printable unicode object. The rstrip() method strips any trailing whitespace (namely, the newline at the end of each line.
# no need for .readlines here
for ln in f:
if 'align="center">' in ln:
print(ln)
But be sure to read the Python tutorial.
I would probably use regular expressions to get the ip itself:
import re
import urllib
f = urllib.request.urlopen('http://www.maxmind.com/app/locate_my_ip')
html_text=f.read()
re.findall(r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}',html_text)[0]
which will print the first string of the format: 1-3digits, period, 1-3digits,...
I take it you were looking for the line, you could simply extend the string in the findall() expression to take care of that. (see the python docs for re for more details).
By the way, the r in front of the match string makes it a raw string so you wouldn't need to escape python escape characters inside of it (but you still need to escape RE escape characters).
Hope that helps

Categories