Grab part of filename with Python - python

Newbie here.
I've just been working with Python/coding for a few days, but I want to create a script that grabs parts of filenames corresponding to a certain pattern, and outputs it to a textfile.
So in my case, let's say I have four .pdf like this:
aaa_ID_8423.pdf
bbbb_ID_8852.pdf
ccccc_ID_7413.pdf
dddddd_ID_4421.pdf
(Note that they are of variable length.)
I want the script to go through these filenames, grab the string after "ID_" and before the filename extension.
Can you point me in the direction to which Python modules and possibly guides that could assist me?

Here's a simple solution using the re module as mentioned in other answers.
# Libraries
import re
# Example filenames. Use glob as described below to grab your pdf filenames
file_list = ['name_ID_123.pdf','name2_ID_456.pdf'] # glob.glob("*.pdf")
for fname in file_list:
res = re.findall("ID_(\d+).pdf", fname)
if not res: continue
print res[0] # You can append the result to a list
And below should be your output. You should be able to adapt this to other patterns.
# Output
123
456
Goodluck!

Here's another alternative, using re.split(), which is probably closer to the spirit of exactly what you're trying to do (although solutions with re.match() and re.search(), among others, are just as valid, useful, and instructive):
>>> import re
>>> re.split("[_.]", "dddddd_ID_4421.pdf")[-2]
'4421'
>>>

If the numbers are variable length, you'll want the regex module "re"
import re
# create and compile a regex pattern
pattern = re.compile(r"_([0-9]+)\.[^\.]+$")
pattern.search("abc_ID_8423.pdf").group(1)
Out[23]: '8423'
Regex is generally used to match variable strings. The regex I just wrote says:
Find an underscore ("_"), followed by a variable number of digits ("[0-9]+"), followed by the last period in the string ("\.[^\.]+$")

You can use the os module in python and do a listdir to get a list of filenames present in that path like so:
import os
filenames = os.listdir(path)
Now you can iterate over the filenames list and look for the pattern which you need using regular expressions:
import re
for filename in filenames:
m = re.search('(?<=ID_)\w+', filename)
print (m)
The above snippet will return the part of the filename following ID_ and prints it out. So, for your example, it would return 4421.pdf, 8423.pdf etc. You can write a similar regex to remove the .pdf part.

You probably want to use glob, which is a python module for file globbing. From the python help page the usage is as follows:
>>> import glob
>>> glob.glob('./[0-9].*')
['./1.gif', './2.txt']
>>> glob.glob('*.gif')
['1.gif', 'card.gif']
>>> glob.glob('?.gif')
['1.gif']

Related

Python Regex to extract file where filename contains and also should not contain specific pattern from a zip folder

I want to extract just one specific single file from the zip folder which has the below 3 files.
Basically it should start with 'kpidata_nfile' and should not contain 'fileheader'
kpidata_nfile_20220919-20220925_fileheader.csv
kpidata_nfile_20220905-20220911.csv
othername_kpidata_nfile_20220905-20220911.csv
Below is my code i have tried-
from zipfile import ZipFile
import re
import os
for x in os.listdir('.'):
if re.match('.*\.(zip)', x):
with ZipFile(x, 'r') as zip:
for info in zip.infolist():
if re.match(r'^kpidata_nfile_', info.filename):
zip.extract(info)
Output required - kpidata_nfile_20220905-20220911.csv
This regex does what you require:
^kpidata_nfile(?:(?!fileheader).)*$
See this answer for more about the (?:(?!fileheader).)*$ part.
You can see the regex working on your example filenames here.
The regex is not particularly readable, so it might be better to use Python expressions instead of regex. Something like:
fname = info.filename
if fname.startswith('kpidata_nfile') and 'fileheader' not in fname:

Python string alphabet removal?

So in my program, I am reading in files and processing them.
My output should say just the file name and then display some data
When I am looping through files and printing output by their name and data,
it displays for example: myfile.txt. I don't want the .txt part. just myfile.
how can I remove the .txt from the end of this string?
The best way to do it is in the example
import os
filename = 'myfile.txt'
print(filename)
print(os.path.splitext(filename))
print(os.path.splitext(filename)[0])
More info about this very useful builtin module
https://docs.python.org/3.8/library/os.path.html
The answers given are totally right, but if you have other possible extensions, or don't want to import anything, try this:
name = file_name.rsplit(".", 1)[0]
You can use pathlib.Path which has a stem attribute that returns the filename without the suffix.
>>> from pathlib import Path
>>> Path('myfile.txt').stem
'myfile'
Well if you only have .txt files you can do this
file_name = "myfile.txt"
file_name.replace('.txt', '')
This uses the built in replace functionality. You can find more info on it here!

Getting file names without file extensions with glob

I'm searching for .txt files only
from glob import glob
result = glob('*.txt')
>> result
['text1.txt','text2.txt','text3.txt']
but I'd like result without the file extensions
>> result
['text1','text2','text3']
Is there a regex pattern that I can use with glob to exclude the file extensions from the output, or do I have to use a list comprehension on result?
There is no way to do that with glob(), You need to take the list given and then create a new one to store the values without the extension:
import os
from glob import glob
[os.path.splitext(val)[0] for val in glob('*.txt')]
os.path.splitext(val) splits the file names into file names and extensions. The [0] just returns the filenames.
Since you’re trying to split off a filename extension, not split an arbitrary string, it makes more sense to use os.path.splitext (or the pathlib module). While it’s true that the it makes no practical difference on the only platforms that currently matter (Windows and *nix), it’s still conceptually clearer what you’re doing. (And if you later start using path-like objects instead of strings, it will continue to work unchanged, to boot.)
So:
paths = [os.path.splitext(path)[0] for path in paths]
Meanwhile, if this really offends you for some reason, what glob does under the covers is just calling fnmatch to turn your glob expression into a regular expression and then applying that to all of the filenames. So, you can replace it by just replacing the regex yourself and using capture groups:
rtxt = re.compile(r'(.*?)\.txt')
files = (rtxt.match(file) for file in os.listdir(dirpath))
files = [match.group(1) for match in files if match]
This way, you’re not doing a listcomp on top of the one that’s already in glob; you’re doing one instead of the one that’s already in glob. I’m not sure if that’s a useful win or not, but since you seem to be interested in eliminating a listcomp…
This glob only selects files without an extension: **/*/!(*.*)
Use index slicing:
result = [i[:-4] for i in result]
Another way using rsplit:
>>> result = ['text1.txt','text2.txt.txt','text3.txt']
>>> [x.rsplit('.txt', 1)[0] for x in result]
['text1', 'text2.txt', 'text3']
You could do as a list-comprehension:
result = [x.rsplit(".txt", 1)[0] for x in glob('*.txt')]
Use str.split
>>> result = [r.split('.')[0] for r in glob('*.txt')]
>>> result
['text1', 'text2', 'text3']

Delete all files with partial filename python

I have files in my present working directory that I would like to delete. They all have a filename that starts with the string 'words' (for example, files words_1.csv and words_2.csv). I want to match all files in the current directory that start with 'words' and delete them. What would the search pattern be?
I found this from here, but it doesn't quite answer the question.
import os, re
def purge(dir, pattern):
for f in os.listdir(dir):
if re.search(pattern, f):
os.remove(os.path.join(dir, f))
t = 'words_1.csv'
print(t.startswith('words'))
it‘s done.
and the pattern may be the '^words.*\.csv$',but i suggest you read python RE doc.
If I'm understanding your question correctly, you have this function and you are asking how it may be used. You should be able to call simply:
purge('/path/to/your/dir','words.*')
This will remove any files starting with the string "words".
pattern is a regular expression pattern. In your case, it's simply anything beginning with "words" and ending with ".csv", so you can use
pattern = "words*.csv"

batch search and replace strings in filenames with python

I am trying to write a small python script to rename a bunch of filenames by searching and replacing. For example:
Original filename:
MyMusic.Songname.Artist-mp3.iTunes.mp3
Intendet Result:
Songname.Artist.mp3
what i've got so far is:
#!/usr/bin/env python
from os import rename, listdir
mustgo = "MyMusic."
filenames = listdir('.')
for fname in fnames:
if fname.startswith(mustgo):
rename(fname, fname.replace(mustgo, '', 1))
(got it from this site as far as i can remember)
Anyway, this will only get rid of the String at the beginning, but not of those in the filename.
Also I would like to maybe use a seperate file (eg badwords.txt) containing all the strings that should be searched for and replaced, so that i can update them without having to edit the whole code.
Content of badwords.txt
MyMusic.
-mp3
-MP3
.iTunes
.itunes
I have been searching for quite some time now but havent found anything. Would appreciate any help!
Thank you!
import fnmatch
import re
import os
with open('badwords.txt','r') as f:
pat='|'.join(fnmatch.translate(badword)[:-1] for badword in
f.read().splitlines())
for fname in os.listdir('.'):
new_fname=re.sub(pat,'',fname)
if fname != new_fname:
print('{o} --> {n}'.format(o=fname,n=new_fname))
os.rename(fname, new_fname)
# MyMusic.Songname.Artist-mp3.iTunes.mp3 --> Songname.Artist.mp3
Note that it is possible for some files to be overwritten (and thus
lost) if two names get reduced to the same shortened name after
badwords have been removed. A set of new fnames could be kept and
checked before calling os.rename to prevent losing data through
name collisions.
fnmatch.translate takes shell-style patterns and returns the
equivalent regular expression. It is used above to convert badwords
(e.g. '.iTunes') into regular expressions (e.g. r'\.iTunes').
Your badwords list seems to indicate you want to ignore case. You
could ignore case by adding '(?i)' to the beginning of pat:
with open('badwords.txt','r') as f:
pat='(?i)'+'|'.join(fnmatch.translate(badword)[:-1] for badword in
f.read().splitlines())

Categories