I have files in my present working directory that I would like to delete. They all have a filename that starts with the string 'words' (for example, files words_1.csv and words_2.csv). I want to match all files in the current directory that start with 'words' and delete them. What would the search pattern be?
I found this from here, but it doesn't quite answer the question.
import os, re
def purge(dir, pattern):
for f in os.listdir(dir):
if re.search(pattern, f):
os.remove(os.path.join(dir, f))
t = 'words_1.csv'
print(t.startswith('words'))
it‘s done.
and the pattern may be the '^words.*\.csv$',but i suggest you read python RE doc.
If I'm understanding your question correctly, you have this function and you are asking how it may be used. You should be able to call simply:
purge('/path/to/your/dir','words.*')
This will remove any files starting with the string "words".
pattern is a regular expression pattern. In your case, it's simply anything beginning with "words" and ending with ".csv", so you can use
pattern = "words*.csv"
Related
I want to extract just one specific single file from the zip folder which has the below 3 files.
Basically it should start with 'kpidata_nfile' and should not contain 'fileheader'
kpidata_nfile_20220919-20220925_fileheader.csv
kpidata_nfile_20220905-20220911.csv
othername_kpidata_nfile_20220905-20220911.csv
Below is my code i have tried-
from zipfile import ZipFile
import re
import os
for x in os.listdir('.'):
if re.match('.*\.(zip)', x):
with ZipFile(x, 'r') as zip:
for info in zip.infolist():
if re.match(r'^kpidata_nfile_', info.filename):
zip.extract(info)
Output required - kpidata_nfile_20220905-20220911.csv
This regex does what you require:
^kpidata_nfile(?:(?!fileheader).)*$
See this answer for more about the (?:(?!fileheader).)*$ part.
You can see the regex working on your example filenames here.
The regex is not particularly readable, so it might be better to use Python expressions instead of regex. Something like:
fname = info.filename
if fname.startswith('kpidata_nfile') and 'fileheader' not in fname:
Newbie here.
I've just been working with Python/coding for a few days, but I want to create a script that grabs parts of filenames corresponding to a certain pattern, and outputs it to a textfile.
So in my case, let's say I have four .pdf like this:
aaa_ID_8423.pdf
bbbb_ID_8852.pdf
ccccc_ID_7413.pdf
dddddd_ID_4421.pdf
(Note that they are of variable length.)
I want the script to go through these filenames, grab the string after "ID_" and before the filename extension.
Can you point me in the direction to which Python modules and possibly guides that could assist me?
Here's a simple solution using the re module as mentioned in other answers.
# Libraries
import re
# Example filenames. Use glob as described below to grab your pdf filenames
file_list = ['name_ID_123.pdf','name2_ID_456.pdf'] # glob.glob("*.pdf")
for fname in file_list:
res = re.findall("ID_(\d+).pdf", fname)
if not res: continue
print res[0] # You can append the result to a list
And below should be your output. You should be able to adapt this to other patterns.
# Output
123
456
Goodluck!
Here's another alternative, using re.split(), which is probably closer to the spirit of exactly what you're trying to do (although solutions with re.match() and re.search(), among others, are just as valid, useful, and instructive):
>>> import re
>>> re.split("[_.]", "dddddd_ID_4421.pdf")[-2]
'4421'
>>>
If the numbers are variable length, you'll want the regex module "re"
import re
# create and compile a regex pattern
pattern = re.compile(r"_([0-9]+)\.[^\.]+$")
pattern.search("abc_ID_8423.pdf").group(1)
Out[23]: '8423'
Regex is generally used to match variable strings. The regex I just wrote says:
Find an underscore ("_"), followed by a variable number of digits ("[0-9]+"), followed by the last period in the string ("\.[^\.]+$")
You can use the os module in python and do a listdir to get a list of filenames present in that path like so:
import os
filenames = os.listdir(path)
Now you can iterate over the filenames list and look for the pattern which you need using regular expressions:
import re
for filename in filenames:
m = re.search('(?<=ID_)\w+', filename)
print (m)
The above snippet will return the part of the filename following ID_ and prints it out. So, for your example, it would return 4421.pdf, 8423.pdf etc. You can write a similar regex to remove the .pdf part.
You probably want to use glob, which is a python module for file globbing. From the python help page the usage is as follows:
>>> import glob
>>> glob.glob('./[0-9].*')
['./1.gif', './2.txt']
>>> glob.glob('*.gif')
['1.gif', 'card.gif']
>>> glob.glob('?.gif')
['1.gif']
I want to match all the files in a folder using regular expressions for some reason:
I used this:
re.compile(r'\.*$')
But this is also matching hidden files and temp files.
Is there a better option?
This makes the assumption that you're wanting to do something with these file names. As someone mentioned in the comments you should use glob. Since I'm not sure what you're going for with the 'temp' files this was the simplest thing. It will return no hidden files. Files is a list of file paths from your current working directory.
import os, glob
files = [f for f in glob.glob('./*') if os.path.isfile(f)]
Try re.compile(r'\w+\.*\w*') to match alphanumeric file names with a possible dot extension.
\w+ matches one or more alphanumeric file names [a-zA-Z0-9_]
\.* matches zero or more '.' characters
\w* matches zero or more file extension alphanumeric characters.
Kodos is an excellent Python regular expression developer/debugger.
Get all files from a directory just using this:
import os
files = [f for p, d, f in os.walk('/foo/bar')][0]
I am trying to write a small python script to rename a bunch of filenames by searching and replacing. For example:
Original filename:
MyMusic.Songname.Artist-mp3.iTunes.mp3
Intendet Result:
Songname.Artist.mp3
what i've got so far is:
#!/usr/bin/env python
from os import rename, listdir
mustgo = "MyMusic."
filenames = listdir('.')
for fname in fnames:
if fname.startswith(mustgo):
rename(fname, fname.replace(mustgo, '', 1))
(got it from this site as far as i can remember)
Anyway, this will only get rid of the String at the beginning, but not of those in the filename.
Also I would like to maybe use a seperate file (eg badwords.txt) containing all the strings that should be searched for and replaced, so that i can update them without having to edit the whole code.
Content of badwords.txt
MyMusic.
-mp3
-MP3
.iTunes
.itunes
I have been searching for quite some time now but havent found anything. Would appreciate any help!
Thank you!
import fnmatch
import re
import os
with open('badwords.txt','r') as f:
pat='|'.join(fnmatch.translate(badword)[:-1] for badword in
f.read().splitlines())
for fname in os.listdir('.'):
new_fname=re.sub(pat,'',fname)
if fname != new_fname:
print('{o} --> {n}'.format(o=fname,n=new_fname))
os.rename(fname, new_fname)
# MyMusic.Songname.Artist-mp3.iTunes.mp3 --> Songname.Artist.mp3
Note that it is possible for some files to be overwritten (and thus
lost) if two names get reduced to the same shortened name after
badwords have been removed. A set of new fnames could be kept and
checked before calling os.rename to prevent losing data through
name collisions.
fnmatch.translate takes shell-style patterns and returns the
equivalent regular expression. It is used above to convert badwords
(e.g. '.iTunes') into regular expressions (e.g. r'\.iTunes').
Your badwords list seems to indicate you want to ignore case. You
could ignore case by adding '(?i)' to the beginning of pat:
with open('badwords.txt','r') as f:
pat='(?i)'+'|'.join(fnmatch.translate(badword)[:-1] for badword in
f.read().splitlines())
Ant has a nice way to select groups of files, most handily using ** to indicate a directory tree. E.g.
**/CVS/* # All files immediately under a CVS directory.
mydir/mysubdir/** # All files recursively under mysubdir
More examples can be seen here:
http://ant.apache.org/manual/dirtasks.html
How would you implement this in python, so that you could do something like:
files = get_files("**/CVS/*")
for file in files:
print file
=>
CVS/Repository
mydir/mysubdir/CVS/Entries
mydir/mysubdir/foo/bar/CVS/Entries
Sorry, this is quite a long time after your OP. I have just released a Python package which does exactly this - it's called Formic and it's available at the PyPI Cheeseshop. With Formic, your problem is solved with:
import formic
fileset = formic.FileSet(include="**/CVS/*", default_excludes=False)
for file_name in fileset.qualified_files():
print file_name
There is one slight complexity: default_excludes. Formic, just like Ant, excludes CVS directories by default (as for the most part collecting files from them for a build is dangerous), the default answer to the question would result in no files. Setting default_excludes=False disables this behaviour.
As soon as you come across a **, you're going to have to recurse through the whole directory structure, so I think at that point, the easiest method is to iterate through the directory with os.walk, construct a path, and then check if it matches the pattern. You can probably convert to a regex by something like:
def glob_to_regex(pat, dirsep=os.sep):
dirsep = re.escape(dirsep)
print re.escape(pat)
regex = (re.escape(pat).replace("\\*\\*"+dirsep,".*")
.replace("\\*\\*",".*")
.replace("\\*","[^%s]*" % dirsep)
.replace("\\?","[^%s]" % dirsep))
return re.compile(regex+"$")
(Though note that this isn't that fully featured - it doesn't support [a-z] style glob patterns for instance, though this could probably be added). (The first \*\*/ match is to cover cases like \*\*/CVS matching ./CVS, as well as having just \*\* to match at the tail.)
However, obviously you don't want to recurse through everything below the current dir when not processing a ** pattern, so I think you'll need a two-phase approach. I haven't tried implementing the below, and there are probably a few corner cases, but I think it should work:
Split the pattern on your directory seperator. ie pat.split('/') -> ['**','CVS','*']
Recurse through the directories, and look at the relevant part of the pattern for this level. ie. n levels deep -> look at pat[n].
If pat[n] == '**' switch to the above strategy:
Reconstruct the pattern with dirsep.join(pat[n:])
Convert to a regex with glob\_to\_regex()
Recursively os.walk through the current directory, building up the path relative to the level you started at. If the path matches the regex, yield it.
If pat doesn't match "**", and it is the last element in the pattern, then yield all files/dirs matching glob.glob(os.path.join(curpath,pat[n]))
If pat doesn't match "**", and it is NOT the last element in the pattern, then for each directory, check if it matches (with glob) pat[n]. If so, recurse down through it, incrementing depth (so it will look at pat[n+1])
os.walk is your friend. Look at the example in the Python manual
(https://docs.python.org/2/library/os.html#os.walk) and try to build something from that.
To match "**/CVS/*" against a file name you get, you can do something like this:
def match(pattern, filename):
if pattern.startswith("**"):
return fnmatch.fnmatch(file, pattern[1:])
else:
return fnmatch.fnmatch(file, pattern)
In fnmatch.fnmatch, "*" matches anything (including slashes).
There's an implementation in the 'waf' build system source code.
http://code.google.com/p/waf/source/browse/trunk/waflib/Node.py?r=10755#471
May be this should be wrapped up in a library of its own?
Yup. Your best bet is, as has already been suggested, to work with 'os.walk'. Or, write wrappers around 'glob' and 'fnmatch' modules, perhaps.
os.walk is your best bet for this. I did the example below with .svn because I had that handy, and it worked great:
import re
for (dirpath, dirnames, filenames) in os.walk("."):
if re.search(r'\.svn$', dirpath):
for file in filenames:
print file