Find 'all' files in a directory, not all files found - python

Using python, I'm trying to find all files in /sys and match a certain file. The problem I'm having is that not all files are being found. It's not a matter of access. I know that python can read and write to the file, which I've tested manually using file.open("file_path","w") and file.write(). I just want to know whether there is some trick to locating files I'm missing here:
import os,re
for roots,dirs,files in os.walk('/sys'):
match=re.search(r'\S+/rq_affinity',roots)
if match:
print(match.group())
I've already tried writing every single file found using os.walk() to a file and then using the shell and grep to see if the file I'm looking for is there, so the problem isn't with matching.
FIXED search:
import os,re
for roots,dirs,files in os.walk('/sys'):
for file in files:
match=re.search(r'\S+/rq_affinity',os.path.join(roots,file))
if match:
print(match.group())

rq_affinity is a file isn't it? Why would you get that in roots?
Also the entries under /sys/dev/block are symlinks so you need to tell os.walk to follow them with followlinks=True.

Related

Python Match Portion of File Name

I am trying to match file names within a folder using python so that I can run a secondary process on the files that match. My file names are such that they begin differently but match strings at some point as below:
3322_VEGETATION_AREA_2009_09
3322_VEGETATION_LINE_2009_09
4522_VEGETATION_POINT_2009_09
4422_VEGETATION_AREA_2009_09
8722_VEGETATION_LINE_2009_09
2522_VEGETATION_POINT_2009_09
4222_VEGETATION_AREA_2009_09
3522_VEGETATION_LINE_2009_09
3622_VEGETATION_POINT_2009_09
Would regex be the right approach to matching those files after the first underscore or am I overthinking this?
import glob
files = glob.glob("*VEGETATION*")
should do the trick. It should find all files in the current directory that contain "VEGETATION" somewhere in the filename

Python NLTK Make corpus from zip files

I'm trying to create my own corpus in NLTK from ca. 200k text files each stored in it's own zip folder. It looks like the following:
Parent_dir
text1.zip
text1.txt
I'm using the following code and try to access all the text files from the parent directory:
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
corpus_path="parent_dir"
corpus=PlaintextCorpusReader(corpus_path,".*")
file_ids=corpus.fileids()
print(file_ids)
But Python just returns an empty list because it probably can't access the text files due to the zipping. Is there an easy way to fix this? Unfortunately, I can't unzip the files because of the size of the files.
If all you're trying to do is get the fileIDs just use the 'glob' module, which doesn't care about file types.
Import the module (if you don't have it go ahead and pip install glob):
from glob import glob
Get your directory use * as a wildcard to get everything in the directory:
directory = glob('/path/to/your/corpus/*')
The glob() method returns a list of strings (which are file paths, in this case).
You can simply iterate over these to print the file name:
for file in directory:
print(file)
This article looks like an answer to your question about reading the contents of a zipped file: How to read text files in a zipped folder in Python
I think a combination of these methods makes an answer to your problem.
Good luck!

Is it possible to download just part of a ZIP file using python zipfile library

I was wondering is there any way by which I can download only a part of a .rar or .zip file without downloading the whole file ? There is a zip file containing files A,B,C and D. I only need A. Can I somehow, use zipfile module so that i can only download 1 file ?
i am trying below code:
r = c.get(file)
z = ZipFile.ZipFile(BytesIO(r.content))
for file1 in z.namelist():
if 'time' not in file1:
print("hi")
z.extractall(file1,download_path + filename)
This code is downloading whole zip file and only extracting specific one. Can i somehow download only the file i Need.
There is similar question here but it shows only approch by command line in linux. That question dosent address how it can be done using python liabraries.
The question #Juggernaut mentioned in a comment is actually very helpful, as it points you in the direction of the solution.
You need to create a replacement for Bytes.IO that returns the necessary information to ZipFile. You will need to get the length of the file, and then get whatever sections ZipFile asks for.
How large are those file? Is it really worth the trouble?
Use remotezip: https://github.com/gtsystem/python-remotezip. You can install it using pip:
pip install remotezip
Usage example:
from remotezip import RemoteZip
with RemoteZip("https://path/to/zip/file.zip") as zip_file:
for file in zip_file.namelist():
if 'time' not in file:
print("hi")
zip_file.extract(file, path="/path/to/extract")
Note that to use this approach, the web server from which you receive the file needs to support the Range header.

How to input multiple files from a directory

First and foremost, I am recently new to Unix and I have tried to find a solution to my question online, but I could not find a solution.
So I am running Python through my Unix terminal, and I have a program that parses xml files and inputs the results into a .dat file.
My program works, but I have to input every single xml file (which number over 50) individually.
For example:
clamshell: python3 my_parser2.py 'items-0.xml' 'items-1.xml' 'items-2.xml' 'items-3.xml' .....`
So I was wondering if it is possible to read from the directory, which contains all of my files into my program? Rather than typing all the xml file names individually and running the program that way.
Any help on this is greatly appreciated.
import glob
listOffiles = glob.glob('directory/*.xml')
The shell itself can expand wildcards so, if you don't care about the order of the input files, just use:
python3 my_parser2.py items-*.xml
If the numeric order is important (you want 0..9, 10-99 and so on in that order, you may have to adjust the wildcard arguments slightly to guarantee this, such as with:
python3 my_parser2.py items-[0-9].xml items-[1-9][0-9].xml items-[1-9][0-9][0-9].xml
python3 my_parser2.py *.xml should work.
Other than the command line option, you could just use glob from within your script and bypass the need for command arguments:
import glob
filenames = glob.glob("*.xml")
This will return all .xml files (as filenames) in the directory from which you are running the script.
Then, if needed you can simply iterate through all the files with a basic loop:
for file in filenames:
with open(file, 'r') as f:
# do stuff to f.

python open file matching pattern excluding substring

I need to open some files inside a folder in python
Say, I have the following files in the folder:
text_pbs.fna
text_pdom_fo_oo.fna
text_pdom_fo_oo_aa.fna
text_pdom_fo_oo.ali
text_pdom_ba_ar.fna
text_pdom_ba_ar_aa.fna
text_pdom_ba_ar.ali
text_pdom_ba_az.fna
text_pdom_ba_az_aa.fna
text_pdom_ba_az.ali
I want to open:
text_pdom_fo_oo.fna
text_pdom_ba_ar.fna
text_pdom_ba_az.fna
only.
I tried with glob:
glob.glob('*_pdom_*[^aa].fna')
But it doesn't work.
Many thanks to point out the problem in the above pattern. Is there any other work around for this?
The ^ is not handled and must be replaced by !, You should try this code:
import glob
glob.glob('*_pdom_*[!aa].fna')
gives the result:
['text_pdom_fo_oo.fna','text_pdom_ba_ar.fna','text_pdom_ba_az.fna']

Categories