Grabbing all files in a query using Python - python

I'm processing a large amount of files using Python. Data related to each other through their file names.
If I was to perform CMD command to perform this (in windoes) it would look something like:
DIR filePrefix_??.txt
And this would return all the file names I would need for that group.
Is there a similar function that I can use in Python?

Have a look at the glob module.
glob.glob("filePrefix_??.txt")
returns a list of matching file names.

Related

read a csv file with specific pattern dynamically in pandas

I have some CSV files with its extension which keeps changes all the time.
Active_Count_1618861363072
Deposit_1618861402104
Game_Type_Wise_Net_Sell_1618861383176
Total_Count_1618861351976
I want to read these files automatically
df1=pd.read_csv('Active_count_'.csv)
df2=pd.read_csv('Deposit_'.csv)
df3=pd.read_csv('Game_Type_Wise_Net_Sell_'.csv)
df4=pd.read_csv('Total_Count_'.csv)
I want this in such a way that I want to keep after the underscore dynamic and load the CSV files.
Is there a way I can achieve this?
This can be achieved outside Pandas using only standard Python functionality:
import glob
active_count_filename = glob.glob('Active_Count_*.csv')[0]
df1 = pd.read_csv(active_count_filename)
This assumes that there is exactly one Active_count_* file - if none exists, it will throw an error, if more than one exists, one will be chosen randomly.

File path/name under Databricks file system

I use glob function glob to grab directory/file name under regular Python.
For example:
glob.glob("/dbfs/mnt/.../*/A*.txt")
However, just realized under DBFS, the full path name starts with /mnt. But is there a way under Pyspark like using glob to get the file directory/name list?
Thanks,
If you want only to get directory/name list, you can only do it in Python.
Pyspark can process a directory/name list sc.textFile("/dbfs/mnt/.../*/A*.txt"), but not return it.
Pyspark is a processing engine, not a framework for filesystem tasks.

How can i read all files from a directory and do operations parallelly?

Suppose I have some files in directory and i want to read each file and extract the file name and first row from the file (i.e header) for some validation. How can we do this in spark (using python).
input_file = sc.textFile(sourceFileDir)
By sc.textFile() we can read all files parallelly but using map we can apply any rules or function to each element in the rdd. I am not understanding how can i fetch only file name and one row of all files using sc.textFile()
Currently, I am doing these requirement (mentioned above) using a for loop.
files = os.listdir(sourceFileDir)
for x in files:
**operations**
How can i do the same in parallel manner to all files that will save some times as there are lots of files in the directory.
Thanks in advance ..
textFile is not what you are looking for. You should use wholeTextFile. It creates a rdd with key as FileName and value as content. Then you apply a map to get only first line:
sc.wholeTextFiles(sourceFileDir).map(lambda x : (x[0], x[1].split('\n')[0]))
By doing that, the output of your map is the fileName and the 1st line.

How to have multiple programs access the same file without manually giving them all the file path?

I'm writing several related python programs that need to access the same file however, this file will be updated/replaced intermittently and I need them all to access the new file. My current idea is to have a specific folder where the latest file is placed whenever it needs to be replaced and was curious how I could have python select whatever text file is in the folder.
Or, would I be better off creating a program that has a Class entirely dedicated to holding the information of the file and have each program reference the file in that class. I could have the Class use tkinter.filedialog to select a new file whenever necessary and perhaps have a text file that has the path or name to the file that I need to access and have the other programs reference that.
Edit: I don't need to write to the file at all just read from it. However, I would like to have it so that I do not need to manually update the path to the file every time I run the program or update the file path.
Edit2: Changed title to suit the question more
If the requirement is to get the most recently modified file in a specific directory:
import os
mypath = r'C:\path\to\wherever'
myfiles = [(f,os.stat(os.path.join(mypath,f)).st_mtime) for f in os.listdir(mypath)]
mysortedfiles = sorted(myfiles,key=lambda x: x[1],reverse=True)
print('Most recently updated: %s'%mysortedfiles[0][0])
Basically, get a list of files in the directory, together with their modified time as a list of tuples, sort on modified date, then get the one you want.
It sounds like you're looking for a singleton pattern, which is a neat way of hiding a lot of logic into an 'only one instance' object.
This means the logic for identifying, retrieving, and delivering the file is all in one place, and your programs interact with it by saying 'give me the one instance of that thing'. If you need to alter how it identifies, retrieves, or delivers what that one thing is, you can keep that hidden.
It's worth noting that the singleton pattern can be considered an antipattern as it's a form of global state, it depends on the context of the program if this is a deal breaker or not.
To "have python select whatever text file is in the folder", you could use the glob library to get a list of file(s) in the directory, see: https://docs.python.org/2/library/glob.html
You can also use os.listdir() to list all of the files in a directory, without matching pattern names.
Then, open() and read() whatever file or files you find in that directory.

Python-2.x: list directory without os.listdir()

With os.listdir(some_dir), we can get all the files from some_dir, but sometimes, there would be 20M files(no sub-dirs) under some_dir, these would be a long time to return 20M strings from os.listdir().
(We don't think it's a wise option to put 20M files under a single directory, but it's really there and out of my control...)
Is it any other generator-like method to do the list operation like this: once find a file, yield it, we fetch it and then the next file.
I have tried os.walk(), it's really a generator-style tool, but it also call os.listdir() to do the list operation, and it can not handle unicode file names well (UTF-8 names along with GBK names).
If you have python 3.5+ you can use os.scandir() see documentation for scandir

Categories