How to extract files from across multiple folders in Python

How to extract files from across multiple folders in Python - python

Essentially... I have 27 folders on a local drive, each comprising tens of thousands of .jpgs. I have generated a random list of 5,000 of these images, distributed accross the 27 folders, that I wish to transfer into one folder.
I have a .csv list of all the 5,000 filenames I need, and was wondering what the easiest way to get these all into one folder would be?
I see plenty of onine resources explaing how to achieve this using 'glob.glob' and 'os.walk' methods to extract all files with a specific file extension (e.g. .txt), however I need to extract them based on the specific filenames that I have in a list.
I generally work in Python, but if there's another obvious non-coding way of doing this without it that I'm missing, please do suggest that too.
Thanks, R

This is a relatively basic Python solution, but maybe you can tweak it to something more useful.
import csv
import os
import shutil
# Generator for all files in root_dir
def list_files(my_root_dir):
for path, _, files in os.walk(my_root_dir):
yield path, files
def copy_files(my_csv_file, my_root_dir, my_target_dir):
with open(my_csv_file, "r") as csvfile:
csv_reader = csv.reader(csvfile, delimiter=",")
for line in csv_reader:
for image in line:
for path, files in list_files(my_root_dir):
# Careful not to copy files already copied to target_dir
if image in files and path != my_target_dir:
shutil.copy(os.path.join(path, image), my_target_dir)
break
If you want to stick to the shell, there are some ideas here: https://unix.stackexchange.com/questions/402728/how-to-move-files-specified-in-a-text-file-to-another-directory-on-bash.

Related

Copying files from one location of a server to another using python

Say I have a file that contains the different locations where some '.wav' files are present on a server. For example say the content of the text file location.txt containing the locations of the wav files is this
/home/user/test_audio_folder_1/audio1.wav
/home/user/test_audio_folder_2/audio2.wav
/home/user/test_audio_folder_3/audio3.wav
/home/user/test_audio_folder_4/audio4.wav
/home/user/test_audio_folder_5/audio5.wav
Now what I want to do is that I want to copy these files from different locations within the server to a particular directory within that server, for example say /home/user/final_audio_folder/ and this directory will contain all the audio files from audio1.wav to audio5.wav
I am trying to perform this task by using shutil, but the main problem with shutil that I am facing is that while copying the files, I need to name the file. I have written a demo version of what I am trying to do, but dont know how to scale it when I will be reading the paths of the '.wav' files from the txt file and copy them to my desired location using a loop.
My code for copying a single file goes as follows,
import shutil
original = r'/home/user/test_audio_folder_1/audio1.wav'
target=r'/home/user/final_audio_folder_1/final_audio1.wav'
shutil.copyfile(original,target)
Any suggestions will be really helpful. Thank you.

import shutil
i=0
with open(r'C:/Users/turing/Desktop/location.txt', "r") as infile:
for t in infile:
i+=1
x="audio"+str(i)+".wav"
t=t.rstrip('\n')
original= r'{}'.format(t)
target=r'C:/Users/turing/Desktop/audio_in/' + x
shutil.copyfile(original, target)

Use the built-in string's split() method within a for loop on the location.txt contents & split the name of the directory on the '/' character, then the last element in a new list would be your filename.

How to load data set having multiple 'No-extension files' in python?

I am trying to load a dataset for my machine learning project and it requires me to load files having no extensions.
I tried :
import os
import glob
files = filter(os.path.isfile, glob.glob("./[0-9]*"))
for name in files:
with open(name) as fh:
contents = fh.read()
But doesn't return anything, mainly that glob command has nothing in it.
Also tried :
import os
import glob
path = './dataset1/training_validation/2012-07-10/'
for infile in glob.glob(os.path.join(path, '*')):
print("test")
file = open(infile, 'r')
print(file)
but this returns [] because of that glob command.
I'm stuck in here and couldn't find anything over the internet.
My actual problem is to load 'no extension files in a training and testing set' from two folders, validation, and the test itself. I can iterate through the folder but don't know how to handle those file types.
When I open those files in a text editor. it shows me something like this.
So I know that it's a binary format of an image, but have no idea how can I store and train them.
any help would be appreciated. thanks.

Two things:
File extensions (.txt , .dat , .bat, .f90, etc.) are not meaningful to python, at least when using glob or numpy or something of the sort, because it's just part of a string. Some of us are raised (within Windows) to believe that file extensions mean something (I too fell for it).
The file you are looking at is a text file, containing the ASCII representation of a binary image on 0's and 1's. So, it's not a binary file, and it's not an image file (per-se), but it is a text file, which means we can read it as such from python.
To read this in, you could do either:
1. Use numpy to do data = numpy.loadtxt(<filename>), however you might have trouble delimiting the digits.
2. Use Python's standard open function on the file, and loop through each line using for line in <file_handle>:. This way, each row of data is a string, which can be parsed easily (see documentation on string indexing).
Good luck!

IMO this simply means that your path does not exist.
Perhaps you try in a first test an absolute path to your folder, as you eventually confused the relative position of the folder to your current working directory.

I got it to work with the following code.
fileNames = [f for f in listdir(dirName) if isfile(join(dirName, f))]
random.shuffle(fileNames)
for files in fileNames:
data = open(dirName+'/'+files,'r');
Thanks for your responses.

Iterating over .wav files in subdirectories of parent directory

Cheers everybody,
I need help with something in python 3.6 exactly. So i have structure of data like this:
|main directory
| |subdirectory's(plural)
| | |.wav files
I'm currently working from a directory where main directory is placed so I don't need to specify paths before that. So firstly I wanna iterate over my main directory and find all subdirectorys. Then in each of them I wanna find the .wav files, and when done with processing them I wanna go to next subdirectory and so on until all of them are opened, and all .wav files are processed. Exactly what I wanna do with those .wav files is input them in my program, process them so i can convert them to numpy arrays, and then I convert that numpy array into some other object (working with tensorflow to be exact, and wanna convert to TF object). I wrote about the whole process if anybody has any fast advices on doing that too so why not.
I tried doing it with for loops like:
for subdirectorys in open(data_path, "r"):
for files in subdirectorys:
#doing some processing stuff with the file
The problem is that it always raises error 13, Permission denied showing on that data_path I gave him but when I go to properties there it seems okay and all permissions are fine.
I tried some other ways like with os.open or i replaced for loop with:
with open(data_path, "r") as data:
and it always raises permission denied error.
os.walk works in some way but it's not what I need, and when i tried to modify it id didn't give errors but it also didnt do anything.
Just to say I'm not any pro programmer in python so I may be missing an obvious thing but ehh, I'm here to ask and learn. I also saw a lot of similiar questions but they mainly focus on .txt files and not specificaly in my case so I need to ask it here.
Anyway thanks for help in advance.

Edit: If you want an example for glob (more sane), here it is:
from pathlib import Path
# The pattern "**" means all subdirectories recursively,
# with "*.wav" meaning all files with any name ending in ".wav".
for file in Path(data_path).glob("**/*.wav"):
if not file.is_file(): # Skip directories
continue
with open(file, "w") as f:
# do stuff
For more info see Path.glob() on the documentation. Glob patterns are a useful thing to know.
Previous answer:
Try using either glob or os.walk(). Here is an example for os.walk().
from os import walk, path
# Recursively walk the directory data_path
for root, _, files in walk(data_path):
# files is a list of files in the current root, so iterate them
for file in files:
# Skip the file if it is not *.wav
if not file.endswith(".wav"):
continue
# os.path.join() will create the path for the file
file = path.join(root, files)
# Do what you need with the file
# You can also use block context to open the files like this
with open(file, "w") as f: # "w" means permission to write. If reading, use "r"
# Do stuff
Note that you may be confused about what open() does. It opens a file for reading, writing, and appending. Directories are not files, and therefore cannot be opened.
I suggest that you Google for documentation and do more reading about the functions used. The documentation will help more than I can.
Another good answer explaining in more detail can be seen here.

import glob
import os
main = '/main_wavs'
wavs = [w for w in glob.glob(os.path.join(main, '*/*.wav')) if os.path.isfile(w)]
In terms of permissions on a path A/B/C... A, B and C must all be accessible. For files that means read permission. For directories, it means read and execute permissions (listing contents).

How to work with CSV files inside a zipped folder?

I'm working with zipped files in python for the first time, and I'm stumped.
I read the documentation for zipfile, but I'm not sure what would be the best way to do what I'm trying to do. I have a zipped folder with CSV files inside, and I'd like to be able to open the zip file, and retrieve certain values from the csv files inside.
Do I use zipfile.extract(file name here) to bring it to the current working directory? And if I do that, do I just use the file name to work with the file, or does this index or list them differently?
Currently, I manually extract all files in the zipped folder to the current working directory for my project, and then use the csv module to read them. All I'm really trying to do is remove that step.
Any and all help would be greatly appreciated!

You are looking to avoid extracting to disk, in the zip docs for python there is ZipFile.open() which gives you a file-like object. That is an object that mostly behaves like a regular file on disk, but it is in memory. It gives a bytes array when read, at least in py3.
Something like this...
from zipfile import ZipFile
import csv
with ZipFile('abc.zip') as myzip:
print(myzip.filelist)
for mf in myzip.filelist:
with myzip.open(mf.filename) as myfile:
mc = myfile.read()
c = csv.StringIO(mc.decode())
for row in c:
print(row)
The documentation of Python is actually quite good once one has learned how to find things as well as some of the basic programming terms/descriptions used in the documentation.
For some reason csv.BytesIO is not implemented, hence the extra step via csv.StringIO.

Renaming and Saving Excel Files With Python

I've got a pretty simple task but I haven't done too many functions with excel within python and I'm not sure how to go about doing this.
What I need to do:
Look at many excel files within subfolders, rename them according to information within the file and store them in all in one folder somewhere else.
The data is structured like this:
Main Folder
Subfolder1
File1
File2
File3
...
For about a hundred subfolders and several files within each subfolder.
From here, I want to pull the company name, part number, and date from within the file and use those to rename the excel file. Not sure how to rename the file.
Then save it somewhere else. I'm having trouble finding all these functions, any advice?

Check the os and os.path module for listing folder contents (walk, listdir) and working with path names (abspath, basename etc.)
Also, shutil has some interesting functions for copying stuff. Check out copyfile and specify the dst parameter based on the data you read from the excel file.
This page can help you getting at the Excel data: http://www.python-excel.org/
You probably want to have some highlevel code like this:
for subfolder_name in os.listdir(MAIN_FOLDER):
# exercise left to reader: filter out non-folders
subfolder_path = os.path.join(MAIN_FOLDER, subfolder_name)
for excel_file_name in os.listdir(os.path.join(MAIN_FOLDER, subfolder_name)):
# exercise left to reader: filter out non-excel-files
excel_file_path = os.path.join(subfolder_path, excel_file_name)
new_excel_file_name = extract_filename_from_excel_file(excel_file_path)
new_excel_file_path = os.path.join(NEW_MAIN_FOLDER, subfolder_name,
new_excel_file_name)
shutil.copyfile(excel_file_path, new_excel_file_path)
You'll have to provide extract_filename_from_excel_file yourself using the xlrd module from the site I mentioned.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.