I'd like to assign unique variable name with each file from a directory. I have no idea how this can be done. I'm new to python, so I'm sorry is the code is scruffy.
def DataFinder(path, extension):
import os
count = 0
extensions = ['.txt','.csv','.xls','xlsm','xlsx']
allfiles = []
if not extension in extensions:
print('Can\'t read data from this file type.\n','Allowed file types are\n',str(extensions))
else:
#loop through the files
for root, dirs, files in os.walk(path):
for file in files:
#check if the file ends with the extension
if file.endswith(extension):
count+=1
print(str(count)+': '+file)
allfiles.append(file)
if count==0:
print('There are no files with',extension,'extension in this folder.')
return allfiles
How this code be modified to assign variable name like df_number.of.file with each iteration as a string?
Thanks
My ultimate goal is to have set of DataFrame objects for each file under unique variable name without a need to create those variables manually.
The suggested duplicate did not answer my question, neither worked for me.
allfiles = {}
#filter through required data extensions
if not extension in extensions:
print('Can\'t read data from this file type.\n','Allowed file types are\n',str(extensions))
else:
#loop through the files
for root, dirs, files in os.walk(path):
for file in files:
#check if the file ends with the extension
if file.endswith(extension):
#raise counter
count+=1
print(str(count)+': '+file)
allfiles.update({'df'+str(count) : path+file})
After adjusting the code as suggested my output was a dictionary:
{'df1': 'C:/Users/Bartek/Downloads/First.csv', 'df2': 'C:/Users/Bartek/Downloads/Second.csv', 'df3': 'C:/Users/Bartek/Downloads/Third.csv'}
I achieved similar thing previously using list:
['df_1First.csv', 'df_2Second.csv', 'df_3Third.csv']
But my exact question is how to achieve this:
for each object in dict:
-create a variable with consecutive object number
so this variable(s) can be passed as data argument to pandas.DataFrame()
I know this is very bad idea (http://stupidpythonideas.blogspot.co.uk/2013/05/why-you-dont-want-to-dynamically-create.html), therefore can you please show me proper way using dict?
Many thanks
You should be able to modify this section of the code to accomplish what you desire. Instead of printing out the number of files. use count to create new unique filenames.
if file.endswith(extension):
count+=1
newfile = ('df_' + str(count) + file)
allfiles.append(newfile)
count would be unique for each different file extension. You should be able to find the newly created file names in allfiles.
EDIT to use a dictionary (thanks Rory): I would suggest an alternative route. create a dictionary and use the file name as the key.
allfilesdict = {}
...
if file.endswith(extension):
count+=1
newfile = ('df_' + str(count) + file)
allfilesdict[file] = newfile
then remember to return the allfilesdict if you are going to use it somewhere outside of your function.
you can modify first script like these.
from time import gmtime, strftime
import os
def DataFinder(path, extension):
count = 0
extensions = ['.txt','.csv','.xls','xlsm','xlsx']
allfiles = []
if not extension in extensions:
print('Can\'t read data from this file type.\n','Allowed file types are\n',str(extensions))
else:
#loop through the files
for root, dirs, files in os.walk(path):
for file in files:
#check if the file ends with the extension
if file.endswith(extension):
count+=1
#taking date and time
date_time=strftime("%Y-%m-%d %H:%M:%S", gmtime())
#now to get file name we are splite with (.)dot so in list we get first (i.e.file_name[0]) file name and (i.e.file_name[1]) as extension.
file_name=file.split('.')
allfiles.append(file_name[0]+date_time+'.'+file_name[1])
if count==0:
print('There are no files with',extension,'extension in this folder.')
return allfiles
print DataFinder('/home/user/tmp/test','.csv')
Related
I have multiple text files containing different text.
They all contain a single appearance of the same 2 lines I am interested in:
================================================================
Result: XX/100
I am trying to write a script to collect all those XX values (numerical values between 0 and 100), and paste them in a CSV file with the text file name in column A and the numerical value in column B.
I have considered using Python or PowerShell for this purpose.
How can I identify the line where "Result" appears under the string of "===..", collect its content until '\n', and then strip it from "Result: " and "/100"?
"Result" and other numerical values could appear in the files, but never in the quoted format, and below "=====", like the line im interested in.
Thank you!
Edit: I have written this poor naive attempt to collect the numerical values.
import os
dir_path = os.path.dirname(os.path.realpath(__file__))
for filename in os.listdir(dir_path):
if filename.endswith(".txt"):
with open(filename,"r") as f:
lineFound=False
for index, line in enumerate(f):
if lineFound:
line=line.replace("Result: ", "")
line=line.replace("/100","")
line.strip()
grade=line
lineFound=False
print(grade, end='')
continue
if index>3:
if "================================================================" in line:
lineFound=True
I'd still be happy to learn if there's a simple way to do this with PowerShell tbh
For the output, I used csv writer to append the results to a file one by one.
So there's two steps involved here, first is to get a list of files. There's a ton of answers for that one on stackoverflow, but this one is stupidly complete.
Once you have the list of files, you can simply just load the files themselves one by one, and then do some simple string.split() to get the value you want.
Finally, write the results into a CSV file. Since the CSV file is a simple one, you don't need to use the CSV library for this.
See the code example below. Note that I copied/pasted the function for generating the list of files from my personal github repo. I reuse that one a lot.
import os
def get_files_from_path(path: str = ".", ext:str or list=None) -> list:
"""Find files in path and return them as a list.
Gets all files in folders and subfolders
See the answer on the link below for a ridiculously
complete answer for this.
https://stackoverflow.com/a/41447012/9267296
Args:
path (str, optional): Which path to start on.
Defaults to '.'.
ext (str/list, optional): Optional file extention.
Defaults to None.
Returns:
list: list of file paths
"""
result = []
for subdir, dirs, files in os.walk(path):
for fname in files:
filepath = f"{subdir}{os.sep}{fname}"
if ext == None:
result.append(filepath)
elif type(ext) == str and fname.lower().endswith(ext.lower()):
result.append(filepath)
elif type(ext) == list:
for item in ext:
if fname.lower().endswith(item.lower()):
result.append(filepath)
return result
filelist = get_files_from_path("path/to/files/", ext=".txt")
split1 = "================================================================\nResult: "
split2 = "/100"
with open("output.csv", "w") as outfile:
outfile.write('filename, value\n')
for filename in filelist:
with open(filename) as infile:
value = infile.read().split(split1)[1].split(split2)[0]
print(value)
outfile.write(f'"{filename}", {value}\n')
You could try this.
In this example the filename written to the CSV will be its full (absolute) path. You may just want the base filename.
Uses the same, albeit seemingly unnecessary, mechanism for deriving the source directory. It would be unusual to have your Python script in the same directory as your data.
import os
import glob
equals = '=' * 64
dir_path = os.path.dirname(os.path.realpath(__file__))
outfile = os.path.join(dir_path, 'foo.csv')
with open(outfile, 'w') as csv:
print('A,B', file=csv)
for file in glob.glob(os.path.join(dir_path, '*.txt')):
prev = None
with open(file) as indata:
for line in indata:
t = line.split()
if len(t) == 2 and t[0] == 'Result:' and prev.startswith(equals):
v = t[1].split('/')
if len(v) == 2 and v[1] == '100':
print(f'{file},{v[0]}', file=csv)
break
prev = line
I have been trying to merge several .csv files form different subfolders (all with the same name) into one. I tried with R but I got the result of not enough memory for carry the process (it should merge more than 20 million of rows). I am now working with a python script to try to get it (see below) it has many columns too so I don't need all of them but also I dont know if can choose which columns to add to the new csv:
import glob
import csv
import os
path= 'C:\\path\\to\\folder\\where\\all\\files\\are-allowated-in-subfolders'
result = glob.glob('*/certificates.csv')
#for i in result:
#full_path = "C:\\path\\to\\folder\\where\\all\\files\\are-allowated-in-subfolders\\" + result
#print(full_path)
os.chdir(path)
i=0
for root, directories, files in os.walk(path, topdown=False):
for name in files:
print(name)
try:
i += 1
if i % 10000 == 0:
#just to see the progress
print(i)
if name == 'certificates.csv':
creader = csv.reader(open(name))
cwriter = csv.writer(open('processed_' + name, 'w'))
for cline in creader:
new_line = [val for col, val in enumerate(cline)]
cwriter.writerow(new_line)
except:
print('problem with file: ' + name)
pass
but it doesn't work, and neither return any error so at the moment I am completely stuck.
Your indentation is wrong, and you are overwriting the output file for each new input file. Also, you are not using the glob result for anything. If the files you want to read are all immediately in subdirectories of path, you can do away with the os.walk() call and do the glob after you os.chdir().
import glob
import csv
import os
# No real need to have a variable for this really
path = 'C:\\path\\to\\folder\\where\\all\\files\\are-allowated-in-subfolders'
os.chdir(path)
# Obviously, can't use input file name in output file name
# because there is only one output file for many input files
with open('processed.csv', 'w') as dest:
cwriter = csv.writer(dest)
for i, name in enumerate(glob.glob('*/certificates.csv'), 1):
if i % 10000 == 0:
#just to see the progress
print(i)
try:
with open(name) as csvin:
creader = csv.reader(csvin)
for cline in creader:
# no need to enumerate fields
cwriter.writerow(cline)
except:
print('problem with file: ' + name)
You probably just need to keep a merged.csv file open whilst reading in each of the certificates.csv files. glob.glob() can be used to recursively find all suitable files:
import glob
import csv
import os
path = r'C:\path\to\folder\where\all\files\are-allowated-in-subfolders'
os.chdir(path)
with open('merged.csv', 'w', newline='') as f_merged:
csv_merged = csv.writer(f_merged)
for filename in glob.glob(os.path.join(path, '*/certificates.csv'), recursive=True):
print(filename)
try:
with open(filename) as f_csv:
csv_merged.writerows(csv.reader(f_csv))
except:
print('problem with file: ', filename)
An r prefix can be added to your path to avoid needing to escape each backslash. Also newline='' should be added to the open() when using a csv.writer() to stop extra blank lines being written to your output file.
I have a directory with many files that are named like:
1234_part1.pdf
1234.pdf
5432_part1.pdf
5432.pdf
2323_part1.pdf
2323.pdf
etc.
I am trying to merge the pdf where the the first number part of the file are the same.
I have code that can do this one at a time, but when I have over 500 files in the directory I am not sure how to loop through, here is what I have so far:
from PyPDF2 import PdfFileMerger, PdfFileReader
merger = PdfFileMerger()
merger.append(PdfFileReader(file('c:/example/1234_part1.pdf', 'rb')))
merger.append(PdfFileReader(file('c:/example/1234.pdf', 'rb')))
merger.write("c:/example/ouput/1234_combined.pdf")
Ideally the output file would be 'xxxx_combined_<today's date>.pdf'.
i.e. 1234_combined_051719.pdf
And also if there is a number file that only has part 1 or the other file it would not combined —
i.e. if there was a 9999_part1.pdf, but no 9999.pdf, then there would be no output for the '9999_combined_<today's date>.pdf'.
Try using os.listdir() to get all of the files in your directory Then use .split() at the end of your string (filename) to isolate the pdf file number. Then look for that number pattern in the list of files you made.
import os
from PyPDF2 import PdfFileMerger, PdfFileReader
dir = 'my/dir/of/pdfs/'
file_list = os.listdir(dir)
num_list = []
for fname in file_list:
if '_' in fname: # if the filename has an underscore in it
file_num = fname.split('_')[0] # get's first element in list of splits
else:
file_num = fname.split('.')[0]
if file_num not in num_list:
num_list.append(file_num)
# now you have a list of all of your file numbers you can grab all files
# in the file_list containing that number
for num in num_list:
pdf_parts = [x for x in file_list if num in x] # grabs all files with that number
if len(pdf_parts < 2): # if there is only one pdf with that num ...
continue # skip it!
# your pdf append operation here for each item in the pdf_parts list.
# something like this maybe ...
merger = PdfFileMerger()
# sorts list by filename length in decending order so that
# '_part' files come first
sorted_pdf_parts = pdf_parts.sort(key=len, reverse=True)
for part in sorted_pdf_parts:
merger.append(PdfFileReader(file(dir + part, 'rb')))
merger.write('out/dir/' + num + '_combined.pdf')
You can do it like this:
from PyPDF2 import PdfFileMerger, PdfFileReader
from os import listdir
from datetime import datetime
file_names = listdir('D:\Code\python-examples\PDF')
for file_name in file_names:
if "_" in file_name:
digits = file_name.split('_')[0]
if f'{digits}.pdf' in file_names:
with open(f'{digits}.pdf', 'rb') as digit_file, open(f'{digits}_part1.pdf', 'rb') as part1_file:
merger = PdfFileMerger()
merger.append(PdfFileReader(part1_file))
merger.append(PdfFileReader(digit_file))
merger.write(f'{digits}_combined_{datetime.now().strftime("%m%d%y")}.pdf')
A couple of notes:
It's recommended to use with when opening files.
You can use datetime.now().strftime("%m%d%y") to get the date format you mentioned.
So if we have a folder like this:
After we run the code, we have:
And we can see that it works:
I also uploaded the code, along with relevant files, to my GitHub page. If anyone wants to try it themselves, they can check it out.
INPUT: I want to add increasing numbers to file names in a directory sorted by date. For example, add "01_", "02_", "03_"...to these files below.
test1.txt (oldest text file)
test2.txt
test3.txt
test4.txt (newest text file)
Here's the code so far. I can get the file names, but each character in the file name seems to be it's own item in a list.
import os
for file in os.listdir("/Users/Admin/Documents/Test"):
if file.endswith(".txt"):
print(file)
The EXPECTED results are:
01_test1.txt
02_test2.txt
03_test3.txt
04_test4.txt
with test1 being the oldest and test 4 being the newest.
How do I add a 01_, 02_, 03_, 04_ to each file name?
I've tried something like this. But it adds a '01_' to every single character in the file name.
new_test_names = ['01_'.format(i) for i in file]
print (new_test_names)
If you want to number your files by age, you'll need to sort them first. You call sorted and pass a key parameter. The function os.path.getmtime will sort in ascending order of age (oldest to latest).
Use glob.glob to get all text files in a given directory. It is not recursive as of now, but a recursive extension is a minimal addition if you are using python3.
Use str.zfill to strings of the form 0x_
Use os.rename to rename your files
import glob
import os
sorted_files = sorted(
glob.glob('path/to/your/directory/*.txt'), key=os.path.getmtime)
for i, f in enumerate(sorted_files, 1):
try:
head, tail = os.path.split(f)
os.rename(f, os.path.join(head, str(i).zfill(2) + '_' + tail))
except OSError:
print('Invalid operation')
It always helps to make a check using try-except, to catch any errors that shouldn't be occurring.
This should work:
import glob
new_test_names = ["{:02d}_{}".format(i, filename) for i, filename in enumerate(glob.glob("/Users/Admin/Documents/Test/*.txt"), start=1)]
Or without list comprehension:
for i, filename in enumerate(glob.glob("/Users/Admin/Documents/Test/*.txt"), start=1):
print("{:02d}_{}".format(i, filename))
Three things to learn about here:
glob, which makes this sort of file matching easier.
enumerate, which lets you write a loop with an index variable.
format, specifically the 02d modifier, which prints two-digit numbers (zero-padded).
two methods to format integer with leading zero.
1.use .format
import os
i = 1
for file in os.listdir("/Users/Admin/Documents/Test"):
if file.endswith(".txt"):
print('{0:02d}'.format(i) + '_' + file)
i+=1
2.use .zfill
import os
i = 1
for file in os.listdir("/Users/Admin/Documents/Test"):
if file.endswith(".txt"):
print(str(i).zfill(2) + '_' + file)
i+=1
The easiest way is to simply have a variable, such as i, which will hold the number and prepend it to the string using some kind of formatting that guarantees it will have at least 2 digits:
import os
i = 1
for file in os.listdir("/Users/Admin/Documents/Test"):
if file.endswith(".txt"):
print('%02d_%s' % (i, file)) # %02d means your number will have at least 2 digits
i += 1
You can also take a look at enumerate and glob to make your code even shorter (but make sure you understand the fundamentals before using it).
test_dir = '/Users/Admin/Documents/Test'
txt_files = [file
for file in os.listdir(test_dir)
if file.endswith('.txt')]
numbered_files = ['%02d_%s' % (i + 1, file)
for i, file in enumerate(txt_files)]
I have a directory with a large number of files that I want to move into folders based on part of the file name. My list of files looks like this:
ID1_geneabc_species1.fa
ID1_genexy_species1.fa
ID2_geneabc_species1.fa
ID3_geneabc_species2.fa
ID3_genexy_species2.fa
ID4_genexy_species3.fa
I want to move the files I have into separate folders based on the last part of the file name (species1, species2, species3). The first parts of the file name do not always have the same number of numbers and/or letters but are always in 3 parts separated by an underscore '_'.
This is what I have tried from looking online but it does not work:
import os
import glob
dirs = glob.glob('*_*')
files = glob.glob('*.fa')
for file in files:
name = os.path.splitext(file)[0]
matchdir = next(x for x in dirs if name == x.rsplit('_')[0])
os.rename(file, os.path.join(matchdir, file))
I have the list of names (species1, species2, species3) in a list in the script below, which correspond to the third part of my file name. I am able to create a set of directories in my current working directory from each of these names. Is there be a better way to do this after the following script, like looping through the list of species, matching the file, then moving it into the correct directory? THANKS.
from Bio import SeqIO
import os
import itertools
#to get a list of all the species in genbank file
all_species = []
for seq_record in SeqIO.parse("sequence.gb", "genbank"):
all_species.append(seq_record.annotations["organism"])
#get unique names and change from set to list
Unique_species = set(all_species)
Species = list(Unique_species)
#send to file
f = open('speciesnames.txt', 'w')
for names in Species:
f.write(names+'\n')
f.close()
print ('There are ' + str(int(len(Species))) + ' species.')
#make directory for each species
path = os.path.dirname(os.path.abspath(__file__))
for item in itertools.product(Species):
os.makedirs(os.path.join(path, *item))
So, you want a function, which gets folder name from file. Then you iterate over files, create dirs which don't exist and move files there. Stuff like that should work out.
def get_dir_name(filename):
pos1 = filename.rfind('_')
pos2 = filename.find('.')
return filename[pos1+1:pos2]
for f in glob.glob('*.fa'):
cwd = os.getcwd()
dir_name = cwd+'/'+get_dir_name(f)
print dir_name
if not os.path.exists(dir_name):
os.mkdir(dir_name)
os.rename(f, dir_name+'/'+f)