I know this type of question is asked all the time. But I am having trouble figuring out the best way to do this.
I wrote a script that reformats a single excel file using pandas.
It works great.
Now I want to loop through multiple excel files, preform the same reformat, and place the newly reformatted data from each excel sheet at the bottom, one after another.
I believe the first step is to make a list of all excel files in the directory.
There are so many different ways to do this so I am having trouble finding the best way.
Below is the code I currently using to import multiple .xlsx and create a list.
import os
import glob
os.chdir('C:\ExcelWorkbooksFolder')
for FileList in glob.glob('*.xlsx'):
print(FileList)
I am not sure if the previous glob code actually created the list that I need.
Then I have trouble understanding where to go from there.
The code below fails at pd.ExcelFile(File)
I beleive I am missing something....
# create for loop
for File in FileList:
for x in File:
# Import the excel file and call it xlsx_file
xlsx_file = pd.ExcelFile(File)
xlsx_file
# View the excel files sheet names
xlsx_file.sheet_names
# Load the xlsx files Data sheet as a dataframe
df = xlsx_file.parse('Data',header= None)
# select important rows,
df_NoHeader = df[4:]
#then It does some more reformatting.
'
Any help is greatly appreciated
I solved my problem. Instead of using the glob function I used the os.listdir to read all my excel sheets, loop through each excel file, reformat, then append the final data to the end of the table.
#first create empty appended_data table to store the info.
appended_data = []
for WorkingFile in os.listdir('C:\ExcelFiles'):
if os.path.isfile(WorkingFile):
# Import the excel file and call it xlsx_file
xlsx_file = pd.ExcelFile(WorkingFile)
# View the excel files sheet names
xlsx_file.sheet_names
# Load the xlsx files Data sheet as a dataframe
df = xlsx_file.parse('sheet1',header= None)
#.... do so reformating, call finished sheet reformatedDataSheet
reformatedDataSheet
appended_data.append(reformatedDataSheet)
appended_data = pd.concat(appended_data)
And thats it, it does everything I wanted.
you need to change
os.chdir('C:\ExcelWorkbooksFolder')
for FileList in glob.glob('*.xlsx'):
print(FileList)
to just
os.chdir('C:\ExcelWorkbooksFolder')
FileList = glob.glob('*.xlsx')
print(FileList)
Why does this fix it? glob returns a single list. Since you put for FileList in glob.glob(...), you're going to walk that list one by one and put the result into FileList. At the end of your loop, FileList is a single filename - a single string.
When you do this code:
for File in FileList:
for x in File:
the first line will assign File to the first character of the last filename (as a string). The second line will assign x to the first (and only) character of File. This is not likely to be a valid filename, so it throws an error.
Related
I have a list of csv file pathnames in a list, and I am trying to save them as dataframes. How can I do it?
import pandas as pd
import os
import glob
# use glob to get all the csv files
# in the folder
path = "/Users/azmath/Library/CloudStorage/OneDrive-Personal/Projects/LESA/2022 HY/All"
csv_files = glob.glob(os.path.join(path, "*.xlsx"))
# loop over the list of csv files
for f in csv_files:
# read the csv file
df = pd.read_excel(f)
display(df)
print()
The issue is that it only prints. but I dont know how to save. I would like to save all the data frames as variables, preferably as their file names.
By “save” I think you mean store dataframes in variables. I would use a dictionary for this instead of separate variables.
import os
data = {}
for f in csv_files:
name = os.path.basename(f)
# read the csv file
data[name] = pd.read_excel(f)
display(data[name])
print()
Now all your dataframes are stored in the data dictionary where you can iterate them (easily handle all of them together if needed). Their key in the dictionary is the basename (filename) of the input file.
Recall that dictionaries remember insertion order, so the order the files were inserted is also preserved. I'd probably recommend sorting the input files before parsing - this way you get a reproducible script and sequence of operations!
try this:
a = [pd.read_excel(file) for file in csv_files]
Then a will be a list of all your dataframe. If you want a dictionary instead of list:
a = {file: pd.read_csv(file) for file in csv_files}
I want to loop through all the PDFs in a directory, extract the text from each one using PDFminer, and then write the output to a single CSV file. I am able to extract the text from each PDF individually by passing it to the function defined here. I am also able to get a list of all the PDF filenames in a given directory. But when I try to put the two together and write the results to a single CSV, I get a CSV with headers but no data.
Here is my code:
import os
pdf_files = [name for name in os.listdir("C:\\My\\Directory\\Path") if name.endswith(".pdf")] #get all files in directory
pdf_files_path = ["C:\\My\\Directory\\Path\\" + pdf_files[i] for i in range(len(pdf_files))] #add directory path
import pandas as pd
df = pd.DataFrame(columns=['FileName','Text'])
for i in range(len(pdf_files)):
scraped_text = convert_pdf_to_txt(pdf_files_path[i])
df.append({ 'FileName': pdf_files[i], 'Text': scraped_text[i]},ignore_index=True)
df.to_csv('output.csv')
The variables have the following values:
pdf_files: ['12280_2007_Article_9000.pdf', '12280_2007_Article_9001.pdf', '12280_2007_Article_9002.pdf', '12280_2007_Article_9003.pdf', '12280_2007_Article_9004.pdf', '12280_2007_Article_9005.pdf', '12280_2007_Article_9006.pdf', '12280_2007_Article_9007.pdf', '12280_2007_Article_9008.pdf', '12280_2007_Article_9009.pdf']
pdf_files_path: ['C:\\My\\Directory Path\\12280_2007_Article_9000.pdf', etc...]
Empty DataFrame
Columns: [FileName, Text]
Index: []
Update: based on a suggestion from #AMC I checked the contents of scraped_text in the loop. For the Text column, it seems that I'm looping through the characters in the first PDF file, rather than looping through each file in the directly. Also, the contents of the loop are not getting written to the dataframe or CSV.
12280_2007_Article_9000.pdf E
12280_2007_Article_9001.pdf a
12280_2007_Article_9002.pdf s
12280_2007_Article_9003.pdf t
12280_2007_Article_9004.pdf
12280_2007_Article_9005.pdf A
12280_2007_Article_9006.pdf s
12280_2007_Article_9007.pdf i
12280_2007_Article_9008.pdf a
12280_2007_Article_9009.pdf n
I guess you don't need pandas for that. You can make it simpler by using the standard library csv.
Another thing that can be improved, if you are using Python 3.4+, is to replace os with pathlib.
Here is an almost complete example:
import csv
from pathlib import Path
folder = Path('c:/My/Directory/Path')
csv_file = Path('c:/path/to/output.csv')
with csv_file.open('w', encoding='utf-8') as f:
writer = csv.writer(f, csv.QUOTE_ALL)
writer.writerow(['FileName', 'Text'])
for pdf_file in folder.glob('*.pdf'):
pdf_text = convert_pdf_to_txt(pdf_file).replace('\n', '|')
writer.writerow([pdf_file.name, pdf_text])
Another thing to bear in mind is to be sure pdf_text will be a single line or else your csv file will be kind of broken. One way to work around that is to pick an arbitrary character to use in place of the new line marks. If you pick the pipe character, for example, than you can do something like this, prior to writer.writerow:
pdf_text.replace('\n', '|')
It is not meant to be a complete example but a starting point. I hope it helps.
My work is in the process of switching from SAS to Python and I'm trying to get a head start on it. I'm trying to write separate .txt files, each one representing all of the values of an Excel file column.
I've been able to upload my Excel sheet and create a .txt file fine, but for practical uses, I need to find a way to create a loop that will go through and make each column into it's own .txt file, and name the file "ColumnName.txt".
Uploading Excel Sheet:
import pandas as pd
wb = pd.read_excel('placements.xls')
Creating single .txt file: (Named each column A-Z for easy reference)
with open("A.txt", "w") as f:
for item in wb['A']:
f.write("%s\n" % item)
Trying my hand at a for loop (to no avail):
import glob
for file in glob.glob("*.txt"):
f = open(( file.rsplit( ".", 1 )[ 0 ] ) + ".txt", "w")
f.write("%s\n" % item)
f.close()
The first portion worked like a charm and gave me a .txt file with all of the relevant data.
When I used the glob command to attempt making some iterations, it doesn't error out, but only gives me one output file (A.txt) and the only data point in A.txt is the letter A. I'm sure my inputs are way off... after scrounging around forever it's what I found that made sense and ran, but I don't think I'm understanding the inputs going in to the command, or if what I'm running is just totally inaccurate.
Any help anyone would give would be much appreciated! I'm sure it's a simple loop, just hard to wrap your head around when you're so new to python programming.
Thanks again!
I suggest use pandas for write files by to_csv, only change extension to .txt:
# Uploading Excel Sheet:
import pandas as pd
df = pd.read_excel('placements.xls')
# Creating single .txt file: (Named each column A-Z for easy reference)
for col in df.columns:
print (col)
#python 3.6+
df[col].to_csv(f"{col}.txt", index=False, header=None)
#python bellow
#df[col].to_csv("{}.txt".format(col), index=False, header=None)
I have this script that read in a list of csv's from a text file, the files are hosted online. The script then makes changes to each of the files and creates a new csv from the changes. At the moment it works for a single file by I cant get it to run through all the links and make new csvs files from each.
import pandas as pd
f = open( "urls.txt", "r" )
lines = []
url = lines
for line in f:
lines.append(line)
df = pd.read_csv(url,skiprows=7)
df = df.rename(columns={'*.conradbrothers.com/*_20180501':'Ranking','*.conradbrothers.com/*_difference': 'Difference'}) # Raname the top coloumn names
df = df.replace(to_replace=["-"],value="0") #find - and replace with 0
print(df)
df.to_csv('seo.csv',index=False) #output file as a csv
print(line)
f.close()
Seeing the code, it seems like you are reading an empty list:
df = pd.read_csv(url,skiprows=7)
wouldn't it be
df = pd.read_csv(line,skiprows=7)
I am just guessing, the information provided in the question is limited
Oh an by the way, you are saving the csv each time with the same name, so you are overwriting it, consider changing the name in each iteration of the loop
I have a set of data saved across multiple .csv files with a fixed number of columns. Each column corresponds to a different measurement.
I would like to add a header to each file. The header will be identical for all files, and is comprised of three rows. Two of these rows are used to identify their corresponding columns.
I am thinking that I could save the header in a separate .csv file, then iteratively merge it with each data file using a for loop.
How can I do this in python? I am new to the language.
Yeah, you can do that easily with pandas. It will be faster and easier than what you're currently thinking which may create problems.
Three simple commands will be used for reading, merging and putting that in a new file and they are:
pandas.read_csv()
pandas.merge()
pandas.to_csv()
You can read what arguments you have to use and more details about them here.
for your case you may need first to create new files with
the headers with them. then you would do another loop to
add the rows, but skipping the header.
import csv
with open("data_out.csv","a") as fout:
# first file:
with open("data.csv") as f: # you header file
for line in f:
fout.write(line)
with open("data_2.csv") as f:
next(f) # this will skip first line
for line in f:
fout.write(line)
Instead of running a for loop appending two files for multiple files, an easier solution would be to put all the csv files you want to merge into a single folder and feed the path to the program. This will merge all the csv files into a single csv file.
(Note: The attributes of each file must be same)
import os
import pandas as pd
#give the path to the folder containing the multiple csv files
dirList = os.listdir(path)
#Put all their names into a list
filenames = []
for item in dirList:
if ".csv" in item:
filenames.append(item)
#Create a dataframe and make sure it's empty (not required but safe practice if using for appending)
df1 = pd.Dataframe()
df1.drop(df1.index, inplace=True)
#Convert each file to a dataframe and append it to dataframe df1
for f in filenames:
df = pd.read_csv(f)
df1 = df1.append(df)
#Convert the dataframe into a single csvfile
df1.to_csv(csvfile, encoding='utf-8', index=False)