Combining Excel worksheets over multiple loops - python

I've got a number of Excel workbooks, each with multiple worksheets, that I'd like to combine.
I've set up two sets of loops (one while, one for) to read in rows for each sheet in a given workbook and then do the same for all workbooks.
I tried to do it on a subset of these, and it appears to work until I try to combine the two sets using the pd.concat function. Error given is
TypeError: first argument must be an iterable of pandas objects, you
passed an object of type "DataFrame"
Any idea what I'm doing incorrectly?
import pandas as pd
d = 2013
numberOfSheets = 5
while d < 2015:
#print(str(d) + ' beginning')
f ='H:/MyDocuments/Z Project Work/scriptTest ' + str(d) + '.xlsx'
for i in range(1,numberOfSheets+1):
data = pd.read_excel(f, sheetname = 'Table '+str(i), header=None)
print(i)
df.append(data)
print(str(d) + ' complete')
print(df)
d += 1
df = pd.concat(df)
print(df)
final = "H:/MyDocuments/Z Project Work/mergedfile.xlsx"
df.to_excel(final)

As the error says, pd.concat() requires an iterable, like a list: pd.concat([df1, df2]) will concatenate df1 and df2 along the default axis of 0, which means df2 is appended to the bottom of df1.
Two issues need fixing:
The for loop refers to df before assigning anything to it.
The variable df is overwritten with each iteration of the for loop.
One workaround is to create an empty list of DataFrames before the loops, then append DataFrames to that list, and finally concatenate all the DataFrames in that list. Something like this:
import pandas as pd
d = 2013
numberOfSheets = 5
dfs = []
while d < 2015:
#print(str(d) + ' beginning')
f ='H:/MyDocuments/Z Project Work/scriptTest ' + str(d) + '.xlsx'
for i in range(1, numberOfSheets + 1):
data = pd.read_excel(f, sheetname='Table ' + str(i), header=None)
print(i)
dfs.append(data)
print(str(d) + ' complete')
print(df)
d += 1
# ignore_index=True gives the result a default IntegerIndex
# starting from 0
df_final = pd.concat(dfs, ignore_index=True)
print(df_final)
final_path = "H:/MyDocuments/Z Project Work/mergedfile.xlsx"
df_final.to_excel(final_path)

Since I can't comment, I'll leave this as an answer: you can speed up this code by opening the file once then parsing the workbook to get each sheet. Should save a second or two off each iteration, since opening the Excel file takes the longest. Here's some code that might help.
Note: setting sheet_name=None will return ALL the sheets in the workbook:
dfs = {<sheetname1>: <DataFrame1>, <sheetname2>: <DataFrame2>, etc.}
Here's the code:
xl = pd.ExcelFile(fpath)
dfs = xl.parse(sheetname=None, header=None)
for i, df in enumerate(dfs):
<do stuff with each, if you want>
print('Sheet {0} looks like:\n{1}'.format(i+1, df))

Thank you, both. I accepted the answer that addressed the specific question, but was able to use the second answer and some additional googling thereafter (eg, glob) to amend the original code, and automate more fully independent of number of workbooks or worksheets.
Final version of the above now below:
import pandas as pd
import glob
#import numpy as np
#import os, collections, csv
#from os.path import basename
fpath = "H:/MyDocuments/Z Project Work/"
dfs = []
files = glob.glob(fpath+'*.xlsx')
for f in files:
xl = pd.ExcelFile(f)
xls = xl.parse(sheetname=None, header=0)
for i, df in enumerate(xls):
print(i)
dfs.append(xls[df])
print(f+ ' complete')
df_final = pd.concat(dfs, ignore_index=True)
final = "H:/MyDocuments/Z Project Work/mergedfile.xlsx"
df_final.to_excel(final)

Related

How to save specific columns from different files into one file

I am new to python and I'm using Python 3.9.6. I have about 48 different files all with "Cam_Cantera_IDT_output_800K_*.csv" in the name. For example: Cam_Cantera_IDT_output_800K_7401.csv and Cam_Cantera_IDT_output_800K_8012.csv. All the files have a time column t followed by many columns named X_oh, X_ch2, X_h20 etc. I want to write a code that goes to each of these 48 files and takes only the t and X_ch2 columns and saves them in a new file. Since the time column is the same for all 48 files I just want 1 time column t followed by 48 columns of X_ch2, so 49 columns in total.
My first attempt was using append:
#Using .append, this code runs
import pandas as pd
import glob
require_cols = ['t', 'X_ch2']
all_data = pd.DataFrame()
for f in glob.glob("Cam_Cantera_IDT_output_800K_*.csv"):
df = pd.read_csv(f, usecols = require_cols)
all_data = all_data.append(df,ignore_index=True)
all_data.to_csv("new_combined_file.csv")
This code ran but it appended each file one underneath the other, so I had only 2 columns, one for t and one for X_ch2, but many rows. I later read that that's what append does, it adds the files one underneath the other.
I then tried using pd.concat to add the columns vertically next to each other. The code I used can be seen below. I unfortunately got the same result as with append. I got just 2 columns, one for time t and one for X_ch2 and all the files were added underneath each other.
#Attempting using pd.concat, this code runs
import glob
import pandas as pd
file_extension = 'Cam_Cantera_IDT_output_800K_*.csv'
all_filenames = [i for i in glob.glob(f"*{file_extension}")]
require_cols = ['t', 'X_ch2']
combined_csv_data = pd.concat([pd.read_csv(f, usecols = require_cols) for f in all_filenames], axis=0)
print(combined_csv_data)
combined_csv_data.to_csv('combined_csv_data.csv')
My last attempt was using pd.merge, I added on='t' for it to merge on the time column so that I only have 1 time column. I keep getting the error that I am missing 'right'. But when I add it to the line it tells me that right is not defined:
combined_csv_data = pd.merge(right, [pd.read_csv(f, usecols = require_cols) for f in all_filenames], on='t') => gives 'right' is not defined.
I tried using right_on = X_ch2 or right_index=True but nothing seems to work.
The original code I tried, without any 'right' is shown below.
# A merge attempt
#TypeError: merge() missing 1 required positional argument: 'right'
import glob
import pandas as pd
file_extension = 'Cam_Cantera_IDT_output_800K_*.csv'
all_filenames = [i for i in glob.glob(f"*{file_extension}")]
require_cols = ['t', 'X_ch2']
combined_csv_data = pd.merge([pd.read_csv(f, usecols = require_cols) for f in all_filenames], on='t')
print(combined_csv_data)
combined_csv_data.to_csv('combined_csv_data.csv')
Any help would be highly appreciated, thank you.
Use axis=1 in concat and convert t column to index, select column X_ch2 for Series:
require_cols = ['t', 'X_ch2']
L = [pd.read_csv(f,usecols = require_cols, index_col=['t'])['X_ch2'] for f in all_filenames]
combined_csv_data = pd.concat(L, axis=1)
If need columns names rename by filenames for avoid duplicated 48 columns names:
import os
L = [pd.read_csv(f,
usecols = require_cols,
index_col=['t'])['X_ch2'].rename(os.path.basename(f))
for f in all_filenames]
If installing convtools library is an option, then:
import glob
from convtools import conversion as c
from convtools.contrib.tables import Table
required_cols = ["t", "X_ch2"]
table = None
for number, f in enumerate(glob.glob("Cam_Cantera_IDT_output_800K_*.csv")):
table_ = (
Table.from_csv(f, header=True)
.take(*required_cols)
.rename({"X_ch2": f"X_ch2__{number}"})
)
if table is None:
table = table_
else:
table = table.join(table_, on="t", how="full")
table.into_csv("new_combined_file.csv", include_header=True)

Pandas Only Exporting 1 Table to Excel but Printing all

The code below only exports the last table on the page to excel, but when I run the print function, it will print all of them. Is there an issue with my code causing not to export all data to excel?
I've also tried exporting as .csv file with no luck.
import pandas as pd
url = 'https://www.vegasinsider.com/college-football/matchups/'
dfs = pd.read_html(url)
for df in dfs:
if len(df.columns) > 1:
df.to_excel(r'VegasInsiderCFB.xlsx', index = False)
#print(df)
Your problem is that each time df.to_excel is called, you are overwriting the file, so only the last df will be left. What you need to do is use a writer and specify a sheet name for each separate df e.g:
url = 'https://www.vegasinsider.com/college-football/matchups/'
writer = pd.ExcelWriter('VegasInsiderCFB.xlsx', engine='xlsxwriter')
dfs = pd.read_html(url)
counter = 0
for df in dfs:
if len(df.columns) > 4:
counter += 1
df.to_excel(writer, sheet_name = f"sheet_{counter}", index = False)
writer.save()
You might need pip install xlsxwriter xlwt to make it work.
Exporting to a csv will never work, since a csv is a single data table (like a single sheet in excel), so in that case you would need to use a new csv for each df.
As pointed out in the comments, it would be possible to write the data onto a single sheet without changing the dfs, but it is likely much better to merge them:
import pandas as pd
import numpy as np
url = 'https://www.vegasinsider.com/college-football/matchups/'
dfs = pd.read_html(url)
dfs = [df for df in dfs if len(df.columns) > 4]
columns = ["gameid","game time", "team"] + list(dfs[0].iloc[1])[1:]
N = len(dfs)
values = np.empty((2*N,len(columns)),dtype=np.object)
for i,df in enumerate(dfs):
time = df.iloc[0,0].replace(" Game Time","")
values[2*i:2*i+2,2:] = df.iloc[2:,:]
values[2*i:2*i+2,:2] = np.array([[i,time],[i,time]])
newdf = pd.DataFrame(values,columns = columns)
newdf.to_excel("output.xlsx",index = False)
I used a numpy.array of object type to be able to copy a submatrix from the original dataframes easily into their intended place. I also needed to create a gameid, that connects the games across rows. It should be now trivial to rewrite this so you loop through a list of urls and write these to separate sheets.

Python Pandas ExcelWriter append to sheet creates a new sheet

I would I really appreciate some help.
I'm trying to use a loop to create sheets, and add data to those sheets for every loop. The position of my data is correct, however Panda ExcelWriter creates a new sheet instead of appending to the one created the first time the loop runs.
I'm a beginner, and right function is over form, so forgive me.
My code:
import pandas as pd
# initial files for dataframes
excel_file = 'output.xlsx'
setup_file = 'setup.xlsx'
# write to excel
output_filename = 'output_final.xlsx'
df = pd.read_excel(excel_file) # create dataframe of entire sheet
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('(', '').str.replace(')',
'') # clean dataframe titles
df_setup = pd.read_excel(setup_file)
df_setup.columns = df_setup.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('(', '').str.replace(')',
'') # clean dataframe titles
df_2 = pd.merge(df, df_setup) # Merge data with setup to have krymp size for each wire in dataframe
df_2['wirelabel'] = "'" + df_2['cable'] + "_" + df_2['function_code'] + "-" + df_2['terminal_strip'] + ":" + df_2[
'terminal'] # creates column for the wirelabel by appending columns with set delimiters. #TODO: delimiters to be by inputs.
df_2.sort_values(by=['switchboard']) # sort so we get proper order
switchboard_unique = df.switchboard.unique().tolist() # crate variable containing unique switchboards for printing to excel sheets
def createsheets(output_filename, sheetname, row_start, column_start, df_towrite):
with pd.ExcelWriter(output_filename, engine='openpyxl', mode='a') as writer:
df_towrite.to_excel(writer, sheet_name=sheetname, columns=['wirelabel'], startrow=row_start, startcol=column_start, index=False, header=False)
writer.save()
writer.close()
def sorter():
for s in switchboard_unique:
df_3 = df_2.loc[df_2['switchboard'] == s]
krymp_unique = df_3.krymp.unique().tolist()
krymp_unique.sort()
# print(krymp_unique)
column_start = 0
row_start = 0
for k in krymp_unique:
df_3.loc[df_3['krymp'] == k]
# print(k)
# print(s)
# print(df_3['wirelabel'])
createsheets(output_filename, s, row_start, column_start, df_3)
column_start = column_start + 1
sorter()
current behavior:
if sheetname is = sheet, then my script creates sheet1, sheet2, sheet3..etc.
pictureofcurrent
Wanted behavior
Create a sheet for each item in "df_3", and put data into columns according to the position calculated in column_start. The position in my code works, just goes to the wrong sheet.
pictureofwanted
I hope it's clear what im trying to accomplish, and all help is appriciated.
I tried all example codes i have sound regarding writing to excel.
I know my code is not a work of art, but I will update this post with the answer to my own question for the sake of completeness, and if anyone stumbles on this post.
It turns out i misunderstood the capabilities of the "append" function in Pandas "pd.ExcelWriter". It is not possible to append to a sheet already existing, the sheet will get overwritten though mode is set to 'a'.
Realizing this i changed my code to build a dataframe for the entire sheet (df_sheet), an then call the "createsheets" function in my code. The first version wrote my data column by column.
"Final" code:
import pandas as pd
# initial files for dataframes
excel_file = 'output.xlsx'
setup_file = 'setup.xlsx'
# write to excel
output_filename = 'output_final.xlsx'
column_name = 0
df = pd.read_excel(excel_file) # create dataframe of entire sheet
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('(', '').str.replace(')',
'') # clean dataframe titles
df_setup = pd.read_excel(setup_file)
df_setup.columns = df_setup.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('(', '').str.replace(')',
'') # clean dataframe titles
df_2 = pd.merge(df, df_setup) # Merge data with setup to have krymp size for each wire in dataframe
df_2['wirelabel'] = "'" + df_2['cable'] + "_" + df_2['function_code'] + "-" + df_2['terminal_strip'] + ":" + df_2[
'terminal'] # creates column for the wirelabel by appending columns with set delimiters. #TODO: delimiters to be by inputs.
df_2.sort_values(by=['switchboard']) # sort so we get proper order
switchboard_unique = df.switchboard.unique().tolist() # crate variable containing unique switchboards for printing to excel sheets
def createsheets(output_filename, sheetname, df_towrite):
with pd.ExcelWriter(output_filename, engine='openpyxl', mode='a') as writer:
df_towrite.to_excel(writer, sheet_name=sheetname, index=False, header=True)
def to_csv_file(output_filename, df_towrite):
df_towrite.to_csv(output_filename, mode='w', index=False)
def sorter():
for s in switchboard_unique:
df_3 = df_2.loc[df_2['switchboard'] == s]
krymp_unique = df_3.krymp.unique().tolist()
krymp_unique.sort()
column_start = 0
row_start = 0
df_sheet = pd.DataFrame([])
for k in krymp_unique:
df_5 = df_3.loc[df_3['krymp'] == k]
df_4 = df_5.filter(['wirelabel'])
column_name = "krymp " + str(k) + " Tavle: " + str(s)
df_4 = df_4.rename(columns={"wirelabel": column_name})
df_4 = df_4.reset_index(drop=True)
df_sheet = pd.concat([df_sheet, df_4], axis=1)
column_start = column_start + 1
row_start = row_start + len(df_5.index) + 1
createsheets(output_filename, s, df_sheet)
to_csv_file(s + ".csv", df_sheet)
sorter()
Thank you.

Concatenating dataframes adding additional columns

I'm trying to create a combined dataframe from a series of 12 individual CSVs (12 months to combine for the year). All the CSVs have the same format and column layout.
When I first ran it, it appeared to work and I was left with a combined dataframe with 6 columns (as expected). Upon looking at it, I found that the header row was applied as actual data in all the files, so I had some bad rows I needed to eliminate. I could manually make these changes but I'm looking to have the code take care of this automatically.
So to that end, I updated the code so it only read in the first CSV with headers and the remaining CSVs without headers and concatenate everything together. This appears to work BUT I end up with 12 columns instead of 6 with the first 6 columns having NaNs for the first CSV and the last 6 columns having NaNs for the other 11 CSVs, which is obviously NOT what I want (see image below).
The code is similar, I just use the header=None parameter in pd.read_csv() for the 11 CSVs after the first (and I don't use that parameter for the first CSV). Can anyone give me a hint as to why I'm getting 12 columns (with the data placement as described) when I run this code? The layout of the CSV file is shown below.
Appreciate any help.
import pandas as pd
import numpy as np
import os
# Need to include the header row only for the first csv (otherwise header row will be included
# for each read csv, which places improperly formatted rows into the combined dataframe).
totrows = 0
# Get list of csv files to read.
files = os.listdir('c:/data/datasets')
# Read the first csv file, including the header row.
dfSD = pd.read_csv('c:/data/datasets/' + files[0], skip_blank_lines=True)
# Now read the remaining csv files (without header row) and concatenate their values
# into our full Sales Data dataframe.
for file in files[1:]:
df = pd.read_csv('c:/data/datasets/' + file, skip_blank_lines=True, header=None)
dfSD = pd.concat([dfSD, df])
totrows += df.shape[0]
print(file + " == " + str(df.shape[0]) + " rows")
print()
print("TOTAL ROWS = " + str(totrows + pd.read_csv('c:/data/datasets/' + files[0]).shape[0]))
One simple solution is the following.
import pandas as pd
import numpy as np
import os
totrows = 0
files = os.listdir('c:/data/datasets')
dfSD = pd.read_csv('c:/data/datasets/' + files[0], skip_blank_lines=True)
columns = []
dfSD = []
for file in files:
df = pd.read_csv('c:/data/datasets/' + file, skip_blank_lines=True)
if not columns:
columns = df.columns
df.columns = columns
dfSD.append(df)
totrows += df.shape[0]
print(file + " == " + str(df.shape[0]) + " rows")
dfSD = pd.concat(dfSD, axis = 0)
dfSD = dfSD.reset_index(drop = True)
Another possibility is:
import pandas as pd
import numpy as np
import os
# Need to include the header row only for the first csv (otherwise header row will be included
# for each read csv, which places improperly formatted rows into the combined dataframe).
totrows = 0
# Get list of csv files to read.
files = os.listdir('c:/data/datasets')
# Read the first csv file, including the header row.
dfSD = pd.read_csv('c:/data/datasets/' + files[0], skip_blank_lines=True)
df_comb = [dfSD]
# Now read the remaining csv files (without header row) and concatenate their values
# into our full Sales Data dataframe.
for file in files[1:]:
df = pd.read_csv('c:/data/datasets/' + file, skip_blank_lines=True, header=None)
df.columns = dfSD.columns
df_comb.append(df)
totrows += df.shape[0]
print(file + " == " + str(df.shape[0]) + " rows")
dfSD = pd.concat([df_comb], axis = 0).reset_index(drop = True)

How to append all output from for loop using pandas python

I have an excel file (df2) and i have used for loop to it to get multiple outputs and then i want to append all the multiple outputs obtained for that file so that i can get it into one single excel. Please find my code below and suggest some ideas so that i can complete my code.
import os
from os import path
import pandas as pd
src = "C:\\ASPAIN-ORANGE\\test_wind3\\udc\\folder\\"
df = pd.read_excel('check_.xlsx',sheet_name='Align_pivot')
files = [i for i in os.listdir(src) if i.startswith("_Verification_") and path.isfile(path.join(src, i))]
for f in files:
slice1 = 19
file_slice = f[slice1:].replace(".csv", "")
df1 = pd.read_csv(f)
total_rows_df1 = len(df1.axes[0])
df2 = df[df['MO'] == (file_slice)]
total_rows_df2 = sum(df2.To_Align)
print("filename : "+str(file_slice))
print("Number of Rows_df1: "+str(total_rows_df1))
print("Number of Rows_df2: "+str(total_rows_df2))
if total_rows_df1 == total_rows_df2:
print('True')
else:
print('False')
df2.to_excel('output.xlsx', index=False, na_rep = 'NA', header = True)
1st Iteration output
2nd Iteration output
3rd Iteration output
and so on
Final appended Output
Your kind help would really be appreciated.
You can use DataFrame.append method (Append rows of other to the end of caller, returning a new object):
df = df.append(sheet, ignore_index=True)
Once all rows are added you can call to_excel method to write the excel.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.append.html

Categories