I have an excel file (df2) and i have used for loop to it to get multiple outputs and then i want to append all the multiple outputs obtained for that file so that i can get it into one single excel. Please find my code below and suggest some ideas so that i can complete my code.
import os
from os import path
import pandas as pd
src = "C:\\ASPAIN-ORANGE\\test_wind3\\udc\\folder\\"
df = pd.read_excel('check_.xlsx',sheet_name='Align_pivot')
files = [i for i in os.listdir(src) if i.startswith("_Verification_") and path.isfile(path.join(src, i))]
for f in files:
slice1 = 19
file_slice = f[slice1:].replace(".csv", "")
df1 = pd.read_csv(f)
total_rows_df1 = len(df1.axes[0])
df2 = df[df['MO'] == (file_slice)]
total_rows_df2 = sum(df2.To_Align)
print("filename : "+str(file_slice))
print("Number of Rows_df1: "+str(total_rows_df1))
print("Number of Rows_df2: "+str(total_rows_df2))
if total_rows_df1 == total_rows_df2:
print('True')
else:
print('False')
df2.to_excel('output.xlsx', index=False, na_rep = 'NA', header = True)
1st Iteration output
2nd Iteration output
3rd Iteration output
and so on
Final appended Output
Your kind help would really be appreciated.
You can use DataFrame.append method (Append rows of other to the end of caller, returning a new object):
df = df.append(sheet, ignore_index=True)
Once all rows are added you can call to_excel method to write the excel.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.append.html
Related
Very new to this, so please go easy on me :)
Trying to take multiple excel spreadsheets, extract data specific from specific cells, add them all to one dataframe and save it as a csv file.
The csv output only contains the data from the last excel file. Please could you help?
import pandas as pd
import os
from pathlib import Path
ip = "//NETWORKLOCATION/In"
op = "//NETWORKLOCATION/Out"
file_exist = False
dir_list = os.listdir(ip)
print(dir_list)
for xlfile in dir_list:
if xlfile.endswith('.xlsx') or xlfile.endswith('.xls'):
file_exist = True
str_file = os.path.join(ip, xlfile)
df1 = pd.read_excel(str_file)
columns1 = {*VARIOUSDATA -*
}
#creates an empty dataframe for the data to all sequentially be added into
df1a = pd.DataFrame([])
#appends the array to the new dataframe df1a
df1a = df1a.append(pd.DataFrame(columns1, columns = ['*VARIOUS COLUMNS*]))
if not file_exist:
print('cannot find any valid excel file in the folder ' + ip)
print(str_file)
df1a.to_csv('//NETWORKLOCATION/Out/Test.csv')
print(df1a)
I think You should put:
#creates an empty dataframe for the data to all sequentially be added into
df1a = pd.DataFrame([])
before for xlfile in dir_list: loop not inside the loop.
Otherwise df1a recreates empty on each file iteration.
A couple of things. First, you'll never encounter:
if not file_exist:
print('cannot find any valid excel file in the folder ' + ip)
print(str_file)
as is written, because it's a nested if statement and so file_exists is always set to true before it's reached.
You're creating df1a inside of your for loop. So you're always setting it back to empty.
Why import Path, and then use os.path and os.listdir?
Why not just use Path(ip).glob('.xls')
This would look like:
import pandas as pd
import os
from pathlib import Path
ip = "//NETWORKLOCATION/In"
op = "//NETWORKLOCATION/Out"
#creates an empty dataframe for the data to all sequentially be added into
df1a = pd.DataFrame([])
for xlfile in Path(ip).glob('*.xls*'):
df1 = pd.read_excel(xlfile)
columns1 = {"VARIOUSDATA"}
#appends the array to the new dataframe df1a
df1a = df1a.append(pd.DataFrame(columns1, columns = ['VARIOUS_COLUMNS']))
if df1a.empty:
print('cannot find any valid excel file in the folder ' + ip)
print(str_file)
else:
df1a.to_csv(op+'/Test.csv')
print(df1a)
The csv output only contains the data from the last excel file.
You create the df1a DataFrame inside the for loop. Each time you read a new xlfile you create a new empty DataFrame.
You have to put df1a = pd.DataFrame([]) on the 9th line of your script before the loop.
Something like this should work for you.
import os
import pandas as pd
import glob
glob.glob("C:\\your_path\\*.xlsx")
all_data = pd.DataFrame()
for f in glob.glob("C:\\your_path\\*.xlsx"):
df = pd.read_excel(f)
all_data = all_data.append(df,ignore_index=True)
type(all_data)
Check out this link.
https://pbpython.com/excel-file-combine.html
I have written a code (thanks to) that groupe the column that I need to remain as it is and sum of the targeted columns:
import pandas as pd
import glob as glob
import numpy as np
#Read excel and Create DF
all_data = pd.DataFrame()
for f in glob.glob(r'C:\Users\Sarah\Desktop\IDPMosul\Data\2014\09\*.xlsx'):
df = pd.read_excel(f,index_col=None, na_values=['NA'])
df['filename'] = f
data = all_data.append(df,ignore_index=True)
#Group and Sum
result = data.groupby(["Date"])["Families","Individuals"].agg([np.sum])
#Save file
file_name = r'C:\Users\Sarah\Desktop\U2014.csv'
result.to_csv(file_name, index=True)
the problem is here :
#Save file
file_name = r'C:\Users\Sarah\Desktop\U2014.csv'
result.to_csv(file_name, index=True)
the code gives me the result that I want however it only takes into account the last file that it iterates through, I need to save all the sums from different files
thank you
Simply you never change all_data in the loop since it is never re-assigned. Each loop iteration appends to the empty data frame initialized outside loop. So only the very last file is retained. A quick (non-recommended) fix would include:
all_data = pd.DataFrame()
for f in glob.glob(r'C:\Users\Sarah\Desktop\IDPMosul\Data\2014\09\*.xlsx'):
...
all_data = all_data.append(df, ignore_index=True) # CHANGE LAST LINE IN LOOP
# USE all_data (NOT data) aggregation
result = all_data.groupby(...)
However, reconsider growing a data frame inside a loop. As #unutbu warns us: Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying. Instead, the recommended version would be to build a list of data frames to concatenate once outside the loop which you can do so with a list comprehension, even assign for filename:
# BUILD LIST OF DFs
df_list = [(pd.read_excel(f, index_col=None, na_values=['NA'])
.assign(filename = f)
) for f in glob.glob(r'C:\Users\Sarah\Desktop\IDPMosul\Data\2014\09\*.xlsx')]
# CONCATENATE ALL DFs
data = pd.concat(df_list, ignore_index=True)
# AGGREGATE DATA
result = data.groupby(["Date"])["Families", "Individuals"].agg([np.sum])
file_name = r'C:\Users\Sarah\Desktop\U2014.csv'
result.to_csv(file_name, index=True)
I've got a number of Excel workbooks, each with multiple worksheets, that I'd like to combine.
I've set up two sets of loops (one while, one for) to read in rows for each sheet in a given workbook and then do the same for all workbooks.
I tried to do it on a subset of these, and it appears to work until I try to combine the two sets using the pd.concat function. Error given is
TypeError: first argument must be an iterable of pandas objects, you
passed an object of type "DataFrame"
Any idea what I'm doing incorrectly?
import pandas as pd
d = 2013
numberOfSheets = 5
while d < 2015:
#print(str(d) + ' beginning')
f ='H:/MyDocuments/Z Project Work/scriptTest ' + str(d) + '.xlsx'
for i in range(1,numberOfSheets+1):
data = pd.read_excel(f, sheetname = 'Table '+str(i), header=None)
print(i)
df.append(data)
print(str(d) + ' complete')
print(df)
d += 1
df = pd.concat(df)
print(df)
final = "H:/MyDocuments/Z Project Work/mergedfile.xlsx"
df.to_excel(final)
As the error says, pd.concat() requires an iterable, like a list: pd.concat([df1, df2]) will concatenate df1 and df2 along the default axis of 0, which means df2 is appended to the bottom of df1.
Two issues need fixing:
The for loop refers to df before assigning anything to it.
The variable df is overwritten with each iteration of the for loop.
One workaround is to create an empty list of DataFrames before the loops, then append DataFrames to that list, and finally concatenate all the DataFrames in that list. Something like this:
import pandas as pd
d = 2013
numberOfSheets = 5
dfs = []
while d < 2015:
#print(str(d) + ' beginning')
f ='H:/MyDocuments/Z Project Work/scriptTest ' + str(d) + '.xlsx'
for i in range(1, numberOfSheets + 1):
data = pd.read_excel(f, sheetname='Table ' + str(i), header=None)
print(i)
dfs.append(data)
print(str(d) + ' complete')
print(df)
d += 1
# ignore_index=True gives the result a default IntegerIndex
# starting from 0
df_final = pd.concat(dfs, ignore_index=True)
print(df_final)
final_path = "H:/MyDocuments/Z Project Work/mergedfile.xlsx"
df_final.to_excel(final_path)
Since I can't comment, I'll leave this as an answer: you can speed up this code by opening the file once then parsing the workbook to get each sheet. Should save a second or two off each iteration, since opening the Excel file takes the longest. Here's some code that might help.
Note: setting sheet_name=None will return ALL the sheets in the workbook:
dfs = {<sheetname1>: <DataFrame1>, <sheetname2>: <DataFrame2>, etc.}
Here's the code:
xl = pd.ExcelFile(fpath)
dfs = xl.parse(sheetname=None, header=None)
for i, df in enumerate(dfs):
<do stuff with each, if you want>
print('Sheet {0} looks like:\n{1}'.format(i+1, df))
Thank you, both. I accepted the answer that addressed the specific question, but was able to use the second answer and some additional googling thereafter (eg, glob) to amend the original code, and automate more fully independent of number of workbooks or worksheets.
Final version of the above now below:
import pandas as pd
import glob
#import numpy as np
#import os, collections, csv
#from os.path import basename
fpath = "H:/MyDocuments/Z Project Work/"
dfs = []
files = glob.glob(fpath+'*.xlsx')
for f in files:
xl = pd.ExcelFile(f)
xls = xl.parse(sheetname=None, header=0)
for i, df in enumerate(xls):
print(i)
dfs.append(xls[df])
print(f+ ' complete')
df_final = pd.concat(dfs, ignore_index=True)
final = "H:/MyDocuments/Z Project Work/mergedfile.xlsx"
df_final.to_excel(final)
Excel limits the columns of any csv file around 3000. I am trying to write 125,000 columns in the following format:
O1
MA1
MI1
C1
V1
...
O125000
MA125000
MI125000
C125000
V125000
import pandas as pd
def formatting(i):
return tuple(map(lambda x: x+str(i), ("O", "MA", "MI", "C", "V")))
l = []
for i in range(1, 125001):
l.extend(formatting(i))
f = pd.read_csv('file.csv')
f.columns = l
f.to_csv('new_file.csv')
I tried coding this script but its too slow and inconsistent in the fact that its prone to errors. However, you can get the idea of what I am trying to do from this script.
The current script I use to generate a csv(that contains 2 rows and 125,000+ columns) is the following:
import pandas as pd
import glob
allfiles = glob.glob('*.csv')
index = 0
def testing(file):
#file = file.loc[:,'Open':'Volume']
file = file.values.reshape(1, -1)
return file
for _fileT in allfiles:
nFile = pd.read_csv(_fileT, header=0, usecols=range(1,6))
fFile = testing(nFile)
df = pd.DataFrame(fFile)
new_df = df.iloc[:125279]
new_df = new_df.shift(1, axis=1)
new_df.to_csv('HeadCSV/FinalCSV.csv', mode='a', index=False, header=0)
This script reads two csv files in the directory, and aggregates them into one file however how can I make sure that it prints the header mentioned above and labels the two rows it prints out?
Id basically like to combine these two scripts in the most logical way possible.
the idea is to write the header, then get all the data from the files into the dataframe, then do the row indexing as mentioned, and finally throw it all into a CSV
# Program to combine data from 2 csv file
The cdc_list gets updated after second call of read_csv
overall_list = []
def read_csv(filename):
file_read = open(filename,"r").read()
file_split = file_read.split("\n")
string_list = file_split[1:len(file_split)]
#final_list = []
for item in string_list:
int_fields = []
string_fields = item.split(",")
string_fields = [int(x) for x in string_fields]
int_fields.append(string_fields)
#final_list.append()
overall_list.append(int_fields)
return(overall_list)
cdc_list = read_csv("US_births_1994-2003_CDC_NCHS.csv")
print(len(cdc_list)) #3652
total_list = read_csv("US_births_2000-2014_SSA.csv")
print(len(total_list)) #9131
print(len(cdc_list)) #9131
I don't think the code you pasted explains the issue you've had, at least it's not anywhere I can determine. Seems like there's a lot of code you did not include in what you pasted above, that might be responsible.
However, if all you want to do is merge two csvs (assuming they both have the same columns), you can use Pandas' read_csv and Pandas DataFrame methods append and to_csv, to achieve this with 3 lines of code (not including imports):
import pandas as pd
# Read CSV file into a Pandas DataFrame object
df = pd.read_csv("first.csv")
# Read and append the 2nd CSV file to the same DataFrame object
df = df.append( pd.read_csv("second.csv") )
# Write merged DataFrame object (with both CSV's data) to file
df.to_csv("merged.csv")