Join multiple excel files into one excel file via python - python

I need to combine all the excel files in my directory into one excel file.
for example, I've 3 excel files:
file1:
file 2:
file 3:
I need to concatenate them and get an output as follows
but instead, they were appended one after another
this is my code :
import pandas as pd
import numpy as np
import glob
all_data = pd.DataFrame()
for f in glob.glob('C:/Users/test-results/FinalResult/05-01-2019/*.xlsx'):
df = pd.read_excel(f)
all_data = all_data.append(df, ignore_index=True)
writer = pd.ExcelWriter('mycollected_data.xlsx', engine='xlsxwriter')
all_data.to_excel(writer, sheet_name='Sheet1')
writer.save()
during my quest, all I found was how to append dfs,as shown in my code and I didn't figure out how too use join

You could use
files = glob.glob('C:/Users/test-results/FinalResult/05-01-2019/*.xlsx')
dfs = (pd.read_excel(f, index_col=0) for f in files)
all_data = pd.concat(dfs, axis=1)

Try this:
all_data = pd.concat([all_data,df],axis=1)

all_data = all_data.merge(df, on = ['first_column_name'], how = 'left')

Related

Read from mutiple excel and write to one file

I am trying to read data from multiple xls files and write it to one single file.
My code below is writing only the first file. Not sure what I am missing.
import glob import os import pandas as pd
def list_files(dir):
r = []
for root, dirs, files in os.walk(dir):
for name in files:
r.append(os.path.join(root, name))
return r
files = list_files("C:\\Users\\12345\\BOFS")
for file in files:
df = pd.read_excel(file)
new_header = df.iloc[1]
df = df[2:]
df.columns = new_header
with pd.ExcelWriter("C:\\Users\\12345\\Test\\Test.xls", mode='a') as writer:
df.to_excel(writer,index=False, header=True,)
Documentation says:
ExcelWriter can also be used to append to an existing Excel file:
with pd.ExcelWriter('output.xlsx',
mode='a') as writer:
df.to_excel(writer, sheet_name='Sheet_name_3')
And that probably replaces given sheet
But you could use pd.concat(<dataframes>) to concatenate dataframes and write all data at once in a single sheet.
I tested this piece of code, hopefully its work in your case.
import glob, os
os.chdir("D:/Data Science/stackoverflow")
for file in glob.glob("*.xlsx"):
df = pd.read_excel(file)
all_data = all_data.append(df,ignore_index=True)
# now save the data frame
writer = pd.ExcelWriter('output.xlsx')
all_data.to_excel(writer,'sheet1')
writer.save()

Python Pandas join a few files

I import a few xlsx files into pandas dataframe. It works fine, but my problem that it copies all the data under each other (so I have 10 excel file with 100 lines = 1000 lines).
I need the Dataframe with 100 lines and 10 columns, so each file will be copied next to each other and not below.
Are there any ideas how to do it?
import os
import pandas as pd
os.chdir('C:/Users/folder/')
path = ('C:/Users/folder/')
files = os.listdir(path)
allNames = pd.DataFrame()
for f in files:
info = pd.read_excel(f,'Sheet1')
allNames = allNames.append(info)
writer = pd.ExcelWriter ('Output.xlsx')
allNames.to_excel(writer, 'Copy')
writer.save()
You can feed your spreadsheets as an array of dataframes directly to pd.concat():
import os
import pandas as pd
os.chdir('C:/Users/folder/')
path = ('C:/Users/folder/')
files = os.listdir(path)
allNames = pd.concat([pd.read_excel(f,'Sheet1') for f in files], axis=1)
writer = pd.ExcelWriter ('Output.xlsx')
allNames.to_excel(writer, 'Copy')
writer.save()
Instead of stacking the tables vertically like this:
allNames = allNames.append(info)
You'll want to concatenate them horizontally like this:
allNames = pd.concat([allNames , info], axis=1)

Read selected data from multiple files

I have 200 .txt files and need to extract one row data from each file and create a different dataframe.
For example (abc1.txt,abc2.txt, .etc) set of files and i need to extract 5th row data from each file and create a dataframe. When reading files, columns need to be separated by '/t' sign.
like this
data = pd.read_csv('abc1.txt', sep="\t", header=None)
I can not figure out how to do all this with a loop. Can you help?
Here is my answer:
import pandas as pd
from pathlib import Path
path = Path('path/to/dir')
files = path.glob('*.txt')
to_concat = []
for f in files:
df = pd.read_csv(f, sep="\t", header=None, nrows=5).loc[4:4]
to_concat.append(df)
result = pd.concat(to_concat)
I have used nrows to read only first 5 rows and then .loc[4:4] to get dataframe rather than series (when you use .loc[4].
Here you go:
import os
import pandas as pd
directory = 'C:\\Users\\PC\\Desktop\\datafiles\\'
aggregate = pd.DataFrame()
for filename in os.listdir(directory):
if filename.endswith(".txt"):
data = pd.read_csv(directory+filename, sep="\t", header=None)
row5 = pd.DataFrame(data.iloc[4]).transpose()
aggregate = aggregate.append(row5)

Excel file overwritten instead of concat - Python - Pandas

I'm trying to contact all excel files and worksheets in them into one using the below script. It kinda works but then the excel file c.xlsx is overwritten per file, so only the last excel file is concated not sure why?
import pandas as pd
import os
import ntpath
import glob
dir_path = os.path.dirname(os.path.realpath(__file__))
os.chdir(dir_path)
cdf = None
for excel_names in glob.glob('*.xlsx'):
print(excel_names)
df = pd.read_excel(excel_names, sheet_name=None, ignore_index=True)
cdf = pd.concat(df.values())
cdf.to_excel("c.xlsx", header=False, index=False)
Idea is create list of DataFrames in list comprehension, but because working with orderdict is necessary concat in loop and then again concat for one big final DataFrame:
cdf = [pd.read_excel(excel_names, sheet_name=None, ignore_index=True).values()
for excel_names in glob.glob('files/*.xlsx')]
df = pd.concat([pd.concat(x) for x in cdf], ignore_index=True)
#print (df)
df.to_excel("c.xlsx", index=False)
I just tested the code below. It merges data from all Excel files in a folder into one, single, Excel file.
import pandas as pd
import numpy as np
import glob
glob.glob("C:\\your_path\\*.xlsx")
all_data = pd.DataFrame()
for f in glob.glob("C:\\your_path\\*.xlsx"):
df = pd.read_excel(f)
all_data = all_data.append(df,ignore_index=True)
print(all_data)
df = pd.DataFrame(all_data)
df.shape
df.to_excel("C:\\your_path\\final.xlsx", sheet_name='Sheet1')
I got it working using the below script which uses #ryguy72's answer but works on all worksheets as well as the header row.
import pandas as pd
import numpy as np
import glob
all_data = pd.DataFrame()
for f in glob.glob("my_path/*.xlsx"):
df = pd.read_excel(f, sheet_name=None, ignore_index=True)
cdf = pd.concat(df.values())
all_data = all_data.append(cdf,ignore_index=True)
print(all_data)
df = pd.DataFrame(all_data)
df.shape
df.to_excel("my_path/final.xlsx", sheet_name='Sheet1')

Df export all into one larger file

I have a df reading in multiple .xlsx files. I have manipulated what I need in the files and the export view is exact. However, I need the data to export into one larger 2 column file rather than multiple individual files.
Any help is appreciated. I haven't been able to figure the problem out on my own.
import os
import glob
import pandas as pd
folder = input('Enter the folder name: ')
os.chdir('C:/Users/PCTR261010/Desktop/' + folder)
FileList = glob.glob('*.xlsx')
for fname in FileList:
df = pd.read_excel(fname).assign(New=os.path.basename('mpcc_' + (fname.split('-', 1)[0]).split('#', 1)[1]))
df1 = df[['New', '<ID>']]
writer = pd.ExcelWriter('ParttoMPCC_Import.xlsx', engine='xlsxwriter')
df1.to_excel(writer, sheet_name='Import', index=False, header=False)
writer.save()
You can append desired columns in a single DataFrame and write that DataFrame to an excel file. Below code should do the job.
import os
import glob
import pandas as pd
folder = input('Enter the folder name: ')
os.chdir('C:/Users/PCTR261010/Desktop/' + folder)
FileList = glob.glob('*.xlsx')
df1 = pd.DataFrame() # create an empty df
for fname in FileList:
df = pd.read_excel(fname).assign(New=os.path.basename('mpcc_' + (fname.split('-', 1)[0]).split('#', 1)[1]))
df1 = df1.append(df[['New', '<ID>']]) # append columns data to the df1
writer = pd.ExcelWriter('ParttoMPCC_Import.xlsx', engine='xlsxwriter')
df1.to_excel(writer, sheet_name='Import', index=False, header=False)
writer.save()
You can use pd.concat as follows:
data = []
for fname in FileList:
df = pd.read_excel(fname).assign(New=os.path.basename('mpcc_' + (fname.split('-', 1)[0]).split('#', 1)[1]))
df1 = df[['New', '<ID>']]
data.append(df1)
writer = pd.ExcelWriter('ParttoMPCC_Import.xlsx', engine='xlsxwriter')
df = pd.concat(data)
df.to_excel(writer, sheet_name='Import', index=False, header=False)
writer.save()

Categories