Trouble merging .xlsx files with pandas - python

I am operating with python 2.7 and I wrote a script that should take the name of two .xlsx files, use pandas to convert them in two dataframe and then concatenate them.
The two files under consideration have the same rows and different columns.
Basically, I have these two Excel files:
I would like to keep the same rows and just unite the columns.
The code is the following:
import pandas as pd
file1 = 'file1.xlsx'
file2 = 'file2.xlsx'
sheet10 = pd.read_excel(file1, sheet_name = 0)
sheet20 = pd.read_excel(file2, sheet_name = 0)
conc1 = pd.concat([sheet10, sheet20], sort = False)
output = pd.ExcelWriter('output.xlsx')
conc1.to_excel(output, 'Sheet 1')
output.save()
Instead of doing what I expected (given the examples I read online), the output becomes something like this:
Does anyone know I could I improve my script?
Thank you very much.

The best answer here really depends on the exact shape of your data. Based on the example you have provided it looks like the data is indexed identically between the two dataframes with differing column headers that you want preserved. If this is the case this would be the best solution:
import pandas as pd
file1 = 'file1.xlsx'
file2 = 'file2.xlsx'
sheet10 = pd.read_excel(file1, sheet_name = 0)
sheet20 = pd.read_excel(file2, sheet_name = 0)
conc1 = sheet10.merge(sheet20, how="left", left_index=True, right_index=True)
output = pd.ExcelWriter('output.xlsx')
conc1.to_excel(output, sheet_name='Sheet 1', ignore_index=True)
output.save()
Since there is a direct match between the number of rows in the two initial dataframes it doesn't really matter if a left, right, outer, or inner join is used. In this example I used a left join.
If the rows in the two data frames do not perfectly line up though, the join method selected can have a huge impact on your output. I recommend looking at pandas documentation on merge/join/concatenate before you go any further.

To get the expected output using pd.concat, column names in both the dataframes should be same. Here's how to do,
# Create a 1:1 mapping of sheet10 and sheet20 columns
cols_mapping = dict(zip(sheet20.columns, sheet10.columns))
# Rename the columns in sheet20 to match with that of sheet10
sheet20_renamed = sheet20.rename(cols_mapping, axis=1)
concatenated = pd.concat([sheet10, sheet20_renamed])

Related

merging two excel files and then removing duplicates that it creates

I've just started using python so could do with some help.
I've merged data in two excel files using the following code:
# Import pandas library
import pandas as pd
#import excel files
df1 = pd.read_excel("B2 teaching.xlsx")
df2 = pd.read_excel("Moderation.xlsx")
#merge dataframes 1 and 2
df = df1.merge(df2, on = 'module_id', how='outer')
#export new dataframe to excel
df.to_excel('WLM module data_test4.xlsx')
This does merge the data, but what it also does is where dataframe 1 has multiple entries for a module, it creates duplicate data in the new merged file so that there are equal entries in the df2 data. Here's an example:
output
So I want to only have one entry for the moderation of the module, whereas I have two at the moment (highlighted in red).
I also want to remove the additional columns : "term_y", "semester_y", "credits_y" and "students_y" in the final output as they are just repeats of data I already have in df1.
Thanks!
I think what you want is duplicated garnerd from
Pandas - Replace Duplicates with Nan and Keep Row
&
Replace duplicated values with a blank string
So what you want is this after your merge: df.loc[df['module_id'].duplicated(), 'module_id'] = pd.NA
Please read both stackoverflow link examples to understand how this works better.
So full code would look like this
import pandas as pd
#import excel files
df1 = pd.read_excel("B2 teaching.xlsx")
df2 = pd.read_excel("Moderation.xlsx")
#merge dataframes 1 and 2
df = df1.merge(df2, on = 'module_id', how='outer')
df.loc[df['module_id'].duplicated(), 'module_id'] = pd.NA
#export new dataframe to excel
df.to_excel('WLM module data_test5-working.xlsx')
Many ways to drop columns too.
Ive chosen, for lack of more time, to do this:
df.drop(df.columns[2], axis=1, inplace=True)
from https://www.stackvidhya.com/drop-column-in-pandas/
change df.columns[2] to the N'th number column you want to drop. (Since my working data was differernt to yours*)
After the merge. so that full code will look like this:
import pandas as pd
#import excel files
df1 = pd.read_excel("B2 teaching.xlsx")
df2 = pd.read_excel("Moderation.xlsx")
#merge dataframes 1 and 2
df = df1.merge(df2, on = 'module_id', how='outer')
df.loc[df['module_id'].duplicated(), 'module_id'] = pd.NA
#https://www.stackvidhya.com/drop-column-in-pandas/
#export new dataframe to excel
df.to_excel('WLM module data_test6-working.xlsx')
df.drop(df.columns[2], axis=1, inplace=True)
Hope ive helped.
I'm just very happy I got you somwhere/did this. For both of our sakess!
Happy you have a working answer.
& if you want to create a new df out of the merged, duplicated and droped columns df, you can do this:
new = df.drop(df.iloc[: , [1, 2, 7]], axis=1)
from Extracting specific selected columns to new DataFrame as a copy
*So that full code * would look something like this (please adjust column numbers as your need) which is what I wanted:
# Import pandas library
import pandas as pd
#import excel files
df1 = pd.read_excel("B2 teaching.xlsx")
df2 = pd.read_excel("Moderation.xlsx")
#merge dataframes 1 and 2
df = df1.merge(df2, on = 'module_id', how='outer')
df.loc[df['module_id'].duplicated(), 'module_id'] = pd.NA
new = df.drop(df.iloc[: , [1, 2, 7]], axis=1)
#new=pd.DataFrame(df.drop(df.columns[2], axis=1, inplace=True))
print(new)
#export new dataframe to excel
df.to_excel('WLM module data_test12.xlsx')
new.to_excel('WLM module data_test13.xlsx')
Note: *When I did mine above , I deliberately didn't have any headers In columns, to try make it generic as possible. So used iloc to specify colum Number Initially. ( Since your original question was not that descriptive or clear, but kind got the point.). Think you should include copyable draft data (not screen shots) next time to make it easier for people/entice insentivise experts on here to enagage with the post. Plus more clearer Why's & How's. & SO isnt a free code writing servce, you know, but it was to my benefit also (hugely) to do/delve into this.
Could you provide a sample of desired output?
Otherwise, choosing the right type of merge should resolve your issue. Have a look at the documentation, there are the possible options and their corresponding SQL statements listed:
https://pandas.pydata.org/docs/reference/api/pandas.merge.html
Regarding the additional columns you have two options:
Again from the documentation: Select the suffixes with the suffixes parameter. To add suffixes only to duplicate columns from df1, you could set them to something like suffixes=("B2", "").
Use df2 within the merge only with the columns needed in the output. E.g.
df = df1.merge(df2[['module_id', 'moderator']], on = 'module_id', how='outer')
& further to the 3 successful and working codes below, each one answering a part of your queston,
you could do the whole thing by/using iloc, which is what I prefer .
# Import pandas library
import pandas as pd
#import excel files
df1 = pd.read_excel("B2 teaching.xlsx")
df2 = pd.read_excel("Moderation.xlsx")
#merge dataframes 1 and 2
df = df1.merge(df2, on = 'module_id', how='outer')
#do/apply duplicated() on one of the columns. (see more about duplicated in my post below)
df.loc[df[df.iloc[:0,8].name].duplicated(), 'module_id'] = pd.NA
# drop the columns you dont want and save to new df/create a new sheet w
new = df.drop(df.iloc[: , [11, 13, 14, 15]], axis=1)
#new=pd.DataFrame(df.drop(df.columns[2], axis=1, inplace=True))
print(new)
#export new dataframe to excel
df.to_excel('WLM module data_test82.xlsx')
new.to_excel('WLM module data_test83.xlsx') #<--this is your data after dropping the columns. Wanted to do it sepertely so you can see what is happeningh/know how to. The default is just to change/modify the old df.
print(df.iloc[:0,8].name) # to show you how to get the header name from iloc
print(df.iloc[:0,8]) # to shoe you what iloc gives on its own
: "term_y", "semester_y", "credits_y" and "students_y"
are
12, 14, 15 & 16 are the columns you want to remove , so ive done that here.
iloc starts from 0. so new = df.drop(df.iloc[: , [11, 13, 14, 15]], axis=1)
so, like in the 3rd piece of code before, does what you wanted. All you have to do is change the column numbers it refers to (if youd given us a non snapshot picture, and dummy text replicating you use case, instead we would have copied that to work with, instead of having no time and writing out outselves to do it). Post Edit 14:48 24/04/22 - Just done that here for you. Just copy the code and run.
you have Module (col 3), Module_Id (col 4) and module name (col 13) in your data [in my dummy data, that was column 9 (iloc 8). as said, didnt have time to replicate perfectly, just the idea) but I think its module_id column (column 9, iloc 8) you are wanting to : not just to merge on, but also then do .duplicated() by on. so you can run code as is , if thats the case.
If its not, just change df.loc[df[df.iloc[:0,8].name].duplicated(), 'module_id'] = pd.NA from number 8, to 2, 3 or 12 for your use-case/columns.
I think I prefer this answer, for fact knowing/counting the number of columns frees you up from having to call it by name , and allows for a different type of automation. You can still implement contains or find regex to locate and work woth columns data later on , but this is another method with its own power over having to rely on names. more precise i feel.
Literally plug this code and run, and play & let me know how it goes. All work for me.
Thanks everyone for your help, this is my final code which seems to work:
# Import pandas library
import pandas as pd
#import excel files
df1 = pd.read_excel("B2 teaching.xlsx")
df2 = pd.read_excel("moderation.xlsx")
#merge dataframes 1 and 2
df = df1.merge(df2, on = 'module_id', how='outer')
#drop columns not needed
df.drop('term_y', inplace=True, axis=1)
df.drop('semester_y', inplace=True, axis=1)
df.drop('credits_y', inplace=True, axis=1)
df.drop('n_students_y', inplace=True, axis=1)
#drop duplicated rows
df.loc[df['module_name'].duplicated(), 'module_name'] = pd.NA
df.loc[df['moderation_wl'].duplicated(), 'moderation_wl'] = pd.NA
#export new dataframe to excel
df.to_excel('output.xlsx')

How to concatenate multiple selected sheets from many XL spredhseets

I'm relatively new to python and pandas and I face the following problem: I have 20+ spreadsheets with multiple sheets. I'd like to concatenate the second sheet from each spreadsheet into a single spreadsheet. I'm using the below code, which works to the point that it creates a list of sheets but doesn't concatenate them correctly, the combined file has the only sheet from the first file. Each sheet has the same header row and the same structure.
Any help would be appreciated. The code I'm using is below:
import os
import glob
import pandas as pd
os.chdir(r"C:\Users\Site_Users")
extension = 'xlsx'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
#combine all files in the list
xl_list=[]
for f in all_filenames:
df=pd.read_excel(f, sheet_name = 1)
xl_list.append(df)
combined = pd.concat(xl_list, ignore_index = True)
combined.to_excel( "combined.xlsx", index=False)
Working under the assumption that you have a list of df's, try adding axis=0 to your concat.
i.e.
combined = pd.concat(xl_list, axis = 0, ignore_index = True)
Just to close the loop on this. I found the answer. The code was correct but there was number of rows which looked empty but they had formulas in them, which for the code looked like not empty cells, so it was adding those rows to the combined sheet. Because of this I missed the added rows as they were 400 rows below empty rows.

How to join multiple dataframes within a loop using python pandas

I have 3 tables on each excel sheet: sheet1 - Gross, sheet2 - Margin, sheet3 - Revenue
So I was able to iterate through each sheet and unpivot it.
But how can I join them together?
sheet_names = ['Gross','Margin','Revenue']
full_table = pd.DataFrame()
for sheet in sheet_names:
df = pd.read_excel(BudgetData.xlsx', sheet_name = sheet, index=False)
unpvt = pd.melt(df,id_vars=['Company'], var_name ='Month', value_name = sheet)
# how can I join unpivoted dataframes here?
print(unpvt)
Desirable result:
UPDATE:
Thanks #Celius Stingher.
I think this is what I need. It just gives me weird sorting:
and gives me this warning:
Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.
To accept the future behavior, pass 'sort=False'.
To retain the current behavior and silence the warning, pass 'sort=True'.
from ipykernel import kernelapp as app
So it seems you are doing the pivoting but not saving each unpivoted dataframe anywhere. Let's create a list of dataframes, that will store each unpivoted dataframe. Later, we will pass that list of dataframes as argument for the pd.concat function to perform the concatenation.
sheet_names = ['Gross','Margin','Revenue']
list_of_df = []
full_table = pd.DataFrame()
for sheet in sheet_names:
df = pd.read_excel(BudgetData.xlsx', sheet_name = sheet, index=False)
df = pd.melt(df,id_vars=['Company'], var_name ='Month', value_name = sheet)
list_of_df.append(df)
full_df = pd.concat(list_of_df,ignore_index=True)
full_df = full_df.sort_values(['Company','Month'])
print(full_df)
Edit:
Now that I understand what you need, let's try a different approach. After the loop try the following code instread of the pd.concat:
full_df = list_of_df[0].merge(list_of_df[1],on=['Company','Month']).merge(list_of_df[2],on=['Company','Month'])
A pd.concat will just pile everything together, you want to actually merge the DataFrames using pd.merge. This works similarly to a SQL Join statement. (based on the 'desired' image in your post)
https://pandas.pydata.org/pandas-docs/version/0.19.1/generated/pandas.DataFrame.merge.html
you just want to use a list of columns to merge on. If you get them all into tidy data frames with the same names as your sheets you would want to do something like:
gross.merge(margin, on=['Company', 'Month']).merge(revenue, on=['Company', 'Month'])

How do I quickly append or concatenate many pandas dataframes from pkl files?

I have approx 50,000 .pkl files, which each contain two pandas, which I want to append to two large pandas.
I tried to loop over the files, reading them in, and appending one by one which gets painfully slow (why? see here):
DF_a = pd.DataFrame
DF_b = pd.DataFrame
for appended_file in os.listdir(folderwithallfiles):
with open(appenddirectory + appended_file, 'rb') as data:
df_a, df_b = pickle.load(data)
DF_a= pd.concat([DF_a, df_a], axis = 0]
DF_b= pd.concat([DF_b, df_b], axis = 0)
As suggested in the linked post, I am trying to build a list of pandas to concatenate, but the only way I can think of doing it would be to rename the dataframes in the loop (like here), which is advised against. I do not see how I can fit them in a dictionary and concat from there. Any advice?
This works:
DF_a = pd.concat([pd.read_pickle(appenddirectory+filename)[0] for filename in appendedfiles])
DF_b = pd.concat([pd.read_pickle(appenddirectory+filename)[1] for filename in appendedfiles])
since pd.read_pickle reads a list of pandas if multiple pandas are in the pkl file

Concat pandas dataframes without following a certain sequence

I have data files which are converted to pandas dataframes which sometimes share column names while others sharing time series index, which all I wish to combine as one dataframe based on both column and index whenever matching. Since there is no sequence in naming they appear randomly for concatenation. If two dataframe have different columns are concatenated along axis=1 it works well, but if the resulting dataframe is combined with new df with the column name from one of the earlier merged pandas dataframe, it fails to concat. For example with these data files :
import pandas as pd
df1 = pd.read_csv('0.csv', index_col=0, parse_dates=True, infer_datetime_format=True)
df2 = pd.read_csv('1.csv', index_col=0, parse_dates=True, infer_datetime_format=True)
df3 = pd.read_csv('2.csv', index_col=0, parse_dates=True, infer_datetime_format=True)
data1 = pd.DataFrame()
file_list = [df1, df2, df3] # fails
# file_list = [df2, df3,df1] # works
for fn in file_list:
if data1.empty==True or fn.columns[1] in data1.columns:
data1 = pd.concat([data1,fn])
else:
data1 = pd.concat([data1,fn], axis=1)
I get ValueError: Plan shapes are not aligned when I try to do that. In my case there is no way to first load all the DataFrames and check their column names. Having that I could combine all df with same column names to later only concat these resulting dataframes with different column names along axis=1 which I know always works as shown below. However, a solution which requires preloading all the DataFrames and rearranging the sequence of concatenation is not possible in my case (it was only done for a working example above). I need a flexibility in terms of in whichever sequence the information comes it can be concatenated with the larger dataframe data1. Please let me know if you have a suggested suitable approach.
If you go through the loop step by step, you can find that in the first iteration it goes into the if, so data1 is equal to df1. In the second iteration it goes to the else, since data1 is not empty and ''Temperature product barrel ValueY'' is not in data1.columns.
After the else, data1 has some duplicated column names. In every row of the duplicated column names. (one of the 2 columns is Nan, the other one is a float). This is the reason why pd.concat() fails.
You can aggregate the duplicate columns before you try to concatenate to get rid of it:
for fn in file_list:
if data1.empty==True or fn.columns[1] in data1.columns:
# new:
data1 = data1.groupby(data1.columns, axis=1).agg(np.nansum)
data1 = pd.concat([data1,fn])
else:
data1 = pd.concat([data1,fn], axis=1)
After that, you would get
data1.shape
(30, 23)

Categories