How to join multiple dataframes within a loop using python pandas - python

I have 3 tables on each excel sheet: sheet1 - Gross, sheet2 - Margin, sheet3 - Revenue
So I was able to iterate through each sheet and unpivot it.
But how can I join them together?
sheet_names = ['Gross','Margin','Revenue']
full_table = pd.DataFrame()
for sheet in sheet_names:
df = pd.read_excel(BudgetData.xlsx', sheet_name = sheet, index=False)
unpvt = pd.melt(df,id_vars=['Company'], var_name ='Month', value_name = sheet)
# how can I join unpivoted dataframes here?
print(unpvt)
Desirable result:
UPDATE:
Thanks #Celius Stingher.
I think this is what I need. It just gives me weird sorting:
and gives me this warning:
Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.
To accept the future behavior, pass 'sort=False'.
To retain the current behavior and silence the warning, pass 'sort=True'.
from ipykernel import kernelapp as app

So it seems you are doing the pivoting but not saving each unpivoted dataframe anywhere. Let's create a list of dataframes, that will store each unpivoted dataframe. Later, we will pass that list of dataframes as argument for the pd.concat function to perform the concatenation.
sheet_names = ['Gross','Margin','Revenue']
list_of_df = []
full_table = pd.DataFrame()
for sheet in sheet_names:
df = pd.read_excel(BudgetData.xlsx', sheet_name = sheet, index=False)
df = pd.melt(df,id_vars=['Company'], var_name ='Month', value_name = sheet)
list_of_df.append(df)
full_df = pd.concat(list_of_df,ignore_index=True)
full_df = full_df.sort_values(['Company','Month'])
print(full_df)
Edit:
Now that I understand what you need, let's try a different approach. After the loop try the following code instread of the pd.concat:
full_df = list_of_df[0].merge(list_of_df[1],on=['Company','Month']).merge(list_of_df[2],on=['Company','Month'])

A pd.concat will just pile everything together, you want to actually merge the DataFrames using pd.merge. This works similarly to a SQL Join statement. (based on the 'desired' image in your post)
https://pandas.pydata.org/pandas-docs/version/0.19.1/generated/pandas.DataFrame.merge.html
you just want to use a list of columns to merge on. If you get them all into tidy data frames with the same names as your sheets you would want to do something like:
gross.merge(margin, on=['Company', 'Month']).merge(revenue, on=['Company', 'Month'])

Related

Pandas to_excel() - can't figure out how to export multiple dataframes as different sheets in one excel using for loop

there are many similar questions to this. I have looked through them all and I can't figure out how to fix my issue.
I have 11 dataframes. I would like to export all of these dataframes to one excel file, with one sheet per data frame. I have 2 lists: One is a list of dataframe objects, and one is a list of the names I want for each df. the lists are each ordered so that if you iterated through each list at the same time it would be the df and the df name I want.
Here is my code:
for (df, df_name) in zip(df_list, df_name_list):
sheetname = "{}".format(df_name)
df.to_excel(r"myfolder\myfile.xlsx", index=False, sheet_name=sheetname)
It exports to excel, but it appears to overwrite the sheet each time. The final sheet has the same name as the final dataframe, so it looped through both lists but it won't save separate sheets. Any help would be much appreciated!
UPDATE - ISSUE FIXED - EDITING TO ADD THE CODE THAT WORKED
`with pd.ExcelWriter(r"myfolder\myfile.xlsx") as writer:
for (df, df_name) in zip(df_list, df_name_list):
sheetname = "{}".format(df_name)
df.to_excel(writer, sheet_name=sheetname)'
I just tried something based docs example as it seems to work ok see:
import pandas as pd
data = [(f'Sheet {j}', pd.DataFrame({'a': [i for i in range(20)], 'b': [j for i in range(20)] })) for j in range(10)]
with pd.ExcelWriter('output.xlsx') as writer:
for sheet, df in data:
df.to_excel(writer, sheet_name=sheet)

Can't figure out why pandas.concat is creating extra column when concatenating two frames

I've been trying to concatenate two sheets while preserving the original indices of both dataframes. However, upon concatenation I can't seem to get the result to output the way I expect or want.
If I use ignore_index = True The old indices are replaced by an index that encompasses both sheets total rows.
If I use ignore_index = False The indices are preserved, but there is a new empty column preceding it shifting the rest of columns over by one.
How can I concatenate my sheets without this excess column?
import pandas as pd
import easygui
path = easygui.fileopenbox("Select a file")
xls = pd.ExcelFile(path)
potential_names = [sheet for sheet in xls.sheet_names if sheet.startswith('Numbers_')]
df = pd.concat(pd.read_excel(xls, sheet_name = potential_names), ignore_index = True)
command_list = ["AA", "dd", "DD"]
warning_list = ["Dd", "dD"]
ingenium_list = ["CC", "BB"]
col_name_type = 'TYPE'
col_name_other = 'OTHER'
col_name_source = 'SOURCE'
df_filtered_command = df[df[col_name_type].isin(command_list)]
df_filtered_warnings = df[df[col_name_type].isin(warning_list)]
df_filtered_other = df[df[col_name_other].isin(ingenium_list)]
with pd.ExcelWriter(('(Processed) ' + os.path.basename(path))) as writer:
#df_filtered_command.to_excel(writer, sheet_name="Command", index=True)
df_final_command.to_excel(writer, sheet_name=("Command"), index=True)
df_filtered_warnings.to_excel(writer, sheet_name="Warnings", index=True)
df_filtered_other.to_excel(writer, sheet_name="Issues", index=True)
I suspect how the concat function is working, but I've not been able to figure out how to fix it.
Any help or direction would be amazing.
Edit: adding an example of my df after running.
I was mistaken before, it seems the first column is empty aside from the sheet names, but I'm still not able to find a way to prevent pandas from making that first column if I don't remake the index.
Since you passed in a dictionary (not a list) of data frames from using a list in sheet_names of pandas.read_excel, pandas.concat will preserve the dict keys, as specified for the first argument:
objs: a sequence or mapping of Series or DataFrame objects
If a mapping is passed, the sorted keys will be used as the keys argument, unless it is passed, in which case the values will be selected (see below).
Consequently, the sheet names (i.e., dict keys) and the index values of each data frame migrate as a MultIindex. Consider using the names argument to name the indexes:
names: list, default None
Names for the levels in the resulting hierarchical index.
sheets_df = pd.concat(
pd.read_excel(xls, sheet_name = potential_names),
names = ["sheet_name", "index"]
)
If you want to convert the dual index as new columns, simple run reset_index afterwards:
sheets_df = (
pd.concat(
pd.read_excel(xls, sheet_name = potential_names),
names = ["sheet_name", "index"]
).reset_index()
)

How to merge multiple sheets in a single workbook using Python when the first column is named differently across the workbook

I have been using the following code from another StackOverflow answer to concatenate data from multiple excel sheets in the same workbook into one sheet.
This works great when the column names is uniform across all sheets in a workbook. However, I'm running into an issue with one specific workbook where only the first column is named differently (or not named at all.. so is blank) but the rest of the columns are the same.
How do I merge such sheets? Is there a way to rename the first column of each sheet into one name so that I can then use the steps from the answer linked above?
Yes, you can rename all the columns as:
# read excel
dfs = pd.read_excel('tmp.xlsx', sheetname=None, ignore_index=True)
# rename columns
column_names = ['col1', 'col2', ...]
for df in dfs.values(): df.columns = column_names
# concat
total_df = pd.concat(dfs.values())
Or, you can ignore the header in read_excel so that the columns are labeled as 0,1,2,...:
# read ignore header
dfs = pd.read_excel('tmp.xlsx', sheet_name=None,
header=None, skiprows=1)
total_df = pd.concat(dfs.values)
# rename
total_df.columns = column_names

How Do I Sort Same Columns Across Multiple Sheets?

I have a spreadsheet with 12 tabs, one for each month. They have the exact same columns, but are possibly in a different order. Eventually, I want to combine all 12 tabs into one dataset and Export a file. I know how to do everything but make sure the columns match before merging the datasets together.
Here's what I have so far:
Import Excel File and Create Ordered Dictionary of All Sheets
sheets_dict = pd.read_excel("Monthly Campaign Data.xlsx", sheet_name = None, parse_dates = ["Date", "Create Date"])
I want to iterate this
sorted(sheets_dict["January"].columns)
and combine it with this and capitalize each column:
new_df = pd.DataFrame()
for name, sheet in sheets_dict.items():
sheet['sheet'] = name
sheet = sheet.rename(columns=lambda x: x.title().split('\n')[-1])
new_df = new_df.append(sheet)
new_df.reset_index(inplace = True, drop = True)
print(new_df)
If all the sheets have exactly the same columns, the pd.concat() function can align those columns and concatenate all these DataFrames.
Then you can group the DataFrame by different year, then sort each part.

Trouble merging .xlsx files with pandas

I am operating with python 2.7 and I wrote a script that should take the name of two .xlsx files, use pandas to convert them in two dataframe and then concatenate them.
The two files under consideration have the same rows and different columns.
Basically, I have these two Excel files:
I would like to keep the same rows and just unite the columns.
The code is the following:
import pandas as pd
file1 = 'file1.xlsx'
file2 = 'file2.xlsx'
sheet10 = pd.read_excel(file1, sheet_name = 0)
sheet20 = pd.read_excel(file2, sheet_name = 0)
conc1 = pd.concat([sheet10, sheet20], sort = False)
output = pd.ExcelWriter('output.xlsx')
conc1.to_excel(output, 'Sheet 1')
output.save()
Instead of doing what I expected (given the examples I read online), the output becomes something like this:
Does anyone know I could I improve my script?
Thank you very much.
The best answer here really depends on the exact shape of your data. Based on the example you have provided it looks like the data is indexed identically between the two dataframes with differing column headers that you want preserved. If this is the case this would be the best solution:
import pandas as pd
file1 = 'file1.xlsx'
file2 = 'file2.xlsx'
sheet10 = pd.read_excel(file1, sheet_name = 0)
sheet20 = pd.read_excel(file2, sheet_name = 0)
conc1 = sheet10.merge(sheet20, how="left", left_index=True, right_index=True)
output = pd.ExcelWriter('output.xlsx')
conc1.to_excel(output, sheet_name='Sheet 1', ignore_index=True)
output.save()
Since there is a direct match between the number of rows in the two initial dataframes it doesn't really matter if a left, right, outer, or inner join is used. In this example I used a left join.
If the rows in the two data frames do not perfectly line up though, the join method selected can have a huge impact on your output. I recommend looking at pandas documentation on merge/join/concatenate before you go any further.
To get the expected output using pd.concat, column names in both the dataframes should be same. Here's how to do,
# Create a 1:1 mapping of sheet10 and sheet20 columns
cols_mapping = dict(zip(sheet20.columns, sheet10.columns))
# Rename the columns in sheet20 to match with that of sheet10
sheet20_renamed = sheet20.rename(cols_mapping, axis=1)
concatenated = pd.concat([sheet10, sheet20_renamed])

Categories