I have a spreadsheet with 12 tabs, one for each month. They have the exact same columns, but are possibly in a different order. Eventually, I want to combine all 12 tabs into one dataset and Export a file. I know how to do everything but make sure the columns match before merging the datasets together.
Here's what I have so far:
Import Excel File and Create Ordered Dictionary of All Sheets
sheets_dict = pd.read_excel("Monthly Campaign Data.xlsx", sheet_name = None, parse_dates = ["Date", "Create Date"])
I want to iterate this
sorted(sheets_dict["January"].columns)
and combine it with this and capitalize each column:
new_df = pd.DataFrame()
for name, sheet in sheets_dict.items():
sheet['sheet'] = name
sheet = sheet.rename(columns=lambda x: x.title().split('\n')[-1])
new_df = new_df.append(sheet)
new_df.reset_index(inplace = True, drop = True)
print(new_df)
If all the sheets have exactly the same columns, the pd.concat() function can align those columns and concatenate all these DataFrames.
Then you can group the DataFrame by different year, then sort each part.
Related
I am trying to combine dataframes with 2 columns into a single dataframe. The initial dataframes are generated through a for loop and stored in a list. I am having trouble getting the data from the list of dataframes into a single dataframe. Right now when I run my code, it treats each full dataframe as a row.
def linear_reg_function(category):
df = pd.read_csv(file)
df = df[df['category_column'] == category]`
df1 = df[['category_column', 'value_column']]
df_export.append(df1)
df_export = []
for category in category_list:
linear_reg_function(category)
when I run this block of code I get a list of dataframes that have 2 columns. When I try to convert df_export to a dataframe, it ends up with 12 rows (the number of categories in category_list). I tried:
df_export = pd.DataFrame()
but the result was:
_
I would like to have a single dataframe with 2 columns, [Category, Value] that includes the values of all 12 categories generated in the for loop.
You can use pd.concat to merge a list of DataFrames into a single big DataFrame.
appended_data = []
for infile in glob.glob("*.xlsx"):
data = pandas.read_excel(infile)
# store DataFrame in list
appended_data.append(data)
# see pd.concat documentation for more info
appended_data = pd.concat(appended_data)
# write DataFrame to an excel sheet
appended_data.to_excel('appended.xlsx')
you can manipulate it to your proper demande
I have been using the following code from another StackOverflow answer to concatenate data from multiple excel sheets in the same workbook into one sheet.
This works great when the column names is uniform across all sheets in a workbook. However, I'm running into an issue with one specific workbook where only the first column is named differently (or not named at all.. so is blank) but the rest of the columns are the same.
How do I merge such sheets? Is there a way to rename the first column of each sheet into one name so that I can then use the steps from the answer linked above?
Yes, you can rename all the columns as:
# read excel
dfs = pd.read_excel('tmp.xlsx', sheetname=None, ignore_index=True)
# rename columns
column_names = ['col1', 'col2', ...]
for df in dfs.values(): df.columns = column_names
# concat
total_df = pd.concat(dfs.values())
Or, you can ignore the header in read_excel so that the columns are labeled as 0,1,2,...:
# read ignore header
dfs = pd.read_excel('tmp.xlsx', sheet_name=None,
header=None, skiprows=1)
total_df = pd.concat(dfs.values)
# rename
total_df.columns = column_names
I am doing some analysis on several different categories. I want to all the analysis to be on the same tab in a spreadsheet. So I have two dataframes for the information, but the columns are different and information different.
dataframe 1
colA colB calC
row 1
row 2
row 3
dataframe 2
colD colE calD
row 1
row 2
row 3
I want to export both of these dataframes on one excel sheet one after the other. The analysis are different lengths and I want dataframe 2 to be right below dataframe1 on a sheet.
import pandas
from openpyxl import load_workbook
book = load_workbook('test.xlsx')
writer = pandas.ExcelWriter('test.xlsx', engine='openpyxl')
writer.book = book
df1.to_excel(writer,sheet_name=sheetname,startrow=writer.sheets["Sheet1"].max_row, index = False,header= False)
writer.save()
// then do the same steps for any more number of dataframes.
You can add an extra row to the second DataFrame with the values same as the column names. And then simply use pd.concat()
df2.columns = df1.columns
pd.concat([df1, df2])
First make the columns of both the dataframes to be the same and then use pd.concat to append df2 to the end of df1
You can create a new dataframe from this and export it to csv :
df = pd.concat([df1,df2])
df.to_csv('filename.csv')
If you want the header of the second dataframe also in your final csv file, create df2 : df2 = pd.read_csv('df2.csv', names = df1.columns)
df1=pd.DataFrame(np.vstack([df1.columns, df1]))
#this will convert column names into rows
df2=pd.DataFrame(np.vstack([df2.columns, df2]))
#samewith other dataframe
#concat these dataframe and save as excel without index or columns
pd.concat((a,b)).to_excel('filename.xlsx',header=False,index=False)
I have 3 tables on each excel sheet: sheet1 - Gross, sheet2 - Margin, sheet3 - Revenue
So I was able to iterate through each sheet and unpivot it.
But how can I join them together?
sheet_names = ['Gross','Margin','Revenue']
full_table = pd.DataFrame()
for sheet in sheet_names:
df = pd.read_excel(BudgetData.xlsx', sheet_name = sheet, index=False)
unpvt = pd.melt(df,id_vars=['Company'], var_name ='Month', value_name = sheet)
# how can I join unpivoted dataframes here?
print(unpvt)
Desirable result:
UPDATE:
Thanks #Celius Stingher.
I think this is what I need. It just gives me weird sorting:
and gives me this warning:
Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.
To accept the future behavior, pass 'sort=False'.
To retain the current behavior and silence the warning, pass 'sort=True'.
from ipykernel import kernelapp as app
So it seems you are doing the pivoting but not saving each unpivoted dataframe anywhere. Let's create a list of dataframes, that will store each unpivoted dataframe. Later, we will pass that list of dataframes as argument for the pd.concat function to perform the concatenation.
sheet_names = ['Gross','Margin','Revenue']
list_of_df = []
full_table = pd.DataFrame()
for sheet in sheet_names:
df = pd.read_excel(BudgetData.xlsx', sheet_name = sheet, index=False)
df = pd.melt(df,id_vars=['Company'], var_name ='Month', value_name = sheet)
list_of_df.append(df)
full_df = pd.concat(list_of_df,ignore_index=True)
full_df = full_df.sort_values(['Company','Month'])
print(full_df)
Edit:
Now that I understand what you need, let's try a different approach. After the loop try the following code instread of the pd.concat:
full_df = list_of_df[0].merge(list_of_df[1],on=['Company','Month']).merge(list_of_df[2],on=['Company','Month'])
A pd.concat will just pile everything together, you want to actually merge the DataFrames using pd.merge. This works similarly to a SQL Join statement. (based on the 'desired' image in your post)
https://pandas.pydata.org/pandas-docs/version/0.19.1/generated/pandas.DataFrame.merge.html
you just want to use a list of columns to merge on. If you get them all into tidy data frames with the same names as your sheets you would want to do something like:
gross.merge(margin, on=['Company', 'Month']).merge(revenue, on=['Company', 'Month'])
I am operating with python 2.7 and I wrote a script that should take the name of two .xlsx files, use pandas to convert them in two dataframe and then concatenate them.
The two files under consideration have the same rows and different columns.
Basically, I have these two Excel files:
I would like to keep the same rows and just unite the columns.
The code is the following:
import pandas as pd
file1 = 'file1.xlsx'
file2 = 'file2.xlsx'
sheet10 = pd.read_excel(file1, sheet_name = 0)
sheet20 = pd.read_excel(file2, sheet_name = 0)
conc1 = pd.concat([sheet10, sheet20], sort = False)
output = pd.ExcelWriter('output.xlsx')
conc1.to_excel(output, 'Sheet 1')
output.save()
Instead of doing what I expected (given the examples I read online), the output becomes something like this:
Does anyone know I could I improve my script?
Thank you very much.
The best answer here really depends on the exact shape of your data. Based on the example you have provided it looks like the data is indexed identically between the two dataframes with differing column headers that you want preserved. If this is the case this would be the best solution:
import pandas as pd
file1 = 'file1.xlsx'
file2 = 'file2.xlsx'
sheet10 = pd.read_excel(file1, sheet_name = 0)
sheet20 = pd.read_excel(file2, sheet_name = 0)
conc1 = sheet10.merge(sheet20, how="left", left_index=True, right_index=True)
output = pd.ExcelWriter('output.xlsx')
conc1.to_excel(output, sheet_name='Sheet 1', ignore_index=True)
output.save()
Since there is a direct match between the number of rows in the two initial dataframes it doesn't really matter if a left, right, outer, or inner join is used. In this example I used a left join.
If the rows in the two data frames do not perfectly line up though, the join method selected can have a huge impact on your output. I recommend looking at pandas documentation on merge/join/concatenate before you go any further.
To get the expected output using pd.concat, column names in both the dataframes should be same. Here's how to do,
# Create a 1:1 mapping of sheet10 and sheet20 columns
cols_mapping = dict(zip(sheet20.columns, sheet10.columns))
# Rename the columns in sheet20 to match with that of sheet10
sheet20_renamed = sheet20.rename(cols_mapping, axis=1)
concatenated = pd.concat([sheet10, sheet20_renamed])