pandas Combine Excel Spreadsheets - python

I have an Excel workbook with many tabs.
Each tab has the same set of headers as all others.
I want to combine all of the data from each tab into one data frame (without repeating the headers for each tab).
So far, I've tried:
import pandas as pd
xl = pd.ExcelFile('file.xlsx')
df = xl.parse()
Can use something for the parse argument that will mean "all spreadsheets"?
Or is this the wrong approach?
Thanks in advance!
Update: I tried:
a=xl.sheet_names
b = pd.DataFrame()
for i in a:
b.append(xl.parse(i))
b
But it's not "working".

This is one way to do it -- load all sheets into a dictionary of dataframes and then concatenate all the values in the dictionary into one dataframe.
import pandas as pd
Set sheetname to None in order to load all sheets into a dict of dataframes
and ignore index to avoid overlapping values later (see comment by #bunji)
df = pd.read_excel('tmp.xlsx', sheet_name=None, index_col=None)
Then concatenate all dataframes
cdf = pd.concat(df.values())
print(cdf)

import pandas as pd
f = 'file.xlsx'
df = pd.read_excel(f, sheet_name=None, ignore_index=True)
df2 = pd.concat(df, sort=True)
df2.to_excel('merged.xlsx',
engine='xlsxwriter',
sheet_name=Merged,
header = True,
index=False)

Related

Extra column appears when appending selected row from one csv to another in Python

I have this code which appends a column of a csv file as a row to another csv file:
def append_pandas(s,d):
import pandas as pd
df = pd.read_csv(s, sep=';', header=None)
df_t = df.T
df_t.iloc[0:1, 0:1] = 'Time Point'
df_t.at[1, 0] = 1
df_t.columns = df_t.iloc[0]
df_new = df_t.drop(0)
pdb = pd.read_csv(d, sep=';')
newpd = pdb.append(df_new)
from pandas import DataFrame
newpd.to_csv(d, sep=';')
The result is supposed to look like this:
Instead, every time the row is appended, there is an extra "Unnamed" column appearing on the left:
Do you know how to fix that?..
Please, help :(
My csv documents from which I select a column look like this:
You have to add index=False to your to_csv() method

Pandas Only Exporting 1 Table to Excel but Printing all

The code below only exports the last table on the page to excel, but when I run the print function, it will print all of them. Is there an issue with my code causing not to export all data to excel?
I've also tried exporting as .csv file with no luck.
import pandas as pd
url = 'https://www.vegasinsider.com/college-football/matchups/'
dfs = pd.read_html(url)
for df in dfs:
if len(df.columns) > 1:
df.to_excel(r'VegasInsiderCFB.xlsx', index = False)
#print(df)
Your problem is that each time df.to_excel is called, you are overwriting the file, so only the last df will be left. What you need to do is use a writer and specify a sheet name for each separate df e.g:
url = 'https://www.vegasinsider.com/college-football/matchups/'
writer = pd.ExcelWriter('VegasInsiderCFB.xlsx', engine='xlsxwriter')
dfs = pd.read_html(url)
counter = 0
for df in dfs:
if len(df.columns) > 4:
counter += 1
df.to_excel(writer, sheet_name = f"sheet_{counter}", index = False)
writer.save()
You might need pip install xlsxwriter xlwt to make it work.
Exporting to a csv will never work, since a csv is a single data table (like a single sheet in excel), so in that case you would need to use a new csv for each df.
As pointed out in the comments, it would be possible to write the data onto a single sheet without changing the dfs, but it is likely much better to merge them:
import pandas as pd
import numpy as np
url = 'https://www.vegasinsider.com/college-football/matchups/'
dfs = pd.read_html(url)
dfs = [df for df in dfs if len(df.columns) > 4]
columns = ["gameid","game time", "team"] + list(dfs[0].iloc[1])[1:]
N = len(dfs)
values = np.empty((2*N,len(columns)),dtype=np.object)
for i,df in enumerate(dfs):
time = df.iloc[0,0].replace(" Game Time","")
values[2*i:2*i+2,2:] = df.iloc[2:,:]
values[2*i:2*i+2,:2] = np.array([[i,time],[i,time]])
newdf = pd.DataFrame(values,columns = columns)
newdf.to_excel("output.xlsx",index = False)
I used a numpy.array of object type to be able to copy a submatrix from the original dataframes easily into their intended place. I also needed to create a gameid, that connects the games across rows. It should be now trivial to rewrite this so you loop through a list of urls and write these to separate sheets.

Error when trying to read multiple sheets from excel file

In the following code I'm trying to write multiple sheets from an Excel file,remove the empty cells, group the columns and store the result in another excel file:
import pandas as pd
sheets = ['R9_14062021','LOGS R9','LOGS R7 01032021']
df = pd.read_excel('LOGS.xlsx',sheet_name = sheets )
df.dropna(inplace = True)
df['Dt'] = pd.to_datetime(df['Dt']).dt.date
df1 = df.groupby(['Dt','webApp','mw'])['chgtCh','accessRecordModule','playerPlay
startOver','playerPlay PdL','playerPlay
PVR','contentHasAds','pdlComplete','lirePdl','lireVod'].sum()
df1.reset_index(inplace=True)
df1.to_excel(r'logs1.xlsx', index = False)
When I execute my script iget the following error:
AttributeError: 'dict' object has no attribute 'dropna'
how can I fix it?
When you provide a list of sheets for sheet_name param, your return object is dict of DataFrame as described here
dropna is method of DataFrame so you have to select the sheet first. for example
df['R9_14062021'].dropna(inplace=True)
Taken from pandas documentation for pd.read_excel:
If you give sheet_name a list, you will receive a list of dataframes.
Meaning you'll have to go over each dataframe and dropna() separately because you can't dropna() on a dictionary, your code will look like this:
import pandas as pd
sheets = ['R9_14062021','LOGS R9','LOGS R7 01032021']
dfs_list = pd.read_excel('LOGS.xlsx',sheet_name = sheets )
for i in dfs_list:
df = dfs_list[i]
df.dropna(inplace = True)
df['Dt'] = pd.to_datetime(df['Dt']).dt.date
df1 = df.groupby(['Dt','webApp','mw'])['chgtCh','accessRecordModule','playerPlay
startOver','playerPlay PdL','playerPlay
PVR','contentHasAds','pdlComplete','lirePdl','lireVod'].sum()
df1.reset_index(inplace=True)
df1.to_excel(r'logs1.xlsx', index = False)
The main difference here is the usage of
for i in dfs_list:
df = dfs_list[i]
in order to apply each change you are doing to each dataframe, if you want a specific dataframe you should do: df[0].dropna() for example.
Hope this helps and this is what you were aiming for.

Can I export a dataframe to excel as the very first sheet?

Running dataframe.to_excel() automatically saves the dataframe as the last sheet in the Excel file.
Is there a way to save a dataframe as the very first sheet, so that, when you open the spreadsheet, Excel shows it as the first on the left?
The only workaround I have found is to first export an empty dataframe to the tab with the name I want as first, then export the others, then export the real dataframe I want to the tab with the name I want. Example in the code below. Is there a more elegant way? More generically, is there a way to specifically choose the position of the sheet you are exporting to (first, third, etc)?
Of course this arises because the dataframe I want as first is the result of some calculations based on all the others, so I cannot export it.
import pandas as pd
import numpy as np
writer = pd.ExcelWriter('My excel test.xlsx')
first_df = pd.DataFrame()
first_df['x'] = np.arange(0,100)
first_df['y'] = 2 * first_df['x']
other_df = pd.DataFrame()
other_df['z'] = np.arange(100,201)
pd.DataFrame().to_excel(writer,'this should be the 1st')
other_df.to_excel(writer,'other df')
first_df.to_excel(writer,'this should be the 1st')
writer.save()
writer.close()
It is possible to re-arrange the sheets after they have been created:
import pandas as pd
import numpy as np
writer = pd.ExcelWriter('My excel test.xlsx')
first_df = pd.DataFrame()
first_df['x'] = np.arange(0,100)
first_df['y'] = 2 * first_df['x']
other_df = pd.DataFrame()
other_df['z'] = np.arange(100,201)
other_df.to_excel(writer,'Sheet2')
first_df.to_excel(writer,'Sheet1')
writer.save()
This will give you this output:
Add this before you save the workbook:
workbook = writer.book
workbook.worksheets_objs.sort(key=lambda x: x.name)

Removing duplicates for a row with duplicates in one column dynamic data

I am attempting to remove duplicates for Column D for dynamic data with no headers or identifying features. I am attempting to delete all the rows where there are duplicates for Column D. I am converting excel to a dataframe, removing duplicates and then putting it back into excel. However I keep getting an assortment of errors or no duplicates removed. I am from a VBA background but we are migrating to Python
Attempted:
df.drop_duplicates(["C"])
df = pd.DataFrame({"C"})
df.groupby(["C"]).filter(lambda df:df.shape[0] == 1)
As well an assortment of other variations. I was able to do this in VBA with one line. Any ideas why this keeps causing this issue.
\\ import pandas as pd
df = pd.DataFrame({"C"]})
df.drop_duplicates(subset=[''C'], keep=False)
DG=df.groupby([''C'])
print pd.concat([DG.get_group(item) for item, value in DG.groups.items() if len(value)==1])
I was able to do this in VBA with one line. Any ideas why this keeps causing this issue.
Code itself Template-
df = pd.read_excel("C:/wadwa.xlsx", sheetname=0)
columns_to_drop = ['d.1']
#columns_to_drop = ['d.1', 'b.1', 'e.1', 'f.1', 'g.1']
import pandas as pd
Df = df[[col for col in df.columns if col not in columns_to_drop]]
print(df)
writer = pd.ExcelWriter('C:/dadwa/dwad.xlsx')
df.to_excel(writer,'Sheet1')
writer.save()
print(df)
Code:
import pandas as pd
df = pd.read_excel("C:/Users/Documents/Book1.xlsx", sheetname=0)
import pandas as pd
df = df.drop_duplicates(subset=[df.columns[3]], keep=False)
writer = pd.ExcelWriter('C:/Users//Documents/Book2.xlsx')
df.to_excel(writer,'Sheet1')
writer.save()
print(df)
I think you need assign back and select 4.th columns by position:
df = df.drop_duplicates(subset=[df.columns[3]], keep=False)

Categories