Error when trying to read multiple sheets from excel file - python

In the following code I'm trying to write multiple sheets from an Excel file,remove the empty cells, group the columns and store the result in another excel file:
import pandas as pd
sheets = ['R9_14062021','LOGS R9','LOGS R7 01032021']
df = pd.read_excel('LOGS.xlsx',sheet_name = sheets )
df.dropna(inplace = True)
df['Dt'] = pd.to_datetime(df['Dt']).dt.date
df1 = df.groupby(['Dt','webApp','mw'])['chgtCh','accessRecordModule','playerPlay
startOver','playerPlay PdL','playerPlay
PVR','contentHasAds','pdlComplete','lirePdl','lireVod'].sum()
df1.reset_index(inplace=True)
df1.to_excel(r'logs1.xlsx', index = False)
When I execute my script iget the following error:
AttributeError: 'dict' object has no attribute 'dropna'
how can I fix it?

When you provide a list of sheets for sheet_name param, your return object is dict of DataFrame as described here
dropna is method of DataFrame so you have to select the sheet first. for example
df['R9_14062021'].dropna(inplace=True)

Taken from pandas documentation for pd.read_excel:
If you give sheet_name a list, you will receive a list of dataframes.
Meaning you'll have to go over each dataframe and dropna() separately because you can't dropna() on a dictionary, your code will look like this:
import pandas as pd
sheets = ['R9_14062021','LOGS R9','LOGS R7 01032021']
dfs_list = pd.read_excel('LOGS.xlsx',sheet_name = sheets )
for i in dfs_list:
df = dfs_list[i]
df.dropna(inplace = True)
df['Dt'] = pd.to_datetime(df['Dt']).dt.date
df1 = df.groupby(['Dt','webApp','mw'])['chgtCh','accessRecordModule','playerPlay
startOver','playerPlay PdL','playerPlay
PVR','contentHasAds','pdlComplete','lirePdl','lireVod'].sum()
df1.reset_index(inplace=True)
df1.to_excel(r'logs1.xlsx', index = False)
The main difference here is the usage of
for i in dfs_list:
df = dfs_list[i]
in order to apply each change you are doing to each dataframe, if you want a specific dataframe you should do: df[0].dropna() for example.
Hope this helps and this is what you were aiming for.

Related

Prevent pandas from changing int to float/date?

I'm trying to merge a series of xlsx files into one, which works fine.
However, when I read a file, columns containing ints are transformed into floats (or dates?) when I merge and output them to csv. I have tried to visualize this in the picture. I have seen some solutions to this where dtype is used to "force" specific columns into int format. However, I do not always know the index nor the title of the column, so i need a more scalable solution.
Anyone with some thoughts on this?
Thank you in advance
#specify folder with xlsx-files
xlsFolder = "{}/system".format(directory)
dfMaster = pd.DataFrame()
#make a list of all xlsx-files in folder
xlsFolderContent = os.listdir(xlsFolder)
xlsFolderList = []
for file in xlsFolderContent:
if file[-5:] == ".xlsx":
xlsFolderList.append(file)
for xlsx in xlsFolderList:
print(xlsx)
xl = pd.ExcelFile("{}/{}".format(xlsFolder, xlsx))
for sheet in xl.sheet_names:
if "_Errors" in sheet:
print(sheet)
dfSheet = xl.parse(sheet)
dfSheet.fillna(0, inplace=True)
dfMaster = dfMaster.append(dfSheet)
print("len of dfMaster:", len(dfMaster))
dfMaster.to_csv("{}/dfMaster.csv".format(xlsFolder),sep=";")
Data sample:
Try to use dtype='object' as parameter of pd.read_csv or (ExcelFile.parse) to prevent Pandas to infer the data type of each column. You can also simplify your code using pathlib:
import pandas as pd
import pathlib
directory = pathlib.Path('your_path_directory')
xlsFolder = directory / 'system'
data = []
for xlsFile in xlsFolder.glob('*.xlsx'):
sheets = pd.read_excel(xlsFile, sheet_name=None, dtype='object')
for sheetname, df in sheets.items():
if '_Errors' in sheetname:
data.append(df.fillna('0'))
pd.concat(data).to_csv(xlsxFolder / dfMaster.csv, sep=';')

Pandas Only Exporting 1 Table to Excel but Printing all

The code below only exports the last table on the page to excel, but when I run the print function, it will print all of them. Is there an issue with my code causing not to export all data to excel?
I've also tried exporting as .csv file with no luck.
import pandas as pd
url = 'https://www.vegasinsider.com/college-football/matchups/'
dfs = pd.read_html(url)
for df in dfs:
if len(df.columns) > 1:
df.to_excel(r'VegasInsiderCFB.xlsx', index = False)
#print(df)
Your problem is that each time df.to_excel is called, you are overwriting the file, so only the last df will be left. What you need to do is use a writer and specify a sheet name for each separate df e.g:
url = 'https://www.vegasinsider.com/college-football/matchups/'
writer = pd.ExcelWriter('VegasInsiderCFB.xlsx', engine='xlsxwriter')
dfs = pd.read_html(url)
counter = 0
for df in dfs:
if len(df.columns) > 4:
counter += 1
df.to_excel(writer, sheet_name = f"sheet_{counter}", index = False)
writer.save()
You might need pip install xlsxwriter xlwt to make it work.
Exporting to a csv will never work, since a csv is a single data table (like a single sheet in excel), so in that case you would need to use a new csv for each df.
As pointed out in the comments, it would be possible to write the data onto a single sheet without changing the dfs, but it is likely much better to merge them:
import pandas as pd
import numpy as np
url = 'https://www.vegasinsider.com/college-football/matchups/'
dfs = pd.read_html(url)
dfs = [df for df in dfs if len(df.columns) > 4]
columns = ["gameid","game time", "team"] + list(dfs[0].iloc[1])[1:]
N = len(dfs)
values = np.empty((2*N,len(columns)),dtype=np.object)
for i,df in enumerate(dfs):
time = df.iloc[0,0].replace(" Game Time","")
values[2*i:2*i+2,2:] = df.iloc[2:,:]
values[2*i:2*i+2,:2] = np.array([[i,time],[i,time]])
newdf = pd.DataFrame(values,columns = columns)
newdf.to_excel("output.xlsx",index = False)
I used a numpy.array of object type to be able to copy a submatrix from the original dataframes easily into their intended place. I also needed to create a gameid, that connects the games across rows. It should be now trivial to rewrite this so you loop through a list of urls and write these to separate sheets.

Adding data to pandas dataframe with no header

I am trying to add a row(list of int) to pandas dataframe but i am somehow unable to append it. The thing is the dataframe has no header and everywhere the put data through specifying column names. I dont understand how to do it without header. Below is my dataframe named sheet
sheet = pd.read_csv("labels.csv",header = None) #sheet = [[1,1,1,2,2 ...]]
i want to append resulting list named simple_list = [1,2,3,4,2,1,1,1] to it.
Does the following work?
sheet = sheet.append(pd.DataFrame([[1,2,3,4,2,1,1,1]]), ignore_index=True)
You need to create a series out of simple_list.
simple_series = pd.Series(simple_list)
Then you can append a row to your dataframe with argument ignore_index=True:
sheet = sheet.append(simple_series, ignore_index=True)
This work for me. sheet.appends returns a dataframe, you need to assign it to the same var or a new one to us it. If not "sheet" will have always the original CSV content
import pandas as pd
if __name__ == '__main__':
sheet = pd.read_csv("labels.csv", header=None)
sheet = sheet.append([[41, 42, 42, 44, 45]], ignore_index=True)

Convert excel file with many sheets (with spaces in the name of the shett) in pandas data frame

I would like to convert an excel file to a pandas dataframe. All the sheets name have spaces in the name, for instances, ' part 1 of 22, part 2 of 22, and so on. In addition the first column is the same for all the sheets.
I would like to convert this excel file to a unique dataframe. However I dont know what happen with the name in python. I mean I was hable to import them, but i do not know the name of the data frame.
The sheets are imported but i do not know the name of them. After this i would like to use another 'for' and use a pd.merge() in order to create a unique dataframe
for sheet_name in Matrix.sheet_names:
sheet_name = pd.read_excel(Matrix, sheet_name)
print(sheet_name.info())
Using only the code snippet you have shown, each sheet (each DataFrame) will be assigned to the variable sheet_name. Thus, this variable is overwritten on each iteration and you will only have the last sheet as a DataFrame assigned to that variable.
To achieve what you want to do you have to store each sheet, loaded as a DataFrame, somewhere, a list for example. You can then merge or concatenate them, depending on your needs.
Try this:
all_my_sheets = []
for sheet_name in Matrix.sheet_names:
sheet_name = pd.read_excel(Matrix, sheet_name)
all_my_sheets.append(sheet_name)
Or, even better, using list comprehension:
all_my_sheets = [pd.read_excel(Matrix, sheet_name) for sheet_name in Matrix.sheet_names]
You can then concatenate them into one DataFrame like this:
final_df = pd.concat(all_my_sheets, sort=False)
You might consider using the openpyxl package:
from openpyxl import load_workbook
import pandas as pd
wb = load_workbook(filename=file_path, read_only=True)
all_my_sheets = wb.sheetnames
# Assuming your sheets have the same headers and footers
n = 1
for ws in all_my_sheets:
records = []
for row in ws._cells_by_row(min_col=1,
min_row=n,
max_col=ws.max_column,
max_row=n):
rec = [cell.value for cell in row]
records.append(rec)
# Make sure you don't duplicate the header
n = 2
# ------------------------------
# Set the column names
records = records[header_row-1:]
header = records.pop(0)
# Create your df
df = pd.DataFrame(records, columns=header)
It may be easiest to call read_excel() once, and save the contents into a list.
So, the first step would look like this:
dfs = pd.read_excel(["Sheet 1", "Sheet 2", "Sheet 3"])
Note that the sheet names you use in the list should be the same as those in the excel file. Then, if you wanted to vertically concatenate these sheets, you would just call:
final_df = pd.concat(dfs, axis=1)
Note that this solution would result in a final_df that includes column headers from all three sheets. So, ideally they would be the same. It sounds like you want to merge the information, which would be done differently; we can't help you with the merge without more information.
I hope this helps!

pandas Combine Excel Spreadsheets

I have an Excel workbook with many tabs.
Each tab has the same set of headers as all others.
I want to combine all of the data from each tab into one data frame (without repeating the headers for each tab).
So far, I've tried:
import pandas as pd
xl = pd.ExcelFile('file.xlsx')
df = xl.parse()
Can use something for the parse argument that will mean "all spreadsheets"?
Or is this the wrong approach?
Thanks in advance!
Update: I tried:
a=xl.sheet_names
b = pd.DataFrame()
for i in a:
b.append(xl.parse(i))
b
But it's not "working".
This is one way to do it -- load all sheets into a dictionary of dataframes and then concatenate all the values in the dictionary into one dataframe.
import pandas as pd
Set sheetname to None in order to load all sheets into a dict of dataframes
and ignore index to avoid overlapping values later (see comment by #bunji)
df = pd.read_excel('tmp.xlsx', sheet_name=None, index_col=None)
Then concatenate all dataframes
cdf = pd.concat(df.values())
print(cdf)
import pandas as pd
f = 'file.xlsx'
df = pd.read_excel(f, sheet_name=None, ignore_index=True)
df2 = pd.concat(df, sort=True)
df2.to_excel('merged.xlsx',
engine='xlsxwriter',
sheet_name=Merged,
header = True,
index=False)

Categories