My program reads in an Excel file that contains multiple sheets and concatenates them together. The issue is that the last row at the end of each sheet Totals and I don't want that row. Is there an argument that will drop the last row when I read the sheets in? And will I need to first read the sheets in and remove this last row before I run the concat function to avoid deleting out the wrong rows? I've tried using skipfooter = 0 and skipfooter = 1 but this threw an error message.
I assume you using pandas to read xlsx where the excel file have multiple sheet with difference length of data and you want to drop the last row from each sheet, so you can use [:-1] like this :
df = pd.ExcelFile('report.xlsx',engine = 'openpyxl')
data = [df.parse(name)[:-1] for name in df.sheet_names]
Related
I'm puzzled by an error from pandas/excel data saving. My code is reading data from an excel file with multiple sheets into dataframes in a for loop. Some data subsetting are carried on a dataframe during each iteration and the results (2 subet dataframes) appended to the bottom of the original data in specific sheets (at same row position but different columns).
My code:
import os
import pandas as pd
import re
from excel import append_df_to_excel
path = 'orig_vers.xlsx'
xls = pd.ExcelFile(path, engine='openpyxl')
# Loop
for sheet in xls.sheet_names:
try:
df = pd.read_excel(xls,sheet_name=sheet)
#drop empty columns and #drop rows with missing values
df1=df.dropna(how='all', axis=1).dropna(how='any',axis=0)
# Subset data
ver1= some calculation on df1
ver2= some calculation on df1
lastrow = len(df1) # return the last row value
append_df_to_excel(path, ver1, sheet_name=sheet,startrow=lastrow+2, index=False)
append_df_to_excel(path, ver2,sheet_name=sheet,startrow=lastrow+2,startcol=5,index=False)
except Exception:
continue
"append_df_to_excel" is a helper function from here link
The code works well for the first sheet by appending the two result dataframe at the bottom of original data at the specified positions, but no data is appended to the other sheets. If I remove the try and catch lines and then run the code, I get this error:"Error -3 while decompressing data: invalid distance too far back".
My suspicion is that maybe because as from sheet number 2 of the excel file being read into dataframe, the original data have some empty rows which my code removed before subsetting, then excel writer has issues with this line: lastrow = len(df1). Does anyone know the answer to this issue?
I have an Excel file, which contains two sheets i.e. old sheet and new sheet. The task is to generate difference between two sheets and print difference in third sheet.
Old Sheet
New Sheet
Note- Each row contains 1 file data and data is Dictionary type
I am comparing the sheets with this code
writer2= pd.ExcelWriter(fvc, engine='openpyxl', mode='a') #taking value of fvc from GUI
olddf= pd.read_excel(fvc, sheet_name=0,na_filter=False)
newdf= pd.read_excel(fvc, sheet_name=1,na_filter=False)
comparison_values = olddf.values == newdf.values
rows,cols=np.where(comparison_values==False)
for item in zip(rows,cols):
olddf.iloc[item[0], item[1]] = '{} --> {}'.format(olddf.iloc[item[0], item[1]],newdf.iloc[item[0], item[1]])
olddf.to_excel(writer2,sheet_name='Delta Sheet',index=False,header=True)
writer2.save()
Now this is working the way I wanted to when the two dataframes have same number of rows, but if one dataframe has extra rows it is not printing and shows ValueError: not enough values to unpack (expected 2, got 1).
So is it possible that if somehow in a dataframe if it has some extra rows then append those rows in other dataframe so that it can make both the dataframe equal and comparison is possible.
Like in old sheet there are 15 rows of data and in new sheet 17 rows, so that means in new sheet some new file has been added and that's why it can't be compared. So can I print that row in the Final Output.
I've tried merging, isin, difference etc. different solutions from Stackoverflow but nothing seems to working for me.
If needed more clarification about question please do ask.
Thanks for your Help :)
I'm trying to add rows from a dataframe into google sheets Im using python2 and pygsheets. I have 10 columns in the google sheets and 6 columns in my dataframe, and my problem is when I add the rows in the Sheets it deletes the 4 extra columns of my sheets
so this code should add the number of rows of the df in the worksheet (the rows without any content)
import pygsheets
import pandas as pd
sh = gc.open_by_key('xxxx')
worksheet = sh.worksheet('RawData')
rows= df.shape[0]
worksheet.add_rows(df)
The code does work but is fitting the grid of sheets to the one of the df.
does anyone know a solution for adding the exact amount of rows in a worksheet and keeping the worksheet columns intact?
I believe your goal as follows.
In your situation, there are 10 columns in Google Spreadsheet.
For this Spreadsheet, you want to append the values from the dataframe which have the 6 columns.
In this situation, you don't want to remove other 4 columns in Spreadsheet.
You want to achieve this using pygsheets of python.
In this case, how about the following flow?
Convert dataframe to a list.
Put the list to Spreadsheet.
When this flow is reflected to the script, it becomes as follows.
Sample script:
df = ### <--- Please set your dataframe.
sh = gc.open_by_key('xxxx')
worksheet = sh.worksheet('RawData')
values = df.values.tolist()
worksheet.append_table(values, start='A1', end=None, dimension='ROWS', overwrite=False)
If you want to include the header row, please use the following script.
df = ### <--- Please set your dataframe.
sh = gc.open_by_key('xxxx')
worksheet = sh.worksheet('RawData')
values = [df.columns.values.tolist()]
values.extend(df.values.tolist())
worksheet.append_table(values, start='A1', end=None, dimension='ROWS', overwrite=False)
In above script, the values are put from the 1st empty row of the sheet RawData.
And, when overwrite=False is used, the new rows of the number of same rows of values are added.
Reference:
append_table
I am quite new to python programming. I need to combine 1000+ files into one file. each file has 3 sheets in it and I need to get data only from sheet2 and make an final excel file. I am facing a problem to pick a value from specific cell from each excel file on sheet2 and create a column. python is picking the value from first file and create a column on that
df = pd.DataFrame()
for file in files:
if file.endswith('.xlsm'):
df = pd.read_excel(file, sheet_name=1, header=None)
df['REPORT_NO'] = df.iloc[1][4] #Report Number
df['SUPPLIER'] = df.iloc[2][4] #Supplier
df['REPORT_DATE'] = df.iloc[0][4] #Report Number
df2 = df2.dropna(thresh=15)
df2 = df.append(df, ignore_index=True)
df = df.reset_index()
del df['index']
df2.to_excel('FINAL_FILES.xlsx')
How can I solve this issue so python can take from each excel and put the information on right rows.
I df.iloc[2][4] refers to the 2nd row and 4th column of the 1st sheet. You have imported with sheet_name=1 and never activated a different sheet, though you mentioned all of the .xlsm have 3 sheets.
II your scoping could be wrong. Why define df outside of the loop? If will change per file, so no need for an external one. All info form the loop should be put into your df2 before the next iteration of the loop.
III Have you checked if append is adding a row or a column?
Even though
df['REPORT_NO'] = df.iloc[1][4] #Report Number
df['SUPPLIER'] = df.iloc[2][4] #Supplier
df['REPORT_DATE'] = df.iloc[0][4] #Report Number
are written as columns they have Report Number/Supplier/Report Date repeated for every row in that column.
When you use df2 = df.append(df, ignore_index=True) check the output. It might not be appending in the way you intend.
I have an Excel workbook with multiple sheets. I need to loop through each sheet and get the same certain columns from each sheet and combine them. I am struggling to
1.Combine and store the tables together
, 2. Get rid of the column headers after the first table is looped through.
This is what I currently have
sheets = ["sheet1", "sheet2", "sheet3"]
df1 = pd.read_excel("Blank.xlsx")
for x in sheets:
df = pd.read_excel("Final.xlsx", sheetname=x, skiprows=1, header='true')
y = df[['Col1','Col2', 'Col3','Col4', 'Col5']]
print(y)
Bonus:
storing everything back into an excel sheet.
I currently have tried to do use a blank workbook to pd.merge whatever is being looped through with the blank excel but it's not working. Help!