I'm puzzled by an error from pandas/excel data saving. My code is reading data from an excel file with multiple sheets into dataframes in a for loop. Some data subsetting are carried on a dataframe during each iteration and the results (2 subet dataframes) appended to the bottom of the original data in specific sheets (at same row position but different columns).
My code:
import os
import pandas as pd
import re
from excel import append_df_to_excel
path = 'orig_vers.xlsx'
xls = pd.ExcelFile(path, engine='openpyxl')
# Loop
for sheet in xls.sheet_names:
try:
df = pd.read_excel(xls,sheet_name=sheet)
#drop empty columns and #drop rows with missing values
df1=df.dropna(how='all', axis=1).dropna(how='any',axis=0)
# Subset data
ver1= some calculation on df1
ver2= some calculation on df1
lastrow = len(df1) # return the last row value
append_df_to_excel(path, ver1, sheet_name=sheet,startrow=lastrow+2, index=False)
append_df_to_excel(path, ver2,sheet_name=sheet,startrow=lastrow+2,startcol=5,index=False)
except Exception:
continue
"append_df_to_excel" is a helper function from here link
The code works well for the first sheet by appending the two result dataframe at the bottom of original data at the specified positions, but no data is appended to the other sheets. If I remove the try and catch lines and then run the code, I get this error:"Error -3 while decompressing data: invalid distance too far back".
My suspicion is that maybe because as from sheet number 2 of the excel file being read into dataframe, the original data have some empty rows which my code removed before subsetting, then excel writer has issues with this line: lastrow = len(df1). Does anyone know the answer to this issue?
Related
My program reads in an Excel file that contains multiple sheets and concatenates them together. The issue is that the last row at the end of each sheet Totals and I don't want that row. Is there an argument that will drop the last row when I read the sheets in? And will I need to first read the sheets in and remove this last row before I run the concat function to avoid deleting out the wrong rows? I've tried using skipfooter = 0 and skipfooter = 1 but this threw an error message.
I assume you using pandas to read xlsx where the excel file have multiple sheet with difference length of data and you want to drop the last row from each sheet, so you can use [:-1] like this :
df = pd.ExcelFile('report.xlsx',engine = 'openpyxl')
data = [df.parse(name)[:-1] for name in df.sheet_names]
I'm trying to add rows from a dataframe into google sheets Im using python2 and pygsheets. I have 10 columns in the google sheets and 6 columns in my dataframe, and my problem is when I add the rows in the Sheets it deletes the 4 extra columns of my sheets
so this code should add the number of rows of the df in the worksheet (the rows without any content)
import pygsheets
import pandas as pd
sh = gc.open_by_key('xxxx')
worksheet = sh.worksheet('RawData')
rows= df.shape[0]
worksheet.add_rows(df)
The code does work but is fitting the grid of sheets to the one of the df.
does anyone know a solution for adding the exact amount of rows in a worksheet and keeping the worksheet columns intact?
I believe your goal as follows.
In your situation, there are 10 columns in Google Spreadsheet.
For this Spreadsheet, you want to append the values from the dataframe which have the 6 columns.
In this situation, you don't want to remove other 4 columns in Spreadsheet.
You want to achieve this using pygsheets of python.
In this case, how about the following flow?
Convert dataframe to a list.
Put the list to Spreadsheet.
When this flow is reflected to the script, it becomes as follows.
Sample script:
df = ### <--- Please set your dataframe.
sh = gc.open_by_key('xxxx')
worksheet = sh.worksheet('RawData')
values = df.values.tolist()
worksheet.append_table(values, start='A1', end=None, dimension='ROWS', overwrite=False)
If you want to include the header row, please use the following script.
df = ### <--- Please set your dataframe.
sh = gc.open_by_key('xxxx')
worksheet = sh.worksheet('RawData')
values = [df.columns.values.tolist()]
values.extend(df.values.tolist())
worksheet.append_table(values, start='A1', end=None, dimension='ROWS', overwrite=False)
In above script, the values are put from the 1st empty row of the sheet RawData.
And, when overwrite=False is used, the new rows of the number of same rows of values are added.
Reference:
append_table
I am trying to create a csv from a dataframe based on conditions like if particular column is not null it needs to be added to a csv file. my code does convert the file based on the criteria but in the end It adds an extra null row.check the screenshot
here is my code:
df= df[pd.notnull(df['TRUCK_ID'])]
df[['FACILITY', 'TRUCK_ID','LICENSES']].to_csv('E:\Truck.txt', header=None, index=None, sep=',')
how can I eliminate the last blank row from the csv file.
You can select every row except the last using iloc:
df.iloc[:-1][['FACILITY', 'TRUCK_ID','LICENSES']].to_csv('E:\Truck.txt', header=None, index=None, sep=',')
I have an Excel workbook with multiple sheets. I need to loop through each sheet and get the same certain columns from each sheet and combine them. I am struggling to
1.Combine and store the tables together
, 2. Get rid of the column headers after the first table is looped through.
This is what I currently have
sheets = ["sheet1", "sheet2", "sheet3"]
df1 = pd.read_excel("Blank.xlsx")
for x in sheets:
df = pd.read_excel("Final.xlsx", sheetname=x, skiprows=1, header='true')
y = df[['Col1','Col2', 'Col3','Col4', 'Col5']]
print(y)
Bonus:
storing everything back into an excel sheet.
I currently have tried to do use a blank workbook to pd.merge whatever is being looped through with the blank excel but it's not working. Help!
table 1: first table
table 2: second table in single sheet.
guys i want to read and fill missing values of excel data . but i have many table in single sheet, how can i split it and only fill table data values of different tables.
here's my code:
#read excel files
import pandas as pd
import numpy as np
stations_data = pd.read_excel('filename', sheet_name=0, skiprows=6)
#get a data frame with selected columns
FORMAT = ['S.No.', 'YEAR', 'JUNE']
df_selected = stations_data[FORMAT]
for col in FORMAT:
for idx, rows in df_selected.iterrows():
if pd.isnull(df_selected.loc[idx,col]):
df_selected = df_selected.fillna(df_selected.mean())
print (df_selected)
You could use pd.read_excel where you use the key word argument skiprows to start at the 'correct' row for the specific table and skipfooter to stop at the correct row. Of course this may not be so practical if the number of rows in the tables change in the future. Maybe it is easier to just restructure the excel to have one table per sheet, and then just use the sheetname kwarg. See the documentation.