I have an Excel workbook with multiple sheets. I need to loop through each sheet and get the same certain columns from each sheet and combine them. I am struggling to
1.Combine and store the tables together
, 2. Get rid of the column headers after the first table is looped through.
This is what I currently have
sheets = ["sheet1", "sheet2", "sheet3"]
df1 = pd.read_excel("Blank.xlsx")
for x in sheets:
df = pd.read_excel("Final.xlsx", sheetname=x, skiprows=1, header='true')
y = df[['Col1','Col2', 'Col3','Col4', 'Col5']]
print(y)
Bonus:
storing everything back into an excel sheet.
I currently have tried to do use a blank workbook to pd.merge whatever is being looped through with the blank excel but it's not working. Help!
Related
I'm puzzled by an error from pandas/excel data saving. My code is reading data from an excel file with multiple sheets into dataframes in a for loop. Some data subsetting are carried on a dataframe during each iteration and the results (2 subet dataframes) appended to the bottom of the original data in specific sheets (at same row position but different columns).
My code:
import os
import pandas as pd
import re
from excel import append_df_to_excel
path = 'orig_vers.xlsx'
xls = pd.ExcelFile(path, engine='openpyxl')
# Loop
for sheet in xls.sheet_names:
try:
df = pd.read_excel(xls,sheet_name=sheet)
#drop empty columns and #drop rows with missing values
df1=df.dropna(how='all', axis=1).dropna(how='any',axis=0)
# Subset data
ver1= some calculation on df1
ver2= some calculation on df1
lastrow = len(df1) # return the last row value
append_df_to_excel(path, ver1, sheet_name=sheet,startrow=lastrow+2, index=False)
append_df_to_excel(path, ver2,sheet_name=sheet,startrow=lastrow+2,startcol=5,index=False)
except Exception:
continue
"append_df_to_excel" is a helper function from here link
The code works well for the first sheet by appending the two result dataframe at the bottom of original data at the specified positions, but no data is appended to the other sheets. If I remove the try and catch lines and then run the code, I get this error:"Error -3 while decompressing data: invalid distance too far back".
My suspicion is that maybe because as from sheet number 2 of the excel file being read into dataframe, the original data have some empty rows which my code removed before subsetting, then excel writer has issues with this line: lastrow = len(df1). Does anyone know the answer to this issue?
My program reads in an Excel file that contains multiple sheets and concatenates them together. The issue is that the last row at the end of each sheet Totals and I don't want that row. Is there an argument that will drop the last row when I read the sheets in? And will I need to first read the sheets in and remove this last row before I run the concat function to avoid deleting out the wrong rows? I've tried using skipfooter = 0 and skipfooter = 1 but this threw an error message.
I assume you using pandas to read xlsx where the excel file have multiple sheet with difference length of data and you want to drop the last row from each sheet, so you can use [:-1] like this :
df = pd.ExcelFile('report.xlsx',engine = 'openpyxl')
data = [df.parse(name)[:-1] for name in df.sheet_names]
I'm trying to add rows from a dataframe into google sheets Im using python2 and pygsheets. I have 10 columns in the google sheets and 6 columns in my dataframe, and my problem is when I add the rows in the Sheets it deletes the 4 extra columns of my sheets
so this code should add the number of rows of the df in the worksheet (the rows without any content)
import pygsheets
import pandas as pd
sh = gc.open_by_key('xxxx')
worksheet = sh.worksheet('RawData')
rows= df.shape[0]
worksheet.add_rows(df)
The code does work but is fitting the grid of sheets to the one of the df.
does anyone know a solution for adding the exact amount of rows in a worksheet and keeping the worksheet columns intact?
I believe your goal as follows.
In your situation, there are 10 columns in Google Spreadsheet.
For this Spreadsheet, you want to append the values from the dataframe which have the 6 columns.
In this situation, you don't want to remove other 4 columns in Spreadsheet.
You want to achieve this using pygsheets of python.
In this case, how about the following flow?
Convert dataframe to a list.
Put the list to Spreadsheet.
When this flow is reflected to the script, it becomes as follows.
Sample script:
df = ### <--- Please set your dataframe.
sh = gc.open_by_key('xxxx')
worksheet = sh.worksheet('RawData')
values = df.values.tolist()
worksheet.append_table(values, start='A1', end=None, dimension='ROWS', overwrite=False)
If you want to include the header row, please use the following script.
df = ### <--- Please set your dataframe.
sh = gc.open_by_key('xxxx')
worksheet = sh.worksheet('RawData')
values = [df.columns.values.tolist()]
values.extend(df.values.tolist())
worksheet.append_table(values, start='A1', end=None, dimension='ROWS', overwrite=False)
In above script, the values are put from the 1st empty row of the sheet RawData.
And, when overwrite=False is used, the new rows of the number of same rows of values are added.
Reference:
append_table
I am quite new to python programming. I need to combine 1000+ files into one file. each file has 3 sheets in it and I need to get data only from sheet2 and make an final excel file. I am facing a problem to pick a value from specific cell from each excel file on sheet2 and create a column. python is picking the value from first file and create a column on that
df = pd.DataFrame()
for file in files:
if file.endswith('.xlsm'):
df = pd.read_excel(file, sheet_name=1, header=None)
df['REPORT_NO'] = df.iloc[1][4] #Report Number
df['SUPPLIER'] = df.iloc[2][4] #Supplier
df['REPORT_DATE'] = df.iloc[0][4] #Report Number
df2 = df2.dropna(thresh=15)
df2 = df.append(df, ignore_index=True)
df = df.reset_index()
del df['index']
df2.to_excel('FINAL_FILES.xlsx')
How can I solve this issue so python can take from each excel and put the information on right rows.
I df.iloc[2][4] refers to the 2nd row and 4th column of the 1st sheet. You have imported with sheet_name=1 and never activated a different sheet, though you mentioned all of the .xlsm have 3 sheets.
II your scoping could be wrong. Why define df outside of the loop? If will change per file, so no need for an external one. All info form the loop should be put into your df2 before the next iteration of the loop.
III Have you checked if append is adding a row or a column?
Even though
df['REPORT_NO'] = df.iloc[1][4] #Report Number
df['SUPPLIER'] = df.iloc[2][4] #Supplier
df['REPORT_DATE'] = df.iloc[0][4] #Report Number
are written as columns they have Report Number/Supplier/Report Date repeated for every row in that column.
When you use df2 = df.append(df, ignore_index=True) check the output. It might not be appending in the way you intend.
I have a file input_file_new.xsl and I need to deleet all completely empty rows and columns. I have come up with this function:
def DeleteEmptyColumns(filename):
import pandas as pd
new_loc = `input_file_new.xsl`
df = pd.read_excel(new_loc, 'Person')
df.drop('Application_ID', 1,inplace=True)
writer = pd.ExcelWriter('output.xlsx')
df.to_excel(writer,'Sheet1')
writer.save()
Which does delete the columns correctly, but only on the first sheet and then it saves that sheet as a whole new worksheet. I need each sheet to remain on the same document after I remove all empty columns. Is there an easier way? I looked into Win32 COM but I want to be able to use Pandas for this.
EDIT: this is a screenshot of the excel. So you can see on the Person tab I need to delete column A because it is completely empty. I need to do this for each tab.
This code should do it:
df = pd.read_excel("input_file_new.xlsx", header=None, sheet_name=None)
writer = pd.ExcelWriter('output_file.xlsx', engine='openpyxl')
for key in df:
sheet= df[key].dropna(how="all").dropna(1,how="all")
sheet.to_excel(writer, key,index=False, header=False )
writer.save()
The for loop is used to iterate over each sheet in the workbook. Then the columns and rows that contain only cells with "Nan" are removed and the resulting table is stored in a sheet called as the original one but in a new file.
read_excel with sheet_name set to None will read each sheet of the workbook into a dictionary (called df).