python read excel data, filling missing values using pandas - python

table 1: first table
table 2: second table in single sheet.
guys i want to read and fill missing values of excel data . but i have many table in single sheet, how can i split it and only fill table data values of different tables.
here's my code:
#read excel files
import pandas as pd
import numpy as np
stations_data = pd.read_excel('filename', sheet_name=0, skiprows=6)
#get a data frame with selected columns
FORMAT = ['S.No.', 'YEAR', 'JUNE']
df_selected = stations_data[FORMAT]
for col in FORMAT:
for idx, rows in df_selected.iterrows():
if pd.isnull(df_selected.loc[idx,col]):
df_selected = df_selected.fillna(df_selected.mean())
print (df_selected)

You could use pd.read_excel where you use the key word argument skiprows to start at the 'correct' row for the specific table and skipfooter to stop at the correct row. Of course this may not be so practical if the number of rows in the tables change in the future. Maybe it is easier to just restructure the excel to have one table per sheet, and then just use the sheetname kwarg. See the documentation.

Related

Write dataframes to multiple excell sheets without overwriting data using python/pandas

I'm puzzled by an error from pandas/excel data saving. My code is reading data from an excel file with multiple sheets into dataframes in a for loop. Some data subsetting are carried on a dataframe during each iteration and the results (2 subet dataframes) appended to the bottom of the original data in specific sheets (at same row position but different columns).
My code:
import os
import pandas as pd
import re
from excel import append_df_to_excel
path = 'orig_vers.xlsx'
xls = pd.ExcelFile(path, engine='openpyxl')
# Loop
for sheet in xls.sheet_names:
try:
df = pd.read_excel(xls,sheet_name=sheet)
#drop empty columns and #drop rows with missing values
df1=df.dropna(how='all', axis=1).dropna(how='any',axis=0)
# Subset data
ver1= some calculation on df1
ver2= some calculation on df1
lastrow = len(df1) # return the last row value
append_df_to_excel(path, ver1, sheet_name=sheet,startrow=lastrow+2, index=False)
append_df_to_excel(path, ver2,sheet_name=sheet,startrow=lastrow+2,startcol=5,index=False)
except Exception:
continue
"append_df_to_excel" is a helper function from here link
The code works well for the first sheet by appending the two result dataframe at the bottom of original data at the specified positions, but no data is appended to the other sheets. If I remove the try and catch lines and then run the code, I get this error:"Error -3 while decompressing data: invalid distance too far back".
My suspicion is that maybe because as from sheet number 2 of the excel file being read into dataframe, the original data have some empty rows which my code removed before subsetting, then excel writer has issues with this line: lastrow = len(df1). Does anyone know the answer to this issue?

Use Python to export Excel column data when column isn't in first row

I need to pull data from a column based on the column header. My only problem is the input files aren't consistent and have the column in different locations and the data doesn't start on row one.
Above is an example excel file. I want to pull the data for Market. I've got this to work using panda if the data starts at a1, but I can't get it to pull the data if it doesn't start in the first position.
How about you use this just after you pd.read_excel() statement ?
df=df.dropna(how='all',axis='columns').dropna(how='all',axis='rows')
You can then set the first row as header:
df.columns = df.iloc[0]
df = df[1:]
df

How to read a main row or main column in DataFrame and search the string for cell values?

I have a excel workbook that I convert into a dataframe but I'm trying to rea each row and column and assign variables but the problem is the data is not always in the same place so I can't hardcode the location of the data. So I capture the first row and first column and I'm just trying to search them to find the data and get the location.
Here is my code:
wb = load_workbook(strTotalFile,data_only = True)
for sheet in wb.sheetnames:
ws = wb[sheet]
data = ws.values
pd.set_option('display.max_columns', None)
df = pd.DataFrame(data)
mainRow = df.iloc[[17]]
mainCol = df.iloc[:,1]
This part of your question really bothers me:
" the data is not always in the same place "
I am taking this to mean that your columns are not formatted, or rows.
This would cause no end of problems for you.
But assuming I am wrong, you can just essentially use simple commands, The accepted answer I link below covers what you need, not my answer.
How do I select rows from a DataFrame based on column values?

How can i add rows in a sheet from a dataframe (using pygsheets) without changing the number of columns in the worksheet?

I'm trying to add rows from a dataframe into google sheets Im using python2 and pygsheets. I have 10 columns in the google sheets and 6 columns in my dataframe, and my problem is when I add the rows in the Sheets it deletes the 4 extra columns of my sheets
so this code should add the number of rows of the df in the worksheet (the rows without any content)
import pygsheets
import pandas as pd
sh = gc.open_by_key('xxxx')
worksheet = sh.worksheet('RawData')
rows= df.shape[0]
worksheet.add_rows(df)
The code does work but is fitting the grid of sheets to the one of the df.
does anyone know a solution for adding the exact amount of rows in a worksheet and keeping the worksheet columns intact?
I believe your goal as follows.
In your situation, there are 10 columns in Google Spreadsheet.
For this Spreadsheet, you want to append the values from the dataframe which have the 6 columns.
In this situation, you don't want to remove other 4 columns in Spreadsheet.
You want to achieve this using pygsheets of python.
In this case, how about the following flow?
Convert dataframe to a list.
Put the list to Spreadsheet.
When this flow is reflected to the script, it becomes as follows.
Sample script:
df = ### <--- Please set your dataframe.
sh = gc.open_by_key('xxxx')
worksheet = sh.worksheet('RawData')
values = df.values.tolist()
worksheet.append_table(values, start='A1', end=None, dimension='ROWS', overwrite=False)
If you want to include the header row, please use the following script.
df = ### <--- Please set your dataframe.
sh = gc.open_by_key('xxxx')
worksheet = sh.worksheet('RawData')
values = [df.columns.values.tolist()]
values.extend(df.values.tolist())
worksheet.append_table(values, start='A1', end=None, dimension='ROWS', overwrite=False)
In above script, the values are put from the 1st empty row of the sheet RawData.
And, when overwrite=False is used, the new rows of the number of same rows of values are added.
Reference:
append_table

Flattening Table From Excel into Csv with Pandas

I'm trying to take the data from a table in excel and put it into a csv in a single row. I have the data imported from excel into a dataframe using pandas, but now, I need to write this data to a csv in a single row. Is this possible to do, and if so, what would the syntax look like generally if I was taking a 50 row 3 column table and flattening it into 1 row 150 column csv table? My code so far is below:
import pandas as pd
df = pd.read_excel('filelocation.xlsx',
sheetname=['pnl1 Data ','pnl2 Data','pnl3 Data','pnl4 Data'],
skiprows=8, parse_cols="B:D", keep_default_na='FALSE', na_values=['NULL'], header=3)
DataFrame.to_csv("outputFile.csv" )
Another question that I would help me understand how to transform this data is, "Is there any way to select a piece of data from a specific row and column"?
You can simply set the line_terminator to nothing, like so:
df.to_csv('ouputfile.csv', line_terminator=',', index=False, header=False)
Or you can translate your dataframe into a numpy array and use the reshape function:
import numpy as np
import pandas as pd
arr = df.values.reshape(1,-1)
You can then use numpy.savetxt() to save as CSV.
try to do this:
df.to_csv("outputFile.csv", line_terminator=',')

Categories