I have a dataframe with 45 columns and 1000 rows. My requirement is to create a single excel sheet with the top 2 values of each column and their percentages (suppose col 1 has the value 'python' present 500 times in it, the percentage should be 50)
I used:
writer = pd.ExcelWriter('abc.xlsx')
df = pd.read_sql('select * from table limit 1000', <db connection sring>)
column_list = df.columns.tolist()
df.fillna("NULL", inplace = True)
for obj in column_list:
df1 = pd.DataFrame(df[obj].value_counts().nlargest(2)).to_excel(writer,sheet_name=obj
writer.save()
This writes the output in separate excel tabs of the same document. I need them in a single sheet in the below format:
Column Name Value Percentage
col1 abc 50
col1 def 30
col2 123 40
col2 456 30
....
Let me know any other functions as well to get to this output.
The first thing that jumps out to me is that you are changing the sheet name each time, by saying sheet_name=obj If you get rid of that, that alone might fix your problem.
If not, I would suggest concatenating the results into one large DataFrame and then writing that DataFrame to Excel.
for obj in column_list:
df = pd.DataFrame(df[obj].value_counts().nlargest(2))
if df_master is None:
df_master = df
else:
df_master = pd.concat([df_master,df])
df_master.to_excel("abc.xlsx")
Here's more information on stacking/concatenating dataframes in Pandas
https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
Related
I have a pandas dataframe and I need to append 3 blank rows over the head of the columns before to export it to xlsx.
I'm using this code based on this question:
df1 = pd.DataFrame([[np.nan] * len(df.columns)], columns=df.columns)
df = df1.append(df, ignore_index=True)
But it adds the rows at index 0 and I need the blank rows before the row with the column names in the xlsx.
Is this possible to do?
Use startrow parameter for omit first N rows:
N = 3
df.to_excel('data.xlsx', sheet_name='Sheet1', startrow=N, index=False)
I am reading an excel file using pandas. I want to create multiple data frames from the original data frame. each data frame name should be the row 1 heading. Also, how to skip the one column between each transaction.
Expected result:
transaction_1:
name id available capacity completed all
transaction_2:
name id available capacity completed all
transaction_3:
name id available capacity completed all
What I tried:
import pandas as pd
import pprint as pp
pd.options.display.width = 0
pd.options.display.max_columns = 999
pd.options.display.max_rows = 999
df = pd.read_excel(r'capacity.xlsx', sheet_name='Sprint Details', header=0)
df1 = df.iloc[:, 0:3]
print(df1)
You can try this (works with pd.__version__ == 1.1.1):
df = (pd.read_excel(
"capacity.xlsx", sheet_name="Sprint Details", header=[0, 1], index_col=[0, 1]
)
.dropna(axis=1, how="all")
.rename_axis(index=["name", "id"], columns=[None, None]))
transaction_1 = df["transaction_1"].reset_index()
transaction_2 = df["transaction_2"].reset_index()
transaction_3 = df["transaction_3"].reset_index()
Essentially, we need to read the sheet in as a dataframe with a MultiIndex. The first 2 rows are our column names header=[0,1]. Whereas the first 2 columns are our index that will be used for each "subtable" index_col=[0,1].
Because there are spaces in each table, we will have columns that are entirely NaN so we drop those with .dropna(axis=1, how="all").
Because pandas does not expect the index names and columns to be in the same row, it should incorrectly parse your index column names ["name", "id"] as the name of the second level of the column index. To remedy this, we can manually assign the correct index name, while also removing the column index names via rename_axis(index=["name", "id"], columns=[None, None])
Now that we have a nicely formatted table with a MultiIndex column, we can simply slice out each table, and call .reset_index() on each to ensure that each table has the "name" and "id" as a column in each table.
Edit: Seems we have a parsing difference between our versions of pandas.
Option 1.
If you can directly modify the excel sheet to include another row (to better separate the columns from the index names). This will provide the most robust results.
The following code works:
df = (pd.read_excel(
"capacity.xlsx", sheet_name="Sprint Details", header=[0, 1], index_col=[0, 1]
)
.dropna(axis=1, how="all"))
transaction_1 = df["transaction_1"].reset_index()
transaction_2 = df["transaction_2"].reset_index()
transaction_3 = df["transaction_3"].reset_index()
Option 2
If you can not modify the excel file, we'll need a more roundabout method unfortunately.
df = pd.read_excel("capacity.xlsx", header=[0,1]).dropna(axis=1, how="all")
index = pd.MultiIndex.from_frame(df.iloc[:, :2].droplevel(0, axis=1))
df = df.iloc[:, 2:].set_axis(index)
transaction_1 = df["transaction_1"].reset_index()
transaction_2 = df["transaction_2"].reset_index()
transaction_3 = df["transaction_3"].reset_index()
I have a excel like below
I have to read the excel and do some operations. The problem is I have to skip the empty rows and columns.In the above example it should read only from B3:D6. But with below code, it considers all the empty rows also like below
Code i'm using
import pandas as pd
user_input = input("Enter the path of your file: ")
user_input_sheet_master = input("Enter the Sheet name : ")
master = pd.read_excel(user_input,user_input_sheet_master)
print(master.head(5))
How to ignore the empty rows and columns to get the below output
ColA ColB ColC
0 10 20 30
1 23 NaN 45
2 NaN 30 50
Based on some research i have tried using df.dropna(how='all') but it also deleted the COLA and COLB. I cannot hardcode value for skiprows or skipcolumns because it may not be same format every time.The no of rows and columns to be skipped may vary. Sometimes there may not be any empty rows or columns. In that case, there is no need to delete anything.
You surely need to use dropna
df = df.dropna(how='all').dropna(axis=1, how='all')
EDIT:
If we have following file:
And then use this code:
df = pd.read_excel('tst1.xlsx', header=None)
df = df.dropna(how='all').dropna(how='all', axis=1)
headers = df.iloc[0]
new_df = pd.DataFrame(df.values[1:], columns=headers)
new_df looks following way:
If we start with:
And use exactly the same code, I get:
Finally, start from:
Get the same as in the first case.
I am doing some analysis on several different categories. I want to all the analysis to be on the same tab in a spreadsheet. So I have two dataframes for the information, but the columns are different and information different.
dataframe 1
colA colB calC
row 1
row 2
row 3
dataframe 2
colD colE calD
row 1
row 2
row 3
I want to export both of these dataframes on one excel sheet one after the other. The analysis are different lengths and I want dataframe 2 to be right below dataframe1 on a sheet.
import pandas
from openpyxl import load_workbook
book = load_workbook('test.xlsx')
writer = pandas.ExcelWriter('test.xlsx', engine='openpyxl')
writer.book = book
df1.to_excel(writer,sheet_name=sheetname,startrow=writer.sheets["Sheet1"].max_row, index = False,header= False)
writer.save()
// then do the same steps for any more number of dataframes.
You can add an extra row to the second DataFrame with the values same as the column names. And then simply use pd.concat()
df2.columns = df1.columns
pd.concat([df1, df2])
First make the columns of both the dataframes to be the same and then use pd.concat to append df2 to the end of df1
You can create a new dataframe from this and export it to csv :
df = pd.concat([df1,df2])
df.to_csv('filename.csv')
If you want the header of the second dataframe also in your final csv file, create df2 : df2 = pd.read_csv('df2.csv', names = df1.columns)
df1=pd.DataFrame(np.vstack([df1.columns, df1]))
#this will convert column names into rows
df2=pd.DataFrame(np.vstack([df2.columns, df2]))
#samewith other dataframe
#concat these dataframe and save as excel without index or columns
pd.concat((a,b)).to_excel('filename.xlsx',header=False,index=False)
I have a spreadsheet with 12 tabs, one for each month. They have the exact same columns, but are possibly in a different order. Eventually, I want to combine all 12 tabs into one dataset and Export a file. I know how to do everything but make sure the columns match before merging the datasets together.
Here's what I have so far:
Import Excel File and Create Ordered Dictionary of All Sheets
sheets_dict = pd.read_excel("Monthly Campaign Data.xlsx", sheet_name = None, parse_dates = ["Date", "Create Date"])
I want to iterate this
sorted(sheets_dict["January"].columns)
and combine it with this and capitalize each column:
new_df = pd.DataFrame()
for name, sheet in sheets_dict.items():
sheet['sheet'] = name
sheet = sheet.rename(columns=lambda x: x.title().split('\n')[-1])
new_df = new_df.append(sheet)
new_df.reset_index(inplace = True, drop = True)
print(new_df)
If all the sheets have exactly the same columns, the pd.concat() function can align those columns and concatenate all these DataFrames.
Then you can group the DataFrame by different year, then sort each part.