How to adjust table header when saving dataframe to excel using Pandas? - python

The objective is to save a df as xlsx format using the code below.
import pandas as pd
from pandas import DataFrame
list_me = [['A','A','A','A','A','B','C','D','D','D','D'],
['TT','TT','UU','UU','UU','UU','UU','TT','TT','TT','TT'],
['5','2','1','1','1','40','10','2','2','2','2'],
['1','1','1','2','3','3','1','2','2','2','1']]
df = DataFrame (list_me).transpose()
df.columns = ['Name','Activity','Hour','Month']
df_tab=pd.crosstab(df.Name, columns=[df.Month, df.Activity], values=df.Hour, aggfunc='sum').fillna(0)
df_tab.reset_index ( level=0, inplace=True )
df_tab.to_excel("output.xlsx")
The code work fine and outputted xlsx as below:
However, I notice adding index on the first column separate the text Month, Activity, Name into separate columns.
May I know whether there is a build-in setting within Pandas that can produce the output as below?
Thanks in advance
p.s.: Please ignore the yellow line, it just to indicate there should be a blank row.

Related

How to replace the blank cells in Excel with 0 using Python?

I'm trying to replace the blank cells in excel to 0 using Python. I have a loop script that will check 2 excel files with the same WorkSheet, Column Headers and Values. Now, from the picture attached,
enter image description here
the script will write the count to Column Count in Excel 2 if the value of Column A in Excel 2 matches to the value of Column A in Excel 1. Now, the problem that I have are the values in Column A in Excel 2 that doesn't have a match in Column A in Excel 1, it leaves the Column Count in Excel 2 blank.
Below is a part of the script that will check the 2 excel files I have. I'm trying the suggestion from this link Pandas: replace empty cell to 0 but it doesn't work for me and I get result.fillna(0, inplace=True) NameError: name 'result' is not defined error message. Guidance on how to achieve my goal would be very nice. Thank you in advance.
import pandas as pd
import os
import openpyxl
daily_data = openpyxl.load_workbook('C:/Test.xlsx')
master_data = openpyxl.load_workbook('C:/Source.xlsx')
daily_sheet = daily_data['WorkSheet']
master_sheet = master_data['WorkSheet']
for i in daily_sheet.iter_rows():
Column_A = i[0].value
row_number = i[0].row
for j in master_sheet.iter_rows():
if j[0].value == Column_A:
daily_sheet.cell(row=row_number, column=6).value = j[1].value
#print(j[1].value)
daily_data.save('C:/Test.xlsx')
daily_data = pd.read_excel('C:/Test.xlsx')
daily_data.fillna(0, inplace=True)
it seems you've made a few fundamental mistakes in your approach. First off, "result" is an object, specifically its a dataframe that someone else made (from that other post) it is not your dataframe. Thus, you need to run it on your dataframe. In python, we have whats called an object oriented approach, meaning that objects are the key players. .fillna() is a mthod that operates on your object. Thus the usage for a toy example is as follows:
my_df = pd.read_csv(my_path_to_my_df_)
my_df.fillna(0, inplace=True)
also this method is for dataframes thus you will need to convert it from the object the openpyxl library creates, at least thats what i would assume i haven't used this library before. Therefore in your data you would do this:
daily_data = pd.read_excel('C:/Test.xlsx')
daily_data.fillna(0, inplace=True)

How to eliminate "blank" rows that show up after importing an Excel file using pd.read_excel()

I read in an Excel file from an external source:
import pandas as pd
df = pd.read_excel('https://www.sharkattackfile.net/spreadsheets/GSAF5.xls')
When I call df.tail(), I see that there are 25,841 rows in this dataframe.
Also, notably, I see that there is a value of 'xx' in the Case Number column. This is not valid data.
But, looking at the file itself, I see that there are only 6807 rows of valid data:
How do I get a dataframe that only has the valid data (i.e. rows 1-6807), noting that as cases are added to this file, the range would need to be dynamic?
Thanks for your help!
You could use pandas DataFrame's replace function, then do dropna to drop every np.nan values.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html
DataFrame.replace('', np.nan)
DataFrame.replace('xx', np.nan)
DataFrame.dropna()

Pandas read_csv does not separate values after comma

I am trying to load some .csv data in the Jupyter notebook but for some reason, it does not separate my data but puts everything in a single column.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df =
pd.read_csv(r'C:\Users\leonm\Documents\Fontys\Semester
4\GriefBot_PracticeChallenge\DummyDataGriefbot.csv')
df.head()
My csv data
In this picture there is the data I am using.
And now I do not understand why my values all come in single column and are not separated where the comas are.
I have also tried both spe=',' and sep=';' but they do not change anything.
This is what I am getting
I would really appreciate your help.
If that's how your data looks in a CSV reader like Excel, then each row likely looks like one big string in a text editor.
"ID,PERSON,DATE"
"1,A. Molina,1593147221"
"2,A. Moran, 16456"
"3,Action Marquez,15436"
You could of course do "text to columns" within Excel and resave your file, or if you have many of these files, you can use the Pandas split function.
df[df.columns[0].split(',')] = df.iloc[:,0].str.split(',', expand=True)
# ^ split header by comma ^ ^ create list split by comma, and expand
# | each list entry into a new column
# | select first column of data
df.head()
> ID,PERSON,DATE ID PERSON DATE
> 0 1,A. Molina,1593147221 1 A. Molina 1593147221
> 1 2,A. Moran, 16456 2 A. Moran 16456
> 2 3,Action Marquez,15436 3 Action Marquez 15436
You can then use pd.drop to drop that first column if you have no use for it
df.drop(df.columns[0], axis=1, inplace=True)

Use multiple rows as column header for pandas

I have a dataframe that I've imported as follows.
df = pd.read_excel("./Data.xlsx", sheet_name="Customer Care", header=None)
I would like to set the first three rows as column headers but can't figure out how to do this. I gave the following a try:
df.columns = df.iloc[0:3,:]
but that doesn't seem to work.
I saw something similar in this answer. But it only applies if all sub columns are going to be named the same way, which is not necessarily the case.
Any recommendations would be appreciated.
df = pd.read_excel(
"./Data.xlsx",
sheet_name="Customer Care",
header=[0,1,2]
)
This will tell pandas to read the first three rows of the excel file as multiindex column labels.
If you want to modify the rows after you load them then set them as columns
#set the first three rows as columns
df.columns=pd.MultiIndex.from_arrays(df.iloc[0:3].values)
#delete the first three rows (because they are also the columns
df=df.iloc[3:]

Pandas - Drop function error (label not contained in axis) [duplicate]

This question already has answers here:
Delete a column from a Pandas DataFrame
(20 answers)
Closed 5 years ago.
I have a CSV file that is as the following:
index,Avg,Min,Max
Build1,56.19,39.123,60.1039
Build2,57.11,40.102,60.2
Build3,55.1134,35.129404123,60.20121
Based off my question here I am able to add some relevant information to this csv via this short script:
import pandas as pd
df = pd.read_csv('newdata.csv')
print(df)
df_out = pd.concat([df.set_index('index'),df.set_index('index').agg(['max','min','mean'])]).rename(index={'max':'Max','min':'Min','mean':'Average'}).reset_index()
with open('newdata.csv', 'w') as f:
df_out.to_csv(f,index=False)
This results in this CSV:
index,Avg,Min,Max
Build1,56.19,39.123,60.1039
Build2,57.11,40.102,60.2
Build3,55.1134,35.129404123,60.20121
Max,57.11,40.102,60.20121
Min,55.1134,35.129404123,60.1039
Average,56.1378,38.1181347077,60.16837
I would like to now have it so I can update this csv. For example if I ran a new build (build4 for instance) I could add that in and then redo the Max, Min, Average rows. My idea is that I therefore delete the rows with labels Max, Min, Average, add my new row, redo the stats. I believe the code I need is as simple as (just for Max but would have lines for Min and Average as well):
df = pd.read_csv('newdata.csv')
df = df.drop('Max')
However this always results in an ValueError: labels ['Max'] not contained in axis
I have created the csv files in sublime text, could this be part of the issue? I have read other SO posts about this and none seem to help my issue.
I am unsure if this allowed but here is a download link to my csv just in case something is wrong with the file itself.
I would be okay with two possible answers:
How to fix this drop issue
How to add more builds and update the statistics (a method without drop)
You must specify the axis argument. default is axis = 0 which is rows columns is axis = 1.
so this should be your code.
df = df.drop('Max',axis=1)
edit:
looking at this piece of code:
df = pd.read_csv('newdata.csv')
df = df.drop('Max')
The code you used does not specify that the first column of the csv file contains the index for the dataframe. Thus pandas creates an index on the fly. This index is purely a numerical one. So your index does not contain "Max".
try the following:
df = pd.read_csv("newdata.csv",index_col=0)
df = df.drop("Max",axis=0)
This forces pandas to use the first column in the csv file to be used as index. This should mean the code works now.
To delete a particular column in pandas; do simply:
del df['Max']

Categories