I would like to be able to show the sub-total column from a multi-index pivot table in different ways for example, I would like to show the sum for a selected row and the max for another, is this possible?
I managed to get half code correct but I am stuck in replicating the code without offsetting the previous one and I am not able to loop this code over.
In my example I want to get the max value from Toyota and the sum value from Honda shown in the newly created Total column.
cars = {'Brand': ['Honda','Toyota', 'Honda','Toyota'],
'Target': ['A','B', 'A','B'],
'Speed': [20, 80, 30 , 10],
'Date' : ['13/02/2019', '18/02/2019', '18/02/2019', '13/02/2019']
}
df = pd.DataFrame(cars)
table = pd.pivot_table(df, values=['Speed'],
index=['Target', 'Brand'],
columns=['Date'],
fill_value=0, aggfunc=np.sum, dropna=True)
table
the code craeted: (which works only for the last line as it overwrites the first one)
table['Total'] = table.loc(axis=0)[:, ['Toyota']].max(axis=1)
table['Total'] = table.loc(axis=0)[:, ['Honda']].sum(axis=1)
Current output:
Disired Output:
I would like to be able to see also the max value for Toyota which would be 80.
Use slicers for set new values in both sides, here : means all values for levels:
idx = pd.IndexSlice
table.loc[idx[:, 'Toyota'], 'Total'] = table.max(axis=1)
table.loc[idx[:, 'Honda'], 'Total'] = table.sum(axis=1)
print (table)
Speed Total
Date 13/02/2019 18/02/2019
Target Brand
A Honda 20 30 50.0
B Toyota 10 80 80.0
You can set and select in both sides:
idx = pd.IndexSlice
table.loc[idx[:, 'Toyota'], 'Total'] = table.loc[idx[:, 'Toyota'], :].max(axis=1)
table.loc[idx[:, 'Honda'], 'Total'] = table.loc[idx[:, 'Honda'], :].sum(axis=1)
Related
A first dataframe has a column containing categories which are the same as the headers of the second. There are multiple row entries for one name in df1. df2 will have 1 row entry per name. df1 has 1 row entry per category per name. All rows for one name occur in sequence in df1.
As shown:
enter image description here
The headers of df2 are as follows:
enter image description here
And the desired output is below:
enter image description here
How can I map data from the df1 to df2?
More specifically, how can I map multiple rows from df1 to 1 row and the respective columns of df2 in a more efficient way than looping twice to check for each category under each name?
Any help is appreciated,
Have a great day
Code:
import pandas as pds
df1 = pds.DataFrame({'Client': ['Rick', 'Rick', 'John'], 'Category': ['Service1', 'Service2', 'Service1'], 'Amount': [250, 6, 79]})
df2 = pds.DataFrame(columns = ['Client', 'Due_Date', 'Service1', 'Service2'])
output = pds.DataFrame({'Client': ['Rick', 'John'], 'Due_Date': [None,None] , 'Service1': [250, 79], 'Service2': [6, 0]})
This is an alternative approach using .pivot() and .assign()
df1_pivot = (df1.pivot(index='Client', columns='Category', values='Amount')
.reset_index().assign(Due_Date=None))
df_out = df2.assign(**df1_pivot)
print(df_out)
Client Due_Date Service1 Service2
0 John None 79.0 NaN
1 Rick None 250.0 6.0
You're looking for pandas.DataFrame.pivot :
out = (df1.pivot(index="Client", columns="Category")
.reset_index()
.set_axis(["Client", "Service1", "Service2"], axis=1)
.assign(Due_Date= None)
)
NB : I suggest you to use import pandas as pd as per the import convention
.
Output :
print(out)
Client Service1 Service2 Due_Date
0 John 79.0 NaN None
1 Rick 250.0 6.0 None
A list of names and I want to retrieve each of the correspondent information in different data frames, to form a new dataframe.
I converted the list into a 1 column dataframe, then to look up its corresponding values in different dataframes.
The idea is visualized as:
I have tried:
import pandas as pd
data = {'Name': ["David","Mike","Lucy"]}
data_h = {'Name': ["David","Mike","Peter", "Lucy"],
'Hobby': ['Music','Sports','Cooking','Reading'],
'Member': ['Yes','Yes','Yes','No']}
data_s = {'Name': ["David","Lancy", "Mike","Lucy"],
'Speed': [56, 42, 35, 66],
'Location': ['East','East','West','West']}
df = pd.DataFrame(data)
df_hobby = pd.DataFrame(data_h)
df_speed = pd.DataFrame(data_s)
df['Hobby'] = df.lookup(df['Name'], df_hobby['Hobby'])
print (df)
But it returns the error message as:
ValueError: Row labels must have same size as column labels
I have also tried:
df = pd.merge(df, df_hobby, on='Name')
It works but it includes unnecessary columns.
What will be the smart an efficient way to do such, especially when the number of to-be-looked-up dataframes are many?
Thank you.
Filter only columns for merge and columns for append like:
df = (pd.merge(df, df_hobby[['Name','Hobby']], on='Name')
.merge(df_speed[['Name','Location']], on='Name'))
print(df)
Name Hobby Location
0 David Music East
1 Mike Sports West
2 Lucy Reading West
If want working with list use this solution with filtering columns:
dfList = [df,
df_hobby[['Name','Hobby']],
df_speed[['Name','Location']]]
from functools import reduce
df = reduce(lambda df1,df2: pd.merge(df1,df2,on='Name'), dfList)
print (df)
Name Hobby Location
0 David Music East
1 Mike Sports West
2 Lucy Reading West
I need to remove rows with the same name when any of the row with the same name has missing data.
See pic for sample
sample
I like to remove BOTH rows for country Belize when any of the row for Belize has missing info. Here Belize is missing data for 2011, and 2012 row for Belize need to be removed too.
What's an efficient way to code this to apply to the whole dataset in Phyton?
try this:
df.dropna(subset = ["Factor A"], inplace=True)
As mentioned in the comments, you can use transform to create a series and use it as a boolean mask to drop the desired rows.
# sample data, please always provide in this form so we can paste in our tests
# you could get it with `df.head().to_dict('list')`
df = pd.DataFrame({
'Country': ['Afghanistan', 'Afghanistan', 'Belize', 'Belize'],
'Factor A': [153, 141, None, 50],
'Factor B': [3.575, 3.794, None, 5.956],
'Year': [2011, 2012, 2011, 2012]
})
droprows = (
df.groupby('Country') # group the rows by Country
.transform(lambda x: x.isna().any())
# .transform applies a function and returns the same scalar value
# for all rows in the group
# x.isna() returns True if a cell contains NaN, element-wise
# .any() aggregates and returns a scalar True/False per group
# the line returns a dataframe shaped as df.shape
# with one les column: 'Country'
.any(axis=1) # collapse that result into a single column
)
print(droprows)
# 0 False
# 1 False
# 2 True
# 3 True
# dtype: bool
df = df[~droprows]
print(df)
Output
Country Factor A Factor B Year
0 Afghanistan 153.0 3.575 2011
1 Afghanistan 141.0 3.794 2012
When I'm working in SQL, I find almost all the things I do with a column are related to the following four operations:
Add a column.
Remove a column.
Change a column type.
What is the preferred way to do these three DML operations in pandas. For example, let's suppose I am starting with the following DataFrame:
import pandas as pd
df = pd.DataFrame([
{'product': 'drink', 'brand': 'spindrift', 'id': '19'},
{'product': 'cup', 'brand': None, 'id': '11'}
])
How would I:
Change the df.id column from a string (or object as it says) to an int64 ?
Rename the column product to product_type ?
Add a new column called 'cost' with values [2.99, 3.99] ?
Remove the column called brand ?
Simple and complete:
import numpy as np
import pandas as pd
df = pd.DataFrame([
{'product': 'drink', 'brand': 'spindrift', 'id': '19'},
{'product': 'cup', 'brand': None, 'id': '11'}
])
# Change the df.id column from a string (or object as it says) to an int64 ?
df['id'] = df['id'].astype(np.int64)
# Rename the column product to product_type ?
df = df.rename(columns={'product':'prouduct_type'})
# Add a new column called 'cost' with values [2.99, 3.99] ?
df['cost'] = pd.Series([2.99, 3.99])
# Remove the column called brand ?
df = df.drop(columns='brand')
This functions can also be chained together. I would not recommend it as it is not fixable as above:
# do all the steps above with a single line
df = df.astype({'id':np.int64},
axis=1
).rename(columns={'product':'prouduct_type'}
).assign(cost=[2.99, 3.99]
).drop(columns='brand')
There is also another way to which you can use inplace=True . This does the assignment. I don’t recommend it as it is not explicitly as the first method
# Using inplace=True
df['id'].astype(np.int64, inplace=True)
df.rename(columns={'product':'prouduct_type'}, inplace=True)
# No change from previous
df['cost'] = pd.Series([2.99, 3.99])
# pop brand out
df.pop('brand')
print(df)
You can perform these steps like this (starting with your original data frame):
# add a column
df = pd.concat([df, pd.Series([2.99, 3.99], name='cost')], axis=1)
# change column name
df = df.rename(columns={'product': 'product_type'})
# remove brand
df = df.drop(columns='brand')
# change data type
df['id'] = df['id'].astype('int')
print(df)
product_type id cost
0 drink 19 2.99
1 cup 11 3.99
you cound do:
df = pd.DataFrame([
{'product': 'drink', 'brand': 'spindrift', 'id': '19'},
{'product': 'cup', 'brand': None, 'id': '11'}
])
df = (df.assign(cost=[2.99, 3.99],
id=lambda d: d.id.astype(int))
.drop(columns=['brand'])
.rename({"product": 'product_type'}, axis=1))
This should work
# change datatype
>>> df['id'] = df['id'].astype('int64')
>>> df.dtypes
brand object
id int64
product object
# rename column
df.rename(columns={'product': 'product_type'}, inplace=True)
>>> df
brand id product_type
0 spindrift 19 drink
1 None 11 cup
# create new column
df['Cost'] = pd.Series([2.99, 3.99])
>>> df
brand id product_type Cost
0 spindrift 19 drink 2.99
1 None 11 cup 3.99
# drop column
>>> df.drop(['brand'], axis=1, inplace=True)
>>> df
id product_type Cost
0 19 drink 2.99
1 11 cup 3.99
I already have an idea as to how I'm going to do this - I'm just curious about whether my method is the most efficient.
So for instance, let's say that for whatever reason, I have the following table:
The first 4 columns in the table are all repeated - they just say info about the employee. The reason these rows repeat is because that employee handles multiple clients.
In some cases, I am missing info on the Age and Employee duration of an employee. Another colleague gave me this information in an excel sheet.
So now, I have info on Brian's and Dennis' age and employment duration, and I need to fill all rows with their employee IDs based on the information. My plan for doing that is this:
data = {"14": # Brian's Employee ID
{"Age":31,
:"Employment Duration":3},
"21": # Dennis' Employee ID
{"Age":45,
"Employment Duratiaon":12}
}
After making the above dictionary of dictionaries with the necessary values, my plan is to iterate over each row in the above dataframe, and fill in the 'Age' and 'Employment Duration' columns based on the value in 'Employee ID':
for index, row in df.iterrows:
if row["Employee ID"] in data:
row["Age"] = data["Employee ID"]["Age"]
row["Employment Duration"] = data["Employee ID"]["Employement Duration"]
That's my plan for populating the missing values!
I'm curious about whether there's a simpler way that's just not presenting itself to me, because this was the first thing that sprang to mind!
Don't iterate over rows in pandas when you can avoid it. Instead maximize the pandas library with actions like this:
Assume we have a dataframe:
data = pd.DataFrame({
'name' : ['john', 'john', 'mary', 'mary'],
'age' : ['', '', 25, 25]
})
Which looks like:
name age
0 john
1 john
2 mary 25
3 mary 25
We can apply a lambda function like so:
data['age'] = data.apply(lambda x: 27 if x.name == 'john' else x.age, axis=1)
Or we can use pandas .loc:
data['age'].loc[data.name == 'john'] = 27
Test them out and compare how long each take to execute vs. iterating over rows.
Ensure missing values are represented as null values (np.NaN). The second set of information should be stored in another DataFrame with the same column labels.
Then by setting the Index to the 'Employee ID' update will align on the indices and fill the missing values.
Sample Data
import pandas as pd
import numpy as np
df = pd.DataFrame({'Employee ID': ["11", "11", "14", "21"],
'Name': ['Alan', 'Alan', 'Brian', 'Dennis'],
'Age': [14,14, np.NaN, np.NaN],
'Employment Duration': [3,3, np.NaN, np.NaN],
'Clients Handled': ['A', 'B', 'C', 'G']})
data = {"14": {"Age": 31, "Employment Duration": 3},
"21": {"Age": 45, "Employment Duration": 12}}
df2 = pd.DataFrame.from_dict(data, orient='index')
Code
#df = df.replace('', np.NaN) # If not null in your dataset
df = df.set_index('Employee ID')
df.update(df2, overwrite=False)
print(df)
Name Age Employment Duration Clients Handled
Employee ID
11 Alan 14.0 3.0 A
11 Alan 14.0 3.0 B
14 Brian 31.0 3.0 C
21 Dennis 45.0 12.0 G