Pandas: Group people into households to generate descriptives - python

My problem can be simplified as having two dataframes;
Dataframe 1 contains people and the household that they live in:
Person ID | Household ID
1 1
2 2
3 2
4 3
5 1
Dataframe 2 contains individual characteristics of people:
Person ID | Age | Workstatus | Education
1 20 Working High
2 29 Working Medium
3 31 Unemployed Low
4 45 Unemployed Medium
5 30 Working Medium
The goal is to group people belonging to the same Household ID together, in order to generate descriptives about the family, e.g. 'average age of persons in household", "average education level", etc.
I tried:
df1.groupby['Household ID']
but I'm not sure where to go from there, how to do it the 'pandas' way. The 'real' dataset is very large so working with lists takes too long.
The ideal output would be:
Household ID | Avg Age of persons | Education
1 25 High/med
2 25.7 High/High
3 28 Low/Low

we can use .map to get the household IDs and groupby with Named Aggregations
df3 = (
df2.assign(houseID=df2["Person ID"].map(df1.set_index("Person ID")["Household ID"]))
.groupby("houseID")
.agg(avgAgeOfPerson=("Age", "mean"), Education=("Education", "/".join))
)
print(df3)
avgAgeOfPerson Education
houseID
1 25 High/Medium
2 30 Medium/Low
3 45 Medium

You can merge both the datasets and then groupby on household id:
df1 = pd.DataFrame([[1,1],[2,2],[3,2],[4,3],[5,1]],columns = ['Person ID', 'Household ID'])
df2 = pd.DataFrame([[1,20,'Working', 'High'],[2,29,'Working','Medium'],[3,31,'Unemployed','Low'],[4,45,'Unemployed','Medium'],[5,30,'Working','Medium']],columns = ['Person ID','Age','Workstatus','Education'])
merged = pd.merge(df1,df2, on = 'Person ID', how = 'left')
merged.groupby('Household ID').agg({'Age':'mean', 'Education':list})
Result:
Age Education
Household ID
1 25 [High, Medium]
2 30 [Medium, Low]
3 45 [Medium]

Related

Add values in columns if criteria from another column is met

I have the following DataFrame
import pandas as pd
d = {'Client':[1,2,3,4],'Salesperson':['John','John','Bob','Richard'],
'Amount':[1000,1000,0,500],'Salesperson 2':['Bob','Richard','John','Tom'],
'Amount2':[400,200,300,500]}
df = pd.DataFrame(data=d)
Client
Salesperson
Amount
Salesperson
Amount2
1
John
1000
Bob
400
2
John
1000
Richard
200
3
Bob
0
John
300
4
Richard
500
Tom
500
And I just need to create some sort of "sumif" statement (the one from excel) that will add the amount each salesperson is due. I don't know how to iterate over each row, but I want to have it so that it adds the values in "Amount" and "Amount2" for each one of the salespersons.
Then I need to be able to see the amount per salesperson.
Expected Output (Ideally in a DataFrame as well)
Sales Person
Total Amount
John
2300
Bob
400
Richard
700
Tom
500
There can be multiple ways of solving this. One option is to use Pandas Concat to join required columns and use groupby
merged_df = pd.concat([df[['Salesperson','Amount']], df[['Salesperson 2', 'Amount2']].rename(columns={'Salesperson 2':'Salesperson','Amount2':'Amount'})])
merged_df.groupby('Salesperson',as_index = False)['Amount'].sum()
you get
Salesperson Amount
0 Bob 400
1 John 2300
2 Richard 700
3 Tom 500
Edit: If you have another pair of salesperson/amount, you can add that to the concat
d = {'Client':[1,2,3,4],'Salesperson':['John','John','Bob','Richard'],
'Amount':[1000,1000,0,500],'Salesperson 2':['Bob','Richard','John','Tom'],
'Amount2':[400,200,300,500], 'Salesperson 3':['Nick','Richard','Sam','Bob'],
'Amount3':[400,800,100,400]}
df = pd.DataFrame(data=d)
merged_df = pd.concat([df[['Salesperson','Amount']], df[['Salesperson 2', 'Amount2']].rename(columns={'Salesperson 2':'Salesperson','Amount2':'Amount'}), df[['Salesperson 3', 'Amount3']].rename(columns={'Salesperson 3':'Salesperson','Amount3':'Amount'})])
merged_df.groupby('Salesperson',as_index = False)['Amount'].sum()
Salesperson Amount
0 Bob 800
1 John 2300
2 Nick 400
3 Richard 1500
4 Sam 100
5 Tom 500
Edit 2: Another solution using pandas wide_to_long
df = df.rename({'Salesperson':'Salesperson 1','Amount':'Amount1'}, axis='columns')
reshaped_df = pd.wide_to_long(df, stubnames=['Salesperson','Amount'], i='Client',j='num', suffix='\s?\d+').reset_index(drop = 1)
The above will reshape df,
Salesperson Amount
0 John 1000
1 John 1000
2 Bob 0
3 Richard 500
4 Bob 400
5 Richard 200
6 John 300
7 Tom 500
8 Nick 400
9 Richard 800
10 Sam 100
11 Bob 400
A simple groupby on reshaped_df will give you required output
reshaped_df.groupby('Salesperson', as_index = False)['Amount'].sum()
One option is to tidy the dataframe into long form, where all the Salespersons are in one column, and the amounts are in another, then you can groupby and get the aggregate.
Let's use pivot_longer from pyjanitor to transform to long form:
# pip install pyjanitor
import pandas as pd
import janitor
(df
.pivot_longer(
index="Client",
names_to=".value",
names_pattern=r"([a-zA-Z]+).*",
)
.groupby("Salesperson", as_index = False)
.Amount
.sum()
)
Salesperson Amount
0 Bob 400
1 John 2300
2 Richard 700
3 Tom 500
The .value tells the function to keep only those parts of the column that match it as headers. The columns have a pattern (They start with a text - either Salesperson or Amount - and either have a number at the end ( or not). This pattern is captured in names_pattern. .value is paired with the regex in the brackets, those outside do not matter in this case.
Once transformed into long form, it is easy to groupby and aggregate. The as_index parameter allows us to keep the output as a dataframe.

Pandas: How to match / filter same key / id values (duplicates) from 2 different dataframes and replace values?

I have 2 dataframes of different sizes. The first dataframe(df1) has 4 columns, but two of those columns have the same name as the columns in the second dataframe(df2), which is only comprised of 2 columns. The columns in common are ['ID'] and ['Department'].
I want to check if any ID from df2 are in df1. If so, I want to replace df1['Department'] value with df2['Department'] value.
For instance, df1 looks something like this:
ID Department Yrs Experience Education
1234 Science 1 Bachelors
2356 Art 3 Bachelors
2456 Math 2 Masters
4657 Science 4 Masters
And df2 looks something like this:
ID Department
1098 P.E.
1234 Technology
2356 History
I want to check if the ID from df2 is in df1 and if so, update Department. The output should looks something like this:
ID Department Yrs Experience Education
1234 **Technology** 1 Bachelors
2356 **History** 3 Bachelors
2456 Math 2 Masters
4657 Science 4 Masters
The expected updates to df1 are in bold
Is there an efficient way to do this?
Thank you for taking the time to read this and help.
You can use ID of df1 to map with the Pandas series formed by setting ID on df2 as index and taking the column of Department from df2 (this acts as a mapping table).
Then, in case of no match of ID from df2, we fill-in the original values of Department from df1 (to retain original values in case of no match):
df1['Department'] = (df1['ID'].map(df2.set_index('ID')['Department'])
.fillna(df1['Department'])
)
Result:
print(df1)
ID Department Yrs Experience Education
0 1234 Technology 1 Bachelors
1 2356 History 3 Bachelors
2 2456 Math 2 Masters
3 4657 Science 4 Masters
Try:
df1["Department"].update(
df1[["ID"]].merge(df2, on="ID", how="left")["Department"]
)
print(df1)
Prints:
ID Department Yrs Experience Education
0 1234 Technology 1 Bachelors
1 2356 History 3 Bachelors
2 2456 Math 2 Masters
3 4657 Science 4 Masters
df_1 = pd.DataFrame(data={'ID':[1234, 2356, 2456, 4657], 'Department':['Science', 'Art', 'Math', 'Science']})
df_2 = pd.DataFrame(data={'ID':[1234, 2356], 'Department':['Technology', 'History']})
df_1.loc[df_1['ID'].isin(df_2['ID']), 'Department'] = df_2['Department']
OutPut
ID Department
0 1234 Technology
1 2356 History
2 2456 Math
3 4657 Science

Adding columns in one dataframe from calculations based on other dataframe using pandas library

I have a dataframe df1 like:
cycleName quarter product qty price sell/buy
0 2020 q3 wood 10 100 sell
1 2020 q3 leather 5 200 buy
2 2020 q3 wood 2 200 buy
3 2020 q4 wood 12 40 sell
4 2020 q4 leather 12 40 sell
5 2021 q1 wood 12 80 sell
6 2021 q2 leather 12 90 sell
And another dataframe df2 as below. It has unique products of df1:
product currentValue
0 wood 20
1 leather 50
I want to create new column in df2, called income which will be based on calculations on df1 data. Example if product is wood the income2020 will be created seeing if cycleName is 2020 and if sell/buy is sell then add quantity * price else subtract quantity * price.
product currentValue income2020
0 wood 20 10 * 100 - 2 * 200 + 12 * 40 (=1080)
1 leather 50 -5 * 200 + 12 * 40 (= -520)
I have a problem statement in python, which I am trying to do using pandas dataframes, which I am very new to.
I am not able to understand how to create that column in df2 based on different conditions on df1.
You can map sell as 1 and buy as -1 using pd.Series.map then multiply columns qty, price and sell/buy using df.prod to get only 2020 cycleName values use df.query and groupby by product and take sum using GroupBy.sum
df_2020 = df.query('cycleName == 2020').copy() # `df[df['cycleName'] == 2020].copy()`
df_2020['sell/buy'] = df_2020['sell/buy'].map({'sell':1, 'buy':-1})
df_2020[['qty', 'price', 'sell/buy']].prod(axis=1).groupby(df_2020['cycleName']).sum()
product
leather -520
wood 1080
dtype: int64
Note:
Use .copy else you would get SettingWithCopyWarning
To maintain the order use sort=False in df.groupby
(df_2020[['qty', 'price', 'sell/buy']].
prod(axis=1).
groupby(df_2020['product'],sort=False).sum()
)
product
wood 1080
leather -520
dtype: int64

Python - Creating a data frame,transpose and merge it to get a table

I am learning Python and I have a question related to creating a data frame for every 5 rows, transpose and merge the data frames.
I have a .txt file with the following input. It has thousands of rows and I need to go through each line until the end of the file.
Name,Kamath
Age,23
Sex,Male
Company,ACC
Vehicle,Car
Name,Ram
Age,32
Sex,Male
Company,CCA
Vehicle,Bike
Name,Reena
Age,26
Sex,Female
Company,BARC
Vehicle,Cycle
I need to get this as my output:
Name,Age,Sex,Company,Vehicle
Kamath,23,Male,ACC,Car
Ram,32,Male,CCA,Bike
Reena,26,Female,BARC,Cycle
Use read_csv for DataFrame and then pivot with cumcount for counter for new index:
import pandas as pd
temp=u"""Name,Kamath
Age,23
Sex,Male
Company,ACC
Vehicle,Car
Name,Ram
Age,32
Sex,Male
Company,CCA
Vehicle,Bike
Name,Reena
Age,26
Sex,Female
Company,BARC
Vehicle,Cycle"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.txt'
df = pd.read_csv(pd.compat.StringIO(temp), names=['a','b'])
print (df)
a b
0 Name Kamath
1 Age 23
2 Sex Male
3 Company ACC
4 Vehicle Car
5 Name Ram
6 Age 32
7 Sex Male
8 Company CCA
9 Vehicle Bike
10 Name Reena
11 Age 26
12 Sex Female
13 Company BARC
14 Vehicle Cycle
df = pd.pivot(index=df.groupby('a').cumcount(),
columns=df['a'],
values=df['b'])
print (df)
a Age Company Name Sex Vehicle
0 23 ACC Kamath Male Car
1 32 CCA Ram Male Bike
2 26 BARC Reena Female Cycle

Is there an "ungroup by" operation opposite to .groupby in pandas?

Suppose we take a pandas dataframe...
name age family
0 john 1 1
1 jason 36 1
2 jane 32 1
3 jack 26 2
4 james 30 2
Then do a groupby() ...
group_df = df.groupby('family')
group_df = group_df.aggregate({'name': name_join, 'age': pd.np.mean})
Then do some aggregate/summarize operation (in my example, my function name_join aggregates the names):
def name_join(list_names, concat='-'):
return concat.join(list_names)
The grouped summarized output is thus:
age name
family
1 23 john-jason-jane
2 28 jack-james
Question:
Is there a quick, efficient way to get to the following from the aggregated table?
name age family
0 john 23 1
1 jason 23 1
2 jane 23 1
3 jack 28 2
4 james 28 2
(Note: the age column values are just examples, I don't care for the information I am losing after averaging in this specific example)
The way I thought I could do it does not look too efficient:
create empty dataframe
from every line in group_df, separate the names
return a dataframe with as many rows as there are names in the starting row
append the output to the empty dataframe
The rough equivalent is .reset_index(), but it may not be helpful to think of it as the "opposite" of groupby().
You are splitting a string in to pieces, and maintaining each piece's association with 'family'. This old answer of mine does the job.
Just set 'family' as the index column first, refer to the link above, and then reset_index() at the end to get your desired result.
It turns out that pd.groupby() returns an object with the original data stored in obj. So ungrouping is just pulling out the original data.
group_df = df.groupby('family')
group_df.obj
Example
>>> dat_1 = df.groupby("category_2")
>>> dat_1
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fce78b3dd00>
>>> dat_1.obj
order_date category_2 value
1 2011-02-01 Cross Country Race 324400.0
2 2011-03-01 Cross Country Race 142000.0
3 2011-04-01 Cross Country Race 498580.0
4 2011-05-01 Cross Country Race 220310.0
5 2011-06-01 Cross Country Race 364420.0
.. ... ... ...
535 2015-08-01 Triathalon 39200.0
536 2015-09-01 Triathalon 75600.0
537 2015-10-01 Triathalon 58600.0
538 2015-11-01 Triathalon 70050.0
539 2015-12-01 Triathalon 38600.0
[531 rows x 3 columns]
Here's a complete example that recovers the original dataframe from the grouped object
def name_join(list_names, concat='-'):
return concat.join(list_names)
print('create dataframe\n')
df = pandas.DataFrame({'name':['john', 'jason', 'jane', 'jack', 'james'], 'age':[1,36,32,26,30], 'family':[1,1,1,2,2]})
df.index.name='indexer'
print(df)
print('create group_by object')
group_obj_df = df.groupby('family')
print(group_obj_df)
print('\nrecover grouped df')
group_joined_df = group_obj_df.aggregate({'name': name_join, 'age': 'mean'})
group_joined_df
create dataframe
name age family
indexer
0 john 1 1
1 jason 36 1
2 jane 32 1
3 jack 26 2
4 james 30 2
create group_by object
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fbfdd9dd048>
recover grouped df
name age
family
1 john-jason-jane 23
2 jack-james 28
print('\nRecover the original dataframe')
print(pandas.concat([group_obj_df.get_group(key) for key in group_obj_df.groups]))
Recover the original dataframe
name age family
indexer
0 john 1 1
1 jason 36 1
2 jane 32 1
3 jack 26 2
4 james 30 2
There are a few ways to undo DataFrame.groupby, one way is to do DataFrame.groupby.filter(lambda x:True), this gets back to the original DataFrame.

Categories