Let's say I have a pandas dataframe that looks like this:
df = pd.read_json('{"id":{"0":"21 Delta","1":"38 Bravo","2":"Charlie 37","3":"Alpha 56"},"name_1":{"0":"Tom","1":"Nick","2":"Chris","3":"David 56"},"name_2":{"0":"Peter 17","1":"Emma 53","2":"Jeff 11","3":"Oscar"},"name_3":{"0":"Jeffrey","1":"Olivier 12","2":null,"3":null},"name_4":{"0":"Henry 23","1":null,"2":null,"3":null}}')
df
id name_1 name_2 name_3 name_4
0 21 Delta Tom Peter 17 Jeffrey Henry 23
1 38 Bravo Nick Emma 53 Olivier 12 None
2 Charlie 37 Chris Jeff 11 None None
3 Alpha 56 David 56 Oscar None None
What I would like to do is to iterate over the columns in this df and check if the column name starts with name. If so, I would like to add the number after the white space in each row of that particular column in an extra column called age_ which increments by one like so:
id name_1 name_2 name_3 name_4 age_1 age_2 age_3 age_4
0 21 Delta Tom Peter 17 Jeffrey Henry 23 None 17 None 23
1 38 Bravo Nick Emma 53 Olivier 12 None None 53 12 None
2 Charlie 37 Chris Jeff 11 None None None 11 None None
3 Alpha 56 David 56 Oscar None None 56 None None None
So far I came up with this, but I struggle how to get to the end result:
for column in df.columns:
if column.startswith("name"):
age = df[column].str.split(" ").str.get(1)
ages = (df.filter(like="name")
.apply(lambda col: col.str.extract(r" (\d+)$", expand=False))
.rename(columns=lambda c: c.replace("name", "age")))
get the "name" involving columns
for each of them, extract the numbers near end with a regex
column names are still "name_*", so replace "name" with "age" there
and lastly join with the original frame to get
>>> df.join(ages)
id name_1 name_2 name_3 name_4 age_1 age_2 age_3 age_4
0 21 Delta Tom Peter 17 Jeffrey Henry 23 NaN 17 NaN 23
1 38 Bravo Nick Emma 53 Olivier 12 None NaN 53 12 None
2 Charlie 37 Chris Jeff 11 None None NaN 11 None None
3 Alpha 56 David 56 Oscar None None 56 NaN None None
Besides Mustafa Aydın's approach. Here is the fixed version of your for-loop
for column in df.columns:
if column.startswith("name"):
age = f"age_{column[-1]}"
df[age] = df[column].str.extract(r"(\d+)")
print(df)
id name_1 name_2 name_3 name_4 age_1 age_2 age_3 age_4
0 21 Delta Tom Peter 17 Jeffrey Henry 23 NaN 17 NaN 23
1 38 Bravo Nick Emma 53 Olivier 12 None NaN 53 12 NaN
2 Charlie 37 Chris Jeff 11 None None NaN 11 NaN NaN
3 Alpha 56 David 56 Oscar None None 56 NaN NaN NaN
Suppose I have a DataFrame that looks something like this:
id
country
grade
category
amount
0
7
fr
a
mango
52
1
5
fr
b
banana
68
2
7
fr
a
banana
73
3
4
it
c
mango
70
4
5
fr
b
banana
99
5
9
uk
a
apple
29
6
3
uk
a
mango
83
7
0
uk
b
banana
59
8
2
it
c
mango
11
9
9
uk
a
banana
91
10
0
uk
b
mango
95
11
8
uk
a
mango
30
12
3
uk
a
mango
82
13
1
it
b
banana
78
14
3
uk
a
apple
76
15
6
it
c
apple
76
16
2
it
c
mango
10
17
1
it
b
mango
30
18
9
uk
a
banana
17
19
2
it
c
mango
58
Where each id belongs to a grade and lives in a country, and spends a certain amount on various fruits (category). Let's say the data covers a whole year. (Dataframe reproducible using the code below.)
import pandas as pd
df = pd.DataFrame({
"id":[7,5,7,4,5,9,3,0,2,9,0,8,3,1,3,6,2,1,9,2],
"country":["fr","fr","fr","it","fr","uk","uk","uk","it","uk","uk","uk","uk","it","uk","it","it","it","uk","it"],
"grade":["a","b","a","c","b","a","a","b","c","a","b","a","a","b","a","c","c","b","a","c"],
"category":["mango","banana","banana","mango","banana","apple","mango","banana","mango","banana","mango","mango","mango","banana","apple","apple","mango","mango","banana","mango"],
"amount":[52,68,73,70,99,29,83,59,11,91,95,30,82,78,76,76,10,30,17,58]
})
I would like to add two columns to this DF.
First, I'd like a column giving the mean annual (ie total) spent on each category by each combination of country and grade. So, for example, the Italy C-grade people have spent the following on mangos:
id: 4 total: 70
id: 2 total: 11 + 10 + 58 = 79
So the mean annual mango spend for Italy C-grade people is 74.5. I'd like to find this value for all of the country/grade/category combinations.
The second column I want to add is the same but for the mean annual count for each combination.
Desired output and the best I could come up with:
I've managed to populate these two desired columns using the following code:
import math
combos = [[i,j,k] for i in set(df["country"]) for j in set(df["grade"]) for k in set(df["category"])]
for c in combos:
x = df.loc[(df["country"]==c[0])&(df["grade"]==c[1])&(df["category"]==c[2])]
m = x.groupby("id").sum()["amount"].mean()
k = x.groupby("id").count()["amount"].mean()
if math.isnan(m):
m = 0
if math.isnan(k):
k = 0
c.append(m)
c.append(k)
temp_grouping = pd.DataFrame(combos,columns=["country","grade","category","mean_totals","mean_counts"])
df = df.merge(temp_grouping,on=["country","grade","category"],how="left")
Which gives the desired output:
id
country
grade
category
amount
mean_totals
mean_counts
0
7
fr
a
mango
52
52
1
1
5
fr
b
banana
68
167
2
2
7
fr
a
banana
73
73
1
3
4
it
c
mango
70
74.5
2
4
5
fr
b
banana
99
167
2
5
9
uk
a
apple
29
52.5
1
6
3
uk
a
mango
83
97.5
1.5
7
0
uk
b
banana
59
59
1
8
2
it
c
mango
11
74.5
2
9
9
uk
a
banana
91
108
2
10
0
uk
b
mango
95
95
1
11
8
uk
a
mango
30
97.5
1.5
12
3
uk
a
mango
82
97.5
1.5
13
1
it
b
banana
78
78
1
14
3
uk
a
apple
76
52.5
1
15
6
it
c
apple
76
76
1
16
2
it
c
mango
10
74.5
2
17
1
it
b
mango
30
30
1
18
9
uk
a
banana
17
108
2
19
2
it
c
mango
58
74.5
2
The above code works, but it is not usable on my real data because it is pretty slow. I'm searching, therefore, for a faster/more efficient solution to my problem. Thanks very much.
You can create mean_totals column as follows:
mean_total_df = df.groupby(['country', 'category', 'grade']).apply(lambda x: x.amount.sum()/ x.id.nunique())
df['mean_totals'] = df.apply(lambda x: mean_total_df.loc[x.country, x.category, x.grade], axis=1)
which gives
0 7 fr a mango 52 52.0
1 5 fr b banana 68 167.0
2 7 fr a banana 73 73.0
3 4 it c mango 70 74.5
4 5 fr b banana 99 167.0
5 9 uk a apple 29 52.5
6 3 uk a mango 83 97.5
7 0 uk b banana 59 59.0
8 2 it c mango 11 74.5
9 9 uk a banana 91 108.0
10 0 uk b mango 95 95.0
11 8 uk a mango 30 97.5
12 3 uk a mango 82 97.5
13 1 it b banana 78 78.0
14 3 uk a apple 76 52.5
15 6 it c apple 76 76.0
16 2 it c mango 10 74.5
17 1 it b mango 30 30.0
18 9 uk a banana 17 108.0
19 2 it c mango 58 74.5
It looks like you need a double groupby. Once for the sum, once for the mean:
out = (df
.groupby(['country', 'grade', 'category', 'id']).sum()
.groupby(['country', 'grade', 'category']).mean()
)
output:
amount
country grade category
fr a banana 73.0
mango 52.0
b banana 167.0
it b banana 78.0
mango 30.0
c apple 76.0
mango 74.5
uk a apple 52.5
banana 108.0
mango 97.5
b banana 59.0
mango 95.0
I hope this will work fast.. First group and compute the required details and merge with existing df.
import pandas as pd
df = pd.DataFrame({
"id":[7,5,7,4,5,9,3,0,2,9,0,8,3,1,3,6,2,1,9,2],
"country":["fr","fr","fr","it","fr","uk","uk","uk","it","uk","uk","uk","uk","it","uk","it","it","it","uk","it"],
"grade":["a","b","a","c","b","a","a","b","c","a","b","a","a","b","a","c","c","b","a","c"],
"category":["mango","banana","banana","mango","banana","apple","mango","banana","mango","banana","mango","mango","mango","banana","apple","apple","mango","mango","banana","mango"],
"amount":[52,68,73,70,99,29,83,59,11,91,95,30,82,78,76,76,10,30,17,58]
})
intermediate_df = df.groupby(by=['country','grade','category','id'], as_index=False).agg(int_totals=pd.NamedAgg(column='amount',aggfunc='sum'),int_counts=pd.NamedAgg(column='id',aggfunc='count')).groupby(by=['country','grade','category'], as_index=False).agg(mean_totals=pd.NamedAgg(column='int_totals',aggfunc='mean'),mean_counts=pd.NamedAgg(column='int_counts',aggfunc='mean'))
output_df = pd.merge(df,intermediate_df, left_on = ['country','grade','category'],right_on = ['country','grade','category'], how='left')
print(output_df)
Output_dataframe
been trying to crack this for a while, but stuck now.
This is my code
l=list()
column_name=[col for col in df.columns if 'SalesPerson' in col]
filtereddf=pd.DataFrame(columns=['Item','SerialNo','Location','SalesPerson01','SalesPerson02',SalesPerson03',SalesPerson04',SalesPerson05',SalesPerson06','PredictedSales01','PredictedSales02','PredictedSales03','PredictedSales04','PredictedSales05','PredictedSales06']
for i,r in df.iterrows():
if len(r['Name'].split(';'))>1:
for x in r['Name'].split(';'):
for y in column_name:
if x in r[y]:
number_is=y[-2:]
filtereddf.at[i,'SerialNo']=r['SerialNo']
filtereddf.at[i,'Location']=r['Location']
filtereddf.at[i,y]=r[y]
filtereddf.at[i,'Item']=r['Item']
filtereddf.at[i,f'PredictedSales{number_is}']=r[f'PredictedSales{number_is}']
#The below statement however prints the values correctly. But I want to filter the values and use in a dataframe
#print(r['SerialNo'],r['Location'],r[f'SalesPerson{number_is}'],r[f'PredictedSales{number_is}]',r['Definition'])
l.append(filtereddf)
elif for y in column_name:
if r['Name'] in r[y]:
number_is=y[-2:]
filtereddf.at[i,'SerialNo']=r['SerialNo']
filtereddf.at[i,'Location']=r['Location']
filtereddf.at[i,y]=r[y]
filtereddf.at[i,'Item']=r['Item']
filtereddf.at[i,f'PredictedSales{number_is}']=r[f'PredictedSales{number_is}']
#The below statement however prints the values correctly. But I want to filter the values and use in a dataframe
#print(r['SerialNo'],r['Location'],r[f'SalesPerson{number_is}'],r[f'PredictedSales{number_is}]',r['Definition'])
l.append(filtereddf)
finaldf=pd.concat(l,ignore_index=True)
It eventually throws an error
MemoryError: Unable to allocate 9.18 GiB for an array with shape (1, 1231543895) and data type object
Basically I want to extract SalesPersonNN and corresponding PredicatedSalesNN from the main dataframe df
sampled dataset is (Actual csv file is almost 100k entries)
Item Name SerialNo Location SalesPerson01 SalesPerson02 SalesPerson03 SalesPerson04 SalesPerson05 SalesPerson06 PredictedSales01 PredictedSales02 PredictedSales03 PredictedSales04 PredictedSales05 PredictedSales06
0 TV Joe;Mary;Philip 11111 NY Tom Julie Joe Sara Mary Philip 90 80 30 98 99 100
1 WashingMachine Mike 22222 NJ Tom Julie Joe Mike Mary Philip 80 70 40 74 88 42
2 Dishwasher Tony;Sue 33333 NC Margaret Tony William Brian Sue Bert 58 49 39 59 78 89
3 Microwave Bill;Jeff;Mary 44444 PA Elmo Bill Jeff Mary Chris Kevin 80 70 90 56 92 59
4 Printer Keith;Joe 55555 DE Keith Clark Ed Matt Martha Joe 87 94 59 48 74 89
And I want the output dataframe to look like
tem Name SerialNo Location SalesPerson01 SalesPerson02 SalesPerson03 SalesPerson04 SalesPerson05 SalesPerson06 PredictedSales01 PredictedSales02 PredictedSales03 PredictedSales04 PredictedSales05 PredictedSales06
0 TV Joe;Mary;Philip 11111 NY NaN NaN Joe NaN Mary Philip NaN NaN 30.0 NaN 99.0 100.0
1 WashingMachine Mike 22222 NJ NaN NaN NaN Mike NaN NaN NaN NaN NaN 74.0 NaN NaN
2 Dishwasher Tony;Sue 33333 NC NaN Tony NaN NaN Sue NaN NaN 49.0 NaN NaN 78.0 NaN
3 Microwave Bill;Jeff;Mary 44444 PA NaN Bill Jeff Mary NaN NaN NaN 70.0 90.0 56.0 NaN NaN
4 Printer Keith;Joe 55555 DE Keith NaN NaN NaN NaN Joe 87.0 NaN NaN NaN NaN 89.0
I am not sure if my approach using dataframe.at is correct or if any pointers as to what i can use to efficiently filter only those columns values which matches the value in column Name
I would recommend changing from a column focused dataframe to a row focused dataframe. You can rewrite your dataset using melt:
df_person = df.loc[:,'SalesPerson01':'SalesPerson06']
df_sales = df.loc[:,'PredictedSales01':'PredictedSales06']
df_person = df_person.melt(ignore_index=False, value_name='SalesPerson')[['SalesPerson']]
PredictedSales = df_sales.melt(ignore_index=False, value_name='PredictedSales')[['PredictedSales']]
df_person['PredictedSales'] = PredictedSales
index_cols = ['Item','SerialNo', 'Location', 'SalesPerson']
df_person = df_person.reset_index().sort_values(index_cols).set_index(index_cols)
df_person will look like this:
Item SerialNo Location SalesPerson PredictedSales
TV 11111 NY Joe 30
Julie 80
Mary 99
Philip 100
Sara 98
Tom 90
WashingMachine 22222 NJ Joe 40
Julie 70
Mary 88
Mike 74
Philip 42
Tom 80
... ... ... ... ...
Printer 55555 DE Clark 94
Ed 59
Joe 89
Keith 87
Martha 74
Matt 48
Now you only want the values from the names in you 'Name' column. Therefor we create a separate dataframe using explode:
df_names = df[['Name']].explode('Name').rename({'Name':'SalesPerson'}, axis=1)
df_names = df_names.reset_index().set_index(['Item','SerialNo', 'Location', 'SalesPerson'])
df_names will look something like this:
Item SerialNo Location SalesPerson
TV 11111 NY Joe
Mary
Philip
WashingMachine 22222 NJ Mike
Dishwasher 33333 NC Tony
Sue
Microwave 44444 PA Bill
Jeff
Mary
Printer 55555 DE Keith
Joe
Now you can simply merge your dataframes:
df_names.merge(df_person, left_index=True, right_index=True)
Now the PredictedSales are added to you df_names dataframe.
Hopefully this will run without errors. Please let me know 😀
I have a dataset that looks like the following.
Region_Name Date Average
London 1990Q1 105
London 1990Q1 118
... ... ...
London 2018Q1 157
I converted the date into quarters and wish to create a new dataframe with the matching quarters and region names grouped together, with the mean average.
What is the best way to accomplish such a task.
I have been looking at the groupby function but keep getting a traceback.
for example:
new_df = df.groupby(['Resion_Name','Date']).mean()
dict3={'Region_Name': ['London','Newyork','London','Newyork','London','London','Newyork','Newyork','Newyork','Newyork','London'],
'Date' : ['1990Q1','1990Q1','1990Q2','1990Q2','1991Q1','1991Q1','1991Q2','1992Q2','1993Q1','1993Q1','1994Q1'],
'Average': [34,56,45,67,23,89,12,45,67,34,67]}
df3=pd.DataFrame(dict3)
**Now My df3 is as follows **
Region_Name Date Average
0 London 1990Q1 34
1 Newyork 1990Q1 56
2 London 1990Q2 45
3 Newyork 1990Q2 67
4 London 1991Q1 23
5 London 1991Q1 89
6 Newyork 1991Q2 12
7 Newyork 1992Q2 45
8 Newyork 1993Q1 67
9 Newyork 1993Q1 34
10 London 1994Q1 67
code looks as follows:
new_df = df3.groupby(['Region_Name','Date'])
new1=new_df['Average'].transform('mean')
Result of dataframe new1:
print(new1)
0 34.0
1 56.0
2 45.0
3 67.0
4 56.0
5 56.0
6 12.0
7 45.0
8 50.5
9 50.5
10 67.0