converting duplicated data from row to columns - python

I have data like:
name val trc
jin 23 apb
tim 52 nmq
tim 61 apb
tim 92 rrc
ron 13 apq
stark 34 rrc
stark 34 apq
ron 4 apq
sia 6 wer
i am looking for output like:
name val_1 trc1 val_2 trc2 val_3 trc3
jin 23 apb
tim 92 rrc 61 apb 52 nmq
ron 13 apq 4 apq
stark 34 rrc 34 apq
sia 6 wer
i want to transform the duplicated values in the row to column with higest val in val_1 and lesser val in val_2 and so on. even the trc1 value should correspond to val_1. Please let me know how to achieve this.
I tried this approach:
d = {k: v.reset_index(drop=True) for k, v in df.groupby('name')}
pd.concat(d, axis=1).reset_index()
index jin ron sia stark tim \
name val trc name val trc name val trc name val trc name
0 0 jin 23.0 apb ron 13.0 apq sia 6.0 wer stark 34.0 rrc tim
1 1 NaN NaN NaN ron 4.0 apq NaN NaN NaN stark 34.0 apq tim
2 2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN tim

Use:
df1 = df.sort_values(['name','val'], ascending=False)
df1 = df1.set_index('name').stack().groupby(level=0).apply(list).apply(pd.Series)
df1 = df1.reset_index().fillna("")
print(df1)
name 0 1 2 3 4 5
0 jin 23 apb
1 ron 13 apq 4 apq
2 sia 6 wer
3 stark 34 rrc 34 apq
4 tim 92 rrc 61 apb 52 nmq

Convert your object into a dictionary with the names as keys and your vals and trcs as connected values in a tuple or list.
You want to end up with something like this:
yourDict[name] = [ [val_1, trc1] , [val_2, trc2] ]

Here an option using pivot:
df['index'] = df.groupby('name').cumcount()
df_vals = df.pivot(index='name', columns='index', values='val').rename(columns=lambda x: 'val_'+str(x))
df_trcs = df.pivot(index='name', columns='index', values='trc').rename(columns=lambda x: 'trc_'+str(x))
df_vals.join(df_trcs).fillna('').reset_index()
index name val_0 val_1 val_2 trc_0 trc_1 trc_2
0 jin 23.0 apb
1 ron 13.0 4 apq apq
2 sia 6.0 wer
3 stark 34.0 34 rrc apq
4 tim 52.0 61 92 nmq apb rrc

Related

Check if column name of a pandas df starts with "name" and split that column based on existing white space

Let's say I have a pandas dataframe that looks like this:
df = pd.read_json('{"id":{"0":"21 Delta","1":"38 Bravo","2":"Charlie 37","3":"Alpha 56"},"name_1":{"0":"Tom","1":"Nick","2":"Chris","3":"David 56"},"name_2":{"0":"Peter 17","1":"Emma 53","2":"Jeff 11","3":"Oscar"},"name_3":{"0":"Jeffrey","1":"Olivier 12","2":null,"3":null},"name_4":{"0":"Henry 23","1":null,"2":null,"3":null}}')
df
id name_1 name_2 name_3 name_4
0 21 Delta Tom Peter 17 Jeffrey Henry 23
1 38 Bravo Nick Emma 53 Olivier 12 None
2 Charlie 37 Chris Jeff 11 None None
3 Alpha 56 David 56 Oscar None None
What I would like to do is to iterate over the columns in this df and check if the column name starts with name. If so, I would like to add the number after the white space in each row of that particular column in an extra column called age_ which increments by one like so:
id name_1 name_2 name_3 name_4 age_1 age_2 age_3 age_4
0 21 Delta Tom Peter 17 Jeffrey Henry 23 None 17 None 23
1 38 Bravo Nick Emma 53 Olivier 12 None None 53 12 None
2 Charlie 37 Chris Jeff 11 None None None 11 None None
3 Alpha 56 David 56 Oscar None None 56 None None None
So far I came up with this, but I struggle how to get to the end result:
for column in df.columns:
if column.startswith("name"):
age = df[column].str.split(" ").str.get(1)
ages = (df.filter(like="name")
.apply(lambda col: col.str.extract(r" (\d+)$", expand=False))
.rename(columns=lambda c: c.replace("name", "age")))
get the "name" involving columns
for each of them, extract the numbers near end with a regex
column names are still "name_*", so replace "name" with "age" there
and lastly join with the original frame to get
>>> df.join(ages)
id name_1 name_2 name_3 name_4 age_1 age_2 age_3 age_4
0 21 Delta Tom Peter 17 Jeffrey Henry 23 NaN 17 NaN 23
1 38 Bravo Nick Emma 53 Olivier 12 None NaN 53 12 None
2 Charlie 37 Chris Jeff 11 None None NaN 11 None None
3 Alpha 56 David 56 Oscar None None 56 NaN None None
Besides Mustafa Aydın's approach. Here is the fixed version of your for-loop
for column in df.columns:
if column.startswith("name"):
age = f"age_{column[-1]}"
df[age] = df[column].str.extract(r"(\d+)")
print(df)
id name_1 name_2 name_3 name_4 age_1 age_2 age_3 age_4
0 21 Delta Tom Peter 17 Jeffrey Henry 23 NaN 17 NaN 23
1 38 Bravo Nick Emma 53 Olivier 12 None NaN 53 12 NaN
2 Charlie 37 Chris Jeff 11 None None NaN 11 NaN NaN
3 Alpha 56 David 56 Oscar None None 56 NaN NaN NaN

Efficient mean and total aggregation over multiple Pandas DataFrame columns

Suppose I have a DataFrame that looks something like this:
id
country
grade
category
amount
0
7
fr
a
mango
52
1
5
fr
b
banana
68
2
7
fr
a
banana
73
3
4
it
c
mango
70
4
5
fr
b
banana
99
5
9
uk
a
apple
29
6
3
uk
a
mango
83
7
0
uk
b
banana
59
8
2
it
c
mango
11
9
9
uk
a
banana
91
10
0
uk
b
mango
95
11
8
uk
a
mango
30
12
3
uk
a
mango
82
13
1
it
b
banana
78
14
3
uk
a
apple
76
15
6
it
c
apple
76
16
2
it
c
mango
10
17
1
it
b
mango
30
18
9
uk
a
banana
17
19
2
it
c
mango
58
Where each id belongs to a grade and lives in a country, and spends a certain amount on various fruits (category). Let's say the data covers a whole year. (Dataframe reproducible using the code below.)
import pandas as pd
df = pd.DataFrame({
"id":[7,5,7,4,5,9,3,0,2,9,0,8,3,1,3,6,2,1,9,2],
"country":["fr","fr","fr","it","fr","uk","uk","uk","it","uk","uk","uk","uk","it","uk","it","it","it","uk","it"],
"grade":["a","b","a","c","b","a","a","b","c","a","b","a","a","b","a","c","c","b","a","c"],
"category":["mango","banana","banana","mango","banana","apple","mango","banana","mango","banana","mango","mango","mango","banana","apple","apple","mango","mango","banana","mango"],
"amount":[52,68,73,70,99,29,83,59,11,91,95,30,82,78,76,76,10,30,17,58]
})
I would like to add two columns to this DF.
First, I'd like a column giving the mean annual (ie total) spent on each category by each combination of country and grade. So, for example, the Italy C-grade people have spent the following on mangos:
id: 4 total: 70
id: 2 total: 11 + 10 + 58 = 79
So the mean annual mango spend for Italy C-grade people is 74.5. I'd like to find this value for all of the country/grade/category combinations.
The second column I want to add is the same but for the mean annual count for each combination.
Desired output and the best I could come up with:
I've managed to populate these two desired columns using the following code:
import math
combos = [[i,j,k] for i in set(df["country"]) for j in set(df["grade"]) for k in set(df["category"])]
for c in combos:
x = df.loc[(df["country"]==c[0])&(df["grade"]==c[1])&(df["category"]==c[2])]
m = x.groupby("id").sum()["amount"].mean()
k = x.groupby("id").count()["amount"].mean()
if math.isnan(m):
m = 0
if math.isnan(k):
k = 0
c.append(m)
c.append(k)
temp_grouping = pd.DataFrame(combos,columns=["country","grade","category","mean_totals","mean_counts"])
df = df.merge(temp_grouping,on=["country","grade","category"],how="left")
Which gives the desired output:
id
country
grade
category
amount
mean_totals
mean_counts
0
7
fr
a
mango
52
52
1
1
5
fr
b
banana
68
167
2
2
7
fr
a
banana
73
73
1
3
4
it
c
mango
70
74.5
2
4
5
fr
b
banana
99
167
2
5
9
uk
a
apple
29
52.5
1
6
3
uk
a
mango
83
97.5
1.5
7
0
uk
b
banana
59
59
1
8
2
it
c
mango
11
74.5
2
9
9
uk
a
banana
91
108
2
10
0
uk
b
mango
95
95
1
11
8
uk
a
mango
30
97.5
1.5
12
3
uk
a
mango
82
97.5
1.5
13
1
it
b
banana
78
78
1
14
3
uk
a
apple
76
52.5
1
15
6
it
c
apple
76
76
1
16
2
it
c
mango
10
74.5
2
17
1
it
b
mango
30
30
1
18
9
uk
a
banana
17
108
2
19
2
it
c
mango
58
74.5
2
The above code works, but it is not usable on my real data because it is pretty slow. I'm searching, therefore, for a faster/more efficient solution to my problem. Thanks very much.
You can create mean_totals column as follows:
mean_total_df = df.groupby(['country', 'category', 'grade']).apply(lambda x: x.amount.sum()/ x.id.nunique())
df['mean_totals'] = df.apply(lambda x: mean_total_df.loc[x.country, x.category, x.grade], axis=1)
which gives
0 7 fr a mango 52 52.0
1 5 fr b banana 68 167.0
2 7 fr a banana 73 73.0
3 4 it c mango 70 74.5
4 5 fr b banana 99 167.0
5 9 uk a apple 29 52.5
6 3 uk a mango 83 97.5
7 0 uk b banana 59 59.0
8 2 it c mango 11 74.5
9 9 uk a banana 91 108.0
10 0 uk b mango 95 95.0
11 8 uk a mango 30 97.5
12 3 uk a mango 82 97.5
13 1 it b banana 78 78.0
14 3 uk a apple 76 52.5
15 6 it c apple 76 76.0
16 2 it c mango 10 74.5
17 1 it b mango 30 30.0
18 9 uk a banana 17 108.0
19 2 it c mango 58 74.5
It looks like you need a double groupby. Once for the sum, once for the mean:
out = (df
.groupby(['country', 'grade', 'category', 'id']).sum()
.groupby(['country', 'grade', 'category']).mean()
)
output:
amount
country grade category
fr a banana 73.0
mango 52.0
b banana 167.0
it b banana 78.0
mango 30.0
c apple 76.0
mango 74.5
uk a apple 52.5
banana 108.0
mango 97.5
b banana 59.0
mango 95.0
I hope this will work fast.. First group and compute the required details and merge with existing df.
import pandas as pd
df = pd.DataFrame({
"id":[7,5,7,4,5,9,3,0,2,9,0,8,3,1,3,6,2,1,9,2],
"country":["fr","fr","fr","it","fr","uk","uk","uk","it","uk","uk","uk","uk","it","uk","it","it","it","uk","it"],
"grade":["a","b","a","c","b","a","a","b","c","a","b","a","a","b","a","c","c","b","a","c"],
"category":["mango","banana","banana","mango","banana","apple","mango","banana","mango","banana","mango","mango","mango","banana","apple","apple","mango","mango","banana","mango"],
"amount":[52,68,73,70,99,29,83,59,11,91,95,30,82,78,76,76,10,30,17,58]
})
intermediate_df = df.groupby(by=['country','grade','category','id'], as_index=False).agg(int_totals=pd.NamedAgg(column='amount',aggfunc='sum'),int_counts=pd.NamedAgg(column='id',aggfunc='count')).groupby(by=['country','grade','category'], as_index=False).agg(mean_totals=pd.NamedAgg(column='int_totals',aggfunc='mean'),mean_counts=pd.NamedAgg(column='int_counts',aggfunc='mean'))
output_df = pd.merge(df,intermediate_df, left_on = ['country','grade','category'],right_on = ['country','grade','category'], how='left')
print(output_df)
Output_dataframe

How to update multiple column values in pandas

been trying to crack this for a while, but stuck now.
This is my code
l=list()
column_name=[col for col in df.columns if 'SalesPerson' in col]
filtereddf=pd.DataFrame(columns=['Item','SerialNo','Location','SalesPerson01','SalesPerson02',SalesPerson03',SalesPerson04',SalesPerson05',SalesPerson06','PredictedSales01','PredictedSales02','PredictedSales03','PredictedSales04','PredictedSales05','PredictedSales06']
for i,r in df.iterrows():
if len(r['Name'].split(';'))>1:
for x in r['Name'].split(';'):
for y in column_name:
if x in r[y]:
number_is=y[-2:]
filtereddf.at[i,'SerialNo']=r['SerialNo']
filtereddf.at[i,'Location']=r['Location']
filtereddf.at[i,y]=r[y]
filtereddf.at[i,'Item']=r['Item']
filtereddf.at[i,f'PredictedSales{number_is}']=r[f'PredictedSales{number_is}']
#The below statement however prints the values correctly. But I want to filter the values and use in a dataframe
#print(r['SerialNo'],r['Location'],r[f'SalesPerson{number_is}'],r[f'PredictedSales{number_is}]',r['Definition'])
l.append(filtereddf)
elif for y in column_name:
if r['Name'] in r[y]:
number_is=y[-2:]
filtereddf.at[i,'SerialNo']=r['SerialNo']
filtereddf.at[i,'Location']=r['Location']
filtereddf.at[i,y]=r[y]
filtereddf.at[i,'Item']=r['Item']
filtereddf.at[i,f'PredictedSales{number_is}']=r[f'PredictedSales{number_is}']
#The below statement however prints the values correctly. But I want to filter the values and use in a dataframe
#print(r['SerialNo'],r['Location'],r[f'SalesPerson{number_is}'],r[f'PredictedSales{number_is}]',r['Definition'])
l.append(filtereddf)
finaldf=pd.concat(l,ignore_index=True)
It eventually throws an error
MemoryError: Unable to allocate 9.18 GiB for an array with shape (1, 1231543895) and data type object
Basically I want to extract SalesPersonNN and corresponding PredicatedSalesNN from the main dataframe df
sampled dataset is (Actual csv file is almost 100k entries)
Item Name SerialNo Location SalesPerson01 SalesPerson02 SalesPerson03 SalesPerson04 SalesPerson05 SalesPerson06 PredictedSales01 PredictedSales02 PredictedSales03 PredictedSales04 PredictedSales05 PredictedSales06
0 TV Joe;Mary;Philip 11111 NY Tom Julie Joe Sara Mary Philip 90 80 30 98 99 100
1 WashingMachine Mike 22222 NJ Tom Julie Joe Mike Mary Philip 80 70 40 74 88 42
2 Dishwasher Tony;Sue 33333 NC Margaret Tony William Brian Sue Bert 58 49 39 59 78 89
3 Microwave Bill;Jeff;Mary 44444 PA Elmo Bill Jeff Mary Chris Kevin 80 70 90 56 92 59
4 Printer Keith;Joe 55555 DE Keith Clark Ed Matt Martha Joe 87 94 59 48 74 89
And I want the output dataframe to look like
tem Name SerialNo Location SalesPerson01 SalesPerson02 SalesPerson03 SalesPerson04 SalesPerson05 SalesPerson06 PredictedSales01 PredictedSales02 PredictedSales03 PredictedSales04 PredictedSales05 PredictedSales06
0 TV Joe;Mary;Philip 11111 NY NaN NaN Joe NaN Mary Philip NaN NaN 30.0 NaN 99.0 100.0
1 WashingMachine Mike 22222 NJ NaN NaN NaN Mike NaN NaN NaN NaN NaN 74.0 NaN NaN
2 Dishwasher Tony;Sue 33333 NC NaN Tony NaN NaN Sue NaN NaN 49.0 NaN NaN 78.0 NaN
3 Microwave Bill;Jeff;Mary 44444 PA NaN Bill Jeff Mary NaN NaN NaN 70.0 90.0 56.0 NaN NaN
4 Printer Keith;Joe 55555 DE Keith NaN NaN NaN NaN Joe 87.0 NaN NaN NaN NaN 89.0
​
I am not sure if my approach using dataframe.at is correct or if any pointers as to what i can use to efficiently filter only those columns values which matches the value in column Name
I would recommend changing from a column focused dataframe to a row focused dataframe. You can rewrite your dataset using melt:
df_person = df.loc[:,'SalesPerson01':'SalesPerson06']
df_sales = df.loc[:,'PredictedSales01':'PredictedSales06']
df_person = df_person.melt(ignore_index=False, value_name='SalesPerson')[['SalesPerson']]
PredictedSales = df_sales.melt(ignore_index=False, value_name='PredictedSales')[['PredictedSales']]
df_person['PredictedSales'] = PredictedSales
index_cols = ['Item','SerialNo', 'Location', 'SalesPerson']
df_person = df_person.reset_index().sort_values(index_cols).set_index(index_cols)
df_person will look like this:
Item SerialNo Location SalesPerson PredictedSales
TV 11111 NY Joe 30
Julie 80
Mary 99
Philip 100
Sara 98
Tom 90
WashingMachine 22222 NJ Joe 40
Julie 70
Mary 88
Mike 74
Philip 42
Tom 80
... ... ... ... ...
Printer 55555 DE Clark 94
Ed 59
Joe 89
Keith 87
Martha 74
Matt 48
Now you only want the values from the names in you 'Name' column. Therefor we create a separate dataframe using explode:
df_names = df[['Name']].explode('Name').rename({'Name':'SalesPerson'}, axis=1)
df_names = df_names.reset_index().set_index(['Item','SerialNo', 'Location', 'SalesPerson'])
df_names will look something like this:
Item SerialNo Location SalesPerson
TV 11111 NY Joe
Mary
Philip
WashingMachine 22222 NJ Mike
Dishwasher 33333 NC Tony
Sue
Microwave 44444 PA Bill
Jeff
Mary
Printer 55555 DE Keith
Joe
Now you can simply merge your dataframes:
df_names.merge(df_person, left_index=True, right_index=True)
Now the PredictedSales are added to you df_names dataframe.
Hopefully this will run without errors. Please let me know 😀

Left shift with condition in pandas

I have some problems with a csv file, I have tried several solutions through the pandas library but none has worked for me, I want to make a left shift to 3 columns in case that in one of them appears a certain code (in this case 11 or 22), for example, this would be my input:
code
name
%
code 2
name 2
% 2
code 3
name 3
% 3
11
John
34
44
Rob
23
33
Peter
15
22
Ken
45
33
Peter
45
44
Rob
25
33
Peter
34
66
Abraham
37
77
Harry
67
11
John
45
77
Harry
39
88
Mary
20
And I expect something like this:
code
name
%
code 2
name 2
% 2
code 3
name 3
% 3
44
Rob
23
33
Peter
15
33
Peter
45
44
Rob
25
33
Peter
34
66
Abraham
37
77
Harry
67
77
Harry
39
88
Mary
20
any idea how I could solve my problem with pandas?
Thanks in advance!
Do you want this?
mask = df['code'].isin([11,22])
df.loc[mask] = df.loc[mask].shift(-3,axis=1)
Output -
code name % code 2 name 2 % 2 code 3 name 3 % 3
0 44.0 Rob 23.0 33.0 Peter 15.0 NaN NaN NaN
1 33.0 Peter 45.0 44.0 Rob 25.0 NaN NaN NaN
2 33.0 Peter 34.0 66.0 Abraham 37.0 77.0 Harry 67.0
3 77.0 Harry 39.0 88.0 Mary 20.0 NaN NaN NaN

New dataframe from grouping together two columns

I have a dataset that looks like the following.
Region_Name Date Average
London 1990Q1 105
London 1990Q1 118
... ... ...
London 2018Q1 157
I converted the date into quarters and wish to create a new dataframe with the matching quarters and region names grouped together, with the mean average.
What is the best way to accomplish such a task.
I have been looking at the groupby function but keep getting a traceback.
for example:
new_df = df.groupby(['Resion_Name','Date']).mean()
dict3={'Region_Name': ['London','Newyork','London','Newyork','London','London','Newyork','Newyork','Newyork','Newyork','London'],
'Date' : ['1990Q1','1990Q1','1990Q2','1990Q2','1991Q1','1991Q1','1991Q2','1992Q2','1993Q1','1993Q1','1994Q1'],
'Average': [34,56,45,67,23,89,12,45,67,34,67]}
df3=pd.DataFrame(dict3)
**Now My df3 is as follows **
Region_Name Date Average
0 London 1990Q1 34
1 Newyork 1990Q1 56
2 London 1990Q2 45
3 Newyork 1990Q2 67
4 London 1991Q1 23
5 London 1991Q1 89
6 Newyork 1991Q2 12
7 Newyork 1992Q2 45
8 Newyork 1993Q1 67
9 Newyork 1993Q1 34
10 London 1994Q1 67
code looks as follows:
new_df = df3.groupby(['Region_Name','Date'])
new1=new_df['Average'].transform('mean')
Result of dataframe new1:
print(new1)
0 34.0
1 56.0
2 45.0
3 67.0
4 56.0
5 56.0
6 12.0
7 45.0
8 50.5
9 50.5
10 67.0

Categories