Having a data set as below.Here I need to group the subset in column and fill the missing values using mode method.Here specifically needs to fill the missing value of Tom from UK. So I need to group the TOM from Uk, and in that group the most repeating value needs to be added to the missing value.
Below fig shows how i need to do the group by.From the below matrix i need to replace all the Nan values using mode.
the desired output:
attaching the dataset
Name location Value
Tom USA 20
Tom UK Nan
Tom USA Nan
Tom UK 20
Jack India Nan
Nihal Africa 30
Tom UK Nan
Tom UK 20
Tom UK 30
Tom UK 20
Tom UK 30
Sam UK 30
Sam UK 30
try:
df = df\
.set_index(['Name', 'location'])\
.fillna(
df[df.Name.eq('Tom') & df.location.eq('UK')]\
.groupby(['Name', 'location'])\
.agg(pd.Series.mode)\
.to_dict()
)\
.reset_index()
Output:
Name location Value
0 Tom USA 20
1 Tom UK 20
2 Tom USA NaN
3 Tom UK 20
4 Jack India NaN
5 Nihal Africa 30
6 Tom UK 20
7 Tom UK 20
8 Tom UK 30
9 Tom UK 20
10 Tom UK 30
11 Sam UK 30
12 Sam UK 30
Related
I have the following dataset:
Name Loc Site Date Total
Alex Italy A 12.31.2020 30
Alex Italy B 12.31.2020 40
Alex Italy B 12.30.2020 100
Alex Italy A 12.30.2020 80
Alex France A 12.28.2020 10
Alex France B 12.28.2020 20
Alex France B 12.27.2020 10
I want to add per each row the average of total in the day before the date per Name, Loc and Date
This is the outcome I'm looking for:
Name Loc Site Date Total Prv_Avg
Alex Italy A 12.31.2020 30 90
Alex Italy B 12.31.2020 40 90
Alex Italy B 12.30.2020 100 NULL
Alex Italy A 12.30.2020 80 NULL
Alex France A 12.28.2020 10 10
Alex France B 12.28.2020 20 10
Alex France B 12.27.2020 10 NULL
The Nulls are for rows where previous date couldn't be found.
I've tried rolling but got mixed up with the index.
First aggregate mean per 3 columns, add one day to MultiIndex for match previous day and last use DataFrame.join:
df['Date'] = pd.to_datetime(df['Date'])
s = df.groupby(['Name','Loc','Date'])['Total'].mean().rename('Prv_Avg')
print (s)
Name Loc Date
Alex France 2020-12-27 10
2020-12-28 15
Italy 2020-12-30 90
2020-12-31 35
Name: Prv_Avg, dtype: int64
s = s.rename(lambda x: x + pd.Timedelta('1 day'), level=2)
print (s)
Name Loc Date
Alex France 2020-12-28 10
2020-12-29 15
Italy 2020-12-31 90
2021-01-01 35
Name: Prv_Avg, dtype: int64
df = df.join(s, on=['Name','Loc','Date'])
print (df)
Name Loc Site Date Total Prv_Avg
0 Alex Italy A 2020-12-31 30 90.0
1 Alex Italy B 2020-12-31 40 90.0
2 Alex Italy B 2020-12-30 100 NaN
3 Alex Italy A 2020-12-30 80 NaN
4 Alex France A 2020-12-28 10 10.0
5 Alex France B 2020-12-28 20 10.0
6 Alex France B 2020-12-27 10 NaN
Another possible solution:
grp = ['Name', 'Loc', 'Date']
s = df.groupby(grp)['Total'].mean().shift().rename('Prv_Avg')
idx1 = s.index.get_level_values('Name').to_list()
idx2 = s.index.get_level_values('Loc').to_list()
df.merge(s.where((idx1 == np.roll(idx1, 1)) &
(idx2 == np.roll(idx2, 1))).reset_index(),
on= grp)
Output:
Name Loc Site Date Total Prv_Avg
0 Alex Italy A 12.31.2020 30 90.0
1 Alex Italy B 12.31.2020 40 90.0
2 Alex Italy B 12.30.2020 100 NaN
3 Alex Italy A 12.30.2020 80 NaN
4 Alex France A 12.28.2020 10 10.0
5 Alex France B 12.28.2020 20 10.0
6 Alex France B 12.27.2020 10 NaN
I have pandas DF as below ,
id age gender country sales_year
1 None M India 2016
2 23 F India 2016
1 20 M India 2015
2 25 F India 2015
3 30 M India 2019
4 36 None India 2019
I want to group by on id, take the latest 1 row as per sales_date with all non null element.
output expected,
id age gender country sales_year
1 20 M India 2016
2 23 F India 2016
3 30 M India 2019
4 36 None India 2019
In pyspark,
df = df.withColumn('age', f.first('age', True).over(Window.partitionBy("id").orderBy(df.sales_year.desc())))
But i need same solution in pandas .
EDIT ::
This can the case with all the columns. Not just age. I need it to pick up latest non null data(id exist) for all the ids.
Use GroupBy.first:
df1 = df.groupby('id', as_index=False).first()
print (df1)
id age gender country sales_year
0 1 20.0 M India 2016
1 2 23.0 F India 2016
2 3 30.0 M India 2019
3 4 36.0 NaN India 2019
If column sales_year is not sorted:
df2 = df.sort_values('sales_year', ascending=False).groupby('id', as_index=False).first()
print (df2)
id age gender country sales_year
0 1 20.0 M India 2016
1 2 23.0 F India 2016
2 3 30.0 M India 2019
3 4 36.0 NaN India 2019
print(df.replace('None',np.NaN).groupby('id').first())
first replace the 'None' with NaN
next use groupby() to group by 'id'
next filter out the first row using first()
Use -
df.dropna(subset=['gender']).sort_values('sales_year', ascending=False).groupby('id')['age'].first()
Output
id
1 20
2 23
3 30
4 36
Name: age, dtype: object
Remove the ['age'] to get full rows -
df.dropna().sort_values('sales_year', ascending=False).groupby('id').first()
Output
age gender country sales_year
id
1 20 M India 2015
2 23 F India 2016
3 30 M India 2019
4 36 None India 2019
You can put the id back as a column with reset_index() -
df.dropna().sort_values('sales_year', ascending=False).groupby('id').first().reset_index()
Output
id age gender country sales_year
0 1 20 M India 2015
1 2 23 F India 2016
2 3 30 M India 2019
3 4 36 None India 2019
been trying to crack this for a while, but stuck now.
This is my code
l=list()
column_name=[col for col in df.columns if 'SalesPerson' in col]
filtereddf=pd.DataFrame(columns=['Item','SerialNo','Location','SalesPerson01','SalesPerson02',SalesPerson03',SalesPerson04',SalesPerson05',SalesPerson06','PredictedSales01','PredictedSales02','PredictedSales03','PredictedSales04','PredictedSales05','PredictedSales06']
for i,r in df.iterrows():
if len(r['Name'].split(';'))>1:
for x in r['Name'].split(';'):
for y in column_name:
if x in r[y]:
number_is=y[-2:]
filtereddf.at[i,'SerialNo']=r['SerialNo']
filtereddf.at[i,'Location']=r['Location']
filtereddf.at[i,y]=r[y]
filtereddf.at[i,'Item']=r['Item']
filtereddf.at[i,f'PredictedSales{number_is}']=r[f'PredictedSales{number_is}']
#The below statement however prints the values correctly. But I want to filter the values and use in a dataframe
#print(r['SerialNo'],r['Location'],r[f'SalesPerson{number_is}'],r[f'PredictedSales{number_is}]',r['Definition'])
l.append(filtereddf)
elif for y in column_name:
if r['Name'] in r[y]:
number_is=y[-2:]
filtereddf.at[i,'SerialNo']=r['SerialNo']
filtereddf.at[i,'Location']=r['Location']
filtereddf.at[i,y]=r[y]
filtereddf.at[i,'Item']=r['Item']
filtereddf.at[i,f'PredictedSales{number_is}']=r[f'PredictedSales{number_is}']
#The below statement however prints the values correctly. But I want to filter the values and use in a dataframe
#print(r['SerialNo'],r['Location'],r[f'SalesPerson{number_is}'],r[f'PredictedSales{number_is}]',r['Definition'])
l.append(filtereddf)
finaldf=pd.concat(l,ignore_index=True)
It eventually throws an error
MemoryError: Unable to allocate 9.18 GiB for an array with shape (1, 1231543895) and data type object
Basically I want to extract SalesPersonNN and corresponding PredicatedSalesNN from the main dataframe df
sampled dataset is (Actual csv file is almost 100k entries)
Item Name SerialNo Location SalesPerson01 SalesPerson02 SalesPerson03 SalesPerson04 SalesPerson05 SalesPerson06 PredictedSales01 PredictedSales02 PredictedSales03 PredictedSales04 PredictedSales05 PredictedSales06
0 TV Joe;Mary;Philip 11111 NY Tom Julie Joe Sara Mary Philip 90 80 30 98 99 100
1 WashingMachine Mike 22222 NJ Tom Julie Joe Mike Mary Philip 80 70 40 74 88 42
2 Dishwasher Tony;Sue 33333 NC Margaret Tony William Brian Sue Bert 58 49 39 59 78 89
3 Microwave Bill;Jeff;Mary 44444 PA Elmo Bill Jeff Mary Chris Kevin 80 70 90 56 92 59
4 Printer Keith;Joe 55555 DE Keith Clark Ed Matt Martha Joe 87 94 59 48 74 89
And I want the output dataframe to look like
tem Name SerialNo Location SalesPerson01 SalesPerson02 SalesPerson03 SalesPerson04 SalesPerson05 SalesPerson06 PredictedSales01 PredictedSales02 PredictedSales03 PredictedSales04 PredictedSales05 PredictedSales06
0 TV Joe;Mary;Philip 11111 NY NaN NaN Joe NaN Mary Philip NaN NaN 30.0 NaN 99.0 100.0
1 WashingMachine Mike 22222 NJ NaN NaN NaN Mike NaN NaN NaN NaN NaN 74.0 NaN NaN
2 Dishwasher Tony;Sue 33333 NC NaN Tony NaN NaN Sue NaN NaN 49.0 NaN NaN 78.0 NaN
3 Microwave Bill;Jeff;Mary 44444 PA NaN Bill Jeff Mary NaN NaN NaN 70.0 90.0 56.0 NaN NaN
4 Printer Keith;Joe 55555 DE Keith NaN NaN NaN NaN Joe 87.0 NaN NaN NaN NaN 89.0
I am not sure if my approach using dataframe.at is correct or if any pointers as to what i can use to efficiently filter only those columns values which matches the value in column Name
I would recommend changing from a column focused dataframe to a row focused dataframe. You can rewrite your dataset using melt:
df_person = df.loc[:,'SalesPerson01':'SalesPerson06']
df_sales = df.loc[:,'PredictedSales01':'PredictedSales06']
df_person = df_person.melt(ignore_index=False, value_name='SalesPerson')[['SalesPerson']]
PredictedSales = df_sales.melt(ignore_index=False, value_name='PredictedSales')[['PredictedSales']]
df_person['PredictedSales'] = PredictedSales
index_cols = ['Item','SerialNo', 'Location', 'SalesPerson']
df_person = df_person.reset_index().sort_values(index_cols).set_index(index_cols)
df_person will look like this:
Item SerialNo Location SalesPerson PredictedSales
TV 11111 NY Joe 30
Julie 80
Mary 99
Philip 100
Sara 98
Tom 90
WashingMachine 22222 NJ Joe 40
Julie 70
Mary 88
Mike 74
Philip 42
Tom 80
... ... ... ... ...
Printer 55555 DE Clark 94
Ed 59
Joe 89
Keith 87
Martha 74
Matt 48
Now you only want the values from the names in you 'Name' column. Therefor we create a separate dataframe using explode:
df_names = df[['Name']].explode('Name').rename({'Name':'SalesPerson'}, axis=1)
df_names = df_names.reset_index().set_index(['Item','SerialNo', 'Location', 'SalesPerson'])
df_names will look something like this:
Item SerialNo Location SalesPerson
TV 11111 NY Joe
Mary
Philip
WashingMachine 22222 NJ Mike
Dishwasher 33333 NC Tony
Sue
Microwave 44444 PA Bill
Jeff
Mary
Printer 55555 DE Keith
Joe
Now you can simply merge your dataframes:
df_names.merge(df_person, left_index=True, right_index=True)
Now the PredictedSales are added to you df_names dataframe.
Hopefully this will run without errors. Please let me know 😀
I have a dataframe and I'm trying to group by the Name and Destination columns and calculate the sum of the sales for that Destination for the particular Name and then get the top 2 for each name.
data=
Name Origin Destination Sales
John Italy China 2
Dan UK China 3
Dan UK India 2
Sam UK India 5
Sam Italy Malaysia 1
John Italy Malaysia 1
Dan France India 4
Dan Italy China 2
Sam Italy Malaysia 2
John France Malaysia 1
Sam Italy China 2
Dan UK Malaysia 4
Dan France India 2
John France Malaysia 4
John Italy China 4
John UK Malaysia 1
Sam UK China 4
Sam France China 5
I have tried to do this but I keep getting it sorted by the Destination and not the Sales. Below is the code I tried.
data.groupby(['Name', 'Destination'])['Sales'].sum().groupby(level=0).head(2).reset_index(name='Total_Sales')
This code gives me this dataframe:
Name Destination Total_Sales
Dan China 5
Dan India 8
John China 6
John Malaysia 7
Sam China 11
Sam India 5
But it is sorted on the wrong column (Destination) but I would like to sort by the sum of the sales (Total_Sales).
The expected result I want I want to achieve is:
Name Destination Total_Sales
Dan India 8
Dan China 5
John Malaysia 7
John China 6
Sam China 11
Sam India 5
Your code:
grouped_df = data.groupby(['Name', 'Destination'])['Sales'].sum().groupby(level=0).head(2).reset_index(name='Total_Sales')
To sort the result:
sorted_df = grouped_df.sort_values(by=['Name','Total_Sales'], ascending=(True,False))
print(sorted_df)
Output:
Name Destination Total_Sales
1 Dan India 8
0 Dan China 5
3 John Malaysia 7
2 John China 6
4 Sam China 11
5 Sam India 5
I have a dataframe and want the output to be formatted to save paper for printing.
GameA GameB
Country
London 5 20
London 5 10
London 3 5
London 3 6
London 8
London 40
France 2 20
France 2 22
France 3
France 3
France 3
USA 10
Is there a way to format the dataframe to look like this:
GameA GameB
Country
London 5 London 20
London 5 London 10
London 3 London 5
London 3 London 6
London London 8
London London 40
GameA GameB
France 2 France 20
France 2 France 22
France 3
France 3
France 3
GameA
USA 10
The formatting is off a bit because of how it copy and pasted the text results above (due to the missing values), but this should work with your actual data.
countries = df.index.unique()
for country in countries:
print(df.loc[df.index == country])
print(' ')
GameA GameB
Country
London 5 20
London 5 10
London 3 5
London 3 6
London 8 NaN
London 40 NaN
GameA GameB
Country
France 2 20
France 2 22
France 3 NaN
France 3 NaN
France 3 NaN
GameA GameB
Country
USA 10 NaN