Combine rows with containing blanks and each other data - Python [duplicate] - python

I have pandas DF as below ,
id age gender country sales_year
1 None M India 2016
2 23 F India 2016
1 20 M India 2015
2 25 F India 2015
3 30 M India 2019
4 36 None India 2019
I want to group by on id, take the latest 1 row as per sales_date with all non null element.
output expected,
id age gender country sales_year
1 20 M India 2016
2 23 F India 2016
3 30 M India 2019
4 36 None India 2019
In pyspark,
df = df.withColumn('age', f.first('age', True).over(Window.partitionBy("id").orderBy(df.sales_year.desc())))
But i need same solution in pandas .
EDIT ::
This can the case with all the columns. Not just age. I need it to pick up latest non null data(id exist) for all the ids.

Use GroupBy.first:
df1 = df.groupby('id', as_index=False).first()
print (df1)
id age gender country sales_year
0 1 20.0 M India 2016
1 2 23.0 F India 2016
2 3 30.0 M India 2019
3 4 36.0 NaN India 2019
If column sales_year is not sorted:
df2 = df.sort_values('sales_year', ascending=False).groupby('id', as_index=False).first()
print (df2)
id age gender country sales_year
0 1 20.0 M India 2016
1 2 23.0 F India 2016
2 3 30.0 M India 2019
3 4 36.0 NaN India 2019

print(df.replace('None',np.NaN).groupby('id').first())
first replace the 'None' with NaN
next use groupby() to group by 'id'
next filter out the first row using first()

Use -
df.dropna(subset=['gender']).sort_values('sales_year', ascending=False).groupby('id')['age'].first()
Output
id
1 20
2 23
3 30
4 36
Name: age, dtype: object
Remove the ['age'] to get full rows -
df.dropna().sort_values('sales_year', ascending=False).groupby('id').first()
Output
age gender country sales_year
id
1 20 M India 2015
2 23 F India 2016
3 30 M India 2019
4 36 None India 2019
You can put the id back as a column with reset_index() -
df.dropna().sort_values('sales_year', ascending=False).groupby('id').first().reset_index()
Output
id age gender country sales_year
0 1 20 M India 2015
1 2 23 F India 2016
2 3 30 M India 2019
3 4 36 None India 2019

Related

Missing value replacemnet using mode in pandas in subgroup of a group

Having a data set as below.Here I need to group the subset in column and fill the missing values using mode method.Here specifically needs to fill the missing value of Tom from UK. So I need to group the TOM from Uk, and in that group the most repeating value needs to be added to the missing value.
Below fig shows how i need to do the group by.From the below matrix i need to replace all the Nan values using mode.
the desired output:
attaching the dataset
Name location Value
Tom USA 20
Tom UK Nan
Tom USA Nan
Tom UK 20
Jack India Nan
Nihal Africa 30
Tom UK Nan
Tom UK 20
Tom UK 30
Tom UK 20
Tom UK 30
Sam UK 30
Sam UK 30
try:
df = df\
.set_index(['Name', 'location'])\
.fillna(
df[df.Name.eq('Tom') & df.location.eq('UK')]\
.groupby(['Name', 'location'])\
.agg(pd.Series.mode)\
.to_dict()
)\
.reset_index()
Output:
Name location Value
0 Tom USA 20
1 Tom UK 20
2 Tom USA NaN
3 Tom UK 20
4 Jack India NaN
5 Nihal Africa 30
6 Tom UK 20
7 Tom UK 20
8 Tom UK 30
9 Tom UK 20
10 Tom UK 30
11 Sam UK 30
12 Sam UK 30

Pandas Python - How to create new columns with MultiIndex from pivot table

I have created a pivot table with 2 different types of values i) Number of apples from 2017-2020, ii) Number of people from 2017-2020. I want to create additional columns to calculate iii) Apples per person from 2017-2020. How can I do so?
Current code for pivot table:
tdf = df.pivot_table(index="States",
columns="Year",
values=["Number of Apples","Number of People"],
aggfunc= lambda x: len(x.unique()),
margins=True)
tdf
Here is my current pivot table:
Number of Apples Number of People
2017 2018 2019 2020 2017 2018 2019 2020
California 10 18 20 25 2 3 4 5
West Virginia 8 35 25 12 2 5 5 4
...
I want my pivot table to look like this, where I add additional columns to divide Number of Apples by Number of People.
Number of Apples Number of People Number of Apples per Person
2017 2018 2019 2020 2017 2018 2019 2020 2017 2018 2019 2020
California 10 18 20 25 2 3 4 5 5 6 5 5
West Virginia 8 35 25 12 2 5 5 4 4 7 5 3
I've tried a few things, such as:
Creating a new column via assigning new column names, but does not work with multiple column index tdf["Number of Apples per Person"][2017] = tdf["Number of Apples"][2017] / tdf["Number of People"][2017]
Tried the other assignment method tdf.assign(tdf["Number of Apples per Person"][2017] = tdf["Enrollment ID"][2017] / tdf["Student ID"][2017]); got this error SyntaxError: expression cannot contain assignment, perhaps you meant "=="?
Appreciate any help! Thanks
What you can do here is stack(), do your thing, and then unstack():
s = df.stack()
s['Number of Apples per Person'] = s['Number of Apples'] / s['Number of People']
df = s.unstack()
Output:
>>> df
Number of Apples Number of People Number of Apples per Person
2017 2018 2019 2020 2017 2018 2019 2020 2017 2018 2019 2020
California 10 18 20 25 2 3 4 5 5.0 6.0 5.0 5.0
West Virginia 8 35 25 12 2 5 5 4 4.0 7.0 5.0 3.0
One-liner:
df = df.stack().pipe(lambda x: x.assign(**{'Number of Apples per Person': x['Number of Apples'] / x['Number of People']})).unstack()
Given
df
Number of Apples Number of People
2017 2018 2019 2020 2017 2018 2019 2020
California 10 18 20 25 2 3 4 5
West Virginia 8 35 25 12 2 5 5 4
You can index on the first level to get sub-frames and then divide. The division will be auto-aligned on the columns.
df['Number of Apples'] / df['Number of People']
2017 2018 2019 2020
California 5.0 6.0 5.0 5.0
West Virginia 4.0 7.0 5.0 3.0
Append this back to your DataFrame:
pd.concat([df, pd.concat([df['Number of Apples'] / df['Number of People']], keys=['Result'], axis=1)], axis=1)
Number of Apples Number of People Result
2017 2018 2019 2020 2017 2018 2019 2020 2017 2018 2019 2020
California 10 18 20 25 2 3 4 5 5.0 6.0 5.0 5.0
West Virginia 8 35 25 12 2 5 5 4 4.0 7.0 5.0 3.0
This is fast since it is completely vectorized.

compare two value from different dataframe and based on that add value in pandas

Need to compare two different Dataframe and based on the result add value to a column
country = {'Year':[2020,2021],'Host':['Mexico','Panama'],'Winners':['Canada','Japan']}
country_df = pd.DataFrame(country,columns=['Year','Host','Winners'])
Year Host Winners
0 2020 Mexico Canada
1 2021 Panama Japan
all_country = {'Country': ['USA','Mexico','USA','Panama','Japan'],'Year':[2021,2020,2020,2021,2021]}
all_country_df=pd.DataFrame(all_country,columns=['Country','Year']
Country Year
0 USA 2021
1 Mexico 2020
2 USA 2020
3 Panama 2021
4 Japan 2021
I want to compare the all_country_df with the country_df to find which country was the host in the given year as well as the winners so something like
all_country= {'Country':['USA','Mexico','USA','Panama','Japan'],'Year':[2021,2020,2020,2021,2021],'Winner':[None,None,None,None,'Winner'],'Host':[None,'Host',None,'Host',None]}
all_Country_df=pd.DataFrame(all_country,columns=['Country','Year','Winner','Host'])
Like this
Country Year Winner Host
0 USA 2021 None None
1 Mexico 2020 None Host
2 USA 2020 None None
3 Panama 2021 None Host
4 Japan 2021 Winner None
Try with merge and np.where:
newdf = all_country_df.merge(country_df)
newdf['Winners'] = np.where(newdf['Country'].ne(newdf['Winner']), np.nan, 'Winners')
newdf['Host'] = np.where(newdf['Country'].ne(newdf['Host']), np.nan, 'Host')
print(newdf)
Output:
Country Year Host Winners
0 USA 2021 nan nan
1 Panama 2021 Host nan
2 Japan 2021 nan Winner
3 Mexico 2020 Host nan
4 USA 2020 nan nan

Sort Sales Data by Customer Name and Year

I have a data set which contains Customer Name, Ship Date, and PO Amount.
I would like to sort the data frame to output a table with the format of
cols:[Customer Name,2016,2017,2018,2019,2020,2021]
rows: 1 row for each customer and the sum of PO's within a given year.
This is what I have tried:
The data is coming in from an excel sheet, but assume ShipToName is a String, Bill Amount is a Float, and Sell data is a datetime.datetime.year().
ShipToName = ['Bob', 'Joe', 'Josh', 'Bob','Joe','Josh']
BillAmount = [30.02,23.2,20,45.32,54.23,65]
SellDate = [2016,2016,2018,2020,2021,2018]
dfSales = {'Customer': ShipToName, 'Total Sales': BillAmmount,
'Year':SellDate}
dfSales = pd.DataFrame(dfSales,columns = ['Customer', 'Year','Total
Sales'])
dfbyyear = dfSales.groupby(['Customer','Year'], as_index =
False).sum().sort_values('Total Sales', ascending = False)
This gives me a new row for each customer/year combo.
I would like the output to look like:
Customer Name
2016
2017
2018
2019
2020
2021
Bob
30.02
45.32
Joe
23.20
54.23
Josh
85.00
Edit v2
Using the data from the original version, we can create a temp dataframe dx that groups the data by Customer Name and Year. Then we can pivot the data to the format you wanted.
dx = df.groupby(['Customer Name','Year'])['PO Amount'].agg(Total_Amt=sum).reset_index()
dp = dx.pivot(index='Customer Name',columns='Year',values='Total_Amt')
print (dp)
The output of this will be:
Year 2020 2021
Customer Name
Boby 6754 6371
Jack 5887 6421
Jane 5161 4411
Jill 5857 5641
Kate 6205 6457
Suzy 5027 4561
Original v1
I am making some assumptions with the data as you haven't provided me with any.
Assumptions:
There are many customers in the dataframe - My example has 6
customers
Each customer has more than one Ship Date - My example has 1
shipment each month for 2 years
The shipment amount is a dollar amount - I used integer range from 100 to 900
The total dataframe size is 144 rows with 3 columns - Customer Name, Ship Date, and PO Amount
You are looking for an output by Customer, by Year, sum of all POs for that year
With these assumptions, here's the dataframe and the output.
import pandas as pd
import random
df = pd.DataFrame({'Customer Name':['Jack'] * 24 + ['Jill'] * 24 + ['Jane'] * 24 +
['Kate'] * 24 + ['Suzy'] * 24 + ['Boby'] * 24,
'Ship Date':pd.date_range('2020-01-01',periods = 24,freq='MS').tolist()*6,
'PO Amount':[random.randint(100,900) for _ in range(144)]})
print (df)
df['Year'] = df['Ship Date'].dt.year
print (df.groupby(['Customer Name','Year'])['PO Amount'].agg(Total_Amt=sum).reset_index())
Customer Name Ship Date PO Amount
0 Jack 2020-01-01 310
1 Jack 2020-02-01 677
2 Jack 2020-03-01 355
3 Jack 2020-04-01 571
4 Jack 2020-05-01 875
.. ... ... ...
139 Boby 2021-08-01 116
140 Boby 2021-09-01 822
141 Boby 2021-10-01 751
142 Boby 2021-11-01 109
143 Boby 2021-12-01 866
Each customer has data from 2020-01-01 through 2021-12-01.
The summary report will be as follows:
Customer Name Year Total_Amt
0 Boby 2020 7176
1 Boby 2021 6049
2 Jack 2020 6187
3 Jack 2021 5240
4 Jane 2020 4919
5 Jane 2021 6105
6 Jill 2020 6556
7 Jill 2021 5963
8 Kate 2020 6300
9 Kate 2021 6360
10 Suzy 2020 5969
11 Suzy 2021 4866

Adding columns of different length into pandas dataframe

I have a dataframe detailing money awarded to people over several years:
Name -- Money -- Year
Paul 57.00 2012
Susan 67.00 2012
Gary 54.00 2011
Paul 77.00 2011
Andrea 20.00 2011
Albert 23.00 2011
Hal 26.00 2010
Paul 23.00 2010
From this dataframe, I want to construct a dataframe that details all the money awarded in a single year, for making a boxplot:
2012 -- 2011 -- 2010
57.00 54.00 26.00
67.00 77.00 23.00
20.00
23.00
So you see this results in columns of different length. When I try to do this using pandas, I get the error 'ValueError: Length of values does not match length of index'. I assume this is because I can't add varying length columns to a dataframe.
Can anyone offer some advice on how to proceed? Perhap I'm approaching this incorrectly? Thanks for any help!
I'd do this in a two-step process: first add a column corresponding to the index in each year using cumcount, and then pivot so that the new column is the index, the years become the columns, and the money column becomes the values:
df["yindex"] = df.groupby("Year").cumcount()
new_df = df.pivot(index="yindex", columns="Year", values="Money")
For example:
>>> df = pd.read_csv("money.txt", sep="\s+")
>>> df
Name Money Year
0 Paul 57 2012
1 Susan 67 2012
2 Gary 54 2011
3 Paul 77 2011
4 Andrea 20 2011
5 Albert 23 2011
6 Hal 26 2010
7 Paul 23 2010
>>> df["yindex"] = df.groupby("Year").cumcount()
>>> df
Name Money Year yindex
0 Paul 57 2012 0
1 Susan 67 2012 1
2 Gary 54 2011 0
3 Paul 77 2011 1
4 Andrea 20 2011 2
5 Albert 23 2011 3
6 Hal 26 2010 0
7 Paul 23 2010 1
>>> df.pivot(index="yindex", columns="Year", values="Money")
Year 2010 2011 2012
yindex
0 26 54 57
1 23 77 67
2 NaN 20 NaN
3 NaN 23 NaN
After which you could get rid of the NaNs if you like, but it depends on whether you want to distinguish between cases like "knowing the value is 0" and "not knowing what the value is":
>>> df.pivot(index="yindex", columns="Year", values="Money").fillna(0)
Year 2010 2011 2012
yindex
0 26 54 57
1 23 77 67
2 0 20 0
3 0 23 0

Categories