Take value from a duplicate row and create a new column pandas - python

I have the following dataframe:
Year Name Town Vehicle
2000 John NYC Truck
2000 John NYC Car
2010 Jim London Bike
2010 Jim London Car
I would like to condense this dataframe to one row per Year/ Name /Town so that my end result is:
Year Name Town Vehicle Vehicle2
2000 John NYC Truck Car
2010 Jim London Bike Car
Im guessing it is some sort of df.grouby statement but im not sure how to create the new column. Any help would be much appreciated!

Use GroupBy.cumcount for counter with reshape by Series.unstack:
g = df.groupby(['Year', 'Name','Town']).cumcount()
df1 = (df.set_index(['Year', 'Name','Town', g])['Vehicle']
.unstack()
.add_prefix('Vehicle')
.reset_index())
print (df1)
Year Name Town Vehicle0 Vehicle1
0 2000 John NYC Truck Car
1 2010 Jim London Bike Car

Related

Add column to DataFrame and assign number to each row

I have the following table
Father
Son
Year
James
Harry
1999
James
Alfi
2001
Corey
Kyle
2003
I would like to add a fourth column that makes the table look like below. It's supposed to show which child of each father was born first, second, third, and so on. How can I do that?
Father
Son
Year
Child
James
Harry
1999
1
James
Alfi
2001
2
Corey
Kyle
2003
1
here is one way to do it. using cumcount
# groupby Father and take a cumcount, offsetted by 1
df['Child']=df.groupby(['Father'])['Son'].cumcount()+1
df
Father Son Year Child
0 James Harry 1999 1
1 James Alfi 2001 2
2 Corey Kyle 2003 1
it assumes that DF is sorted by Father and Year. if not, then
df['Child']=df.sort_values(['Father','Year']).groupby(['Father'] )['Son'].cumcount()+1
df
Here is an idea of solving this using groupby and cumsum functions.
This assumes that the rows are ordered so that the younger sibling is always below their elder brother and all children of the same father are in a continuous pack of rows.
Assume we have the following setup
import pandas as pd
df = pd.DataFrame({'Father': ['James', 'James', 'Corey'],
'Son': ['Harry', 'Alfi', 'Kyle'],
'Year': [1999, 2001, 2003]})
then here is the trick we group the siblings with the same father into a groupby object and then compute the cumulative sum of ones to assign a sequential number to each row.
df['temp_column'] = 1
df['Child'] = df.groupby('Father')['temp_column'].cumsum()
df.drop(columns='temp_column')
The result would look like this
Father Son Year Child
0 James Harry 1999 1
1 James Alfi 2001 2
2 Corey Kyle 2003 1
Now to make the solution more general consider reordering the rows to satisfy the preconditions before applying the solution and then if necessary restore the dataframe to the original order.

Calculating grouped by % based on if there are contained values in numerator and unique column value in denominator

I am trying to compute a ratio or % that takes the number of occurrences of a grouped by column (Service Column) that has at least one of two possible values (Food or Beverage) and then divide it over the number of unique column (Business Column) values in the df but am having trouble.
Original df:
Rep | Business | Service
Cindy Shakeshake Food
Cindy Shakeshake Outdoor
Kim BurgerKing Beverage
Kim Burgerking Phone
Kim Burgerking Car
Nate Tacohouse Food
Nate Tacohouse Car
Tim Cofeeshop Coffee
Tim Coffeeshop Seating
Cindy Italia Seating
Cindy Italia Coffee
Desired Output:
Rep | %
Cindy .5
Kim 1
Nate 1
Tim 0
Where % is the number of Businesses cindy has with at least 1 Food or Beverage row divided by all unique Businesses in df for her.
I am trying something like below:
(df.assign(Service=df.Service.isin(['Food','Beverage']).astype(int))
.groupby('Rep')
.agg({'Business':'nunique','Service':'count'}))
s['Service']/s['Business']
But this doesnt give me what im looking for as the service only gives all rows in df for cindy in this case 4 and the Businees column isnt giving me an accurate # of where she has food or beverage in a grouped by business.
Thanks for looking and possible help in advance.
I you think you need aggregate sum for count matched values:
df1 = (df.assign(Service=df.Service.isin(['Food','Beverage']).astype(int))
.groupby('Rep')
.agg({'Business':'nunique','Service':'sum'}))
print (df1)
Business Service
Rep
Cindy 2 1
Kim 2 1
Nate 1 1
Tim 2 0
s = df1['Service']/df1['Business']
print (s)
Cindy 0.5
Kim 0.5
Nate 1.0
Tim 0.0
dtype: float64
There is a small mistake that you made in your code here:
s=(df.assign(Service=df.Service.isin(['Food','Beverage']).astype(int))
.groupby('Rep')
.agg({'Business':'nunique','Service':'count'}))
s['Service']/s['Business']
You would need to change 'Service':'count' to 'Service':'sum'. count just counts the number of rows that each Rep has. With sum, it counts the number of rows that each Rep has that is either Food or Beverage service.

How can I fill some data of the cell of the new column that is in accord with a substring of the original data using pandas?

There are 2 dataframes, and they have simillar data.
A dataframe
Index Business Address
1 Oils Moskva, Russia
2 Foods Tokyo, Japan
3 IT California, USA
... etc.
B dataframe
Index Country Country Calling Codes
1 USA +1
2 Egypt +20
3 Russia +7
4 Korea +82
5 Japan +81
... etc.
I will add a column named 'Country Calling Codes' to A dataframe, too.
After this, 'Country' column in B dataframe will be compared with the data of 'Address' column. If the string of 'A.Address' includes string of 'B.Country', 'B.Country Calling Codes' will be inserted to 'A.Country Calling Codes' of compared row.
Result is:
Index Business Address Country Calling Codes
1 Oils Moskva, Russia +7
2 Foods Tokyo, Japan +81
3 IT California, USA +1
I don't know how to deal with the issue because I don't have much experience using pandas. I should be very grateful to you if you might help me.
Use Series.str.extract for get possible strings by Country column and then Series.map by Series:
d = B.drop_duplicates('Country').set_index('Country')['Country Calling Codes']
s = A['Address'].str.extract(f'({"|".join(d.keys())})', expand=False)
A['Country Calling Codes'] = s.map(d)
print (A)
Index Business Address Country Calling Codes
0 1 Oils Moskva, Russia +7
1 2 Foods Tokyo, Japan +81
2 3 IT California, USA +1
Detail:
print (A['Address'].str.extract(f'({"|".join(d.keys())})', expand=False))
0 Russia
1 Japan
2 USA
Name: Address, dtype: object

pandas fill missing country values based on city if it exists

I'm trying to fill country names in my dataframe if it is null based on city and country names, which exists. For eg see the dataframe below, here i want to replace NaN for City Bangalore with Country India if such City exists in the dataframe
df1=
City Country
0 Bangalore India
1 Delhi India
2 London UK
3 California USA
4 Dubai UAE
5 Abu Dhabi UAE
6 Bangalore NaN
I am new to this so any help would be appreciated :).
You can create a series mapping after dropping nulls and duplicates.
Then use fillna with pd.Series.map:
g = df.dropna(subset=['Country']).drop_duplicates('City').set_index('City')['Country']
df['Country'] = df['Country'].fillna(df['City'].map(g))
print(df)
City Country
0 Bangalore India
1 Delhi India
2 London UK
3 California USA
4 Dubai UAE
5 AbuDhabi UAE
6 Bangalore India
This solution will also work if NaN occurs first within a group.
I believe
df1.groupby('City')['Country'].fillna(method='ffill')
should resolve your issue by forward filling missing values within the group by.
One of the ways could be -
non_null_cities = df1.dropna().drop_duplicates(['City']).rename(columns={'Country':'C'})
df1 = df1.merge(non_null_cities, on='City', how='left')
df1.loc[df1['Country'].isnull(), 'Country'] = df1['C']
del df1['C']
Hope this will be helpful!
Here is one nasty way to do it.
first use forward fill and then use backwardfill ( for the possible NaN occurs first)
df = df.groupby('City')[['City','Country']].fillna(method = 'ffill').groupby('City')[['City','Country']].fillna(method = 'bfill')

how to remove entire column if a particular row has duplicate values in a dataframe in python

I have a dataframe like this,
df,
Name City
0 sri chennai
1 pedhci pune
2 bahra pune
there is a duplicate in City column.
I tried:
df["City"].drop_duplicates()
but it gives only the particular column.
my desired output should be
output_df
Name City
0 sri chennai
1 pedhci pune
You can use:
df2 = df.drop_duplicates(subset='City')
if you want to store the result in a new dataframe, or:
df.drop_duplicates(subset='City',inplace=True)
if you want to update df.
This produces:
>>> df
City Name
0 chennai sri
1 pune pedhci
2 pune bahra
>>> df.drop_duplicates(subset='City')
City Name
0 chennai sri
1 pune pedhci
This will thus only take duplicates for City into account, duplicates in Name are ignored.

Categories