I have the following table
Father
Son
Year
James
Harry
1999
James
Alfi
2001
Corey
Kyle
2003
I would like to add a fourth column that makes the table look like below. It's supposed to show which child of each father was born first, second, third, and so on. How can I do that?
Father
Son
Year
Child
James
Harry
1999
1
James
Alfi
2001
2
Corey
Kyle
2003
1
here is one way to do it. using cumcount
# groupby Father and take a cumcount, offsetted by 1
df['Child']=df.groupby(['Father'])['Son'].cumcount()+1
df
Father Son Year Child
0 James Harry 1999 1
1 James Alfi 2001 2
2 Corey Kyle 2003 1
it assumes that DF is sorted by Father and Year. if not, then
df['Child']=df.sort_values(['Father','Year']).groupby(['Father'] )['Son'].cumcount()+1
df
Here is an idea of solving this using groupby and cumsum functions.
This assumes that the rows are ordered so that the younger sibling is always below their elder brother and all children of the same father are in a continuous pack of rows.
Assume we have the following setup
import pandas as pd
df = pd.DataFrame({'Father': ['James', 'James', 'Corey'],
'Son': ['Harry', 'Alfi', 'Kyle'],
'Year': [1999, 2001, 2003]})
then here is the trick we group the siblings with the same father into a groupby object and then compute the cumulative sum of ones to assign a sequential number to each row.
df['temp_column'] = 1
df['Child'] = df.groupby('Father')['temp_column'].cumsum()
df.drop(columns='temp_column')
The result would look like this
Father Son Year Child
0 James Harry 1999 1
1 James Alfi 2001 2
2 Corey Kyle 2003 1
Now to make the solution more general consider reordering the rows to satisfy the preconditions before applying the solution and then if necessary restore the dataframe to the original order.
Related
I have a dataset like these
state
sex
birth
player
QLD
m
1993
Dave
QLD
m
1992
Rob
Now I would like to create an additional row, which is the id. ID is equal to the row number but + 1
df = df.assign(ID=range(len(df)))
But unfortunately the first ID is zero, how can I fix that the first ID begins with 1 and so on
I want these output
state
sex
birth
player
ID
QLD
m
1993
Dave
1
QLD
m
1992
Rob
2
but I got these
state
sex
birth
player
ID
QLD
m
1993
Dave
0
QLD
m
1992
Rob
1
How can I add an additional column to python, which starts with one and gives for every row a unique number so for the second row 2, third 3 and so on.
You can try this:
import pandas as pd
df['ID'] = pd.Series(range(len(df))) + 1
If there is someone who understands, please help me to resolve this. I want to label user data using python pandas, where there are two columns in my dataset, namely author, and retweeted_screen_name. I want to do a label with the criteria if every user in the author column has the same value in the retweeted_screen_name column then are 1 and the others that do not have the same value are 0.
Author
RT_Screen_Name
Label
Alice
John
1
Sandy
John
1
Lisa
Mario
0
Luna
Mark
0
Luna
John
1
Luke
Anthony
0
df['Label']=0
df.loc[df["RT_Screen_Name"]=="John", ["Label"]] = 1
It is unclear what condition you are using to decide the label variable, but if you are clear on your condition you can change out the conditional statement within this code. Also if you edit your question to clarify the condition, notify me and I will adjust my answer.
IIUC, try with groupby:
df["Label"] = (df.groupby("RT_Screen_Name")["Author"].transform("count")>1).astype(int)
>>> df
Author RT_Screen_Name Label
0 Alice John 1
1 Sandy John 1
2 Lisa Mario 0
3 Luna Mark 0
4 Luna John 1
5 Luke Anthony 0
I have the following dataframe:
Year Name Town Vehicle
2000 John NYC Truck
2000 John NYC Car
2010 Jim London Bike
2010 Jim London Car
I would like to condense this dataframe to one row per Year/ Name /Town so that my end result is:
Year Name Town Vehicle Vehicle2
2000 John NYC Truck Car
2010 Jim London Bike Car
Im guessing it is some sort of df.grouby statement but im not sure how to create the new column. Any help would be much appreciated!
Use GroupBy.cumcount for counter with reshape by Series.unstack:
g = df.groupby(['Year', 'Name','Town']).cumcount()
df1 = (df.set_index(['Year', 'Name','Town', g])['Vehicle']
.unstack()
.add_prefix('Vehicle')
.reset_index())
print (df1)
Year Name Town Vehicle0 Vehicle1
0 2000 John NYC Truck Car
1 2010 Jim London Bike Car
I am trying to compute a ratio or % that takes the number of occurrences of a grouped by column (Service Column) that has at least one of two possible values (Food or Beverage) and then divide it over the number of unique column (Business Column) values in the df but am having trouble.
Original df:
Rep | Business | Service
Cindy Shakeshake Food
Cindy Shakeshake Outdoor
Kim BurgerKing Beverage
Kim Burgerking Phone
Kim Burgerking Car
Nate Tacohouse Food
Nate Tacohouse Car
Tim Cofeeshop Coffee
Tim Coffeeshop Seating
Cindy Italia Seating
Cindy Italia Coffee
Desired Output:
Rep | %
Cindy .5
Kim 1
Nate 1
Tim 0
Where % is the number of Businesses cindy has with at least 1 Food or Beverage row divided by all unique Businesses in df for her.
I am trying something like below:
(df.assign(Service=df.Service.isin(['Food','Beverage']).astype(int))
.groupby('Rep')
.agg({'Business':'nunique','Service':'count'}))
s['Service']/s['Business']
But this doesnt give me what im looking for as the service only gives all rows in df for cindy in this case 4 and the Businees column isnt giving me an accurate # of where she has food or beverage in a grouped by business.
Thanks for looking and possible help in advance.
I you think you need aggregate sum for count matched values:
df1 = (df.assign(Service=df.Service.isin(['Food','Beverage']).astype(int))
.groupby('Rep')
.agg({'Business':'nunique','Service':'sum'}))
print (df1)
Business Service
Rep
Cindy 2 1
Kim 2 1
Nate 1 1
Tim 2 0
s = df1['Service']/df1['Business']
print (s)
Cindy 0.5
Kim 0.5
Nate 1.0
Tim 0.0
dtype: float64
There is a small mistake that you made in your code here:
s=(df.assign(Service=df.Service.isin(['Food','Beverage']).astype(int))
.groupby('Rep')
.agg({'Business':'nunique','Service':'count'}))
s['Service']/s['Business']
You would need to change 'Service':'count' to 'Service':'sum'. count just counts the number of rows that each Rep has. With sum, it counts the number of rows that each Rep has that is either Food or Beverage service.
From a two string columns pandas data frame looking like:
d = {'SCHOOL' : ['Yale', 'Yale', 'LBS', 'Harvard','UCLA', 'Harvard', 'HEC'],
'NAME' : ['John', 'Marc', 'Alex', 'Will', 'Will','Miller', 'Tom']}
df = pd.DataFrame(d)
Notice the relationship between NAME to SCHOOL is n to 1.
I want to get the last school in case one person has gone to two different schools (see "Will" case).
So far I got:
df = df.groupby('NAME')['SCHOOL'].unique().reset_index()
Return:
NAME SCHOOL
0 Alex [LBS]
1 John [Yale]
2 Marc [Yale]
3 Miller [Harvard]
4 Tom [HEC]
5 Will [Harvard, UCLA]
PROBLEMS:
unique() return both school not only the last school.
This line return SCHOOL column as a np.array instead of string. Very difficult to work further with this df.
Both problems where solved based on #IanS comments.
Using last() instead of unique():
df = df.groupby('NAME')['SCHOOL'].last().reset_index()
Return:
NAME SCHOOL
0 Alex LBS
1 John Yale
2 Marc Yale
3 Miller Harvard
4 Tom HEC
5 Will UCLA
Use drop_duplicates with parameter last and specifying column for check duplicates:
df = df.drop_duplicates('NAME', keep='last')
print (df)
NAME SCHOOL
0 John Yale
1 Marc Yale
2 Alex LBS
4 Will UCLA
5 Miller Harvard
6 Tom HEC
Also if need sorting add sort_values:
df = df.drop_duplicates('NAME', keep='last').sort_values('NAME')
print (df)
NAME SCHOOL
2 Alex LBS
0 John Yale
1 Marc Yale
5 Miller Harvard
6 Tom HEC
4 Will UCLA