Pandas groupby transform to get not null date value - python

I have a dataframe constructed as so:
df = pd.DataFrame({'id': [1,2,3,4,1,2,3,4],
'birthdate': ['01-01-01','02-02-02','03-03-03','04-04-04',
'','02-02-02','03-04-04','04-03-04']})
df['birthdate'] = pd.to_datetime(df['birthdate'])
I want to do a groupby to change the original data using pandas .transform
The condition is that I want to pick the birthdate value of the first not null row per id
I know I can do max if no other option is available to get rid of the not null entries, but if there are inconsistencies, I don't necessarily want the maximum date, just the one that occurs first in the dataframe.
As such:
df['birthdate'] = df.groupby('id')['birthdate'].transform(max)
This is how output looks using max:
id birthdate
0 1 2001-01-01
1 2 2002-02-02
2 3 2003-03-03
3 4 2004-04-04
4 1 2001-01-01
5 2 2002-02-02
6 3 2004-03-04
7 4 2004-04-04
This is how I actually want it to look:
id birthdate
0 1 2001-01-01
1 2 2002-02-02
2 3 2003-03-03
3 4 2004-04-04
4 1 2001-01-01
5 2 2002-02-02
6 3 2003-03-03
7 4 2004-04-04
I'm pretty sure I have to create a customer lambda to put inside the .transform but I am unsure what condition to use.

You can try the following. Your dataframe definition and suggested outputs contain different dates, so I assumed your dataframe definition was correct
df['birthdate'] = df.groupby('id').transform('first')
which outputs.
id birthdate
0 1 2001-01-01
1 2 2002-02-02
2 3 2003-03-03
3 4 2004-04-04
4 1 2001-01-01
5 2 2002-02-02
6 3 2003-03-03
7 4 2004-04-04

Related

Pandas forward fill entire rows according to missing integers in specific column

I can't seem to figure out a clean way to forward fill entire rows based on missing integers in a specific column. For example I have the dataframe:
df = pd.DataFrame({'frame':[0,3,5], 'value': [1,2,3]})
frame value
0 0 1
1 3 2
2 5 3
And I'd like to create new rows where the frame column has integer gaps, and the values are forward filled for all other columns. For example the output in this case would look like:
frame value
0 0 1
1 1 1
2 2 1
3 3 2
4 4 2
5 5 3
You can set frame as index and reindex with ffill:
frame_range = np.arange(df['frame'].min(), df['frame'].max()+1)
df.set_index('frame').reindex(frame_range).ffill().reset_index()
Or you can also use merge then ffill
# frame_range as above
df.merge(pd.DataFrame({'frame':frame_range}),
on='frame', how='outer').ffill()
Output:
frame value
0 0 1.0
1 1 1.0
2 2 1.0
3 3 2.0
4 4 2.0
5 5 3.0
Update: merge_asof is actually a better choice:
pd.merge_asof(pd.DataFrame({'frame':frame_range}),
df,
on='frame')
Output:
frame value
0 0 1
1 1 1
2 2 1
3 3 2
4 4 2
5 5 3

How could I replace null value In a group?

I created this dataframe I calculated the gap that I was looking but the problem is that some flats have the same price and I get a difference of price of 0. How could I replace the value 0 by the difference with the last lower price of the same group.
for example:
neighboorhood:a, bed:1, bath:1, price:5
neighboorhood:a, bed:1, bath:1, price:5
neighboorhood:a, bed:1, bath:1, price:3
neighboorhood:a, bed:1, bath:1, price:2
I get difference price of 0,2,1,nan and I'm looking for 2,2,1,nan (briefly I don't want to compare 2 flats with the same price)
Thanks in advance and good day.
data=[
[1,'a',1,1,5],[2,'a',1,1,5],[3,'a',1,1,4],[4,'a',1,1,2],[5,'b',1,2,6],[6,'b',1,2,6],[7,'b',1,2,3]
]
df = pd.DataFrame(data, columns = ['id','neighborhoodname', 'beds', 'baths', 'price'])
df['difference_price'] = ( df.dropna()
.sort_values('price',ascending=False)
.groupby(['city','beds','baths'])['price'].diff(-1) )
I think you can remove duplicates first per all columns used for groupby with diff, create new column in filtered data and last use merge with left join to original:
df1 = (df.dropna()
.sort_values('price',ascending=False)
.drop_duplicates(['neighborhoodname','beds','baths', 'price']))
df1['difference_price'] = df1.groupby(['neighborhoodname','beds','baths'])['price'].diff(-1)
df = df.merge(df1[['neighborhoodname','beds','baths','price', 'difference_price']], how='left')
print (df)
id neighborhoodname beds baths price difference_price
0 1 a 1 1 5 1.0
1 2 a 1 1 5 1.0
2 3 a 1 1 4 2.0
3 4 a 1 1 2 NaN
4 5 b 1 2 6 3.0
5 6 b 1 2 6 3.0
6 7 b 1 2 3 NaN
Or you can use lambda function for back filling 0 values per groups for avoid wrong outputs if one row groups (data moved from another groups):
df['difference_price'] = (df.sort_values('price',ascending=False)
.groupby(['neighborhoodname','beds','baths'])['price']
.apply(lambda x: x.diff(-1).replace(0, np.nan).bfill()))
print (df)
id neighborhoodname beds baths price difference_price
0 1 a 1 1 5 1.0
1 2 a 1 1 5 1.0
2 3 a 1 1 4 2.0
3 4 a 1 1 2 NaN
4 5 b 1 2 6 3.0
5 6 b 1 2 6 3.0
6 7 b 1 2 3 NaN

Elegant way to drop records in pandas based on size/count of a record

This isn't a duplicate. I am not trying drop rows based on Index
I have a dataframe like as shown below
df = pd.DataFrame({
'subject_id':[1,1,1,1,1,1,1,2,2,2,2,2],
'time_1' :['2173-04-03 12:35:00','2173-04-03 12:50:00','2173-04-05
12:59:00','2173-05-04 13:14:00','2173-05-05 13:37:00','2173-07-06
13:39:00','2173-07-08 11:30:00','2173-04-08 16:00:00','2173-04-09
22:00:00','2173-04-11 04:00:00','2173- 04-13 04:30:00','2173-04-14
08:00:00'],
'val' :[5,2,3,1,1,6,5,5,8,3,4,6]})
df['time_1'] = pd.to_datetime(df['time_1'])
df['day'] = df['time_1'].dt.day
I would like to drop records based on subject_id if their count is <=5.
This is what I tried
df1 = df.groupby(['subject_id']).size().reset_index(name='counter')
df1[df1['counter']>5] # this gives the valid subject_id = 1 has count more than 5)
Now using this subject_id, I have to get the base dataframe rows for that subject_id
There might be an elegant way to do this.
I would like to get the output as shown below. I would like have my base dataframe rows
Use:
df[df.groupby('subject_id')['subject_id'].transform('size')>5]
Output:
subject_id time_1 val day
0 1 2173-04-03 12:35:00 5 3
1 1 2173-04-03 12:50:00 2 3
2 1 2173-04-05 12:59:00 3 5
3 1 2173-05-04 13:14:00 1 4
4 1 2173-05-05 13:37:00 1 5
5 1 2173-07-06 13:39:00 6 6
6 1 2173-07-08 11:30:00 5 8

Assign the frequency of each value to dataframe with new column

I try to set up a Dataframe that countains a column called frequency.
This column should show how often the value is present in a specific column of the dataframe in every row. Something like this:
Index Category Frequency
0 1 1
1 3 2
2 3 2
3 4 1
4 7 3
5 7 3
6 7 3
7 8 1
This is just an example
I already tried it with value_counts(), however I only receive a value in the last line of the appearing number.
In the case of the example
Index Category Frequency
0 1 1
1 3 N.A
2 3 2
3 4 1
4 7 N.A
5 7 N.A
6 7 3
7 8 1
It is very important that the column has the same number of rows as the dataframe, preferably appended to the same dataframe
df['Frequency'] = df.groupby('Category').transform('count')
Use pandas.Series.map:
df['Frecuency']=df['Category'].map(df['Category'].value_counts())
or pandas.Series.replace:
df['Frecuency']=df['Category'].replace(df['Category'].value_counts())
Output:
Index Category Frecuency
0 0 1 1
1 1 3 2
2 2 3 2
3 3 4 1
4 4 7 3
5 5 7 3
6 6 7 3
7 7 8 1
Details
df['Category'].value_counts()
7 3
3 2
4 1
1 1
8 1
Name: Category, dtype: int64
using value_counts you get a series whose index are the elements of the category and the values ​​is the count. So you can use map or pandas.Series.replace to create a series with the category values ​​replaced by those in the count. And finally assign this series to the frequency column
you can do it using group by like below
df.groupby("Category") \
.apply(lambda g: g.assign(frequency = len(g))) \
.reset_index(level=0, drop=True)

Pandas calculate average value of column for rows satisfying condition

I have a dataframe containing information about users rating items during a period of time. It has the following semblance :
In the dataframe I have a number of rows with identical 'user_id' and 'business_id' which i retrieve using the following code :
mask = reviews_df.duplicated(subset=['user_id','business_id'], keep=False)
dup = reviews_df[mask]
obtaining something like this :
I now need to remove all such duplicates from the original dataframe and substitute them with their average. Is there a fast and elegant way to achive this?Thanks!
Se if you do have a dataframe looks like
review_id user_id business_id stars date
0 1 0 3 2.0 2019-01-01
1 2 1 3 5.0 2019-11-11
2 3 0 2 4.0 2019-10-22
3 4 3 4 3.0 2019-09-13
4 5 3 4 1.0 2019-02-14
5 6 0 2 5.0 2019-03-17
Then the solution should be something like that:
df.loc[df.duplicated(['user_id', 'business_id'], keep=False)]\
.groupby(['user_id', 'business_id'])\
.apply(lambda x: x.stars - x.stars.mean())
With the following result:
user_id business_id
0 2 2 -0.5
5 0.5
3 4 3 1.0
4 -1.0

Categories