How could I replace null value In a group? - python

I created this dataframe I calculated the gap that I was looking but the problem is that some flats have the same price and I get a difference of price of 0. How could I replace the value 0 by the difference with the last lower price of the same group.
for example:
neighboorhood:a, bed:1, bath:1, price:5
neighboorhood:a, bed:1, bath:1, price:5
neighboorhood:a, bed:1, bath:1, price:3
neighboorhood:a, bed:1, bath:1, price:2
I get difference price of 0,2,1,nan and I'm looking for 2,2,1,nan (briefly I don't want to compare 2 flats with the same price)
Thanks in advance and good day.
data=[
[1,'a',1,1,5],[2,'a',1,1,5],[3,'a',1,1,4],[4,'a',1,1,2],[5,'b',1,2,6],[6,'b',1,2,6],[7,'b',1,2,3]
]
df = pd.DataFrame(data, columns = ['id','neighborhoodname', 'beds', 'baths', 'price'])
df['difference_price'] = ( df.dropna()
.sort_values('price',ascending=False)
.groupby(['city','beds','baths'])['price'].diff(-1) )

I think you can remove duplicates first per all columns used for groupby with diff, create new column in filtered data and last use merge with left join to original:
df1 = (df.dropna()
.sort_values('price',ascending=False)
.drop_duplicates(['neighborhoodname','beds','baths', 'price']))
df1['difference_price'] = df1.groupby(['neighborhoodname','beds','baths'])['price'].diff(-1)
df = df.merge(df1[['neighborhoodname','beds','baths','price', 'difference_price']], how='left')
print (df)
id neighborhoodname beds baths price difference_price
0 1 a 1 1 5 1.0
1 2 a 1 1 5 1.0
2 3 a 1 1 4 2.0
3 4 a 1 1 2 NaN
4 5 b 1 2 6 3.0
5 6 b 1 2 6 3.0
6 7 b 1 2 3 NaN
Or you can use lambda function for back filling 0 values per groups for avoid wrong outputs if one row groups (data moved from another groups):
df['difference_price'] = (df.sort_values('price',ascending=False)
.groupby(['neighborhoodname','beds','baths'])['price']
.apply(lambda x: x.diff(-1).replace(0, np.nan).bfill()))
print (df)
id neighborhoodname beds baths price difference_price
0 1 a 1 1 5 1.0
1 2 a 1 1 5 1.0
2 3 a 1 1 4 2.0
3 4 a 1 1 2 NaN
4 5 b 1 2 6 3.0
5 6 b 1 2 6 3.0
6 7 b 1 2 3 NaN

Related

fill nan values with values from another row with common values in two or more columns [duplicate]

I am trying to impute/fill values using rows with similar columns' values.
For example, I have this dataframe:
one | two | three
1 1 10
1 1 nan
1 1 nan
1 2 nan
1 2 20
1 2 nan
1 3 nan
1 3 nan
I wanted to using the keys of column one and two which is similar and if column three is not entirely nan then impute the existing value from a row of similar keys with value in column '3'.
Here is my desired result:
one | two | three
1 1 10
1 1 10
1 1 10
1 2 20
1 2 20
1 2 20
1 3 nan
1 3 nan
You can see that keys 1 and 3 do not contain any value because the existing value does not exists.
I have tried using groupby+fillna():
df['three'] = df.groupby(['one','two'])['three'].fillna()
which gave me an error.
I have tried forward fill which give me rather strange result where it forward fill the column 2 instead. I am using this code for forward fill.
df['three'] = df.groupby(['one','two'], sort=False)['three'].ffill()
If only one non NaN value per group use ffill (forward filling) and bfill (backward filling) per group, so need apply with lambda:
df['three'] = df.groupby(['one','two'], sort=False)['three']
.apply(lambda x: x.ffill().bfill())
print (df)
one two three
0 1 1 10.0
1 1 1 10.0
2 1 1 10.0
3 1 2 20.0
4 1 2 20.0
5 1 2 20.0
6 1 3 NaN
7 1 3 NaN
But if multiple value per group and need replace NaN by some constant - e.g. mean by group:
print (df)
one two three
0 1 1 10.0
1 1 1 40.0
2 1 1 NaN
3 1 2 NaN
4 1 2 20.0
5 1 2 NaN
6 1 3 NaN
7 1 3 NaN
df['three'] = df.groupby(['one','two'], sort=False)['three']
.apply(lambda x: x.fillna(x.mean()))
print (df)
one two three
0 1 1 10.0
1 1 1 40.0
2 1 1 25.0
3 1 2 20.0
4 1 2 20.0
5 1 2 20.0
6 1 3 NaN
7 1 3 NaN
You can sort data by the column with missing values then groupby and forwardfill:
df.sort_values('three', inplace=True)
df['three'] = df.groupby(['one','two'])['three'].ffill()

Replace NaN of rows using data from another rows [duplicate]

I am trying to impute/fill values using rows with similar columns' values.
For example, I have this dataframe:
one | two | three
1 1 10
1 1 nan
1 1 nan
1 2 nan
1 2 20
1 2 nan
1 3 nan
1 3 nan
I wanted to using the keys of column one and two which is similar and if column three is not entirely nan then impute the existing value from a row of similar keys with value in column '3'.
Here is my desired result:
one | two | three
1 1 10
1 1 10
1 1 10
1 2 20
1 2 20
1 2 20
1 3 nan
1 3 nan
You can see that keys 1 and 3 do not contain any value because the existing value does not exists.
I have tried using groupby+fillna():
df['three'] = df.groupby(['one','two'])['three'].fillna()
which gave me an error.
I have tried forward fill which give me rather strange result where it forward fill the column 2 instead. I am using this code for forward fill.
df['three'] = df.groupby(['one','two'], sort=False)['three'].ffill()
If only one non NaN value per group use ffill (forward filling) and bfill (backward filling) per group, so need apply with lambda:
df['three'] = df.groupby(['one','two'], sort=False)['three']
.apply(lambda x: x.ffill().bfill())
print (df)
one two three
0 1 1 10.0
1 1 1 10.0
2 1 1 10.0
3 1 2 20.0
4 1 2 20.0
5 1 2 20.0
6 1 3 NaN
7 1 3 NaN
But if multiple value per group and need replace NaN by some constant - e.g. mean by group:
print (df)
one two three
0 1 1 10.0
1 1 1 40.0
2 1 1 NaN
3 1 2 NaN
4 1 2 20.0
5 1 2 NaN
6 1 3 NaN
7 1 3 NaN
df['three'] = df.groupby(['one','two'], sort=False)['three']
.apply(lambda x: x.fillna(x.mean()))
print (df)
one two three
0 1 1 10.0
1 1 1 40.0
2 1 1 25.0
3 1 2 20.0
4 1 2 20.0
5 1 2 20.0
6 1 3 NaN
7 1 3 NaN
You can sort data by the column with missing values then groupby and forwardfill:
df.sort_values('three', inplace=True)
df['three'] = df.groupby(['one','two'])['three'].ffill()

pandas number of items in one column per value in another column

I have two dataframes. say for example, frame 1 is the student info:
student_id course
1 a
2 b
3 c
4 a
5 f
6 f
frame 2 is each interaction the student has with a program
student_id day number_of_clicks
1 4 60
1 5 34
1 7 87
2 3 33
2 4 29
2 8 213
2 9 46
3 2 103
I am trying to add the information from frame 2 to frame 1, ie. for each student I would like to know the number of different days they accessed the database on, and the sum of all the clicks on those days. eg:
student_id course no_days total_clicks
1 a 3 181
2 b 4 321
3 c 1 103
4 a 0 0
5 f 0 0
6 f 0 0
I've tried to do this with groupby, but I couldn't add the information back into frame 1, or figure out how to sum the number of clicks. any ideas?
First we aggregate your df2 to the desired information using GroupBy.agg. Then we merge that information into df1:
agg = df2.groupby('student_id').agg(
no_days=('day', 'size'),
total_clicks=('number_of_clicks', 'sum')
)
df1 = df1.merge(agg, on='student_id', how='left').fillna(0)
student_id course no_days total_clicks
0 1 a 3.0 181.0
1 2 b 4.0 321.0
2 3 c 1.0 103.0
3 4 a 0.0 0.0
4 5 f 0.0 0.0
5 6 f 0.0 0.0
Or if you like one-liners, here's the same method as above, but in one line of code and more in SQL kind of style:
df1.merge(
df2.groupby('student_id').agg(
no_days=('day', 'size'),
total_clicks=('number_of_clicks', 'sum')
),
on='student_id',
how='left'
).fillna(0)
Use merge and fillna the null values then aggregate using groupby.agg as:
df = df1.merge(df2, how='left').fillna(0, downcast='infer')\
.groupby(['student_id', 'course'], as_index=False)\
.agg({'day':np.count_nonzero, 'number_of_clicks':np.sum}).reset_index()
print(df)
student_id course day number_of_clicks
0 1 a 3 181
1 2 b 4 321
2 3 c 1 103
3 4 a 0 0
4 5 f 0 0
5 6 f 0 0
​

Pandas - create total column based on other column

I'm trying to create a total column that sums the numbers from another column based on a third column. I can do this by using .groupby(), but that creates a truncated column, whereas I want a column that is the same length.
My code:
df = pd.DataFrame({'a':[1,2,2,3,3,3], 'b':[1,2,3,4,5,6]})
df['total'] = df.groupby(['a']).sum().reset_index()['b']
My result:
a b total
0 1 1 1.0
1 2 2 5.0
2 2 3 15.0
3 3 4 NaN
4 3 5 NaN
5 3 6 NaN
My desired result:
a b total
0 1 1 1.0
1 2 2 5.0
2 2 3 5.0
3 3 4 15.0
4 3 5 15.0
5 3 6 15.0
...where each 'a' column has the same total as the other.
Returning the sum from a groupby operation in pandas produces a column only as long as the number of unique items in the index. Use transform to produce a column of the same length ("like-indexed") as the original data frame without performing any merges.
df['total'] = df.groupby('a')['b'].transform(sum)
>>> df
a b total
0 1 1 1
1 2 2 5
2 2 3 5
3 3 4 15
4 3 5 15
5 3 6 15

Loop over groups Pandas Dataframe and get sum/count

I am using Pandas to structure and process Data.
This is my DataFrame:
And this is the code which enabled me to get this DataFrame:
(data[['time_bucket', 'beginning_time', 'bitrate', 2, 3]].groupby(['time_bucket', 'beginning_time', 2, 3])).aggregate(np.mean)
Now I want to have the sum (Ideally, the sum and the count) of my 'bitrates' grouped in the same time_bucket. For example, for the first time_bucket((2016-07-08 02:00:00, 2016-07-08 02:05:00), it must be 93750000 as sum and 25 as count, for all the case 'bitrate'.
I did this :
data[['time_bucket', 'bitrate']].groupby(['time_bucket']).agg(['sum', 'count'])
And this is the result :
But I really want to have all my data in one DataFrame.
Can I do a simple loop over 'time_bucket' and apply a function which calculate the sum of all bitrates ?
Any ideas ? Thx !
I think you need merge, but need same levels of indexes of both DataFrames, so use reset_index. Last get original Multiindex by set_index:
data = pd.DataFrame({'A':[1,1,1,1,1,1],
'B':[4,4,4,5,5,5],
'C':[3,3,3,1,1,1],
'D':[1,3,1,3,1,3],
'E':[5,3,6,5,7,1]})
print (data)
A B C D E
0 1 4 3 1 5
1 1 4 3 3 3
2 1 4 3 1 6
3 1 5 1 3 5
4 1 5 1 1 7
5 1 5 1 3 1
df1 = data[['A', 'B', 'C', 'D','E']].groupby(['A', 'B', 'C', 'D']).aggregate(np.mean)
print (df1)
E
A B C D
1 4 3 1 5.5
3 3.0
5 1 1 7.0
3 3.0
df2 = data[['A', 'C']].groupby(['A'])['C'].agg(['sum', 'count'])
print (df2)
sum count
A
1 12 6
print (pd.merge(df1.reset_index(['B','C','D']), df2, left_index=True, right_index=True)
.set_index(['B','C','D'], append=True))
E sum count
A B C D
1 4 3 1 5.5 12 6
3 3.0 12 6
5 1 1 7.0 12 6
3 3.0 12 6
I try another solution to get output from df1, but this is aggregated so it is impossible get right data. If sum level C, you get 8 instead 12.

Categories