Hy there.
I have a pandas DataFrame (df) like this:
foo id1 bar id2
0 8.0 1 NULL 1
1 5.0 1 NULL 1
2 3.0 1 NULL 1
3 4.0 1 1 2
4 7.0 1 3 2
5 9.0 1 4 3
6 5.0 1 2 3
7 7.0 1 3 1
...
I want to group by id1 and id2 and try to get the mean of foo and bar.
My code:
res = df.groupby(["id1","id2"])["foo","bar"].mean()
What I get is almost what I expect:
foo
id1 id2
1 1 5.750000
2 7.000000
2 1 3.500000
2 1.500000
3 1 6.000000
2 5.333333
The values in column "foo" are exactly the average values (means) that I am looking for but where is my column "bar"?
So if it would be SQL I was looking for a result like from:
"select avg(foo), avg(bar) from dataframe group by id1, id2;"
(Sorry for this but I am more an sql person and new to pandas but I need it now.)
What I alternatively tried:
groupedFrame = res.groupby(["id1","id2"])
aggrFrame = groupedFrame.aggregate(numpy.mean)
Which gives me exactly the same result, still missing column "bar".
Sites I read:
http://wesmckinney.com/blog/groupby-fu-improvements-in-grouping-and-aggregating-data-in-pandas/
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.aggregate.html
and documentation for group-by but I cannot post the link here.
What am I doing wrong? - Thanks in foreward.
There is problem your column bar is not numeric, so aggregate function omit it.
You can check dtype of omited column - is not numeric:
print (df['bar'].dtype)
object
You can check automatic exclusion of nuisance columns.
Solution is before aggregating convert string values to numeric and if not possible, add NaNs with to_numeric and parameter errors='coerce':
df['bar'] = pd.to_numeric(df['bar'], errors='coerce')
res = df.groupby(["id1","id2"])["foo","bar"].mean()
print (res)
foo bar
id1 id2
1 1 5.75 3.0
2 5.50 2.0
3 7.00 3.0
But if have mixed data - numeric with strings is possible use replace:
df['bar'] = df['bar'].replace("NULL", np.nan)
As stated earlier, you should replace your NULL value before taking the mean
df.replace("NULL",-1).groupby(["id1","id2"])["foo","bar"].mean()
output
id1 id2 foo bar
1 1 5.75 3.0
1 2 5.5 2.0
1 3 7.0 3.0
Related
I am trying to fix one issue with this dataset.
The link is here.
So, I loaded the dataset this way.
df = pd.read_csv('ratings.csv', sep='::', names=['user_id', 'movie_id', 'rating', 'timestamp'])
num_of_unique_users = len(df['user_id'].unique())
The number of unique user is 69878.
If we print out the last rows of the dataset.
We can see that the user id is above 69878.
There are missing user id in this case.
Same case for movie id. There is an exceeding number of movie id than actual id.
I only want it to match the missing user_id with the existing one and not exceed 69878.
For example, the the number 75167 will become the last number of unique user id which is 69878 and the movie id 65133 will become 10677 the last unique movie id.
Actual
user_id movie_id rating timestamp
0 1 122 5.0 838985046
1 1 185 5.0 838983525
2 1 231 5.0 838983392
3 1 292 5.0 838983421
4 1 316 5.0 838983392
... ... ... ... ...
10000044 71567 1984 1.0 912580553
10000045 71567 1985 1.0 912580553
10000046 71567 1986 1.0 912580553
10000047 71567 2012 3.0 912580722
10000048 71567 2028 5.0 912580344
Desired
user_id movie_id rating timestamp
0 1 122 5.0 838985046
1 1 185 5.0 838983525
2 1 231 5.0 838983392
3 1 292 5.0 838983421
4 1 316 5.0 838983392
... ... ... ... ...
10000044 69878 1984 1.0 912580553
10000045 69878 1985 1.0 912580553
10000046 69878 1986 1.0 912580553
10000047 69878 2012 3.0 912580722
10000048 69878 2028 5.0 912580344
Is there anyway to do this with pandas?
Here's a way to do this:
df2 = df.groupby('user_id').count().reset_index()
df2 = df2.assign(new_user_id=df2.index + 1).set_index('user_id')
df = df.join(df2['new_user_id'], on='user_id').drop(columns=['user_id']).rename(columns={'new_user_id':'user_id'})
df2 = df.groupby('movie_id').count().reset_index()
df2 = df2.assign(new_movie_id=df2.index + 1).set_index('movie_id')
df = df.join(df2['new_movie_id'], on='movie_id').drop(columns=['movie_id']).rename(columns={'new_movie_id':'movie_id'})
df = pd.concat([df[['user_id', 'movie_id']], df.drop(columns=['user_id', 'movie_id'])], axis=1)
Sample input:
user_id movie_id rating timestamp
0 1 2 5.0 838985046
1 1 4 5.0 838983525
2 3 4 5.0 838983392
3 3 6 5.0 912580553
4 5 2 5.0 912580722
5 5 6 5.0 912580344
Sample output:
user_id movie_id rating timestamp
0 1 1 5.0 838985046
1 1 2 5.0 838983525
2 2 2 5.0 838983392
3 2 3 5.0 912580553
4 3 1 5.0 912580722
5 3 3 5.0 912580344
Here are intermediate results and explanations.
First we do this:
df2 = df.groupby('user_id').count().reset_index()
Output:
user_id movie_id rating timestamp
0 1 2 2 2
1 3 2 2 2
2 5 2 2 2
What we have done above is to use groupby to get one row per unique user_id. We call count just to convert the output (a groupby object) back to a dataframe. We call reset_index to create a new integer range index with no gaps. (NOTE: the only column we care about for future use is user_id.)
Next we do this:
df2 = df2.assign(new_user_id=df2.index + 1).set_index('user_id')
Output:
movie_id rating timestamp new_user_id
user_id
1 2 2 2 1
3 2 2 2 2
5 2 2 2 3
The assign call creates a new column named new_user_id which we fill using the 0 offset index plus 1 (so that we will not have id values < 1). The set_index call replaces our index with user_id in anticipation of using the index of this dataframe as the target for a late call to join.
The next step is:
df = df.join(df2['new_user_id'], on='user_id').drop(columns=['user_id']).rename(columns={'new_user_id':'user_id'})
Output:
movie_id rating timestamp user_id
0 2 5.0 838985046 1
1 4 5.0 838983525 1
2 4 5.0 838983392 2
3 6 5.0 912580553 2
4 2 5.0 912580722 3
5 6 5.0 912580344 3
Here we have taken just the new_user_id column of df2 and called join on the df object, directing the method to use the user_id column (the on argument) in df to join with the index (which was originally the user_id column in df2). This creates a df with the desired new-paradigm user_id values in the column named new_user_id. All that remains is to drop the old-paradigm user_id column and rename new_user_id to be user_id, which is what the calls to drop and rename do.
The logic for changing the movie_id values to the new paradigm (i.e., eliminating gaps in the unique value set) is completely analogous. When we're done, we have this output:
rating timestamp user_id movie_id
0 5.0 838985046 1 1
1 5.0 838983525 1 2
2 5.0 838983392 2 2
3 5.0 912580553 2 3
4 5.0 912580722 3 1
5 5.0 912580344 3 3
To finish up, we reorder the columns to look like the original using this code:
df = pd.concat([df[['user_id', 'movie_id']], df.drop(columns=['user_id', 'movie_id'])], axis=1)
Output:
user_id movie_id rating timestamp
0 1 1 5.0 838985046
1 1 2 5.0 838983525
2 2 2 5.0 838983392
3 2 3 5.0 912580553
4 3 1 5.0 912580722
5 3 3 5.0 912580344
UPDATE:
Here is an alternative solution which uses Series.unique() instead of gropuby and saves a couple of lines:
df2 = pd.DataFrame(df.user_id.unique(), columns=['user_id']
).reset_index().set_index('user_id').rename(columns={'index':'new_user_id'})['new_user_id'] + 1
df = df.join(df2, on='user_id').drop(columns=['user_id']).rename(columns={'new_user_id':'user_id'})
df2 = pd.DataFrame(df.movie_id.unique(), columns=['movie_id']
).reset_index().set_index('movie_id').rename(columns={'index':'new_movie_id'})['new_movie_id'] + 1
df = df.join(df2, on='movie_id'
).drop(columns=['movie_id']).rename(columns={'new_movie_id':'movie_id'})
df = pd.concat([df[['user_id', 'movie_id']], df.drop(columns=['user_id', 'movie_id'])], axis=1)
The idea here is:
Line 1:
use unique to get the unique values of user_id without bothering to count duplicates or maintain other columns (which is what groupby did in the original solution above)
create a new dataframe containing these unique values in a column named new_user_id
call reset_index to get an index that is a non-gapped integer range (one integer for each unique user_id)
call set_index which will create a column named 'index' containing the previous index (0..number of unique user_id values) and make user_id the new index
rename the column labeled 'index' to be named new_user_id
access the new_user_id column and add 1 to convert from 0-offset to 1-offset id value.
Line 2:
call join just as we did in the original solution, except that the other dataframe is simply df2 (which is fine since it has only a single column, new_user_id).
The logic for movie_id is completely analogous, and the final line using concat is the same as in the original solution above.
Say I have a huge DataFrame that only contains a handful of cells that match the filtering I perform. How can I end up with only the values that match it (and their indexes and columns) in a new dataframe without the entire other DataFrame that becomes Nan. Dropping Nan's with dropna just removes the whole column or row and filter replaces non matches with Nans.
Here's my code:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.random((1000, 1000)))
# this one is almost filled with Nans
df[df<0.01]
If need non missing values in another format you can use DataFrame.stack:
np.random.seed(2020)
df = pd.DataFrame(np.random.randint(10, size=(5, 3)))
# this one is almost filled with Nans
df1 = df[df<7]
print (df1)
0 1 2
0 0.0 NaN 3.0
1 6.0 3.0 3.0
2 NaN NaN 0.0
3 0.0 NaN NaN
4 3.0 NaN 2.0
df2 = df1.stack().rename_axis(('a','b')).reset_index(name='c')
print (df2)
a b c
0 0 0 0.0
1 0 2 3.0
2 1 0 6.0
3 1 1 3.0
4 1 2 3.0
5 2 2 0.0
6 3 0 0.0
7 4 0 3.0
8 4 2 2.0
This is the dataframe I used.
token name ltp change
0 12345.0 abc 2.0 NaN
1 12345.0 abc 5.0 1.500000
2 12345.0 abc 3.0 -0.400000
3 12345.0 abc 9.0 2.000000
4 12345.0 abc 5.0 -0.444444
5 12345.0 abc 16.0 2.200000
6 6789.0 xyz 1.0 NaN
7 6789.0 xyz 5.0 4.000000
8 6789.0 xyz 3.0 -0.400000
9 6789.0 xyz 13.0 3.333333
10 6789.0 xyz 9.0 -0.307692
11 6789.0 xyz 20.0 1.222222
While trying to solve this question, I encountered this wierd behaviour of pd.NamedAgg
#Worked as intended
df.groupby('name').agg(pos=pd.NamedAgg(column='change',aggfunc=lambda x: x.gt(0).sum()),\
neg = pd.NamedAgg(column='change',aggfunc=lambda x:x.lt(0).sum()))
# Output
pos neg
name
abc 3.0 2.0
xyz 3.0 2.0
When doing it over specific column
df.groupby('name')['change'].agg(pos = pd.NamedAgg(column='change',aggfunc=lambda x:x.gt(0).sum()),\
neg = pd.NamedAgg(column='change',aggfunc=lambda x:x.lt(0).sum()))
#Output
pos neg
name
abc 2.0 2.0
xyz 2.0 2.0
pos columns values are over-written with neg column values.
Another example below:
df.groupby('name')['change'].agg(pos = pd.NamedAgg(column='change',aggfunc=lambda x:x.gt(0).sum()),\
neg = pd.NamedAgg(column='change',aggfunc=lambda x:x.sum()))
#Output
pos neg
name
abc 4.855556 4.855556
xyz 7.847863 7.847863
More weirder results:
df.groupby('name')['change'].agg(pos = pd.NamedAgg(column='change',aggfunc=lambda x:x.gt(0).sum()),\
neg = pd.NamedAgg(column='change',aggfunc=lambda x:x.sum()),\
max = pd.NamedAgg(column='ltp',aggfunc='max'))
# I'm applying on Series `'change'` but I mentioned `column='ltp'` which should
# raise an `KeyError: "Column 'ltp' does not exist!"` but it produces results as follows
pos neg max
name
abc 4.855556 4.855556 2.2
xyz 7.847863 7.847863 4.0
The problem is when using it with pd.Series
s = pd.Series([1,1,2,2,3,3,4,5])
s.groupby(s.values).agg(one = pd.NamedAgg(column='new',aggfunc='sum'))
one
1 2
2 4
3 6
4 4
5 5
Shouldn't it raise an KeyError?
Some more weird results, The values one column are not over-written when we use different column names.
s.groupby(s.values).agg(one=pd.NamedAgg(column='anything',aggfunc='sum'),\
second=pd.NamedAgg(column='something',aggfunc='max'))
one second
1 2 1
2 4 2
3 6 3
4 4 4
5 5 5
Values are over-written when we use the same column name in pd.NamedAgg
s.groupby(s.values).agg(one=pd.NamedAgg(column='weird',aggfunc='sum'),\
second=pd.NamedAgg(column='weird',aggfunc='max'))
one second # Values of column `one` are over-written
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
My pandas version
pd.__version__
# '1.0.3'
From the pandas documentation:
Named aggregation is also valid for Series groupby aggregations. In this case there’s no column selection, so the values are just the functions.
In [82]: animals.groupby("kind").height.agg(
....: min_height='min',
....: max_height='max',
....: )
....:
Out[82]:
min_height max_height
kind
cat 9.1 9.5
dog 6.0 34.0
But couldn't find why using it with column produces weird results.
UPDATE :
Bug report is filed by #jezrael in github issue #34380, and here too.
EDIT: This is a bug confirmed by pandas-dev and this has been resolved in PR BUG: aggregations were getting overwritten if they had the same name #30858
If there is specified columns after groupby use solution specified in paragraph:
Named aggregation is also valid for Series groupby aggregations. In this case there’s no column selection, so the values are just the functions.
df = df.groupby('name')['change'].agg(pos = lambda x:x.gt(0).sum(),\
neg = lambda x:x.lt(0).sum())
print (df)
pos neg
name
abc 3.0 2.0
xyz 3.0 2.0
why using it with column produces weird results.
I think it is bug, instead wrong output is should raise error.
I have a dataframe containing information about users rating items during a period of time. It has the following semblance :
In the dataframe I have a number of rows with identical 'user_id' and 'business_id' which i retrieve using the following code :
mask = reviews_df.duplicated(subset=['user_id','business_id'], keep=False)
dup = reviews_df[mask]
obtaining something like this :
I now need to remove all such duplicates from the original dataframe and substitute them with their average. Is there a fast and elegant way to achive this?Thanks!
Se if you do have a dataframe looks like
review_id user_id business_id stars date
0 1 0 3 2.0 2019-01-01
1 2 1 3 5.0 2019-11-11
2 3 0 2 4.0 2019-10-22
3 4 3 4 3.0 2019-09-13
4 5 3 4 1.0 2019-02-14
5 6 0 2 5.0 2019-03-17
Then the solution should be something like that:
df.loc[df.duplicated(['user_id', 'business_id'], keep=False)]\
.groupby(['user_id', 'business_id'])\
.apply(lambda x: x.stars - x.stars.mean())
With the following result:
user_id business_id
0 2 2 -0.5
5 0.5
3 4 3 1.0
4 -1.0
I have a data frame
ID 2014-01-01 2015-01-01 2016-01-01
1 NaN 0.1 0.2
2 0.1 0.3 0.5
3 0.2 NaN 0.7
4 0.8 0.4 0.1
For each date(col), I want to get the rank of each id. For example, in col '2014-01-01', id = 4 has greatest value, so we assign rank 1 to id = 4. id = 3 has second greatest value, so we give it rank 2. If the data is NaN, just ignore it.
ID 2014-01-01 2015-01-01 2016-01-01
1 NaN 3 3
2 3 2 2
3 2 NaN 1
4 1 1 4
Next step is to get the average rank of each id. Fore example, AvgRank of id1 = (4+3)/2 = 3.5 and AvgRank of id2 = (3+2+2)/3 = 2.33
ID AvgRank
1 3
2 2.33
3 1.5
4 2
My algorithm is:
create a dictionary for each id ({str:list})-> loop through all the columns -> for each column calculate the rank and update to the list in dictionary
but i think it is too complicated for this simple problem.
Is there any easy way to get the avgrank table?
Here is the code to create the dataframe
df = pd.DataFrame({'ID':[1,2,3,4],'2014-01-01':[float('NaN'),0.1,0.2,0.8],
'2015-01-01':[0.1,0.3,float('NaN'),0.4],'2016-01-01':[0.2,0.5,0.7,0.1]})
it's unclear why you think the rank should be 4 for the first row value in the second column but the following gives you what you want. Here we call rank on the cols of interest and pass method='dense' and ascending=False so it ranks correctly:
In [60]:
df.ix[:, :-1].rank(method='dense', ascending=False)
Out[60]:
2014-01-01 2015-01-01 2016-01-01
0 NaN 3 3
1 3 2 2
2 2 NaN 1
3 1 1 4
We then concat the single column from the orig df and rename the result of mean with axis=1 for row-wise mean:
In [67]:
pd.concat([df['ID'], df.ix[:, :-1].rank(method='dense', ascending=False).mean(axis=1)], axis=1).rename(columns={0:'AvgRank'})
Out[67]:
ID AvgRank
0 1 3.000000
1 2 2.333333
2 3 1.500000
3 4 2.000000