I have a dataframe containing information about users rating items during a period of time. It has the following semblance :
In the dataframe I have a number of rows with identical 'user_id' and 'business_id' which i retrieve using the following code :
mask = reviews_df.duplicated(subset=['user_id','business_id'], keep=False)
dup = reviews_df[mask]
obtaining something like this :
I now need to remove all such duplicates from the original dataframe and substitute them with their average. Is there a fast and elegant way to achive this?Thanks!
Se if you do have a dataframe looks like
review_id user_id business_id stars date
0 1 0 3 2.0 2019-01-01
1 2 1 3 5.0 2019-11-11
2 3 0 2 4.0 2019-10-22
3 4 3 4 3.0 2019-09-13
4 5 3 4 1.0 2019-02-14
5 6 0 2 5.0 2019-03-17
Then the solution should be something like that:
df.loc[df.duplicated(['user_id', 'business_id'], keep=False)]\
.groupby(['user_id', 'business_id'])\
.apply(lambda x: x.stars - x.stars.mean())
With the following result:
user_id business_id
0 2 2 -0.5
5 0.5
3 4 3 1.0
4 -1.0
Related
I would like to process with the grouped dataset below (the command lead to the below result is df.groupby(['ids','category'])['counts'].sum())
ids|category|counts
1 A 3
2 B 5
3 A 1
B 1
C 1
4 C 3
What I am trying to get is below (every unique id with one row so that I can merge with another data table later):
ids|A_n|B_n|C_n
1 3 0 0
2 0 5 0
3 1 1 1
4 0 0 3
Is this possible? Any thoughts are welcome, thank you in advance!
Just unstack and fillna
group_df.unstack('category').fillna(0)
Output
counts
category A B C
ids
1 3.0 0.0 0.0
2 0.0 5.0 0.0
3 1.0 1.0 1.0
4 0.0 0.0 3.0
I am trying to fix one issue with this dataset.
The link is here.
So, I loaded the dataset this way.
df = pd.read_csv('ratings.csv', sep='::', names=['user_id', 'movie_id', 'rating', 'timestamp'])
num_of_unique_users = len(df['user_id'].unique())
The number of unique user is 69878.
If we print out the last rows of the dataset.
We can see that the user id is above 69878.
There are missing user id in this case.
Same case for movie id. There is an exceeding number of movie id than actual id.
I only want it to match the missing user_id with the existing one and not exceed 69878.
For example, the the number 75167 will become the last number of unique user id which is 69878 and the movie id 65133 will become 10677 the last unique movie id.
Actual
user_id movie_id rating timestamp
0 1 122 5.0 838985046
1 1 185 5.0 838983525
2 1 231 5.0 838983392
3 1 292 5.0 838983421
4 1 316 5.0 838983392
... ... ... ... ...
10000044 71567 1984 1.0 912580553
10000045 71567 1985 1.0 912580553
10000046 71567 1986 1.0 912580553
10000047 71567 2012 3.0 912580722
10000048 71567 2028 5.0 912580344
Desired
user_id movie_id rating timestamp
0 1 122 5.0 838985046
1 1 185 5.0 838983525
2 1 231 5.0 838983392
3 1 292 5.0 838983421
4 1 316 5.0 838983392
... ... ... ... ...
10000044 69878 1984 1.0 912580553
10000045 69878 1985 1.0 912580553
10000046 69878 1986 1.0 912580553
10000047 69878 2012 3.0 912580722
10000048 69878 2028 5.0 912580344
Is there anyway to do this with pandas?
Here's a way to do this:
df2 = df.groupby('user_id').count().reset_index()
df2 = df2.assign(new_user_id=df2.index + 1).set_index('user_id')
df = df.join(df2['new_user_id'], on='user_id').drop(columns=['user_id']).rename(columns={'new_user_id':'user_id'})
df2 = df.groupby('movie_id').count().reset_index()
df2 = df2.assign(new_movie_id=df2.index + 1).set_index('movie_id')
df = df.join(df2['new_movie_id'], on='movie_id').drop(columns=['movie_id']).rename(columns={'new_movie_id':'movie_id'})
df = pd.concat([df[['user_id', 'movie_id']], df.drop(columns=['user_id', 'movie_id'])], axis=1)
Sample input:
user_id movie_id rating timestamp
0 1 2 5.0 838985046
1 1 4 5.0 838983525
2 3 4 5.0 838983392
3 3 6 5.0 912580553
4 5 2 5.0 912580722
5 5 6 5.0 912580344
Sample output:
user_id movie_id rating timestamp
0 1 1 5.0 838985046
1 1 2 5.0 838983525
2 2 2 5.0 838983392
3 2 3 5.0 912580553
4 3 1 5.0 912580722
5 3 3 5.0 912580344
Here are intermediate results and explanations.
First we do this:
df2 = df.groupby('user_id').count().reset_index()
Output:
user_id movie_id rating timestamp
0 1 2 2 2
1 3 2 2 2
2 5 2 2 2
What we have done above is to use groupby to get one row per unique user_id. We call count just to convert the output (a groupby object) back to a dataframe. We call reset_index to create a new integer range index with no gaps. (NOTE: the only column we care about for future use is user_id.)
Next we do this:
df2 = df2.assign(new_user_id=df2.index + 1).set_index('user_id')
Output:
movie_id rating timestamp new_user_id
user_id
1 2 2 2 1
3 2 2 2 2
5 2 2 2 3
The assign call creates a new column named new_user_id which we fill using the 0 offset index plus 1 (so that we will not have id values < 1). The set_index call replaces our index with user_id in anticipation of using the index of this dataframe as the target for a late call to join.
The next step is:
df = df.join(df2['new_user_id'], on='user_id').drop(columns=['user_id']).rename(columns={'new_user_id':'user_id'})
Output:
movie_id rating timestamp user_id
0 2 5.0 838985046 1
1 4 5.0 838983525 1
2 4 5.0 838983392 2
3 6 5.0 912580553 2
4 2 5.0 912580722 3
5 6 5.0 912580344 3
Here we have taken just the new_user_id column of df2 and called join on the df object, directing the method to use the user_id column (the on argument) in df to join with the index (which was originally the user_id column in df2). This creates a df with the desired new-paradigm user_id values in the column named new_user_id. All that remains is to drop the old-paradigm user_id column and rename new_user_id to be user_id, which is what the calls to drop and rename do.
The logic for changing the movie_id values to the new paradigm (i.e., eliminating gaps in the unique value set) is completely analogous. When we're done, we have this output:
rating timestamp user_id movie_id
0 5.0 838985046 1 1
1 5.0 838983525 1 2
2 5.0 838983392 2 2
3 5.0 912580553 2 3
4 5.0 912580722 3 1
5 5.0 912580344 3 3
To finish up, we reorder the columns to look like the original using this code:
df = pd.concat([df[['user_id', 'movie_id']], df.drop(columns=['user_id', 'movie_id'])], axis=1)
Output:
user_id movie_id rating timestamp
0 1 1 5.0 838985046
1 1 2 5.0 838983525
2 2 2 5.0 838983392
3 2 3 5.0 912580553
4 3 1 5.0 912580722
5 3 3 5.0 912580344
UPDATE:
Here is an alternative solution which uses Series.unique() instead of gropuby and saves a couple of lines:
df2 = pd.DataFrame(df.user_id.unique(), columns=['user_id']
).reset_index().set_index('user_id').rename(columns={'index':'new_user_id'})['new_user_id'] + 1
df = df.join(df2, on='user_id').drop(columns=['user_id']).rename(columns={'new_user_id':'user_id'})
df2 = pd.DataFrame(df.movie_id.unique(), columns=['movie_id']
).reset_index().set_index('movie_id').rename(columns={'index':'new_movie_id'})['new_movie_id'] + 1
df = df.join(df2, on='movie_id'
).drop(columns=['movie_id']).rename(columns={'new_movie_id':'movie_id'})
df = pd.concat([df[['user_id', 'movie_id']], df.drop(columns=['user_id', 'movie_id'])], axis=1)
The idea here is:
Line 1:
use unique to get the unique values of user_id without bothering to count duplicates or maintain other columns (which is what groupby did in the original solution above)
create a new dataframe containing these unique values in a column named new_user_id
call reset_index to get an index that is a non-gapped integer range (one integer for each unique user_id)
call set_index which will create a column named 'index' containing the previous index (0..number of unique user_id values) and make user_id the new index
rename the column labeled 'index' to be named new_user_id
access the new_user_id column and add 1 to convert from 0-offset to 1-offset id value.
Line 2:
call join just as we did in the original solution, except that the other dataframe is simply df2 (which is fine since it has only a single column, new_user_id).
The logic for movie_id is completely analogous, and the final line using concat is the same as in the original solution above.
I have a pandas DataFrame of the form:
df
ID_col time_in_hours data_col
1 62.5 4
1 40 3
1 20 3
2 30 1
2 20 5
3 50 6
What I want to be able to do is, find the rate of change of data_col by using the time_in_hours column. Specifically,
rate_of_change = (data_col[i+1] - data_col[i]) / abs(time_in_hours[ i +1] - time_in_hours[i])
Where i is a given row and the rate_of_change is calculated separately for different IDs
Effectively, I want a new DataFrame of the form:
new_df
ID_col time_in_hours data_col rate_of_change
1 62.5 4 NaN
1 40 3 -0.044
1 20 3 0
2 30 1 NaN
2 20 5 0.4
3 50 6 NaN
How do I go about this?
You can use groupby:
s = df.groupby('ID_col').apply(lambda dft: dft['data_col'].diff() / dft['time_in_hours'].diff().abs())
s.index = s.index.droplevel()
s
returns
0 NaN
1 -0.044444
2 0.000000
3 NaN
4 0.400000
5 NaN
dtype: float64
You can actually get around the groupby + apply given how your DataFrame is sorted. In this case, you can just check if the ID_col is the same as the shifted row.
So calculate the rate of change for everything, and then only assign the values back if they are within a group.
import numpy as np
mask = df.ID_col == df.ID_col.shift(1)
roc = (df.data_col - df.data_col.shift(1))/np.abs(df.time_in_hours - df.time_in_hours.shift(1))
df.loc[mask, 'rate_of_change'] = roc[mask]
Output:
ID_col time_in_hours data_col rate_of_change
0 1 62.5 4 NaN
1 1 40.0 3 -0.044444
2 1 20.0 3 0.000000
3 2 30.0 1 NaN
4 2 20.0 5 0.400000
5 3 50.0 6 NaN
You can use pandas.diff:
df.groupby('ID_col').apply(
lambda x: x['data_col'].diff() / x['time_in_hours'].diff().abs())
ID_col
1 0 NaN
1 -0.044444
2 0.000000
2 3 NaN
4 0.400000
3 5 NaN
dtype: float64
I am working on pandas data frame, something like below:
id vals
0 1 11
1 1 5.5
2 1 -2
3 1 8
4 2 3
5 2 4
6 2 19
7 2 20
Above is just a small part of the df, the vals are grouped by id , and there will always be equal number of vals per id. In above case it's 4 and 4 values for id = 1 and id =2.
What I am trying to achieve is to add the value at index 0 with value at index 4, then value at index 1 with value at index 5 and so on.
Following is the expected df/ series, say df2:
total
0 14
1 9.5
2 17
3 28
Real df has hundreds of id and not just two as above.
Groupby() can be used but I dont get how to get the specific indices from each group.
Please let me know if anything is unclear.
groupby on modulo of df.index values and take sum of vals
In [805]: df.groupby(df.index % 4).vals.sum()
Out[805]:
0 14.0
1 9.5
2 17.0
3 28.0
Name: vals, dtype: float64
Since there are exactly 4 values per ID, we can simply reshape the underlying 1D array data to 2D array and sum along the appropriate axis (axis=0 in this case) -
pd.DataFrame({'total':df.vals.values.reshape(-1,4).sum(0)})
Sample run -
In [192]: df
Out[192]:
id vals
0 1 11.0
1 1 5.5
2 1 -2.0
3 1 8.0
4 2 3.0
5 2 4.0
6 2 19.0
7 2 20.0
In [193]: pd.DataFrame({'total':df.vals.values.reshape(-1,4).sum(0)})
Out[193]:
total
0 14.0
1 9.5
2 17.0
3 28.0
I'm trying to create a total column that sums the numbers from another column based on a third column. I can do this by using .groupby(), but that creates a truncated column, whereas I want a column that is the same length.
My code:
df = pd.DataFrame({'a':[1,2,2,3,3,3], 'b':[1,2,3,4,5,6]})
df['total'] = df.groupby(['a']).sum().reset_index()['b']
My result:
a b total
0 1 1 1.0
1 2 2 5.0
2 2 3 15.0
3 3 4 NaN
4 3 5 NaN
5 3 6 NaN
My desired result:
a b total
0 1 1 1.0
1 2 2 5.0
2 2 3 5.0
3 3 4 15.0
4 3 5 15.0
5 3 6 15.0
...where each 'a' column has the same total as the other.
Returning the sum from a groupby operation in pandas produces a column only as long as the number of unique items in the index. Use transform to produce a column of the same length ("like-indexed") as the original data frame without performing any merges.
df['total'] = df.groupby('a')['b'].transform(sum)
>>> df
a b total
0 1 1 1
1 2 2 5
2 2 3 5
3 3 4 15
4 3 5 15
5 3 6 15