This seems like a really easy task, but I have been struggling with it for a while.
I want to calculated the (number of sessions with PageView==1) / (total number of sessions) per day, sample data below
session_df
Date/Timestamp Session ID PageViews
2/14/2016 a 1
2/14/2016 b 5
2/14/2016 c 8
3/23/2016 d 1
3/23/2016 e 1
3/23/2016 f 2
and the expected output:
Date/Timestamp BounceRate
2/14/2016 0.333333333
3/23/2016 0.666666667
I tried first add a Bounced? column based on the PageViews number, then groupby and calculate the percentage, then I need to filter out the bounced?==False which is very cumbersome. If anyone can suggest a better way for doing this that would be great!
sessions_df['Bounced?'] = sessions_df['PageViews']>1
dt = pd.DatetimeIndex(sessions_df['Date/Timestamp'])
daily_session_bounce_rate = sessions_df.groupby([dt.date, 'Bounced?']).agg({'Session ID':'count'})
daily_session_bounce_rate = daily_session_bounce_rate.groupby(level=0).apply(lambda x: x / float(x.sum()))
daily_session_bounce_rate
# this is my output
Bounced? Session ID
2016-01-01 False 0.459893
True 0.540107
#filter data
daily_session_bounce_rate.loc[daily_session_bounce_rate['Bounced?']==True,['level_0','Session ID']]
You not need define separate Bounced? column. Take count of grouped rows where PageViews==1 and divide by number of all rows for that date to get fraction
daily_session_bounce_rate = \
df[df['PageViews']==1].groupby('Date/Timestamp').agg({'Session ID':'count'}) /\
df.groupby('Date/Timestamp').agg({'Session ID':'count'})
You could try like this:
bouncerate = (df.loc[df['PageViews'] == 1]
.groupby('Date/Timestamp')['Session ID'].count()
.div(df.groupby('Date/Timestamp')['Session ID']
.count())
.to_frame('Bounce Rate'))
Or:
bouncerate = (df.groupby('Date/Timestamp')
.apply(lambda x: sum(x.PageViews == 1) / x.PageViews.count())
.to_frame('Bounce Rate'))
Both result in:
>>> bouncerate
Bounce Rate
Date/Timestamp
2/14/2016 0.333333
3/23/2016 0.666667
You need:
grp = session_df.groupby(session_df['Date/Timestamp'].dt.day)['Session_ID'].count()
session_1 = session_df.loc[session_df['PageViews']==1].groupby(session_df['Date/Timestamp'].dt.day)['Session_ID'].count()
pd.DataFrame({'bouncerate':list(session_1/grp)}, index=session_df['Date/Timestamp'].unique())
Output:
bouncerate
2016-02-14 0.333333
2016-03-23 0.666667
sessions_df['bounced?'] = sessions_df['PageViews']==1
daily_session_bounce_rate = sessions_df.groupby('Date/Timestamp').mean()['bounced?']
The first line creates a column based of whether PageViews is equal to 1. This is opposite from how you did it; you had bounced? be True when PageViews is more than 1 ... which, if I understand your use of "bounced" correctly, is a case where the user didn't bounce.
The second line groups by Date/Timestamp and then takes the mean. Whenever you do math with booleans like this, Python casts them as int, so every time someone bounces, that's True/1, and whenever they don't, that's False/0. Thus, the sum of the values of the booleans as int is the same as the count of True. When you tell pandas to take the mean of a Series of booleans, it takes the sum/count of True, and divides by the total number of values, which is the same as finding the percentage of times you have True.
Thus, grouping by Date and taking the mean gives you a dataframe where the rows are dates and the columns are the means for that date. The resulting dataframe has a column for each column of the original dataframe (in this case, you have a column consisting of the mean PageViews for each date, and a column of the mean bounced? for each date). If you just want the bounce percentage, you can subset the data frame with ['bounced?']
Related
I want to split all rows into two groups that have similar means.
I have a dataframe of about 50 rows but this could go into several thousands with a column of interest called 'value'.
value total bucket
300048137 3.0741 3.0741 0
352969997 2.1024 5.1765 0
abc13.com 4.5237 9.7002 0
abc7.com 5.8202 15.5204 0
abcnews.go.com 6.7270 22.2474 0
........
www.legacy.com 12.6609 263.0797 1
www.math-aids.com 10.9832 274.0629 1
So far I tried using cumulative sum for which total column was created then I essentially made the split based on where the mid-point of the total column is. Based on this solution.
test['total'] = test['value'].cumsum()
df_sum = test['value'].sum()//2
test['bucket'] = np.where(test['total'] <= df_sum, 0,1)
If I try to group them and take the average for each group then the difference is quite significant
display(test.groupby('bucket')['value'].mean())
bucket
0 7.456262
1 10.773905
Is there a way I could achieve this partition based on means instead of sums? I was thinking about using expanding means from pandas but couldn't find a proper way to do it.
I am not sure I understand what you are trying to do, but possibly you want to groupy by quantiles of a column. If so:
test['bucket'] = pd.qcut(test['value'], q=2, labels=False)
which will have bucket=0 for the half of rows with the lesser value values. And 1 for the rest. By tweakign the q parameter you can have as many groups as you want (as long as <= number of rows).
Edit:
New attemp, now that I think I understand better your aim:
df = pd.DataFrame( {'value':pd.np.arange(100)})
df['group'] = df['value'].argsort().mod(2)
df.groupby('group')['value'].mean()
# group
# 0 49
# 1 50
# Name: value, dtype: int64
df['group'] = df['value'].argsort().mod(3)
df.groupby('group')['value'].mean()
#group
# 0 49.5
# 1 49.0
# 2 50.0
# Name: value, dtype: float64
The 'ratings' DataFrame has two columns of interest: User-ID and Book-Rating.
I'm trying to make a histogram showing the amount of books read per user in this dataset. In other words, I'm looking to count Book-Ratings per User-ID. I'll include the dataset in case anyone wants to check it out.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
!wget https://raw.githubusercontent.com/porterjenkins/cs180-intro-data-science/master/data/ratings_train.csv
ratings = pd.read_csv('ratings_train.csv')
# Remove Values where Ratings are Zero
ratings2 = ratings.loc[(ratings != 0).all(axis=1)]
# Sort by User
ratings2 = ratings2.sort_values(by=['User-ID'])
usersList = []
booksRead = []
for i in range(2000):
numBooksRead = ratings2.isin([i]).sum()['User-ID']
if numBooksRead != 0:
usersList.append(i)
booksRead.append(numBooksRead)
new_dict = {'User_ID':usersList,'booksRated':booksRead}
usersBooks = pd.DataFrame(new_dict)
usersBooks
The code works as is, but it took almost 5 minutes to complete. And this is the problem: the dataset has 823,000 values. So if it took me 5 minutes to sort through only the first 2000 numbers, I don't think it's feasible to go through all of the data.
I also should admit, I'm sure there's a better way to make a DataFrame than creating two lists, turning them into a dict, and then making that a DataFrame.
Mostly I just want to know how to go through all this data in a way that won't take all day.
Thanks in advance!!
It seems you want a list of user IDs, with the count how often an ID appears in the dataframe. Use value_counts() for that:
ratings = pd.read_csv('ratings_train.csv')
# Remove Values where Ratings are Zero
ratings2 = ratings.loc[(ratings != 0).all(axis=1)]
In [74]: ratings2['User-ID'].value_counts()
Out[74]:
11676 6836
98391 4650
153662 1630
189835 1524
23902 1123
...
258717 1
242214 1
55947 1
256110 1
252621 1
Name: User-ID, Length: 21553, dtype: int64
The result is a Series, with the User-ID as index, and the value is number of books read (or rather, number of books rated by that user).
Note: be aware that the result is heavily skewed: there are a few very active readers, but most will have rated very few books. As a result, your histogram will likely just show one bin.
Taking the log (or plotting with the x-axis on a log scale) may show a clearer histogram:
np.log(s).hist()
First filter by column Book-Rating for remove 0 values and then count values by Series.value_counts with convert to DataFrame, loop here is not necessary:
ratings = pd.read_csv('ratings_train.csv')
ratings2 = ratings[ratings['Book-Rating'] != 0]
usersBooks = (ratings2['User-ID'].value_counts()
.sort_index()
.rename_axis('User_ID')
.reset_index(name='booksRated'))
print (usersBooks)
User_ID booksRated
0 8 6
1 17 4
2 44 1
3 53 3
4 69 2
... ...
21548 278773 3
21549 278782 2
21550 278843 17
21551 278851 10
21552 278854 4
[21553 rows x 2 columns]
Hi I have a dataset like below
df = pd.DataFrame({"price" :[250,200,100,400,200,110], "segment": ["A","A","C","B","C","B"]})
I want to know how much percentage does each segment spent.
like
A = 35.71%
B = 40.47%
C = 23.82%
I have done through subsetting each segment and then doing percentage of each, but I want to do it in single line.
Thanks in advance.
May be you can try with groupby and applying lambda to each group. Something like:
first apply groupby 'segment'
then for each group take the segment sum multiplied by 100
and divide by total sum of df
As below:
df.groupby('segment')['price'].apply(lambda g: sum(g)*100.0/df.price.sum())
Result:
segment
A 35.714286
B 40.476190
C 23.809524
Name: price, dtype: float64
I have a DataFrame with information from every single March Madness game since 1985. Now I am trying to calculate the percentage of wins by the higher seed by round. The main DataFrame looks like this:
I thought that the best way to do it is by creating separate functions. The first one deals with when the score is higher than the score.1 return team and when score.1 is higher than score return team.1 Then append those at end of function. Next one for needs u do seed.1 higher than seed and return team then seed higher than seed.1 and return team.1 then append and last function make a function for when those are equal
def func1(x):
if tourney.loc[tourney['Score']] > tourney.loc[tourney['Score.1']]:
return tourney.loc[tourney['Team']]
elif tourney.loc[tourney['Score.1']] > tourney.loc[tourney['Score']]:
return tourney.loc[tourney['Team.1']]
func1(tourney.loc[tourney['Score']])
You can apply a row-wise function by apply a lambda function to the entire dataframe, with the axis=1. This will allow you to get a True/False column 'low_seed_wins'.
With the new column of True/False you can take the count and the sum (count being the number of games, and sum being the number of lower_seed victories). Using this you can divide the sum by the count to get the win ratio.
This only works because your lower seed teams are always on the left. If they are not it will be a little more complex.
import pandas as pd
df = pd.DataFrame([[1987,3,1,74,68,5],[1987,3,2,87,81,6],[1987,4,1,84,81,2],[1987,4,1,75,79,2]], columns=['Year','Round','Seed','Score','Score.1','Seed.1'])
df['low_seed_wins'] = df.apply(lambda row: row['Score'] > row['Score.1'], axis=1)
df = df.groupby(['Year','Round'])['low_seed_wins'].agg(['count','sum']).reset_index()
df['ratio'] = df['sum'] / df['count']
df.head()
Year Round count sum ratio
0 1987 3 2 2.0 1.0
1 1987 4 2 1.0 0.5
You should be to calculate this by checking both conditions, for both the first and second team. This returns a boolean, the sum of which is the number of cases it is true. Then just divide by the length of the whole dataframe to get the percentage. Without test data hard to check exactly
(
((tourney['Seed'] > tourney['Seed.1']) &
(tourney['Score'] > tourney['Score.1'])) ||
((tourney['Seed.1'] > tourney['Seed']) &
(tourney['Score.1'] > tourney['Score']))
).sum() / len(tourney)
I want to do almost the same thing as this question.
However, the approach in the accepted answer by #jezrael takes way too long based on my dataset -- I have ~300k rows in the original dataframe, and it takes a few minutes to run the nlargest(1) command. Furthermore, I tried it on a head(1000) limited dataframe, and didn't get only 1 row for each within the value_count -- I got exactly the same Series back as the value_counts.
In my own words: Basically, my dataset has two columns like this:
Session Rating
A Positive
A Positive
A Positive
A Negative
B Negative
B Negative
C Positive
C Negative
Using counts = df.groupby('Session')['Rating'].value_counts() I get a Series object like this:
Session Rating
A Positive 3
Negative 1
B Negative 2
C Positive 1
Negative 1
How do I get a dataframe where just the Rating with the max count is included? And in cases where there are multiple maxes (such as C), I would like to exclude that one from the returned table.
I think you want something like this.
df.groupby('Session')['Rating'].apply(lambda x: x.value_counts().head(1))
Output:
Session
A Positive 3
B Negative 2
C Negative 1
Name: Rating, dtype: int64