Pandas - partition a dataframe into two groups with an approximate mean value - python

I want to split all rows into two groups that have similar means.
I have a dataframe of about 50 rows but this could go into several thousands with a column of interest called 'value'.
value total bucket
300048137 3.0741 3.0741 0
352969997 2.1024 5.1765 0
abc13.com 4.5237 9.7002 0
abc7.com 5.8202 15.5204 0
abcnews.go.com 6.7270 22.2474 0
........
www.legacy.com 12.6609 263.0797 1
www.math-aids.com 10.9832 274.0629 1
So far I tried using cumulative sum for which total column was created then I essentially made the split based on where the mid-point of the total column is. Based on this solution.
test['total'] = test['value'].cumsum()
df_sum = test['value'].sum()//2
test['bucket'] = np.where(test['total'] <= df_sum, 0,1)
If I try to group them and take the average for each group then the difference is quite significant
display(test.groupby('bucket')['value'].mean())
bucket
0 7.456262
1 10.773905
Is there a way I could achieve this partition based on means instead of sums? I was thinking about using expanding means from pandas but couldn't find a proper way to do it.

I am not sure I understand what you are trying to do, but possibly you want to groupy by quantiles of a column. If so:
test['bucket'] = pd.qcut(test['value'], q=2, labels=False)
which will have bucket=0 for the half of rows with the lesser value values. And 1 for the rest. By tweakign the q parameter you can have as many groups as you want (as long as <= number of rows).
Edit:
New attemp, now that I think I understand better your aim:
df = pd.DataFrame( {'value':pd.np.arange(100)})
df['group'] = df['value'].argsort().mod(2)
df.groupby('group')['value'].mean()
# group
# 0 49
# 1 50
# Name: value, dtype: int64
​
df['group'] = df['value'].argsort().mod(3)
df.groupby('group')['value'].mean()
#group
# 0 49.5
# 1 49.0
# 2 50.0
# Name: value, dtype: float64

Related

Pandas - using group by and including value counts which are larger than n

I have a table which includes salary and company_location.
I was trying to calculate the mean salary of a country, its works:
wage = df.groupby('company_location').mean()['salary']
However, I have many with company_location which have less than 5 entries, I would like to exclude them from the report.
I know how to calculate countries with the top 5 entries:
Top_5 = df['company_location'].value_counts().head(5)
I am just having a problem connecting those to variables into one and making a graph out of it...
Thank you.
You can remove rows whose value occurrence is below a threshold:
df = df[df.groupby('company_location')['company_location'].transform('size') > 5]
You can do the following to only apply the groupby and aggregation to those with more than 5 records:
mask = (df['company_location'].map(df['company_location'].value_counts()) > 5)
wage = df[mask].groupby('company_location')['salary'].mean()

Why is Pandas DataFrame Function 'isin()' taking so much time?

The 'ratings' DataFrame has two columns of interest: User-ID and Book-Rating.
I'm trying to make a histogram showing the amount of books read per user in this dataset. In other words, I'm looking to count Book-Ratings per User-ID. I'll include the dataset in case anyone wants to check it out.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
!wget https://raw.githubusercontent.com/porterjenkins/cs180-intro-data-science/master/data/ratings_train.csv
ratings = pd.read_csv('ratings_train.csv')
# Remove Values where Ratings are Zero
ratings2 = ratings.loc[(ratings != 0).all(axis=1)]
# Sort by User
ratings2 = ratings2.sort_values(by=['User-ID'])
usersList = []
booksRead = []
for i in range(2000):
numBooksRead = ratings2.isin([i]).sum()['User-ID']
if numBooksRead != 0:
usersList.append(i)
booksRead.append(numBooksRead)
new_dict = {'User_ID':usersList,'booksRated':booksRead}
usersBooks = pd.DataFrame(new_dict)
usersBooks
The code works as is, but it took almost 5 minutes to complete. And this is the problem: the dataset has 823,000 values. So if it took me 5 minutes to sort through only the first 2000 numbers, I don't think it's feasible to go through all of the data.
I also should admit, I'm sure there's a better way to make a DataFrame than creating two lists, turning them into a dict, and then making that a DataFrame.
Mostly I just want to know how to go through all this data in a way that won't take all day.
Thanks in advance!!
It seems you want a list of user IDs, with the count how often an ID appears in the dataframe. Use value_counts() for that:
ratings = pd.read_csv('ratings_train.csv')
# Remove Values where Ratings are Zero
ratings2 = ratings.loc[(ratings != 0).all(axis=1)]
In [74]: ratings2['User-ID'].value_counts()
Out[74]:
11676 6836
98391 4650
153662 1630
189835 1524
23902 1123
...
258717 1
242214 1
55947 1
256110 1
252621 1
Name: User-ID, Length: 21553, dtype: int64
The result is a Series, with the User-ID as index, and the value is number of books read (or rather, number of books rated by that user).
Note: be aware that the result is heavily skewed: there are a few very active readers, but most will have rated very few books. As a result, your histogram will likely just show one bin.
Taking the log (or plotting with the x-axis on a log scale) may show a clearer histogram:
np.log(s).hist()
First filter by column Book-Rating for remove 0 values and then count values by Series.value_counts with convert to DataFrame, loop here is not necessary:
ratings = pd.read_csv('ratings_train.csv')
ratings2 = ratings[ratings['Book-Rating'] != 0]
usersBooks = (ratings2['User-ID'].value_counts()
.sort_index()
.rename_axis('User_ID')
.reset_index(name='booksRated'))
print (usersBooks)
User_ID booksRated
0 8 6
1 17 4
2 44 1
3 53 3
4 69 2
... ...
21548 278773 3
21549 278782 2
21550 278843 17
21551 278851 10
21552 278854 4
[21553 rows x 2 columns]

How to optimally update cells based on previous cell value / How to elegantly spread values of cell to other cells?

I have a "large" DataFrame table with index being country codes (alpha-3) and columns being years (1900 to 2000) imported via a pd.read_csv(...) [as I understand, these are actually string so I need to pass it as '1945' for example].
The values are 0,1,2,3.
I need to "spread" these values until the next non-0 for each row.
example : 0 0 1 0 0 3 0 0 2 1
becomes: 0 0 1 1 1 3 3 3 2 1
I understand that I should not use iterations (current implementation is something like this, as you can see, using 2 loops is not optimal, I guess I could get rid of one by using apply(row) )
def spread_values(df):
for idx in df.index:
previous_v = 0
for t_year in range(min_year, max_year):
current_v = df.loc[idx, str(t_year)]
if current_v == 0 and previous_v != 0:
df.loc[idx, str(t_year)] = previous_v
else:
previous_v = current_v
However I am told I should use the apply() function, or vectorisation or list comprehension because it is not optimal?
The apply function however, regardless of the axis, does not allow to dynamically get the index/column (which I need to conditionally update the cell), and I think the core issue I can't make the vec or list options work is because I do not have a finite set of column names but rather a wide range (all examples I see use a handful of named columns...)
What would be the more optimal / more elegant solution here?
OR are DataFrames not suited for my data at all? what should I use instead?
You can use df.replace(to_replace=0, method='ffil). This will fill all zeros in your dataframe (except for zeros occuring at the start of your dataframe) with the previous non-zero value per column.
If you want to do it rowwise unfortunately the .replace() function does not accept an axis argument. But you can transpose your dataframe, replace the zeros and transpose it again: df.T.replace(0, method='ffill').T

How to append dataframe with selected columns having higher feature score

Hi I am new to python let me know if the question is not clear.
Here is my dataframe:
df = pd.DataFrame(df_test)
age bmi children charges
0 19 27.900 0 16884.92400
1 18 33.770 1 1725.55230
2 28 33.000 3 4449.46200
3 33 22.705 0 21984.47061
I am applying select 'k' best feature selection using chi squared test for this numerical data
X_clf = numeric_data.iloc[:,0:(col_len-1)]
y_clf = numeric_data.iloc[:,-1]
bestfeatures = SelectKBest(score_func=chi2, k=2)
fit = bestfeatures.fit(X_clf,y_clf)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X_clf.columns)
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
This is my output:
Feature Score
0 age 6703.764216
1 bmi 1592.481991
2 children 1752.136519
I wish to now append my dataframe to contain only the features with 2 highest scores. However I wish to do so without hardcoding the column names while appending into my dataframe.
I have tried to store the column names into a list and append those with highest score but am getting a Value error. Is there any method/function i could try by storing the selected columns and then appending them based on they're scores?
Expected Output: Column 'bmi' is not there as it has lowest of 3 scores
age children charges
0 19 0 16884.92400
1 18 1 1725.55230
2 28 3 4449.46200
3 33 0 21984.47061
So first you want to find out which features have the largest values, then find the Featurename of the columns you do not want to see.
colToDrop = feature.iloc[~feature['Score'].nlargest(2)]['Feature'].values
Next we just filter the original df and remove those columns from the columns list
df[df.columns.drop(colToDrop)]
I believe you need to work on the dataframe featureScores to keep the first 2 features with the highest Score and then use this values as a list to filter the columns in the original dataframe. Something along the lines of:
important_features = featureScores.sort_values('Score',ascending=False)['Feature'].values.tolist()[:2] + ['charges']
filtered_df = df[important_features]
The sort_values() is to make sure the features (in case there are more) are sorted from highest score to lowest score. We then are creating a list of the first 2 values of the column Feature (which has been sorted already) with .values.tolist()[:2]. Since you seem to also want to include the column charges in your output, we are appending it manually with +['charges'] to our list of important_features.
Finally, we're creating a filtered_df by selecing only the important_features columns from the original df.
Edit based on comments:
If you can guarantee charges will be the last column in your original df then you can simply do:
important_features = featureScores.sort_values('Score',ascending=False)['Feature'].values.tolist()[:2] + [df.columns[-1]]
filtered_df = df[important_features]
I see you have previously defined your y column with y_clf = numeric_data.iloc[:,-1] you can then use y_clf.columns or [df.columns[-1]], either should work fine.

calculate bounce rate by day

This seems like a really easy task, but I have been struggling with it for a while.
I want to calculated the (number of sessions with PageView==1) / (total number of sessions) per day, sample data below
session_df
Date/Timestamp Session ID PageViews
2/14/2016 a 1
2/14/2016 b 5
2/14/2016 c 8
3/23/2016 d 1
3/23/2016 e 1
3/23/2016 f 2
and the expected output:
Date/Timestamp BounceRate
2/14/2016 0.333333333
3/23/2016 0.666666667
I tried first add a Bounced? column based on the PageViews number, then groupby and calculate the percentage, then I need to filter out the bounced?==False which is very cumbersome. If anyone can suggest a better way for doing this that would be great!
sessions_df['Bounced?'] = sessions_df['PageViews']>1
dt = pd.DatetimeIndex(sessions_df['Date/Timestamp'])
daily_session_bounce_rate = sessions_df.groupby([dt.date, 'Bounced?']).agg({'Session ID':'count'})
daily_session_bounce_rate = daily_session_bounce_rate.groupby(level=0).apply(lambda x: x / float(x.sum()))
daily_session_bounce_rate
# this is my output
Bounced? Session ID
2016-01-01 False 0.459893
True 0.540107
#filter data
daily_session_bounce_rate.loc[daily_session_bounce_rate['Bounced?']==True,['level_0','Session ID']]
You not need define separate Bounced? column. Take count of grouped rows where PageViews==1 and divide by number of all rows for that date to get fraction
daily_session_bounce_rate = \
df[df['PageViews']==1].groupby('Date/Timestamp').agg({'Session ID':'count'}) /\
df.groupby('Date/Timestamp').agg({'Session ID':'count'})
You could try like this:
bouncerate = (df.loc[df['PageViews'] == 1]
.groupby('Date/Timestamp')['Session ID'].count()
.div(df.groupby('Date/Timestamp')['Session ID']
.count())
.to_frame('Bounce Rate'))
Or:
bouncerate = (df.groupby('Date/Timestamp')
.apply(lambda x: sum(x.PageViews == 1) / x.PageViews.count())
.to_frame('Bounce Rate'))
Both result in:
>>> bouncerate
Bounce Rate
Date/Timestamp
2/14/2016 0.333333
3/23/2016 0.666667
You need:
grp = session_df.groupby(session_df['Date/Timestamp'].dt.day)['Session_ID'].count()
session_1 = session_df.loc[session_df['PageViews']==1].groupby(session_df['Date/Timestamp'].dt.day)['Session_ID'].count()
pd.DataFrame({'bouncerate':list(session_1/grp)}, index=session_df['Date/Timestamp'].unique())
Output:
bouncerate
2016-02-14 0.333333
2016-03-23 0.666667
sessions_df['bounced?'] = sessions_df['PageViews']==1
daily_session_bounce_rate = sessions_df.groupby('Date/Timestamp').mean()['bounced?']
The first line creates a column based of whether PageViews is equal to 1. This is opposite from how you did it; you had bounced? be True when PageViews is more than 1 ... which, if I understand your use of "bounced" correctly, is a case where the user didn't bounce.
The second line groups by Date/Timestamp and then takes the mean. Whenever you do math with booleans like this, Python casts them as int, so every time someone bounces, that's True/1, and whenever they don't, that's False/0. Thus, the sum of the values of the booleans as int is the same as the count of True. When you tell pandas to take the mean of a Series of booleans, it takes the sum/count of True, and divides by the total number of values, which is the same as finding the percentage of times you have True.
Thus, grouping by Date and taking the mean gives you a dataframe where the rows are dates and the columns are the means for that date. The resulting dataframe has a column for each column of the original dataframe (in this case, you have a column consisting of the mean PageViews for each date, and a column of the mean bounced? for each date). If you just want the bounce percentage, you can subset the data frame with ['bounced?']

Categories