Pandas - question about Groupby - 2 columns - python

Trying to print 2 columns against one another to see how much one depends on the other i.e how chances of admission depends on research experience - print the average chance of admit against research.
I'm not sure I'm getting the command correct and if size or something else should be used at the end:
df.groupby(['Chance of Admit', 'Research']).size()
this is the result when I run the above:
Chance of Admit Research
0.34 0 2
0.36 0 1
1 1
0.37 0 1
0.38 0 2
..
0.93 1 12
0.94 1 13
0.95 1 5
0.96 1 8
0.97 1 4
Length: 99, dtype: int64

To see the mean admission rate by number of research papers, you should group by 'research' and take the mean:
df.groupby('Research').mean()
To see additional stats grouped by the number of research papers, use .describe():
df.groupby('Research').describe()
Finally, it may be useful to plot a correlation of the chance of admission vs. the amount of research:
df.plot.scatter(x='Research', y='Chance of Admit')

Syntactically speaking, to take the mean of the chance to admit column grouped by research, all you need is the following:
df[['Chance of Admit', 'Research']].groupby('Research').mean()
But be wary here, as the number may not actually tell you what the chance of getting admitted is given a particular amount of research. That is because you don't know the denominators in those probabilities.
For example,
Suppose I had a dataset that contained the following two rows:
Chance of Admit Research
0.75 1
0.25 1
The 'mean' "chance of admit" for research==1 would be .50 when calculated this way, but suppose that the first change came from a population of 100 students where 75% (75) we admitted. And the second from a population of 900, where 25% (225) were admitted.
Then over all the data we have, we would see 300 with research==1 admitted from a total of 1000 students or 30% change to admit with reseach==1

Related

Rolling calculation across a column - array-wise

I'm trying to get a rolling n-day annualized equity return volatility but am having trouble implementing it. Basically, I would want to see in the last row (index 10) an implementation of sorts that does np.std(df["log returns“]*np.sqrt(252) for a rolling n-day window (e.g. index 6-10 for a 5-day window). If there aren't n values left, leave empty/fill with np.nan.
Index
log returns
annualized volatility
0
0.01
1
-0.005
2
0.021
3
0.01
4
-0.01
5
0.02
6
0.012
7
0.022
8
-0.001
9
-0.01
10
0.01
I thought about doing this with a while loop, but since I'm working with a lot of data I thought an array-wise operation may be smarter. Unfortunately I can't come up with one for the life of me.

Correlation between strings within variable

How could I assess the correlation between each type within each variable?
df
level job
0 good golfer
1 bad footballer
2 intermediate musician
...
Expected Output is a correlation table or something similar to this:
golfer footballer musician ...
good
bad
intermediate
I tried:
df['level']=df['level'].astype('category').cat.codes
df['job']=df['job'].astype('category').cat.codes
df.corr()
You can use pd.crosstab
df1 = pd.crosstab(df.level, df.job)
df1
For my example data you get the output
job footballer golfer musician
level
bad 1 3 3
good 3 3 2
intermediate 1 2 2
And divide by the sum of every row
df1 / df1.sum()
Output
job footballer golfer musician
level
bad 0.2 0.375 0.428571
good 0.6 0.375 0.285714
intermediate 0.2 0.250 0.285714
Judging from the expected output you want to have a table of frequencies. I guess this could be done better but one approach is:
count_combos = pd.Series(zip(df.level, df.job)).value_counts()
count_combos.index = pd.MultiIndex.from_tuples(count_combos.index)
count_combos.unstack()

Get the daily percentages of values that fall within certain ranges

I have a large dataset of test results where I have a columns to represent the date a test was completed and number of hours it took to complete the test i.e.
df = pd.DataFrame({'Completed':['21/03/2020','22/03/2020','21/03/2020','24/03/2020','24/03/2020',], 'Hours_taken':[23,32,8,73,41]})
I have a months worth of test data and the tests can take anywhere from a couple of hours to a couple of days. I want to try and work out, for each day, what percentage of tests fall within the ranges of 24hrs/48hrs/72hrs ect. to complete, up to the percentage of tests that took longer than a week.
I've been able to work it out generally without taking the dates into account like so:
Lab_tests['one-day'] = Lab_tests['hours'].between(0,24)
Lab_tests['two-day'] = Lab_tests['hours'].between(24,48)
Lab_tests['GreaterThanWeek'] = Lab_tests['hours'] >168
one = Lab_tests['1-day'].value_counts().loc[True]
two = Lab_tests['two-day'].value_counts().loc[True]
eight = Lab_tests['GreaterThanWeek'].value_counts().loc[True]
print(one/10407 * 100)
print(two/10407 * 100)
print(eight/10407 * 100)
Ideally I'd like to represent the percentages in another dataset where the rows represent the dates and the columns represent the data ranges. But I can't work out how to take what I've done and modify it to get these percentages for each date. Is this possible to do in pandas?
This question, Counting qualitative values based on the date range in Pandas is quite similar but the fact that I'm counting the occurrences in specified ranges is throwing me off and I haven't been able to get a solution out of it.
Bonus Question
I'm sure you've noticed my current code is not the most elegant thing in the world, is the a cleaner way to do what I've done above, as I'm doing that for every data range that I want?
Edit:
So the Output for the sample data given would look like so:
df = pd.DataFrame({'1-day':[100,0,0,0], '2-day':[0,100,0,50],'3-day':[0,0,0,0],'4-day':[0,0,0,50]},index=['21/03/2020','22/03/2020','23/03/2020','24/03/2020'])
You're almost there. You just need to do a few final steps:
First, cast your bools to ints, so that you can sum them.
Lab_tests['one-day'] = Lab_tests['hours'].between(0,24).astype(int)
Lab_tests['two-day'] = Lab_tests['hours'].between(24,48).astype(int)
Lab_tests['GreaterThanWeek'] = (Lab_tests['hours'] > 168).astype(int)
Completed hours one-day two-day GreaterThanWeek
0 21/03/2020 23 1 0 0
1 22/03/2020 32 0 1 0
2 21/03/2020 8 1 0 0
3 24/03/2020 73 0 0 0
4 24/03/2020 41 0 1 0
Then, drop the hours column and roll the rest up to the level of Completed:
Lab_tests['one-day'] = Lab_tests['hours'].between(0,24).astype(int)
Lab_tests['two-day'] = Lab_tests['hours'].between(24,48).astype(int)
Lab_tests['GreaterThanWeek'] = (Lab_tests['hours'] > 168).astype(int)
Lab_tests.drop('hours', axis=1).groupby('Completed').sum()
one-day two-day GreaterThanWeek
Completed
21/03/2020 2 0 0
22/03/2020 0 1 0
24/03/2020 0 1 0
EDIT: To get to percent, you just need to divide each column by the sum of all three. You can sum columns by defining the axis of the sum:
...
daily_totals = Lab_tests.drop('hours', axis=1).groupby('Completed').sum()
daily_totals.sum(axis=1)
Completed
21/03/2020 2
22/03/2020 1
24/03/2020 1
dtype: int64
Then divide the daily totals dataframe by the column-wise sum of the daily totals (again, we use axis to define whether each value of the series will be the divisor for a row or a column.):
daily_totals.div(daily_totals.sum(axis=1), axis=0)
one-day two-day GreaterThanWeek
Completed
21/03/2020 1.0 0.0 0.0
22/03/2020 0.0 1.0 0.0
24/03/2020 0.0 1.0 0.0

Solving multiple linear equations using Pandas

I have what I think is a very interesting problem here but have little idea how I can go about solving it computationally or whether a Python dataframe is appropriate for this purpose. I have data like so:
SuperGroup Group Code Weight Income
8 E1 E012 a 0.5 1000
9 E1 E012 b 0.2 1000
10 E1 E013 b 0.2 1000
11 E1 E013 c 0.3 1000
Effectively, 'Code' has a one-to-one relationship with 'Weight'.
'SuperGroup' has a one-to-one relationship with 'Income'.
A SuperGroup is composed of many Groups and a Group has many Codes.
I am attempting to distribute the income according to the combined weights of codes within that group so for E012 this is (0.5*0.2 = 0.1) and for E013 this is (0.2*0.3 = 0.06) As a proportion of their total, E012s becomes 0.625 (0.1/(0.1+0.06) and E013s becomes 0.375 (0.06/(0.1+0.06).
The dataframe can be collapsed and re-written as:
SuperGroup Group Code CombinedWeight Income
8 E1 E012 a,b 0.625 1000
10 E1 E013 b,c 0.375 1000
I am capable of producing the above dataframe, but my next step is to apply the weights to the income to distribute it in such a way that it averages to 1000 still but reflects the size of the weight of the group it is associated with.
Letting x=0.625 and y=0.375 then x=1.67y
Additionally, (x+y)/2 = 1000 note: my data often has several groups present in a supergroup so it could be more than 2 resulting in a system of linear equations if my understanding is correct
Solving simultaneously produces 1250 and 750 as the weighted incomes. The dataframe can be re-written as:
SuperGroup Group Code Income
8 E1 E012 a,b 1250
10 E1 E013 b,c 750
which is effectively how I need it. Any guidance is warmly appreciated.
First we agg the DataFrame on ['SuperGroup', 'Group']
res = (df.groupby(['SuperGroup', 'Group'])
.agg({'Weight': lambda x: x.cumprod().iloc[-1],
'Code': ','.join,
'Income': 'first'}))
Then we re-adjust the Income within each SuperGroup with the help of transform:
s = res.groupby(level='SuperGroup')
res['Income'] = s.Income.transform('sum')*res.Weight/s.Weight.transform('sum')
Weight Code Income
SuperGroup Group
E1 E012 0.10 a,b 1250.0
E013 0.06 b,c 750.0

Does the test set need data cleaning in machine learning?

I am on an interesting machine learning project about the NYC taxi data (https://s3.amazonaws.com/nyc-tlc/trip+data/green_tripdata_2017-04.csv), the target is predicting the tip amount, the raw data looks like (2 data samples):
VendorID lpep_pickup_datetime lpep_dropoff_datetime store_and_fwd_flag \
0 2 2017-04-01 00:03:54 2017-04-01 00:20:51 N
1 2 2017-04-01 00:00:29 2017-04-01 00:02:44 N
RatecodeID PULocationID DOLocationID passenger_count trip_distance \
0 1 25 14 1 5.29
1 1 263 75 1 0.76
fare_amount extra mta_tax tip_amount tolls_amount ehail_fee \
0 18.5 0.5 0.5 1.00 0.0 NaN
1 4.5 0.5 0.5 1.45 0.0 NaN
improvement_surcharge total_amount payment_type trip_type
0 0.3 20.80 1 1.0
1 0.3 7.25 1 1.0
There are five different 'payment_type', indicated by numerical number 1,2,3,4,5
I find that only when the 'payment_type' is 1, the 'tip_amount' is meaningful, 'payment_type' 2,3,4,5 all have zero tip:
for i in range(1,6):
print(raw[raw["payment_type"] == i][['tip_amount', 'payment_type']].head(2))
gives:
tip_amount payment_type
0 1.00 1
1 1.45 1
tip_amount payment_type
5 0.0 2
8 0.0 2
tip_amount payment_type
100 0.0 3
513 0.0 3
tip_amount payment_type
59 0.0 4
102 0.0 4
tip_amount payment_type
46656 0.0 5
53090 0.0 5
First question: I want to build a regression model for 'tip_amount', if i use the 'payment_type' as a feature, can the model automatically handle this kind of behavior?
Second question: We know that the 'tip_amount' is actually not zero for 'payment_type' 2,3,4,5, just not being correctly recorded, if I drop these data samples and only keep the 'payment_type' == 1, then when using the model for unseen test dataset, it can not predict 'payment_type' 2,3,4,5 to zero tip, so I have to keep the 'payment_type' as an important feature right?
Third question: Let's say I keep all different 'payment_type' data samples and the model is able to predict zero tip amount for 'payment_type' 2,3,4,5 but is this what we really want? Because the underlying true tip should not be zero, it's just how the data looks like.
A common saying for machine learning goes garbage in, garbage out. Often, feature selection and data preprocessing is more important than your model architecture.
First question:
Yes
Second question:
Since payment_type of 2, 3, 4, 5 all result in 0, why not just keep it simple. Replace all payment types that are not 1 with 0. This will let your model easily correlate 1 to being paid and 0 to not being paid. It also reduces the amount of things your model will have to learn in the future.
Third question:
If the "underlying true tip" is not reflected in the data, then it is simply impossible for your model to learn it. Whether this inaccurate representation of the truth is what we want or not what we want is a decision for you to make. Ideally you would have data that shows the actual tip.
Preprocessing your data is very important and will help your model tremendously. Besides making some changes to your payment_type features, you should also look into normalizing your data, which will help your machine learning algorithm better generalize relations between your data.

Categories