Correlation between strings within variable - python

How could I assess the correlation between each type within each variable?
df
level job
0 good golfer
1 bad footballer
2 intermediate musician
...
Expected Output is a correlation table or something similar to this:
golfer footballer musician ...
good
bad
intermediate
I tried:
df['level']=df['level'].astype('category').cat.codes
df['job']=df['job'].astype('category').cat.codes
df.corr()

You can use pd.crosstab
df1 = pd.crosstab(df.level, df.job)
df1
For my example data you get the output
job footballer golfer musician
level
bad 1 3 3
good 3 3 2
intermediate 1 2 2
And divide by the sum of every row
df1 / df1.sum()
Output
job footballer golfer musician
level
bad 0.2 0.375 0.428571
good 0.6 0.375 0.285714
intermediate 0.2 0.250 0.285714

Judging from the expected output you want to have a table of frequencies. I guess this could be done better but one approach is:
count_combos = pd.Series(zip(df.level, df.job)).value_counts()
count_combos.index = pd.MultiIndex.from_tuples(count_combos.index)
count_combos.unstack()

Related

Pandas - question about Groupby - 2 columns

Trying to print 2 columns against one another to see how much one depends on the other i.e how chances of admission depends on research experience - print the average chance of admit against research.
I'm not sure I'm getting the command correct and if size or something else should be used at the end:
df.groupby(['Chance of Admit', 'Research']).size()
this is the result when I run the above:
Chance of Admit Research
0.34 0 2
0.36 0 1
1 1
0.37 0 1
0.38 0 2
..
0.93 1 12
0.94 1 13
0.95 1 5
0.96 1 8
0.97 1 4
Length: 99, dtype: int64
To see the mean admission rate by number of research papers, you should group by 'research' and take the mean:
df.groupby('Research').mean()
To see additional stats grouped by the number of research papers, use .describe():
df.groupby('Research').describe()
Finally, it may be useful to plot a correlation of the chance of admission vs. the amount of research:
df.plot.scatter(x='Research', y='Chance of Admit')
Syntactically speaking, to take the mean of the chance to admit column grouped by research, all you need is the following:
df[['Chance of Admit', 'Research']].groupby('Research').mean()
But be wary here, as the number may not actually tell you what the chance of getting admitted is given a particular amount of research. That is because you don't know the denominators in those probabilities.
For example,
Suppose I had a dataset that contained the following two rows:
Chance of Admit Research
0.75 1
0.25 1
The 'mean' "chance of admit" for research==1 would be .50 when calculated this way, but suppose that the first change came from a population of 100 students where 75% (75) we admitted. And the second from a population of 900, where 25% (225) were admitted.
Then over all the data we have, we would see 300 with research==1 admitted from a total of 1000 students or 30% change to admit with reseach==1

Does the test set need data cleaning in machine learning?

I am on an interesting machine learning project about the NYC taxi data (https://s3.amazonaws.com/nyc-tlc/trip+data/green_tripdata_2017-04.csv), the target is predicting the tip amount, the raw data looks like (2 data samples):
VendorID lpep_pickup_datetime lpep_dropoff_datetime store_and_fwd_flag \
0 2 2017-04-01 00:03:54 2017-04-01 00:20:51 N
1 2 2017-04-01 00:00:29 2017-04-01 00:02:44 N
RatecodeID PULocationID DOLocationID passenger_count trip_distance \
0 1 25 14 1 5.29
1 1 263 75 1 0.76
fare_amount extra mta_tax tip_amount tolls_amount ehail_fee \
0 18.5 0.5 0.5 1.00 0.0 NaN
1 4.5 0.5 0.5 1.45 0.0 NaN
improvement_surcharge total_amount payment_type trip_type
0 0.3 20.80 1 1.0
1 0.3 7.25 1 1.0
There are five different 'payment_type', indicated by numerical number 1,2,3,4,5
I find that only when the 'payment_type' is 1, the 'tip_amount' is meaningful, 'payment_type' 2,3,4,5 all have zero tip:
for i in range(1,6):
print(raw[raw["payment_type"] == i][['tip_amount', 'payment_type']].head(2))
gives:
tip_amount payment_type
0 1.00 1
1 1.45 1
tip_amount payment_type
5 0.0 2
8 0.0 2
tip_amount payment_type
100 0.0 3
513 0.0 3
tip_amount payment_type
59 0.0 4
102 0.0 4
tip_amount payment_type
46656 0.0 5
53090 0.0 5
First question: I want to build a regression model for 'tip_amount', if i use the 'payment_type' as a feature, can the model automatically handle this kind of behavior?
Second question: We know that the 'tip_amount' is actually not zero for 'payment_type' 2,3,4,5, just not being correctly recorded, if I drop these data samples and only keep the 'payment_type' == 1, then when using the model for unseen test dataset, it can not predict 'payment_type' 2,3,4,5 to zero tip, so I have to keep the 'payment_type' as an important feature right?
Third question: Let's say I keep all different 'payment_type' data samples and the model is able to predict zero tip amount for 'payment_type' 2,3,4,5 but is this what we really want? Because the underlying true tip should not be zero, it's just how the data looks like.
A common saying for machine learning goes garbage in, garbage out. Often, feature selection and data preprocessing is more important than your model architecture.
First question:
Yes
Second question:
Since payment_type of 2, 3, 4, 5 all result in 0, why not just keep it simple. Replace all payment types that are not 1 with 0. This will let your model easily correlate 1 to being paid and 0 to not being paid. It also reduces the amount of things your model will have to learn in the future.
Third question:
If the "underlying true tip" is not reflected in the data, then it is simply impossible for your model to learn it. Whether this inaccurate representation of the truth is what we want or not what we want is a decision for you to make. Ideally you would have data that shows the actual tip.
Preprocessing your data is very important and will help your model tremendously. Besides making some changes to your payment_type features, you should also look into normalizing your data, which will help your machine learning algorithm better generalize relations between your data.

Pandas Concat Different Sized DataFrame to End of Column

Note: Contrived example. Please don't hate on forecasting and I don't need advice on it. This is strictly a Pandas how-to question.
Example - One Solution
I have two different sized DataFrames, one representing sales and one representing a forecast.
sales = pd.DataFrame({'sales':[5,3,5,6,4,4,5,6,7,5]})
forecast = pd.DataFrame({'forecast':[5,5.5,6,5]})
The forecast needs to be with the latest sales, which is at the end of the list of sales numbers [5, 6, 7, 5]. Other times, I might want it at other locations (please don't ask why, I just need it this way).
This works:
df = pd.concat([sales, forecast], ignore_index=True, axis=1)
df.columns = ['sales', 'forecast'] # Not necessary, making next command pretty
df.forecast = df.forecast.shift(len(sales) - len(forecast))
This gives me the desired outcome:
Question
What I want to know is: Can I concatenate to the end of the sales data without performing the additional shift (the last command)? I'd like to do this in one step instead of two. concat or something similar is fine, but I'd like to skip the shift.
I'm not hung up on having two lines of code. That's okay. I want a solution with the maximum possible performance. My application is sensitive to every millisecond we throw at it on account of huge volumes.
Not sure if that is much faster but you could do
sales = pd.DataFrame({'sales':[5,3,5,6,4,4,5,6,7,5]})
forecast = pd.DataFrame({'forecast':[5,5.5,6,5]})
forecast.index = sales.index[-forecast.shape[0]:]
which gives
forecast
6 5.0
7 5.5
8 6.0
9 5.0
and then simply
pd.concat([sales, forecast], axis=1)
yielding the desired outcome:
sales forecast
0 5 NaN
1 3 NaN
2 5 NaN
3 6 NaN
4 4 NaN
5 4 NaN
6 5 5.0
7 6 5.5
8 7 6.0
9 5 5.0
A one-line solution using the same idea, as mentioned by #Dark in the comments, would be:
pd.concat([sales, forecast.set_axis(sales.index[-len(forecast):], inplace=False)], axis=1)
giving the same output.

Filling in missing values with values that may exist elsewhere in DataFrame?

I have an aviation dataset that I am trying to clean. There are some missing values for the NumEngines feature, but there are some instances where a missing value can be derived from an entry elsewhere in the dataframe (this is not always the case). Below is a mini example of my dataset to illustrate both cases. Note that first Cessna entry can be used to fill in the second, but this is not the case for Piper.
df = pd.DataFrame()
df["Make"] = ["Cessna","Piper","Cessna","Boeing"]
df["Model"] = ["Citation","PA32RT","Citation","737-300"]
df["NumEngines"] = [2,None,None,2]
How can I make it so that the resulting DataFrame would be
Make Model NumEngines
0 Cessna Citation 2.0
1 Piper PA32RT NaN
2 Cessna Citation 2.0
3 Boeing 737-300 2.0
I would bet transform('first') could make it again here:
df.groupby(['Make', 'Model']).transform('first')
Out[179]:
NumEngines
0 2.0
1 NaN
2 2.0
3 2.0

Efficient operation over grouped dataframe Pandas

I have a very big Pandas dataframe where I need an ordering within groups based on another column. I know how to iterate over groups, do an operation on the group and union all those groups back into one dataframe however this is slow and I feel like there is a better way achieve this. Here is the input and what I want out of it. Input:
ID price
1 100.00
1 80.00
1 90.00
2 40.00
2 40.00
2 50.00
Output:
ID price order
1 100.00 3
1 80.00 1
1 90.00 2
2 40.00 1
2 40.00 2 (could be 1, doesn't matter too much)
2 50.00 3
Since this is over about 5kk records with around 250,000 IDs efficiency is important.
If speed is what you want, then the following should be pretty good, although it is a bit more complicated as it makes use of complex number sorting in numpy. This is similar to the approach used (my me) when writing the aggregate-sort method in the package numpy-groupies.
# get global sort order, for sorting by ID then price
full_idx = np.argsort(df['ID'] + 1j*df['price'])
# get min of full_idx for each ID (note that there are multiple ways of doing this)
n_for_id = np.bincount(df['ID'])
first_of_idx = np.cumsum(n_for_id)-n_for_id
# subtract first_of_idx from full_idx
rank = np.empty(len(df),dtype=int)
rank[full_idx] = arange(len(df)) - first_of_idx[df['ID'][full_idx]]
df['rank'] = rank+1
It takes 2s for 5m rows on my machine, which is about 100x faster than using groupby.rank from pandas (although I didn't actually run the pandas version with 5m rows because it would take too long; I'm not sure how #ayhan managed to do it in only 30s, perhaps a difference in pandas versions?).
If you do use this, then I recommend testing it thoroughly, as I have not.
You can use rank:
df["order"] = df.groupby("ID")["price"].rank(method="first")
df
Out[47]:
ID price order
0 1 100.0 3.0
1 1 80.0 1.0
2 1 90.0 2.0
3 2 40.0 1.0
4 2 40.0 2.0
5 2 50.0 3.0
It takes about 30s on a dataset of 5m rows with 250000 ID's (i5-3330) :
df = pd.DataFrame({"price": np.random.rand(5000000), "ID": np.random.choice(np.arange(250000), size = 5000000)})
%time df["order"] = df.groupby("ID")["price"].rank(method="first")
Wall time: 36.3 s

Categories