Does the test set need data cleaning in machine learning?

Does the test set need data cleaning in machine learning? - python

I am on an interesting machine learning project about the NYC taxi data (https://s3.amazonaws.com/nyc-tlc/trip+data/green_tripdata_2017-04.csv), the target is predicting the tip amount, the raw data looks like (2 data samples):
VendorID lpep_pickup_datetime lpep_dropoff_datetime store_and_fwd_flag \
0 2 2017-04-01 00:03:54 2017-04-01 00:20:51 N
1 2 2017-04-01 00:00:29 2017-04-01 00:02:44 N
RatecodeID PULocationID DOLocationID passenger_count trip_distance \
0 1 25 14 1 5.29
1 1 263 75 1 0.76
fare_amount extra mta_tax tip_amount tolls_amount ehail_fee \
0 18.5 0.5 0.5 1.00 0.0 NaN
1 4.5 0.5 0.5 1.45 0.0 NaN
improvement_surcharge total_amount payment_type trip_type
0 0.3 20.80 1 1.0
1 0.3 7.25 1 1.0
There are five different 'payment_type', indicated by numerical number 1,2,3,4,5
I find that only when the 'payment_type' is 1, the 'tip_amount' is meaningful, 'payment_type' 2,3,4,5 all have zero tip:
for i in range(1,6):
print(raw[raw["payment_type"] == i][['tip_amount', 'payment_type']].head(2))
gives:
tip_amount payment_type
0 1.00 1
1 1.45 1
tip_amount payment_type
5 0.0 2
8 0.0 2
tip_amount payment_type
100 0.0 3
513 0.0 3
tip_amount payment_type
59 0.0 4
102 0.0 4
tip_amount payment_type
46656 0.0 5
53090 0.0 5
First question: I want to build a regression model for 'tip_amount', if i use the 'payment_type' as a feature, can the model automatically handle this kind of behavior?
Second question: We know that the 'tip_amount' is actually not zero for 'payment_type' 2,3,4,5, just not being correctly recorded, if I drop these data samples and only keep the 'payment_type' == 1, then when using the model for unseen test dataset, it can not predict 'payment_type' 2,3,4,5 to zero tip, so I have to keep the 'payment_type' as an important feature right?
Third question: Let's say I keep all different 'payment_type' data samples and the model is able to predict zero tip amount for 'payment_type' 2,3,4,5 but is this what we really want? Because the underlying true tip should not be zero, it's just how the data looks like.

A common saying for machine learning goes garbage in, garbage out. Often, feature selection and data preprocessing is more important than your model architecture.
First question:
Yes
Second question:
Since payment_type of 2, 3, 4, 5 all result in 0, why not just keep it simple. Replace all payment types that are not 1 with 0. This will let your model easily correlate 1 to being paid and 0 to not being paid. It also reduces the amount of things your model will have to learn in the future.
Third question:
If the "underlying true tip" is not reflected in the data, then it is simply impossible for your model to learn it. Whether this inaccurate representation of the truth is what we want or not what we want is a decision for you to make. Ideally you would have data that shows the actual tip.
Preprocessing your data is very important and will help your model tremendously. Besides making some changes to your payment_type features, you should also look into normalizing your data, which will help your machine learning algorithm better generalize relations between your data.

Related

Pandas - question about Groupby - 2 columns

Trying to print 2 columns against one another to see how much one depends on the other i.e how chances of admission depends on research experience - print the average chance of admit against research.
I'm not sure I'm getting the command correct and if size or something else should be used at the end:
df.groupby(['Chance of Admit', 'Research']).size()
this is the result when I run the above:
Chance of Admit Research
0.34 0 2
0.36 0 1
1 1
0.37 0 1
0.38 0 2
..
0.93 1 12
0.94 1 13
0.95 1 5
0.96 1 8
0.97 1 4
Length: 99, dtype: int64

To see the mean admission rate by number of research papers, you should group by 'research' and take the mean:
df.groupby('Research').mean()
To see additional stats grouped by the number of research papers, use .describe():
df.groupby('Research').describe()
Finally, it may be useful to plot a correlation of the chance of admission vs. the amount of research:
df.plot.scatter(x='Research', y='Chance of Admit')

Syntactically speaking, to take the mean of the chance to admit column grouped by research, all you need is the following:
df[['Chance of Admit', 'Research']].groupby('Research').mean()
But be wary here, as the number may not actually tell you what the chance of getting admitted is given a particular amount of research. That is because you don't know the denominators in those probabilities.
For example,
Suppose I had a dataset that contained the following two rows:
Chance of Admit Research
0.75 1
0.25 1
The 'mean' "chance of admit" for research==1 would be .50 when calculated this way, but suppose that the first change came from a population of 100 students where 75% (75) we admitted. And the second from a population of 900, where 25% (225) were admitted.
Then over all the data we have, we would see 300 with research==1 admitted from a total of 1000 students or 30% change to admit with reseach==1

Forecasting future occurrences with Random Forest

I'm currently exploring the use of Random Forests to predict future values of occurrences (my ARIMA model gave me really bad forecasting so I'm trying to evaluate other options). I'm fully aware that the bad results might be due to the fact that I don't have a lot of data and the quality isn't the greatest. My initial data consisted simply of the number of occurrences per date. I then added separate columns representing the day, month, year, day of the week (which was later one-hot encoded) and then I also added two columns with lagged values (one of them with the value observed in the day before and another with the value observed two days before). The final data is like this:
Count Year Month Day Count-1 Count-2 Friday Monday Saturday Sunday Thursday Tuesday Wednesday
196.0 2017.0 7.0 10.0 196.0 196.0 0 1 0 0 0 0 0
264.0 2017.0 7.0 11.0 196.0 196.0 0 0 0 0 0 1 0
274.0 2017.0 7.0 12.0 264.0 196.0 0 0 0 0 0 0 1
286.0 2017.0 7.0 13.0 274.0 264.0 0 0 0 0 1 0 0
502.0 2017.0 7.0 14.0 286.0 274.0 1 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ...
I then trained a random forest making the count the label (what I'm trying to predict) and all the rest the features. I also made 70/30 train/test split. Trained it on the train data and then used the test set to evaluate the model (code below):
rf = RandomForestRegressor(n_estimators = 1000, random_state = 42)
rf.fit(train_features, train_labels)
predictions = rf.predict(test_features)
The results I obtained were pretty good: MAE=1.71 and Accuracy of 89.84%.
First question: is there any possibility that I'm crazily overfitting the data? I just want to make sure I'm not making some big mistake that's giving me better results than I should get.
Second question: with the model trained, how do I use RF to predict future values? My goal was to give weekly forecasts for the number occurrences but I'm kind of stuck on how to do that.
If some who's a bit better and more experienced than me at this could help, I'd be very much appreciated! Thanks

Adressing your first question, random forest might tend to overfit, but that should be checked when comparing the MAE, MSE, RMSE of your test set. What do you mean with accuracy? Your R square? However, the way to work with models is to usually make them overfit at first, so you have a decent accuracy/mse/rmse and later perform regularization techniques to deal with this overfitting by setting a high min_child_weight or low max_depth, a high n_estimators is also good.
Secondly, to use your model to predict future values, you need to use the exact same model you trained, with the dataset you want to make your prediction on. Of course the features that were given in train must match the inputs that will be given when doing the forecasting. Furthermore, keep in mind that as time passes, this new information will be very valuable to improve your model by adding this new information to your train dataset.
forecasting = rf.predict(dataset_to_be_forecasted)

Linear regression containing Nans

I'm working on a model which will predict a number from others opinion. For this i will use Linear Regression from Sklearn.
For example, i have 5 agents from witch i collect data over time of theirs last changes in each iteration, if they didn't insert it yet, data contains Nan, till their first change. Data looks something like this:
a1 a2 a3 a4 a5 target
1 nan nan nan nan 3 4.5
2 4 nan nan nan 3 4.5
3 4 5 nan nan 3 4.5
4 4 5 5 nan 3 4.5
5 4 5 5 4 3 4.5
6 5 5 5 4 3 4.5
So in each iteration/change i want to predict end number. As we know linear regression doesn't allow you to have an = Nan's in data. I replace them with an = 0, witch doesn't ruin answer, because formula of linear regression is: result = a1*w1 + a2*w2 + ... + an*wn + c.
Current questions i have at the moment:
Does my solution somehow effects on fit? Is there any better solution for my problem? Should i learn my model only with full data than use it with current solution?

Setting nan's to 0 and training a linear regression to find coefficients for each of the variables is fine depending on the use case.
Why?
You are essentialy training the model and telling it that for many rows - the importance of variable a1 ,a2 , etc (when the value is nan and set to 0).
If the NAN's are because of data not being filled in yet, then setting them to 0 and training your model is wrong. It's better to train your model after all the data has been entered (atleast for all the agents who have entered some data) This can later be used to predict for new agents. Else , your coefficients will be over fit for 0's(NAN's) if many agents have not yet entered in their data.
Based on the end target(which is a continuous variable) , linear regression is a good approach to go by.

How to reverse Geocode (county level name) from lat-long in Python

I am using NYC trips data. I wanted to convert the lat-long present in the data to respective boroughs in NYC. I especially want to know if there is some NYC airport (Laguardia/JFK) present in one of those trips.
I know that Google Maps API and even libraries like Geopy get the reverse geocoding. However, most of them give city and country level codings.
I wanted to extract the borough or airport (like Queens, Manhattan, JFK, Laguardia etc) name from the lat-long. I have lat-long for both pickup and dropoff locations.
Here is a sample dataset in pandas dataframe.
VendorID lpep_pickup_datetime Lpep_dropoff_datetime Store_and_fwd_flag RateCodeID Pickup_longitude Pickup_latitude Dropoff_longitude Dropoff_latitude Passenger_count Trip_distance Fare_amount Extra MTA_tax Tip_amount Tolls_amount Ehail_fee improvement_surcharge Total_amount Payment_type Trip_type
0 2 2015-09-01 00:02:34 2015-09-01 00:02:38 N 5 -73.979485 40.684956 -73.979431 40.685020 1 0.00 7.8 0.0 0.0 1.95 0.0 NaN 0.0 9.75 1 2.0
1 2 2015-09-01 00:04:20 2015-09-01 00:04:24 N 5 -74.010796 40.912216 -74.010780 40.912212 1 0.00 45.0 0.0 0.0 0.00 0.0 NaN 0.0 45.00 1 2.0
2 2 2015-09-01 00:01:50 2015-09-01 00:04:24 N 1 -73.921410 40.766708 -73.914413 40.764687 1 0.59 4.0 0.5 0.5 0.50 0.0 NaN 0.3 5.80 1 1.0
In [5]:
You can find the data here too:
http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml
After bit of research I found I can leverage Google Maps API, to get the county and even establishment level data.
Here is the code I wrote:
A mapper function to get the geocode data from Google API for the lat-long passed
def reverse_geocode(latlng):
result = {}
url = 'https://maps.googleapis.com/maps/api/geocode/json?latlng={}'
request = url.format(latlng)
data = requests.get(request).json()
if len(data['results']) > 0:
result = data['results'][0]
return result
# Geo_code data for pickup-lat-long
trip_data_sample["est_pickup"] = [y["address_components"][0]["long_name"] for y in map(reverse_geocode, trip_data_sample["lat_long_pickup"].values)]
trip_data_sample["locality_pickup"]=[y["address_components"][2]["long_name"] for y in map(reverse_geocode, trip_data_sample["lat_long_pickup"].values)]
However, I initially had 1.4MM records. It was taking lot of time to get this done. So I reduced to 200K. Even that was taking lot of time to run. So then I reduced to 115K. Even that taking too much time.
So now I reduced to 50K. But then this sample would hardly be having a representative distribution of the whole data.
I was wondering if there is any better and faster way to get the reverse geocode of lat-long. I am not using Spark since I am running it on local mac. So using Spark might not give that much speed leverage on single machine. Pls advise.

Efficient operation over grouped dataframe Pandas

I have a very big Pandas dataframe where I need an ordering within groups based on another column. I know how to iterate over groups, do an operation on the group and union all those groups back into one dataframe however this is slow and I feel like there is a better way achieve this. Here is the input and what I want out of it. Input:
ID price
1 100.00
1 80.00
1 90.00
2 40.00
2 40.00
2 50.00
Output:
ID price order
1 100.00 3
1 80.00 1
1 90.00 2
2 40.00 1
2 40.00 2 (could be 1, doesn't matter too much)
2 50.00 3
Since this is over about 5kk records with around 250,000 IDs efficiency is important.

If speed is what you want, then the following should be pretty good, although it is a bit more complicated as it makes use of complex number sorting in numpy. This is similar to the approach used (my me) when writing the aggregate-sort method in the package numpy-groupies.
# get global sort order, for sorting by ID then price
full_idx = np.argsort(df['ID'] + 1j*df['price'])
# get min of full_idx for each ID (note that there are multiple ways of doing this)
n_for_id = np.bincount(df['ID'])
first_of_idx = np.cumsum(n_for_id)-n_for_id
# subtract first_of_idx from full_idx
rank = np.empty(len(df),dtype=int)
rank[full_idx] = arange(len(df)) - first_of_idx[df['ID'][full_idx]]
df['rank'] = rank+1
It takes 2s for 5m rows on my machine, which is about 100x faster than using groupby.rank from pandas (although I didn't actually run the pandas version with 5m rows because it would take too long; I'm not sure how #ayhan managed to do it in only 30s, perhaps a difference in pandas versions?).
If you do use this, then I recommend testing it thoroughly, as I have not.

You can use rank:
df["order"] = df.groupby("ID")["price"].rank(method="first")
df
Out[47]:
ID price order
0 1 100.0 3.0
1 1 80.0 1.0
2 1 90.0 2.0
3 2 40.0 1.0
4 2 40.0 2.0
5 2 50.0 3.0
It takes about 30s on a dataset of 5m rows with 250000 ID's (i5-3330) :
df = pd.DataFrame({"price": np.random.rand(5000000), "ID": np.random.choice(np.arange(250000), size = 5000000)})
%time df["order"] = df.groupby("ID")["price"].rank(method="first")
Wall time: 36.3 s

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.