How to reverse Geocode (county level name) from lat-long in Python - python

I am using NYC trips data. I wanted to convert the lat-long present in the data to respective boroughs in NYC. I especially want to know if there is some NYC airport (Laguardia/JFK) present in one of those trips.
I know that Google Maps API and even libraries like Geopy get the reverse geocoding. However, most of them give city and country level codings.
I wanted to extract the borough or airport (like Queens, Manhattan, JFK, Laguardia etc) name from the lat-long. I have lat-long for both pickup and dropoff locations.
Here is a sample dataset in pandas dataframe.
VendorID lpep_pickup_datetime Lpep_dropoff_datetime Store_and_fwd_flag RateCodeID Pickup_longitude Pickup_latitude Dropoff_longitude Dropoff_latitude Passenger_count Trip_distance Fare_amount Extra MTA_tax Tip_amount Tolls_amount Ehail_fee improvement_surcharge Total_amount Payment_type Trip_type
0 2 2015-09-01 00:02:34 2015-09-01 00:02:38 N 5 -73.979485 40.684956 -73.979431 40.685020 1 0.00 7.8 0.0 0.0 1.95 0.0 NaN 0.0 9.75 1 2.0
1 2 2015-09-01 00:04:20 2015-09-01 00:04:24 N 5 -74.010796 40.912216 -74.010780 40.912212 1 0.00 45.0 0.0 0.0 0.00 0.0 NaN 0.0 45.00 1 2.0
2 2 2015-09-01 00:01:50 2015-09-01 00:04:24 N 1 -73.921410 40.766708 -73.914413 40.764687 1 0.59 4.0 0.5 0.5 0.50 0.0 NaN 0.3 5.80 1 1.0
In [5]:
You can find the data here too:
http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml
After bit of research I found I can leverage Google Maps API, to get the county and even establishment level data.
Here is the code I wrote:
A mapper function to get the geocode data from Google API for the lat-long passed
def reverse_geocode(latlng):
result = {}
url = 'https://maps.googleapis.com/maps/api/geocode/json?latlng={}'
request = url.format(latlng)
data = requests.get(request).json()
if len(data['results']) > 0:
result = data['results'][0]
return result
# Geo_code data for pickup-lat-long
trip_data_sample["est_pickup"] = [y["address_components"][0]["long_name"] for y in map(reverse_geocode, trip_data_sample["lat_long_pickup"].values)]
trip_data_sample["locality_pickup"]=[y["address_components"][2]["long_name"] for y in map(reverse_geocode, trip_data_sample["lat_long_pickup"].values)]
However, I initially had 1.4MM records. It was taking lot of time to get this done. So I reduced to 200K. Even that was taking lot of time to run. So then I reduced to 115K. Even that taking too much time.
So now I reduced to 50K. But then this sample would hardly be having a representative distribution of the whole data.
I was wondering if there is any better and faster way to get the reverse geocode of lat-long. I am not using Spark since I am running it on local mac. So using Spark might not give that much speed leverage on single machine. Pls advise.

Related

Pandas.pivot replace some existing values with NaN

I have to read some table from a SQL Server and merge their values in a single data structure for a machine learning project. I'm using Pandas and in particular pd.read_sql_query for read the values and pd.merge to fuse them.
One table has at least 80 milion rows and if I try to read it entirely it occupies all my memory storage (it's not so much, only 20gb but I've got a small ssd), so I decided to divide it in chunk of 100000 rows:
df_bilanci = pd.read_sql_query('SELECT [IdBilancioRiclassificato], [IdBilancio], [Importo] FROM dbo.Bilanci', conn, chunksize=100000)
One single chunk will be like this:
IdBilancioRiclassificato
IdBilancio
Importo
0
10.0
7001.0
0.0
1
11.0
7001.0
502643.0
2
12.0
7001.0
-4550.0
3
10.0
7002.0
654654.0
4
11.0
7002.0
0.0
5
12.0
7002.0
0.0
I'm interested to have the values of IdBilancioRiclassificato as columns (there are a total of 97 unique values for this column, so they have to be 97 columns), so I used pd.pivot on every chunk and then pd.concat plus merge to put together all the data:
for chunk in df_bilanci:
chunk.reset_index()
chunk_pivoted = pd.pivot(data=chunk,
index='IdBilancio',
columns='IdBilancioRiclassificato',
values='Importo'
)
df_aziende_bil = pd.concat([df_aziende_bil, pd.merge(left=df_aziende_anagrafe, right=chunk_pivoted, left_on='ID', right_index=True)])
At this point however, the chunk_pivoted dataframe has some values replaced with NaN values but if I look in the table the values exist.
The result expected is a table like this one:
IdBilancio
10.0
11.0
12.0
7001.0
0.0
502643.0
-4550.0
7002.0
654654.0
0.0
0.0
but i've got something like this:
IdBilancio
10.0
11.0
12.0
7001.0
0.0
NaN
-4550.0
7002.0
654654.0
NaN
NaN

How to remove NaN values from corr() function output

EDITED TO SHOW EXAMPLE OF ORIGINAL DATAFRAME:
df.head(4)
shop category subcategory season
date
2013-09-04 abc weddings shoes winter
2013-09-04 def jewelry watches summer
2013-09-05 ghi sports sneakers spring
2013-09-05 jkl jewelry necklaces fall
I've successfully generated the following dataframe using get_dummies():
wedding_seasons = pd.get_dummies(df.loc[df['category']=='weddings',['category','season']],prefix = '', prefix_sep = '' )
wedding_seasons.head(3)
weddings winter summer spring fall
71654 1.0 0.0 1.0 0.0 0.0
72168 1.0 0.0 1.0 0.0 0.0
72080 1.0 0.0 1.0 0.0 0.0
The goal of the above is to help assess frequency of weddings across seasons, so I've used corr() to generate the following result:
weddings fall spring summer winter
weddings NaN NaN NaN NaN NaN
fall NaN 1.000000 0.054019 -0.331866 -0.012122
spring NaN 0.054019 1.000000 -0.857205 0.072420
summer NaN -0.331866 -0.857205 1.000000 -0.484578
winter NaN -0.012122 0.072420 -0.484578 1.000000
I'm unsure why the wedding column is generating NaN values, but my gut feeling is that it originates from how I originally created wedding_seasons. Any guidance would be greatly appreciated so that I can properly assess column correlations.
I don't think what you're interested in seeing here is the "correlation".
All of the columns in the dataframe wedding_seasons contain floating point values; however, if my suspicions are correct, the rows in your original dataframe df contain something like transaction records, where each row corresponds to an individual.
Please tell me if I'm incorrect, but I'll proceed with my reasoning.
Correlation will measure, intuitively, the tendency for values vary together/against each other within the same observation (e.g. if X and Y are negatively correlated, then when we see X go above its mean, we'd expect Y to appear below its mean).
However, what you have here is data where, if one transaction is summer, then categorically it cannot possibly be winter at the same time. When you create wedding_seasons, Pandas is creating dummy variables that are treated as floating point values when computing your correlation matrix; since it's impossible for any row to contain two 1.0 entries at the same time, clearly your resulting correlation matrix is going to have negative entries everywhere.
You could drop the weddings column before doing corr().
wedding_seasons.drop(columns = ['weddings'])

Does the test set need data cleaning in machine learning?

I am on an interesting machine learning project about the NYC taxi data (https://s3.amazonaws.com/nyc-tlc/trip+data/green_tripdata_2017-04.csv), the target is predicting the tip amount, the raw data looks like (2 data samples):
VendorID lpep_pickup_datetime lpep_dropoff_datetime store_and_fwd_flag \
0 2 2017-04-01 00:03:54 2017-04-01 00:20:51 N
1 2 2017-04-01 00:00:29 2017-04-01 00:02:44 N
RatecodeID PULocationID DOLocationID passenger_count trip_distance \
0 1 25 14 1 5.29
1 1 263 75 1 0.76
fare_amount extra mta_tax tip_amount tolls_amount ehail_fee \
0 18.5 0.5 0.5 1.00 0.0 NaN
1 4.5 0.5 0.5 1.45 0.0 NaN
improvement_surcharge total_amount payment_type trip_type
0 0.3 20.80 1 1.0
1 0.3 7.25 1 1.0
There are five different 'payment_type', indicated by numerical number 1,2,3,4,5
I find that only when the 'payment_type' is 1, the 'tip_amount' is meaningful, 'payment_type' 2,3,4,5 all have zero tip:
for i in range(1,6):
print(raw[raw["payment_type"] == i][['tip_amount', 'payment_type']].head(2))
gives:
tip_amount payment_type
0 1.00 1
1 1.45 1
tip_amount payment_type
5 0.0 2
8 0.0 2
tip_amount payment_type
100 0.0 3
513 0.0 3
tip_amount payment_type
59 0.0 4
102 0.0 4
tip_amount payment_type
46656 0.0 5
53090 0.0 5
First question: I want to build a regression model for 'tip_amount', if i use the 'payment_type' as a feature, can the model automatically handle this kind of behavior?
Second question: We know that the 'tip_amount' is actually not zero for 'payment_type' 2,3,4,5, just not being correctly recorded, if I drop these data samples and only keep the 'payment_type' == 1, then when using the model for unseen test dataset, it can not predict 'payment_type' 2,3,4,5 to zero tip, so I have to keep the 'payment_type' as an important feature right?
Third question: Let's say I keep all different 'payment_type' data samples and the model is able to predict zero tip amount for 'payment_type' 2,3,4,5 but is this what we really want? Because the underlying true tip should not be zero, it's just how the data looks like.
A common saying for machine learning goes garbage in, garbage out. Often, feature selection and data preprocessing is more important than your model architecture.
First question:
Yes
Second question:
Since payment_type of 2, 3, 4, 5 all result in 0, why not just keep it simple. Replace all payment types that are not 1 with 0. This will let your model easily correlate 1 to being paid and 0 to not being paid. It also reduces the amount of things your model will have to learn in the future.
Third question:
If the "underlying true tip" is not reflected in the data, then it is simply impossible for your model to learn it. Whether this inaccurate representation of the truth is what we want or not what we want is a decision for you to make. Ideally you would have data that shows the actual tip.
Preprocessing your data is very important and will help your model tremendously. Besides making some changes to your payment_type features, you should also look into normalizing your data, which will help your machine learning algorithm better generalize relations between your data.

Transform Pandas DataFrame to LIBFM format txt file

I want to transform a Pandas Data frame in python to a sparse matrix txt file in the LIBFM format.
Here the format needs to look like this:
4 0:1.5 3:-7.9
2 1:1e-5 3:2
-1 6:1
This file contains three cases. The first column states the target of each of the three case: i.e. 4 for the first case, 2 for the second and -1 for the third. After the target, each line contains the non-zero elements of x, where an entry like 0:1.5 reads x0 = 1.5 and 3:-7.9 means x3 = −7.9, etc. That means the left side of INDEX:VALUE states the index within x whereas the right side states the value of x.
In total the data from the example describes the following design matrix X and target vector y:
1.5 0.0 0.0 −7.9 0.0 0.0 0.0
X: 0.0 10−5 0.0 2.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 1.0
4
Y: 2
−1
This is also explained in the Manual file under chapter 2.
Now here is my problem: I have a pandas dataframe that looks like this:
overall reviewerID asin brand Positive Negative \
0 5.0 A2XVJBSRI3SWDI 0000031887 Boutique Cutie 3.0 -1
1 4.0 A2G0LNLN79Q6HR 0000031887 Boutique Cutie 5.0 -2
2 2.0 A2R3K1KX09QBYP 0000031887 Boutique Cutie 3.0 -2
3 1.0 A19PBP93OF896 0000031887 Boutique Cutie 2.0 -3
4 4.0 A1P0IHU93EF9ZK 0000031887 Boutique Cutie 2.0 -2
LDA_0 LDA_1 ... LDA_98 LDA_99
0 0.000833 0.000833 ... 0.000833 0.000833
1 0.000769 0.000769 ... 0.000769 0.000769
2 0.000417 0.000417 ... 0.000417 0.000417
3 0.000137 0.014101 ... 0.013836 0.000137
4 0.000625 0.000625 ... 0.063125 0.000625
Where "overall" is the target column and all other 105 columns are features.
The 'ReviewerId', 'Asin' and 'Brand' columns needs to be changed to dummy variables. So each unique 'ReviewerID', 'Asin' and brand gets his own column. This means if 'ReviewerID' has 100 unique values you get 100 columns where the value is 1 if that row represents the specific Reviewer and else zero.
All other columns don't need to get reformatted. So the index for those columns can just be the column number.
So the first 3 rows in the above pandas data frame need to be transformed to the following output:
5 0:1 5:1 6:1 7:3 8:-1 9:0.000833 10:0.000833 ... 107:0.000833 108:0.00833
4 1:1 5:1 6:1 7:5 8:-2 9:0.000769 10:0.000769 ... 107:0.000769 108:0.00769
2 2:1 5:1 6:1 7:3 8:-2 9:0.000417 10:0.000417 ... 107:0.000417 108:0.000417
In the LIBFM] package there is a program that can transform the User - Item - Rating into the LIBFM output format. However this program can't get along with this many columns.
Is there an easy way to do this? I have 1 million rows in total.
LibFM executable expects the input in libSVM format that you have explained here. If the file converter in the LibFM package do not work for your data, try the scikit learn sklearn.datasets.dump_svmlight_file method.
Ref: http://scikit-learn.org/stable/modules/generated/sklearn.datasets.dump_svmlight_file.html

Efficient operation over grouped dataframe Pandas

I have a very big Pandas dataframe where I need an ordering within groups based on another column. I know how to iterate over groups, do an operation on the group and union all those groups back into one dataframe however this is slow and I feel like there is a better way achieve this. Here is the input and what I want out of it. Input:
ID price
1 100.00
1 80.00
1 90.00
2 40.00
2 40.00
2 50.00
Output:
ID price order
1 100.00 3
1 80.00 1
1 90.00 2
2 40.00 1
2 40.00 2 (could be 1, doesn't matter too much)
2 50.00 3
Since this is over about 5kk records with around 250,000 IDs efficiency is important.
If speed is what you want, then the following should be pretty good, although it is a bit more complicated as it makes use of complex number sorting in numpy. This is similar to the approach used (my me) when writing the aggregate-sort method in the package numpy-groupies.
# get global sort order, for sorting by ID then price
full_idx = np.argsort(df['ID'] + 1j*df['price'])
# get min of full_idx for each ID (note that there are multiple ways of doing this)
n_for_id = np.bincount(df['ID'])
first_of_idx = np.cumsum(n_for_id)-n_for_id
# subtract first_of_idx from full_idx
rank = np.empty(len(df),dtype=int)
rank[full_idx] = arange(len(df)) - first_of_idx[df['ID'][full_idx]]
df['rank'] = rank+1
It takes 2s for 5m rows on my machine, which is about 100x faster than using groupby.rank from pandas (although I didn't actually run the pandas version with 5m rows because it would take too long; I'm not sure how #ayhan managed to do it in only 30s, perhaps a difference in pandas versions?).
If you do use this, then I recommend testing it thoroughly, as I have not.
You can use rank:
df["order"] = df.groupby("ID")["price"].rank(method="first")
df
Out[47]:
ID price order
0 1 100.0 3.0
1 1 80.0 1.0
2 1 90.0 2.0
3 2 40.0 1.0
4 2 40.0 2.0
5 2 50.0 3.0
It takes about 30s on a dataset of 5m rows with 250000 ID's (i5-3330) :
df = pd.DataFrame({"price": np.random.rand(5000000), "ID": np.random.choice(np.arange(250000), size = 5000000)})
%time df["order"] = df.groupby("ID")["price"].rank(method="first")
Wall time: 36.3 s

Categories