I am currently working on a Big Data project requiring the merger of a number of files into a single table that can be analyzed through SAS. The majority of work is completed with the final fact tables needing to be added to the final output.
I have hit a snag when attempting to combine a fact table to the final output. In this csv file that has been loaded to its own dataframe the following columns are present.
table name: POP
year | Borough | Population
within the dataset on which they are to be enjoined these fields also exist along with around 26 others. When first attempted to merge through the following line:
#Output = pd.merge(Output, POP, on=['year', 'Borough'], how='outer')
the following error was returned
ValueError: You are trying to merge on object and int64 columns. If
you wish to proceed you should use pd.concat
i understood this to simply be a data type mismatch and so added the following line ahead of the merge command:
POP['year'] = POP['year'].astype(object)
doing so allows for ""successfull"" execution of the program, however, the output file has the population column, but it is filled with NaN when it should have the appropriate population for each row where a combination of "year" and "borough" match that found in the POP table.
any help would be greatly appreciated and i provide below a fuller excerpt of the code for those who find that easier to parse:
import pandas as pd
#
# Add Population Data
#
#rename columns for easier joining
POP.rename(columns={"Area name":"Borough"}, inplace=True)
POP.rename(columns={"Persons":"Population"}, inplace=True)
POP.rename(columns={"Year":"year"}, inplace=True)
#convert type of output column to allow join
POP['year'] = POP['year'].astype(object)
#add to output file
Output = pd.merge(Output, POP, on=['year', 'Borough'], how='outer')
Additionally find below some information about the data types and shape of the items and tables involved in case it is of use:
> Output table info
>
> <class 'pandas.core.frame.DataFrame'> Int64Index: 34241 entries, 0 to
> 38179 Data columns (total 2 columns): year 34241 non-null object
> Borough 34241 non-null object dtypes: object(2) memory usage:
> 535.0+ KB None table shape: (34241, 36)
> ----------
>
> POP table info <class 'pandas.core.frame.DataFrame'> RangeIndex: 357
> entries, 0 to 356 Data columns (total 3 columns): year 357
> non-null object Borough 357 non-null object Population 357
> non-null object dtypes: object(3) memory usage: 4.2+ KB None table
> shape: (357, 3)
finally, my apologies if any of this is asked or presented incorrectly, I'm relatively new to python and its my first time using stack as a contributor
EDITS:
(1)
as requested here is samples from the data:
this is the Output Dataframe
[357 rows x 3 columns]
Borough Date totalIncidents Calculated Mean Closure \
0 Camden 2013-11-06 2.0 613.5
1 Kensington and Chelsea 2013-11-06 NaN NaN
2 Westminster 2013-11-06 1.0 113.0
PM2.5 (ug/m3) PM10 (ug/m3) (R) SO2 (ug/m3) (R) CO (mg m-3) (R) \
0 9.55 16.200 5.3 NaN
1 10.65 21.125 1.7 0.2
2 19.90 30.600 NaN 0.7
NO (ug/m3) (R) NO2 (ug/m3) (R) O3 (ug/m3) (R) Bus Stops \
0 135.866670 82.033333 24.4 447.0
1 80.360000 65.680000 29.3 270.0
2 171.033333 109.000000 21.3 489.0
Cycle Parking Points \
0 67.0
1 27.0
2 45.0
Average Public Transport Access Index 2015 (AvPTAI2015) \
0 24.316782
1 23.262691
2 39.750796
Public Transport Accessibility Levels (PTAL) Catagorisation \
0 5
1 5
2 6a
Underground Stations in Borough \
0 16.0
1 12.0
2 31.0
PM2.5 Daily Air Quality Index(DAQI) classification \
0 Low
1 Low
2 Low
PM2.5 Above WHO 24hr mean Guidline PM10 Above WHO 24hr mean Guidline \
0 0.0 0.0
1 0.0 0.0
2 0.0 0.0
O3 Above WHO 8hr mean Guidline* NO2 Above WHO 1hr mean Guidline* \
0 1.0 1.0
1 1.0 1.0
2 1.0 1.0
SO2 Above WHO 24hr mean Guidline SO2 Above EU 24hr mean Allowence \
0 0.0 0.0
1 0.0 0.0
2 0.0 0.0
NO2 Above EU 24hr mean Allowence CO Above EU 8hr mean Allowence \
0 1.0 0.0
1 1.0 0.0
2 1.0 0.0
O3 Above EU 8hr mean Allowence year NO2 Year Mean (ug/m3) \
0 0.0 2013 50.003618
1 0.0 2013 50.003618
2 0.0 2013 50.003618
PM2.5 Year Mean (ug/m3) PM10 Year Mean (ug/m3) \
0 15.339228 24.530299
1 15.339228 24.530299
2 15.339228 24.530299
NO2 Above WHO Annual mean Guidline NO2 Above EU Annual mean Allowence \
0 0.0 1.0
1 0.0 1.0
2 0.0 1.0
PM2.5 Above EU Annual mean Allowence PM10 Above EU Annual mean Allowence \
0 0.0 0.0
1 0.0 0.0
2 0.0 0.0
Number of Bicycle Hires (All Boroughs)
0 18,431
1 18,431
2 18,431
here is the Population dataframe
year Borough Population
0 2010 Barking and Dagenham 182,838
1 2011 Barking and Dagenham 187,029
2 2012 Barking and Dagenham 190,560
edit(2):
so this seems that it was a date type issue but im still not entirely sure why as i had attempted recasting the datatypes. However, the solution that finall got me going was to save the output dataframe as a csv and reload it into the programme, from there the merge started working again.
Related
I have a data file containing different foetal ultrasound measurements. The measurements are collected at different points during pregnancy, like so:
PregnancyID MotherID gestationalAgeInWeeks abdomCirc
0 0 14 150
0 0 21 200
1 1 20 294
1 1 25 315
1 1 30 350
2 2 8 170
2 2 9 180
2 2 18 NaN
Following this answer to a previous questions I had asked, I used this code to summarise the ultrasound measurements using the maximum measurement recorded in a single trimester (13 weeks):
(df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13))
.drop(columns = 'gestationalAgeInWeeks')
.groupby(['MotherID', 'PregnancyID','tm'])
.agg('max')
.unstack()
)
This results in the following output:
tm 1 2 3
MotherID PregnancyID
0 0 NaN 200.0 NaN
1 1 NaN 294.0 350.0
2 2 180.0 NaN NaN
However, MotherID and PregnancyID no longer appear as columns in the output of df.info(). Similarly, when I output the dataframe to a csv file, I only get columns 1,2 and 3. The id columns only appear when running df.head() as can be seen in the dataframe above.
I need to preserve the id columns as I want to use them to merge this dataframe with another one using the ids. Therefore, my question is, how do I preserve these id columns as part of my dataframe after running the code above?
Chain that with reset_index:
(df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13)
# .drop(columns = 'gestationalAgeInWeeks') # don't need this
.groupby(['MotherID', 'PregnancyID','tm'])['abdomCirc'] # change here
.max().add_prefix('abdomCirc_') # here
.unstack()
.reset_index() # and here
)
Or a more friendly version with pivot_table:
(df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13)
.pivot_table(index= ['MotherID', 'PregnancyID'], columns='tm',
values= 'abdomCirc', aggfunc='max')
.add_prefix('abdomCirc_') # remove this if you don't want the prefix
.reset_index()
)
Output:
tm MotherID PregnancyID abdomCirc_1 abdomCirc_2 abdomCirc_3
0 abdomCirc_0 abdomCirc_0 NaN 200.0 NaN
1 abdomCirc_1 abdomCirc_1 NaN 315.0 350.0
2 abdomCirc_2 abdomCirc_2 180.0 NaN NaN
I have two sets of continuous data that I would like to pass into a contour plot. The x-axis would be time, the y-axis would be mass, and the z-axis would be frequency (as in how many times that data point appears). However, most data points are not identical but rather very similar. Thus, I suspect it's easiest to discretize both the x-axis and y-axis.
Here's the data I currently have:
INPUT
import pandas as pd
df = pd.read_excel('data.xlsx')
df['Dates'].head(5)
df['Mass'].head(5)
OUTPUT
13 2003-05-09
14 2003-09-09
15 2010-01-18
16 2010-11-21
17 2012-06-29
Name: Date, dtype: datetime64[ns]
13 2500.0
14 3500.0
15 4000.0
16 4500.0
17 5000.0
Name: Mass, dtype: float64
I'd like to convert the data such that it groups up data points within the year (ex: all datapoints taken in 2003) and it groups up data points within different levels of mass (ex: all datapoints between 3000-4000 kg). Next, the code would count how many data points are within each of these blocks and pass that as the z-axis.
Ideally, I'd also like to be able to adjust the levels of slices. Ex: grouping points up every 100kg instead of 1000kg, or passing a custom list of levels that aren't equally distributed. How would I go about doing this?
I think the function you are looking for is pd.cut
import pandas as pd
import numpy as np
import datetime
n = 10
scale = 1e3
Min = 0
Max = 1e4
np.random.seed(6)
Start = datetime.datetime(2000, 1, 1)
Dates = np.array([base + datetime.timedelta(days=i*180) for i in range(n)])
Mass = np.random.rand(n)*10000
df = pd.DataFrame(index = Dates, data = {'Mass':Mass})
print(df)
gives you:
Mass
2000-01-01 8928.601514
2000-06-29 3319.798053
2000-12-26 8212.291231
2001-06-24 416.966257
2001-12-21 1076.566799
2002-06-19 5950.520642
2002-12-16 5298.173622
2003-06-14 4188.074286
2003-12-11 3354.078493
2004-06-08 6225.194322
if you want to group your Masses by say 1000, or implement your own custom bins, you can do this:
Bins,Labels=np.arange(Min,Max+.1,scale),(np.arange(Min,Max,scale))+(scale)/2
EqualBins = pd.cut(df['Mass'],bins=Bins,labels=Labels)
df.insert(1,'Equal Bins',EqualBins)
Bins,Labels=[0,1000,5000,10000],['Small','Medium','Big']
CustomBins = pd.cut(df['Mass'],bins=Bins,labels=Labels)
df.insert(2,'Custom Bins',CustomBins)
If you want to just show the year, month, etc it is very simple:
df['Year'] = df.index.year
df['Month'] = df.index.month
but you can also do custom date ranges if you like:
Bins=[datetime.datetime(1999, 12, 31),datetime.datetime(2000, 9, 1),
datetime.datetime(2002, 1, 1),datetime.datetime(2010, 9, 1)]
Labels = ['Early','Middle','Late']
CustomDateBins = pd.cut(df.index,bins=Bins,labels=Labels)
df.insert(3,'Custom Date Bins',CustomDateBins)
print(df)
This yields something like what you want:
Mass Equal Bins Custom Bins Custom Date Bins Year Month
2000-01-01 8928.601514 8500.0 Big Early 2000 1
2000-06-29 3319.798053 3500.0 Medium Early 2000 6
2000-12-26 8212.291231 8500.0 Big Middle 2000 12
2001-06-24 416.966257 500.0 Small Middle 2001 6
2001-12-21 1076.566799 1500.0 Medium Middle 2001 12
2002-06-19 5950.520642 5500.0 Big Late 2002 6
2002-12-16 5298.173622 5500.0 Big Late 2002 12
2003-06-14 4188.074286 4500.0 Medium Late 2003 6
2003-12-11 3354.078493 3500.0 Medium Late 2003 12
2004-06-08 6225.194322 6500.0 Big Late 2004 6
The .groupby function is probably of interst to you as well:
yeargroup = df.groupby(df.index.year).mean()
massgroup = df.groupby(df['Equal Bins']).count()
print(yeargroup)
print(massgroup)
Mass Year Month
2000 6820.230266 2000.0 6.333333
2001 746.766528 2001.0 9.000000
2002 5624.347132 2002.0 9.000000
2003 3771.076389 2003.0 9.000000
2004 6225.194322 2004.0 6.000000
Mass Custom Bins Custom Date Bins Year Month
Equal Bins
500.0 1 1 1 1 1
1500.0 1 1 1 1 1
2500.0 0 0 0 0 0
3500.0 2 2 2 2 2
4500.0 1 1 1 1 1
5500.0 2 2 2 2 2
6500.0 1 1 1 1 1
7500.0 0 0 0 0 0
8500.0 2 2 2 2 2
9500.0 0 0 0 0 0
I want to transform a Pandas Data frame in python to a sparse matrix txt file in the LIBFM format.
Here the format needs to look like this:
4 0:1.5 3:-7.9
2 1:1e-5 3:2
-1 6:1
This file contains three cases. The first column states the target of each of the three case: i.e. 4 for the first case, 2 for the second and -1 for the third. After the target, each line contains the non-zero elements of x, where an entry like 0:1.5 reads x0 = 1.5 and 3:-7.9 means x3 = −7.9, etc. That means the left side of INDEX:VALUE states the index within x whereas the right side states the value of x.
In total the data from the example describes the following design matrix X and target vector y:
1.5 0.0 0.0 −7.9 0.0 0.0 0.0
X: 0.0 10−5 0.0 2.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 1.0
4
Y: 2
−1
This is also explained in the Manual file under chapter 2.
Now here is my problem: I have a pandas dataframe that looks like this:
overall reviewerID asin brand Positive Negative \
0 5.0 A2XVJBSRI3SWDI 0000031887 Boutique Cutie 3.0 -1
1 4.0 A2G0LNLN79Q6HR 0000031887 Boutique Cutie 5.0 -2
2 2.0 A2R3K1KX09QBYP 0000031887 Boutique Cutie 3.0 -2
3 1.0 A19PBP93OF896 0000031887 Boutique Cutie 2.0 -3
4 4.0 A1P0IHU93EF9ZK 0000031887 Boutique Cutie 2.0 -2
LDA_0 LDA_1 ... LDA_98 LDA_99
0 0.000833 0.000833 ... 0.000833 0.000833
1 0.000769 0.000769 ... 0.000769 0.000769
2 0.000417 0.000417 ... 0.000417 0.000417
3 0.000137 0.014101 ... 0.013836 0.000137
4 0.000625 0.000625 ... 0.063125 0.000625
Where "overall" is the target column and all other 105 columns are features.
The 'ReviewerId', 'Asin' and 'Brand' columns needs to be changed to dummy variables. So each unique 'ReviewerID', 'Asin' and brand gets his own column. This means if 'ReviewerID' has 100 unique values you get 100 columns where the value is 1 if that row represents the specific Reviewer and else zero.
All other columns don't need to get reformatted. So the index for those columns can just be the column number.
So the first 3 rows in the above pandas data frame need to be transformed to the following output:
5 0:1 5:1 6:1 7:3 8:-1 9:0.000833 10:0.000833 ... 107:0.000833 108:0.00833
4 1:1 5:1 6:1 7:5 8:-2 9:0.000769 10:0.000769 ... 107:0.000769 108:0.00769
2 2:1 5:1 6:1 7:3 8:-2 9:0.000417 10:0.000417 ... 107:0.000417 108:0.000417
In the LIBFM] package there is a program that can transform the User - Item - Rating into the LIBFM output format. However this program can't get along with this many columns.
Is there an easy way to do this? I have 1 million rows in total.
LibFM executable expects the input in libSVM format that you have explained here. If the file converter in the LibFM package do not work for your data, try the scikit learn sklearn.datasets.dump_svmlight_file method.
Ref: http://scikit-learn.org/stable/modules/generated/sklearn.datasets.dump_svmlight_file.html
I am using NYC trips data. I wanted to convert the lat-long present in the data to respective boroughs in NYC. I especially want to know if there is some NYC airport (Laguardia/JFK) present in one of those trips.
I know that Google Maps API and even libraries like Geopy get the reverse geocoding. However, most of them give city and country level codings.
I wanted to extract the borough or airport (like Queens, Manhattan, JFK, Laguardia etc) name from the lat-long. I have lat-long for both pickup and dropoff locations.
Here is a sample dataset in pandas dataframe.
VendorID lpep_pickup_datetime Lpep_dropoff_datetime Store_and_fwd_flag RateCodeID Pickup_longitude Pickup_latitude Dropoff_longitude Dropoff_latitude Passenger_count Trip_distance Fare_amount Extra MTA_tax Tip_amount Tolls_amount Ehail_fee improvement_surcharge Total_amount Payment_type Trip_type
0 2 2015-09-01 00:02:34 2015-09-01 00:02:38 N 5 -73.979485 40.684956 -73.979431 40.685020 1 0.00 7.8 0.0 0.0 1.95 0.0 NaN 0.0 9.75 1 2.0
1 2 2015-09-01 00:04:20 2015-09-01 00:04:24 N 5 -74.010796 40.912216 -74.010780 40.912212 1 0.00 45.0 0.0 0.0 0.00 0.0 NaN 0.0 45.00 1 2.0
2 2 2015-09-01 00:01:50 2015-09-01 00:04:24 N 1 -73.921410 40.766708 -73.914413 40.764687 1 0.59 4.0 0.5 0.5 0.50 0.0 NaN 0.3 5.80 1 1.0
In [5]:
You can find the data here too:
http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml
After bit of research I found I can leverage Google Maps API, to get the county and even establishment level data.
Here is the code I wrote:
A mapper function to get the geocode data from Google API for the lat-long passed
def reverse_geocode(latlng):
result = {}
url = 'https://maps.googleapis.com/maps/api/geocode/json?latlng={}'
request = url.format(latlng)
data = requests.get(request).json()
if len(data['results']) > 0:
result = data['results'][0]
return result
# Geo_code data for pickup-lat-long
trip_data_sample["est_pickup"] = [y["address_components"][0]["long_name"] for y in map(reverse_geocode, trip_data_sample["lat_long_pickup"].values)]
trip_data_sample["locality_pickup"]=[y["address_components"][2]["long_name"] for y in map(reverse_geocode, trip_data_sample["lat_long_pickup"].values)]
However, I initially had 1.4MM records. It was taking lot of time to get this done. So I reduced to 200K. Even that was taking lot of time to run. So then I reduced to 115K. Even that taking too much time.
So now I reduced to 50K. But then this sample would hardly be having a representative distribution of the whole data.
I was wondering if there is any better and faster way to get the reverse geocode of lat-long. I am not using Spark since I am running it on local mac. So using Spark might not give that much speed leverage on single machine. Pls advise.
I want to find the pct_change of Dew_P Temp (C) from the initial value of -3.9. I want the pct_change in a new column.
Source here:
weather = pd.read_csv('https://raw.githubusercontent.com/jvns/pandas-cookbook/master/data/weather_2012.csv')
weather[weather.columns[:4]].head()
Date/Time Temp (C) Dew_P Temp (C) Rel Hum (%)
0 2012-01-01 -1.8 -3.9 86
1 2012-01-01 -1.8 -3.7 87
2 2012-01-01 -1.8 -3.4 89
3 2012-01-01 -1.5 -3.2 88
4 2012-01-01 -1.5 -3.3 88
I have tried variations of this for loop (even going as far as adding an index shown here) but to no avail:
for index, dew_point in weather['Dew_P Temp (C)'].iteritems():
new = weather['Dew_P Temp (C)'][index]
old = weather['Dew_P Temp (C)'][0]
pct_diff = (new-old)/old*100
weather['pct_diff'] = pct_diff
I think the problem is the weather['pct_diff'], it doesn't take the new it takes the last value of the data frame and subtracts it from old
So its always (2.1-3.9)/3.9*100 thus my percent change is always -46%.
The end result I want is this:
Date/Time Temp (C) Dew_P Temp (C) Rel Hum (%) pct_diff
0 2012-01-01 -1.8 -3.9 86 0.00%
1 2012-01-01 -1.8 -3.7 87 5.12%
2 2012-01-01 -1.8 -3.4 89 12.82%
Any ideas? Thanks!
You can use iat to access the scalar value (e.g. iat[0] accesses the first value in the series).
df = weather
df['pct_diff'] = df['Dew_P Temp (C)'] / df['Dew_P Temp (C)'].iat[0] - 1
IIUC you can do it this way:
In [88]: ((weather['Dew Point Temp (C)'] - weather.ix[0, 'Dew Point Temp (C)']).abs() / weather.ix[0, 'Dew Point Temp (C)']).abs() * 100
Out[88]:
0 0.000000
1 5.128205
2 12.820513
3 17.948718
4 15.384615
5 15.384615
6 20.512821
7 7.692308
8 7.692308
9 20.512821
I find this more graceful
weather['Dew_P Temp (C)'].pct_change().fillna(0).add(1).cumprod().sub(1)
0 0.000000
1 -0.051282
2 -0.128205
3 -0.179487
4 -0.153846
Name: Dew_P Temp (C), dtype: float64
To get your expected output with absolute values
weather['pct_diff'] = weather['Dew_P Temp (C)'].pct_change().fillna(0).add(1).cumprod().sub(1).abs()
weather