I'm learning python and am currently trying to parse out the longitude and latitude from a "Location" column and assign them to the 'lat' and 'lon' columns. I currently have the following code:
def getlatlong(cell):
dd['lat'] = cell.split('\n')[2].split(',')[0][1:]
dd['lon'] = cell.split('\n')[2].split(',')[1][1:-1]
dd['Location'] = dd['Location'].apply(getlatlong)
dd.head()
The splitting portion of the code works. The problem is that this code copies the lat and lon from the last cell in the dataframe to all of the 'lat' and 'lon' rows. I want it to split the current row it is iterating through, assign the 'lat' and 'lon' values for that row, and then do the same on every subsequent row.
I get that assigning dd['lat'] to the split value assigns it to the whole column, but I don't know how to assign to just the row currently being iterated over.
Data sample upon request:
Index,Location
0,"1554 FIRST ST\nLIVERMORE, CA 94550\n(37.67930642, -121.7765857)"
1,"1554 FIRST ST\nLIVERMORE, CA 94550\n(37.67931141, -121.7765988)"
2,"138 14TH ST\nOAKLAND, CA 94612\n(37.80140803, -122.26369831)"
3,"4014 MACARTHUR BLVD\nOAKLAND, CA 94619\n(37.78968061, -122.19690846)"
4,"4014 MACARTHUR BLVD\nOAKLAND, CA 94619\n(37.78968557, -122.19692165)"
Please see my approach below. It is based on creating a DataFrame with lat and lon columns and then adding it to the existing dataframe.
def getlatlong(x):
return pd.Series([x.split('\n')[2].split(',')[0][1:],
x.split('\n')[2].split(',')[1][1:-1]],
index = ["lat", "lon"])
df = pd.concat((df, df.Location.apply(getlatlong)), axis=1)
This addresses another technique you can use to get the answer, but isn't exact code you need. If you add sample data i can tailor it.
Using Pandas's build in str methods you can save yourself some headache as follows:
temp_df = df['Location'].str.split('\n').str.split().apply(pd.Series)
The above splits the Location column on spaces, and then turns the split values into columns. You can then assign just the Latitude and Longitude columns to the original df.
df[['Latitude', 'Longitude']] = temp_df[[<selection1>, <selection2>]]
str.split() also has an expand parameter so that you can write .str.split("char", expand=True) to spread out the columns without the apply.
Update
Given your example, this works for your specific case:
df = pd.DataFrame({"Location": ["1554 FIRST ST\nLIVERMORE, CA 94550\n(37.67930642, -121.7765857)"]})
df[["Latitude", "Longitude"]] = (df['Location']
.str.split('\n')
.apply(pd.Series)[2] # Column 2 has the str (lat, long)
.str[1:-1] # Strip the ()
.str.split(",", expand=True) # Expand latitude and longitude into two columns
.astype(float)) # Make sure latitude and longitude are floats
Out:
Location Latitude Longitude
0 1554 FIRST ST\nLIVERMORE, CA 94550\n(37.679306... 37.679306 -121.776586
Update #2
#Abhishek Mishra's answer is faster (takes only 55% of the time, since it goes through the data fewer times). Worth noting that the output from that example has strings in each column, so you might want to modify to get values back to floats.
for ind, row in dd.iterrows():
dd['lat'].loc[ind] = dd['Location'].loc[ind].split(',')[0][1:]
dd['lon'].loc[ind] = dd['Location'].loc[ind].split(',')[1][1:-1]
PS: iterrows() is slow.
Related
I am using the codes below to identify US county. The data is taken from Yelp which provides lat/lon coordinate.
id
latitude
longitude
1
40.017544
-105.283348
2
45.588906
-122.593331
import pandas
df = pandas.read_json("/Users/yelp/yelp_academic_dataset_business.json", lines=True, encoding='utf-8')
# Identify county
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="http")
df['county'] = geolocator.reverse(df['latitude'],df['longitude'])
The error was "TypeError: reverse() takes 2 positional arguments but 3 were given".
Nominatim.reverse takes coordinate pairs; the issue is that you are passing it pandas dataframe columns. df['latitude'] here refers to the entire column in your data, not just one value, and since geopy is independent of pandas, it doesn't support processing an entire column and instead just sees that the input isn't a valid number.
Instead, try looping through the rows:
county = []
for row in range(len(df)):
county.append(geolocator.reverse((df['latitude'][row], df['longitude'][row])))
(Note the double brackets.)
Then, insert the column into the dataframe:
df.insert(index, 'county', county, True)
(index should be what column position you want, and the boolean value at the end indicates that duplicate values are allowed.)
I have a dataframe that I convert to a pivot table, perform some imputation for missing data and then convert it back to the original form. The code I have appears to work in that it does not produce errors, but the output does not yield the expected number of rows. I suspect the problem is something to do with specifying the melting/stacking, but dont quite know what. I would be very grateful if someone was able to provide some help/support. Pictures, code and further info are below.
Thankyou in advance to anyone who helps.
The initial dataframe (data) contains 4 columns (geocode/country, variablename, year and value). There are 290,038 rows x 4 columns.
I convert data into the following form (country year pairs in each row, with each column being a variable). using the following code
data_temp = data.copy()
data_temp_grouped = pd.pivot_table(data_temp, index=(['geocode','year']),columns="variablename",values="value")
After performing some operations/imputation, I want to convert data_temp_grouped back to the original form as data. I have tried a few different methods, code does not produce the expected number of rows (290,038) .
This produces 4 columns but 827,929 rows.
data_temp_grouped2 = data_temp_grouped.copy()
data_temp_grouped3 = data_temp_grouped2.stack(0).reset_index(name='value')
This produces 111,5712 rows x 4 columns
data_temp_grouped2 = data_temp_grouped2.copy()
data_temp_grouped4 = data_temp_grouped4.reset_index()
data_temp_grouped4 = pd.melt(data_temp_grouped4,id_vars=["geocode","year"])
data_temp_grouped4
TLDR: I failed to account for "missing" data in wide format that was "added" to long format.
I just realized why I was having these problems. In the initial long format, there were ~290,000 rows. When converted into a wide format, there are 7748 (rows) x144 (cols). When this is squished into a long format, there are a total of 1,115,712 rows (7748 x 144). This increase comes due to the fact that missing data (country year pairs for certain variables) was not present in the initial data and only "emerged" during the conversion to wide format. Recoverting it again from long to wide the dimensions match : 7748 x 144 as expected.
For anyone else who might encounter the same problem, I've also included my code below.
The code is below
# grouping country year pairs
data_temp = data.copy()
# converts into multi indexed wide format (country year pairs)
data_temp_grouped = pd.pivot_table(data_temp, index=(['geocode','year']),columns="variablename",values="value")
# linearly interpolates the data for each country year pair
data_temp_grouped=data_temp_grouped.groupby("geocode").apply(lambda x : x.interpolate(method="linear",limit_direction="both"))
# Make a copy of the dataframe
data_temp_grouped2 = data_temp_grouped.copy()
# reset the index
data_temp_grouped2=data_temp_grouped2.reset_index()
data_temp_grouped2_melted=pd.melt(data_temp_grouped2,id_vars=['geocode',"year"],var_name='variablename', value_name='value')
data_temp_grouped2_melted
# to double check and convert back to multi index wide format
data_temp_grouped_check = pd.pivot_table(data_temp_grouped2_melted,index=(['geocode','year']),columns="variablename",values="value")
I am trying to extract the location codes / product codes from a sql table using pandas. The field is an array type, i.e. it has multiple values as a list within each row. I have to extract values from string for product/location codes.
Here is a sample of the table
df.head()
Target_Type Constraints
45 ti_8188,to_8188,r_8188,trad_8188_1,to_9258,ti_9258,r_9258,trad_9258_1
45 ti_8188,to_8188,r_8188,trad_8188_1,trad_22420_1
45 ti_8894,trad_8894_0.2
Now I want to extract the numeric values of the codes. I also want to ignore the end float values after 2nd underscore in the entries, i.e. ignore the _1, _0.2 etc.
Here is a sample output I want to achieve. It should be unique list/df column of all the extracted values -
Target_Type_45_df.head()
Constraints
8188
9258
22420
8894
I have never worked with nested/array type of column before. Any help would be appreciated.
You can use explode to bring each variable into a single cell, under one column:
df = df.explode('Constraints')
df['newConst'] = df['Constraints'].apply(lambda x: str(x).split('_')[1])
I would think the following overall strategy would work well (you'll need to debug):
Define a function that takes a row as input (the idea being to broadcast this function with the pandas .apply method).
In this function, set my_list = row['Constraints'].
Then do my_list = my_list.split(','). Now you have a list, with no commas.
Next, split with the underscore, take the second element (index 1), and convert to int:
numbers = [int(element.split('_')[1]) for element in my_list]
Finally, convert to set: return set(numbers)
The output for each row will be a set - just union all these sets together to get the final result.
I have a dataframe(df3)
df3 = pd.DataFrame({
'Origin':['DEL','BOM','AMD'],
'Destination':['BOM','AMD','DEL']})
comprising of Travel Data which contains Origin/Destination and I'm trying to map Latitude and Longitude for Origin & Destination airports using 3 letter city codes (df_s3).
df_s3 = pd.DataFrame({
'iata_code':['AMD','BOM','DEL'],
'Lat':['72.6346969603999','72.8678970337','77.103104'],
'Lon':['23.0771999359','19.0886993408','28.5665']})
I've tried mapping them one at a time, i.e.
df4=pd.merge(left=df3,right=df_s3,how='left',left_on=['Origin'],right_on=['iata_code'],suffixes=['_origin','_origin'])
df5=pd.merge(left=df4,right=df_s3,how='left',left_on=['Destination'],right_on=['iata_code'],suffixes=['_destination','_destination'])
This updates the values in the dataframe but the columns corresponding to origin lat/long have '_destination' as the suffix
I've even taken an aspirational long shot by combining the two, i.e.
df4=pd.merge(left=df3,right=df_s3,how='left',left_on=['Origin','Destination'],right_on=['iata_code','iata_code'],suffixes=['_origin','_destination'])
Both of these dont seem to be working. Any suggestions on how to make it work in a larger dataset while keeping the processing time low.
Your solution was almost correct. But you need to specify the origin suffix in the second merge:
df4=pd.merge(left=df3,
right=df_s3,how='left',
left_on=['Origin'],
right_on=['iata_code'])
df5=pd.merge(left=df4,
right=df_s3,how='left',
left_on=['Destination'],
right_on=['iata_code'],
suffixes=['_origin', '_destination'])
In the first merge you don't need to specify any suffix as there is no overlap. In the second merge you need to specify the suffix for the right side and the left side. The right side is the longitude and latitude from the origin and the left side are from the destination.
You can try to apply to each column a function like this one:
def from_place_to_coord(place: str):
if place in df_s3['iata_code'].to_list():
Lat = df_s3[df_s3['iata_code'] == place]['Lat'].values[0]
Lon = df_s3[df_s3['iata_code'] == place]['Lon'].values[0]
return Lat, Lon
else:
print('Not found')
and then:
df3['origin_loc'] = df3['Origin'].apply(from_place_to_coord)
df3['destination_loc'] = df3['Destination'].apply(from_place_to_coord)
It will return you 2 more columns with a tuple of Lat,Lon according to the location
I am using GeoPandas and Pandas.
I have a, say, 300,000 rows Dataframe, df, with 4 columns + the index column.
id lat lon geometry
0 2009 40.711174 -73.99682 0
1 536 40.741444 -73.97536 0
2 228 40.754601 -73.97187 0
however the unique ids are only a handful (~200)
I want to generate a shapely.geometry.point.Point object for each (lat,lon) combination, similarly to what shown here: http://nbviewer.ipython.org/gist/kjordahl/7129098
(see cell#5),
where it loops through all rows of the dataframe; but for such a big dataset, I wanted to limit the loop to the much smaller number of unique ids.
Therefore, for a given id value, idvalue (i.e., 2009 from the first row) create the GeoSeries, and assign it directly to ALL rows that have id==idvalue
My code looks like:
for count, iunique in enumerate(df.if.unique()):
sc_start = GeoSeries([Point(np.array(df[df.if==iunique].lon)[0],np.array(df[df.if==iunique].lat)[0])])
df.loc[iunique,['geometry']] = sc_start
however things don't work - the geometry field does not change - and I think is because the indexes of sc_start don't match with the indexes of df.
how can I solve this? should I just stick to the loop through the whole df?
I would take the following approach:
First find the unique id's and create a GeoSeries of Points for this:
unique_ids = df.groupby('id', as_index=False).first()
unique_ids['geometry'] = GeoSeries([Point(x, y) for x, y in zip(unique_ids['lon'], unique_ids['lat'])])
Then merge these geometries with the original dataframe on matching ids:
df.merge(unique_ids[['id', 'geometry']], how='left', on='id')