I have a pandas dataframe that looks like this:
salesperson
company
sales_amount
area
abc
AA
54000
bandung
def
BB
23000
bogor
ghi
BC
54000
bandung
jkl
DE
23000
bogor
abc
IJ
54000
bandung
ghi
KL
23000
bogor
I want to visualize the number of company data above based on the area per salesperson. The thing is, I don't have the data of area coordinate.
This is the data output that I want to plot in a map:
salesperson area
abc bandung 2
def bogor 1
ghi bandung 1
ghi bogor 1
jkl bogor 1
Is it possible to plot a geospatial visualization without the coordinate data? Or should I look for the area coordinate data to plot a geospatial visualization?
You need to get the coordinates of these places, so do first the following to append that data to you dataframe:
import pandas as pd
import geopy
from geopy.geocoders import Nominatim
data = {
'salesperson': ['abc', 'def', 'ghi', 'jkl', 'abc', 'ghi'],
'company': ['AA', 'BB', 'BC', 'DE', 'IJ', 'KL'],
'sales_amount': [54000, 23000, 54000, 23000, 54000, 23000],
'area': ['bandung', 'bogor', 'bandung', 'bogor', 'bandung', 'bogor']
}
df = pd.DataFrame(data)
geolocator = Nominatim(user_agent='my_application')
def get_lat_lon(location):
try:
# Use the geolocator to obtain the location information
loc = geolocator.geocode(location)
# Extract the latitude and longitude
lat, lon = loc.latitude, loc.longitude
return lat, lon
except:
return None, None
df['latitude'], df['longitude'] = zip(*df['area'].apply(get_lat_lon))
## If you get a set copy warning, replace the above line by
#df['latitude'] = None
#df['longitude'] = None
#for index, row in df.iterrows():
# lat, lon = get_lat_lon(row['area'])
# df.loc[index, 'latitude'] = lat
# df.loc[index, 'longitude'] = lon
# Print the dataframe with latitude and longitude columns
print(df)
which returns:
salesperson company sales_amount area latitude longitude
0 abc AA 54000 bandung -6.934469 107.604954
1 def BB 23000 bogor -6.596299 106.797242
2 ghi BC 54000 bandung -6.934469 107.604954
3 jkl DE 23000 bogor -6.596299 106.797242
4 abc IJ 54000 bandung -6.934469 107.604954
5 ghi KL 23000 bogor -6.596299 106.797242
and then map:
import folium
from folium.plugins import MarkerCluster
center = [-6.1754, 106.8272] # Jakarta, Indonesia
map_sales = folium.Map(location=center, zoom_start=8)
marker_cluster = MarkerCluster().add_to(map_sales)
for index, row in df.iterrows():
lat, lon = row['latitude'], row['longitude']
# Create a popup with the salesperson and the number of companies in the area
popup_text = f"Salesperson: {row['salesperson']}<br>Number of companies: {row['area']}<br>Sales amount: {row['sales_amount']}"
folium.Marker(location=[lat, lon], popup=popup_text).add_to(marker_cluster)
folium.CircleMarker(location=[lat, lon], radius=row['sales_amount']/10000,
fill_color='red', color=None, fill_opacity=0.2).add_to(map_sales)
map_sales
Related
I need a help in pandas to group the rows based on a specific condition. I have a dataset as follows:
Name Source Country Severity
ABC XYZ USA Low
DEF XYZ England High
ABC XYZ India Medium
EFG XYZ Algeria High
DEF XYZ UK Medium
I want to group these rows based on the Name field in such a way that Country should be appended by rows in the column and Severity is set based on Its Highest priority.
After that output table looks like this:
Name Source Country Severity
ABC XYZ USA, India Medium
DEF XYZ England, UK High
EFG XYZ Algeria High
I am able to aggregate the first 3 columns using below code but not get solution for merging severity.
df = df.groupby('Name').agg({'source':'first', 'Country': ', '.join })
You should convert your Severity to an ordered Categorical.
This enables you to use a simple max:
df['Severity'] = pd.Categorical(df['Severity'],
categories=['Low', 'Medium', 'High'],
ordered=True)
out = (df
.groupby('Name')
.agg({'Source':'first',
'Country': ', '.join,
'Severity': 'max'})
)
output:
Source Country Severity
Name
ABC XYZ USA, India Medium
DEF XYZ England, UK High
EFG XYZ Algeria High
You can try converting the Severity column to a number, then aggregating Severity-number based on max, and then converting the Severity-number column back to a word like so:
import pandas
dataframe = pandas.DataFrame({'Name': ['ABC', 'DEF', 'ABC', 'EFG', 'DEF'], 'Source': ['XYZ', 'XYZ', 'XYZ', 'XYZ', 'XYZ'], 'Country': ['USA', 'England', 'India', 'Algeria', 'UK'], 'Severity': ['Low', 'High', 'Medium', 'High', 'Medium']})
severity_to_number = {'Low': 1, 'Medium': 2, 'High': 3}
severity_to_word = inv_map = {v: k for k, v in severity_to_number.items()}
dataframe['Severity-number'] = dataframe['Severity'].replace(severity_to_number)
dataframe = dataframe.groupby('Name').agg({'Source':'first', 'Country': ', '.join, 'Severity-number':'max' })
dataframe['Severity'] = dataframe['Severity-number'].replace(severity_to_word)
del dataframe['Severity-number']
print(dataframe)
Output
Source Country Severity
Name
ABC XYZ USA, India Medium
DEF XYZ England, UK High
EFG XYZ Algeria High
I have two functions which shift a row of a pandas dataframe to the top or bottom, respectively. After applying them more then once to a dataframe, they seem to work incorrectly.
These are the 2 functions to move the row to top / bottom:
def shift_row_to_bottom(df, index_to_shift):
"""Shift row, given by index_to_shift, to bottom of df."""
idx = df.index.tolist()
idx.pop(index_to_shift)
df = df.reindex(idx + [index_to_shift])
return df
def shift_row_to_top(df, index_to_shift):
"""Shift row, given by index_to_shift, to top of df."""
idx = df.index.tolist()
idx.pop(index_to_shift)
df = df.reindex([index_to_shift] + idx)
return df
Note: I don't want to reset_index for the returned df.
Example:
df = pd.DataFrame({'Country' : ['USA', 'GE', 'Russia', 'BR', 'France'],
'ID' : ['11', '22', '33','44', '55'],
'City' : ['New-York', 'Berlin', 'Moscow', 'London', 'Paris'],
'short_name' : ['NY', 'Ber', 'Mosc','Lon', 'Pa']
})
df =
Country ID City short_name
0 USA 11 New-York NY
1 GE 22 Berlin Ber
2 Russia 33 Moscow Mosc
3 BR 44 London Lon
4 France 55 Paris Pa
This is my dataframe:
Now, apply function for the first time. Move row with index 0 to bottom:
df_shifted = shift_row_to_bottom(df,0)
df_shifted =
Country ID City short_name
1 GE 22 Berlin Ber
2 Russia 33 Moscow Mosc
3 BR 44 London Lon
4 France 55 Paris Pa
0 USA 11 New-York NY
The result is exactly what I want.
Now, apply function again. This time move row with index 2 to the bottom:
df_shifted = shift_row_to_bottom(df_shifted,2)
df_shifted =
Country ID City short_name
1 GE 22 Berlin Ber
2 Russia 33 Moscow Mosc
4 France 55 Paris Pa
0 USA 11 New-York NY
2 Russia 33 Moscow Mosc
Well, this is not what I was expecting. There must be a problem when I want to apply the function a second time. The promblem is analog to the function shift_row_to_top.
My question is:
What's going on here?
Is there a better way to shift a specific row to top / bottom of the dataframe? Maybe a pandas-function?
If not, how would you do it?
Your problem is these two lines:
idx = df.index.tolist()
idx.pop(index_to_shift)
idx is a list and idx.pop(index_to_shift) removes the item at index index_to_shift of idx, which is not necessarily valued index_to_shift as in the second case.
Try this function:
def shift_row_to_bottom(df, index_to_shift):
idx = [i for i in df.index if i!=index_to_shift]
return df.loc[idx+[index_to_shift]]
# call the function twice
for i in range(2): df = shift_row_to_bottom(df, 2)
Output:
Country ID City short_name
0 USA 11 New-York NY
1 GE 22 Berlin Ber
3 BR 44 London Lon
4 France 55 Paris Pa
2 Russia 33 Moscow Mosc
I've tried to convert flightradar24api, which gives a list of airlines, into a pandas data frame without succeed. So this is what i've done:
import flightradar24
import pandas as pd
fr = flightradar24.Api()
airlines = fr.get_airlines()
items = airlines.items()
list_items = list(items)
df = pd.DataFrame(list_items)
print(airlines)
print(df.head())
And this was the result:
0 1
0 version 1594656446
1 rows [{'Name': '21 Air', 'Code': '2I', 'ICAO': 'CSB...
That being said, could you please help me convert flightradar24 api into a pandas dataframe?
Thanks in advance.
This should do it:
fr = flightradar24.Api()
airlines = fr.get_airlines()
df = pd.json_normalize(airlines['rows'])
print(df)
Name Code ICAO
0 21 Air 2I CSB
1 40-Mile Air Q5 MLA
2 9 Air AQ JYH
3 ABX Air GB ABX
4 ACE Belgium Freighters X7 FRH
... ... ... ...
1337 Zambian Airways Q3 MBN
1338 Zanair B4 TAN
1339 Zimex Aviation XM IMX
1340 ZIPAIR ZG TZP
1341 Zorex ORZ
I have several tables that look like this:
ID YY ZZ
2 97 826
2 78 489
4 47 751
4 110 322
6 67 554
6 88 714
code:
raw = {'ID': [2, 2, 4, 4, 6, 6,],
'YY': [97,78,47,110,67,88],
'ZZ':[826,489,751,322,554,714]}
df = pd.DataFrame(raw)
For each of these dfs, I have to perform a number of operations.
First, group by id,
extract the length of the column zz and average of the column zz,
put results in new df
New df that looks like this
Cities length mean
Paris 0 0
Madrid 0 0
Berlin 0 0
Warsaw 0 0
London 0 0
code:
raw2 = {'Cities': ['Paris', 'Madrid', 'Berlin', 'Warsaw', 'London'],
'length': 0,
'mean': 0}
df2 = pd.DataFrame(raw2)
I pulled out the average and the size of individual groups
df_grouped = df.groupby('ID').ZZ.size()
df_grouped2 = df.groupby('ID').ZZ.mean()
the problem occurs when trying to transfer results to a new table because it does not contain all the cities and the results must be matched according to the appropriate key.
I tried to use a dictionary:
dic_cities = {"Paris":df_grouped.loc[2],
"Madrid":df_grouped.loc[4],
"Warsaw":df_grouped.loc[6],
"Berlin":df_grouped.loc[8],
"London":df_grouped.loc[10]}
Unfortunately, I'm receiving KeyError: 8
I have 19 df's from which I have to extract this data and the final tables have to look like this:
Cities length mean
Paris 2 657.5
Madrid 2 536.5
Berlin 0 0.0
Warsaw 2 634.0
London 0 0.0
Does anyone know how to deal with it using groupby and the dictionary or knows a better way to do it?
First, you should index df2 on 'Cities':
raw2 = {'Cities': ['Paris', 'Madrid', 'Berlin', 'Warsaw', 'London'],
'length': 0,
'mean': 0}
df2 = pd.DataFrame(raw2).set_index('Cities')
Then you should reverse you dictionary:
dic_cities = {2: "Paris",
4: "Madrid",
6: "Warsaw",
8: "Berlin",
10: "London"}
Once this is done, the processing is as simple as a groupby:
for i, sub in df.groupby('ID'):
df2.loc[dic_cities[i]] = sub.ZZ.agg([len, np.mean]).tolist()
Which gives for df2:
length mean
Cities
Paris 2.0 657.5
Madrid 2.0 536.5
Berlin 0.0 0.0
Warsaw 2.0 634.0
London 0.0 0.0
See this:
import pandas as pd
# setup raw data
raw = {'ID': [2, 2, 4, 4, 6, 6,], 'YY': [97,78,47,110,67,88], 'ZZ':[826,489,751,322,554,714]}
df = pd.DataFrame(raw)
# get mean values
mean_values = df.groupby('ID').mean()
# drop column
mean_values = mean_values.drop(['YY'], axis=1)
# get occurrence number
occurrence = df.groupby('ID').size()
# save data
result = pd.concat([occurrence, mean_values], axis=1, sort=False)
# rename columns
result.rename(columns={0:'length', 'ZZ':'mean'}, inplace=True)
# city data
raw2 = 'Cities': ['Paris', 'Madrid', 'Berlin', 'Warsaw', 'London'], 'length': 0, 'mean': 0}
df2 = pd.DataFrame(raw2)
# rename indexes
df2 = df2.rename(index={0: 2, 1:4, 2:8, 3:6, 4:10}
# merge data
df2['length'] = result['length']
df2['mean'] = result['mean']
Outout:
Cities length mean
2 Paris 2.0 657.5
4 Madrid 2.0 536.5
8 Berlin NaN NaN
6 Warsaw 2.0 634.0
10 London NaN NaN
I want to get the lat of ~ 100 k entries in a pandas dataframe. Since I can query geopy only with a second delay, I want to make sure I do not query duplicates (most should be duplicates since there are not that many cities)
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="xxx")
df['loc']=0
for x in range(1,len(df):
for y in range(1,x):
if df['Location'][y]==df['Location'][x]:
df['lat'][x]=df['lat'][y]
else:
location = geolocator.geocode(df['Location'][x])
time.sleep(1.2)
df.at[x,'lat']=location.latitude
The idea is to check if the location is already in the list, and only if not query geopy. Somehow it is painfully slow and seems not to be doing what I intended. Any help or tip is appreciated.
Prepare the initial dataframe:
import pandas as pd
df = pd.DataFrame({
'some_meta': [1, 2, 3, 4],
'city': ['london', 'paris', 'London', 'moscow'],
})
df['city_lower'] = df['city'].str.lower()
df
Out[1]:
some_meta city city_lower
0 1 london london
1 2 paris paris
2 3 London london
3 4 moscow moscow
Create a new DataFrame with unique cities:
df_uniq_cities = df['city_lower'].drop_duplicates().to_frame()
df_uniq_cities
Out[2]:
city_lower
0 london
1 paris
3 moscow
Run geopy's geocode on that new DataFrame:
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="specify_your_app_name_here")
from geopy.extra.rate_limiter import RateLimiter
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1)
df_uniq_cities['location'] = df_uniq_cities['city_lower'].apply(geocode)
# Or, instead, do this to get a nice progress bar:
# from tqdm import tqdm
# tqdm.pandas()
# df_uniq_cities['location'] = df_uniq_cities['city_lower'].progress_apply(geocode)
df_uniq_cities
Out[3]:
city_lower location
0 london (London, Greater London, England, SW1A 2DU, UK...
1 paris (Paris, Île-de-France, France métropolitaine, ...
3 moscow (Москва, Центральный административный округ, М...
Merge the initial DataFrame with the new one:
df_final = pd.merge(df, df_uniq_cities, on='city_lower', how='left')
df_final['lat'] = df_final['location'].apply(lambda location: location.latitude if location is not None else None)
df_final['long'] = df_final['location'].apply(lambda location: location.longitude if location is not None else None)
df_final
Out[4]:
some_meta city city_lower location lat long
0 1 london london (London, Greater London, England, SW1A 2DU, UK... 51.507322 -0.127647
1 2 paris paris (Paris, Île-de-France, France métropolitaine, ... 48.856610 2.351499
2 3 London london (London, Greater London, England, SW1A 2DU, UK... 51.507322 -0.127647
3 4 moscow moscow (Москва, Центральный административный округ, М... 55.750446 37.617494
The key to resolving your issue with timeouts is the geopy's RateLimiter class. Check out the docs for more details: https://geopy.readthedocs.io/en/1.18.1/#usage-with-pandas
Imports
see geopy documentation for how to instantiate the Nominatum geoencoder
import pandas as pd
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="specify_your_app_name_here") # specify your application name
Generate some data with locations
d = ['New York, NY', 'Seattle, WA', 'Philadelphia, PA',
'Richardson, TX', 'Plano, TX', 'Wylie, TX',
'Waxahachie, TX', 'Washington, DC']
df = pd.DataFrame(d, columns=['Location'])
print(df)
Location
0 New York, NY
1 Seattle, WA
2 Philadelphia, PA
3 Richardson, TX
4 Plano, TX
5 Wylie, TX
6 Waxahachie, TX
7 Washington, DC
Use a dict to geoencode only the unique Locations per this SO post
extract all parameters simultaneously
first, get lat and lon in same step (as tuples in a single column of the DataFrame)
second, split the column of tuples into separate columns
locations = df['Location'].unique()
# Create dict of geoencodings
d = (
dict(zip(locations, pd.Series(locations)
.apply(geolocator.geocode, args=(10,))
.apply(lambda x: (x.latitude, x.longitude)) # get tuple of latitude and longitude
)
)
)
# Map dict to `Location` column
df['city_coord'] = df['Location'].map(d)
# Split single column of tuples into multiple (2) columns
df[['lat','lon']] = pd.DataFrame(df['city_coord'].tolist(), index=df.index)
print(df)
Location city_coord lat lon
0 New York, NY (40.7308619, -73.9871558) 40.730862 -73.987156
1 Seattle, WA (47.6038321, -122.3300624) 47.603832 -122.330062
2 Philadelphia, PA (39.9524152, -75.1635755) 39.952415 -75.163575
3 Richardson, TX (32.9481789, -96.7297206) 32.948179 -96.729721
4 Plano, TX (33.0136764, -96.6925096) 33.013676 -96.692510
5 Wylie, TX (33.0151201, -96.5388789) 33.015120 -96.538879
6 Waxahachie, TX (32.3865312, -96.8483311) 32.386531 -96.848331
7 Washington, DC (38.8950092, -77.0365625) 38.895009 -77.036563