Scenario
If column1 = ‘Value’ then column2 = ‘AAA’
How can we use faker to generate mock data for these dependent columns. Need to consider both positive and negative data.
You can use Faker database like this:
import pandas as pd
from faker.providers import date_time
df = (pd.DataFrame(date_time.Provider.countries, columns=['name', 'alpha-2-code'])
.rename(columns={'name': 'country', 'alpha-2-code': 'country_code'})
.sample(n=1000, replace=True, ignore_index=True, random_state=2022))
Output:
>>> df
country country_code
0 Rwanda RW
1 Grenada GD
2 Oman OM
3 Moldova MD
4 Saint Vincent and the Grenadines VC
.. ... ...
995 Iceland IS
996 Seychelles SC
997 Israel IL
998 Equatorial Guinea GQ
999 Republic of Ireland IE
[1000 rows x 2 columns]
Or use pycountry.
Related
I have a list of dictionaries that also consist of lists and would like to create a dataframe using this list. For example, the data looks like this:
lst = [{'France': [[12548, ABC], [45681, DFG], [45684, HJK]]},
{'USA': [[84921, HJK], [28917, KLESA]]},
{'Japan':[[38292, ASF], [48902, DSJ]]}]
And this is the dataframe I'm trying to create
Country Amount Code
France 12548 ABC
France 45681 DFG
France 45684 HJK
USA 84921 HJK
USA 28917 KLESA
Japan 38292 ASF
Japan 48902 DSJ
As you can see, the keys became column values of the country column and the numbers and the strings became the amount and code columns. I thought I could use something like the following, but it's not working.
df = pd.DataFrame(lst)
You probably need to transform the data into a format that Pandas can read.
Original data
data = [
{"France": [[12548, "ABC"], [45681, "DFG"], [45684, "HJK"]]},
{"USA": [[84921, "HJK"], [28917, "KLESA"]]},
{"Japan": [[38292, "ASF"], [48902, "DSJ"]]},
]
Transforming the data
new_data = []
for country_data in data:
for country, values in country_data.items():
new_data += [{"Country": country, "Amount": amt, "Code": code} for amt, code in values]
Create the dataframe
df = pd.DataFrame(new_data)
Ouput
Country Amount Code
0 France 12548 ABC
1 France 45681 DFG
2 France 45684 HJK
3 USA 84921 HJK
4 USA 28917 KLESA
5 Japan 38292 ASF
6 Japan 48902 DSJ
df = pd.concat([pd.DataFrame(elem) for elem in list])
df = df.apply(lambda x: pd.Series(x.dropna().values)).stack()
df = df.reset_index(level=[0], drop=True).to_frame(name = 'vals')
df = pd.DataFrame(df["vals"].to_list(),index= df.index, columns=['Amount', 'Code']).sort_index()
print(df)
output:
Amount Code
France 12548 ABC
USA 84921 HJK
Japan 38292 ASF
France 45681 DFG
USA 28917 KLESA
Japan 48902 DSJ
France 45684 HJK
Use nested list comprehension for flatten data and pass to DataFrame constructor:
lst = [
{"France": [[12548, "ABC"], [45681, "DFG"], [45684, "HJK"]]},
{"USA": [[84921, "HJK"], [28917, "KLESA"]]},
{"Japan": [[38292, "ASF"], [48902, "DSJ"]]},
]
L = [(country, *x) for country_data in lst
for country, values in country_data.items()
for x in values]
df = pd.DataFrame(L, columns=['Country','Amount','Code'])
print (df)
Country Amount Code
0 France 12548 ABC
1 France 45681 DFG
2 France 45684 HJK
3 USA 84921 HJK
4 USA 28917 KLESA
5 Japan 38292 ASF
6 Japan 48902 DSJ
Build a new dictionary that combines the individual dicts into one, before concatenating the dataframes:
new_dict = {}
for ent in lst:
for key, value in ent.items():
new_dict[key] = pd.DataFrame(value, columns = ['Amount', 'Code'])
pd.concat(new_dict, names=['Country']).droplevel(1).reset_index()
Country Amount Code
0 France 12548 ABC
1 France 45681 DFG
2 France 45684 HJK
3 USA 84921 HJK
4 USA 28917 KLESA
5 Japan 38292 ASF
6 Japan 48902 DSJ
Table is like this
id
ADDRESS
0
6101 SUMMITVIEW AVE STE 200 YAKIMA
1
527 CEDAR WAY SUITE 105 OAKMONT
2
1700 N ROSE AVE SUITE 460 OXNARD
3
1275 YORK AVE NEW YORK
4
2300 MANCHESTER EXPY A SUITE 101 A COLUMBUS
5
401 N MICHIGAN AVE CHICAGO
6
111 GROSSMAN DR INTERNAL MEDICINE BRAINTREE
7
1850 N CENTRAL AVE STE 1600 PHOENIX
8
47 NEW SCOTLAND AVENUE ALBANY MEDICAL CENTER A...
9
201 N VINE ST EL DORADO
10
4420 LAKE BOONE TRL RALEIGH
11
2727 W HOLCOMBE BLVD HOUSTON
12
850 PETER BRYCE BLVD TUSCALOOSA
13
1803 WEHRLI RD NAPERVILLE
14
4321 N MACDILL AVE STE 203 TAMPA
15
111 CONTINENTAL DR SUITE 412 NEWARK
16
1834 E INNOVATION PARK DR ORO VALLEY
17
880 KEMPSVILLE RD SUITE 2200 NORFOLK
18
701 PRINCETON AVE SW BIRMINGHAM
19
4729 COUNTY ROAD 101 MINNETONKA
import pandas as pd
import geopandas as gpd
import geopy
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter
import matplotlib.pyplot as plt
import folium
from folium.plugins import FastMarkerCluster
locator = Nominatim(user_agent="myGeocoder")
from geopy.extra.rate_limiter import RateLimiter
geocode = RateLimiter(locator.geocode,min_delay_seconds=0.0, error_wait_seconds=1.0, swallow_exceptions=True, return_value_on_exception=None)
apprix_1_na['location'] = apprix_1_na['ADDRESS'].apply(geocode)
apprix_1_na['point'] = apprix_1_na['location'].apply(lambda loc: tuple(loc.point) if loc enter code hereelse None)
I want this code to work in Pyspark for longitude and latitude
I'll show a "complex" example with GoogleV3 API. It is easy suitable to your case
from geopy.geocoders import GoogleV3
from pyspark.sql.functions import col, udf
from pyspark.sql.types import FloatType, ArrayType
df = spark.createDataFrame([("123 Fake St, Springfield, 12345, USA",),("1000 N West Street, Suite 1200 Wilmington, DE 19801, USA",)], ["address"])
df.display()
address
123 Fake St, Springfield, 12345, USA
1000 N West Street, Suite 1200 Wilmington, DE 19801, USA
#udf(returnType=ArrayType(FloatType()))
def geoloc(address):
api = 'your_api_key_here'
geolocator = GoogleV3(api)
#get lat_long
return geolocator.geocode(address)[1]
#find coord
df = df.withColumn('geocode', geoloc(col('address')))
#separate tuple
df = df.withColumn("latitude", col('geocode').getItem(0))\
.withColumn("longitude", col('geocode').getItem(1))
df.display()
address
geocode
latitude
longitude
123 Fake St, Springfield, 12345, USA
[44.046238, -123.022026]
44.046238
-123.022026
1000 N West Street, Suite 1200 Wilmington, DE 19801, USA
[39.74717, -75.54999]
39.74717
-75.54999
import pandas as pd
dane= pd.read_csv('WHO-COVID-19-global-data _2.csv')
dane
dane.groupby('Country')[['Cumulative_cases']].sum()
KeyError: 'Country'
I don't know why this code doesn't run?
There are spaces at the beginning of dane columns
Remove them with the following line:
dane.rename(columns=lambda x: x.strip(), inplace=True)
dane.groupby('Country')[['Cumulative_cases']].sum()
Cumulative_cases
Country
Afghanistan 5702767
Albania 1300156
Algeria 5561691
American Samoa 0
Andorra 273756
... ...
Wallis and Futuna 14
Yemen 256353
Zambia 1323403
Zimbabwe 692447
occupied Palestinian territory, including east ... 4057017
I've tried to convert flightradar24api, which gives a list of airlines, into a pandas data frame without succeed. So this is what i've done:
import flightradar24
import pandas as pd
fr = flightradar24.Api()
airlines = fr.get_airlines()
items = airlines.items()
list_items = list(items)
df = pd.DataFrame(list_items)
print(airlines)
print(df.head())
And this was the result:
0 1
0 version 1594656446
1 rows [{'Name': '21 Air', 'Code': '2I', 'ICAO': 'CSB...
That being said, could you please help me convert flightradar24 api into a pandas dataframe?
Thanks in advance.
This should do it:
fr = flightradar24.Api()
airlines = fr.get_airlines()
df = pd.json_normalize(airlines['rows'])
print(df)
Name Code ICAO
0 21 Air 2I CSB
1 40-Mile Air Q5 MLA
2 9 Air AQ JYH
3 ABX Air GB ABX
4 ACE Belgium Freighters X7 FRH
... ... ... ...
1337 Zambian Airways Q3 MBN
1338 Zanair B4 TAN
1339 Zimex Aviation XM IMX
1340 ZIPAIR ZG TZP
1341 Zorex ORZ
I want to get the lat of ~ 100 k entries in a pandas dataframe. Since I can query geopy only with a second delay, I want to make sure I do not query duplicates (most should be duplicates since there are not that many cities)
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="xxx")
df['loc']=0
for x in range(1,len(df):
for y in range(1,x):
if df['Location'][y]==df['Location'][x]:
df['lat'][x]=df['lat'][y]
else:
location = geolocator.geocode(df['Location'][x])
time.sleep(1.2)
df.at[x,'lat']=location.latitude
The idea is to check if the location is already in the list, and only if not query geopy. Somehow it is painfully slow and seems not to be doing what I intended. Any help or tip is appreciated.
Prepare the initial dataframe:
import pandas as pd
df = pd.DataFrame({
'some_meta': [1, 2, 3, 4],
'city': ['london', 'paris', 'London', 'moscow'],
})
df['city_lower'] = df['city'].str.lower()
df
Out[1]:
some_meta city city_lower
0 1 london london
1 2 paris paris
2 3 London london
3 4 moscow moscow
Create a new DataFrame with unique cities:
df_uniq_cities = df['city_lower'].drop_duplicates().to_frame()
df_uniq_cities
Out[2]:
city_lower
0 london
1 paris
3 moscow
Run geopy's geocode on that new DataFrame:
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="specify_your_app_name_here")
from geopy.extra.rate_limiter import RateLimiter
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1)
df_uniq_cities['location'] = df_uniq_cities['city_lower'].apply(geocode)
# Or, instead, do this to get a nice progress bar:
# from tqdm import tqdm
# tqdm.pandas()
# df_uniq_cities['location'] = df_uniq_cities['city_lower'].progress_apply(geocode)
df_uniq_cities
Out[3]:
city_lower location
0 london (London, Greater London, England, SW1A 2DU, UK...
1 paris (Paris, Île-de-France, France métropolitaine, ...
3 moscow (Москва, Центральный административный округ, М...
Merge the initial DataFrame with the new one:
df_final = pd.merge(df, df_uniq_cities, on='city_lower', how='left')
df_final['lat'] = df_final['location'].apply(lambda location: location.latitude if location is not None else None)
df_final['long'] = df_final['location'].apply(lambda location: location.longitude if location is not None else None)
df_final
Out[4]:
some_meta city city_lower location lat long
0 1 london london (London, Greater London, England, SW1A 2DU, UK... 51.507322 -0.127647
1 2 paris paris (Paris, Île-de-France, France métropolitaine, ... 48.856610 2.351499
2 3 London london (London, Greater London, England, SW1A 2DU, UK... 51.507322 -0.127647
3 4 moscow moscow (Москва, Центральный административный округ, М... 55.750446 37.617494
The key to resolving your issue with timeouts is the geopy's RateLimiter class. Check out the docs for more details: https://geopy.readthedocs.io/en/1.18.1/#usage-with-pandas
Imports
see geopy documentation for how to instantiate the Nominatum geoencoder
import pandas as pd
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="specify_your_app_name_here") # specify your application name
Generate some data with locations
d = ['New York, NY', 'Seattle, WA', 'Philadelphia, PA',
'Richardson, TX', 'Plano, TX', 'Wylie, TX',
'Waxahachie, TX', 'Washington, DC']
df = pd.DataFrame(d, columns=['Location'])
print(df)
Location
0 New York, NY
1 Seattle, WA
2 Philadelphia, PA
3 Richardson, TX
4 Plano, TX
5 Wylie, TX
6 Waxahachie, TX
7 Washington, DC
Use a dict to geoencode only the unique Locations per this SO post
extract all parameters simultaneously
first, get lat and lon in same step (as tuples in a single column of the DataFrame)
second, split the column of tuples into separate columns
locations = df['Location'].unique()
# Create dict of geoencodings
d = (
dict(zip(locations, pd.Series(locations)
.apply(geolocator.geocode, args=(10,))
.apply(lambda x: (x.latitude, x.longitude)) # get tuple of latitude and longitude
)
)
)
# Map dict to `Location` column
df['city_coord'] = df['Location'].map(d)
# Split single column of tuples into multiple (2) columns
df[['lat','lon']] = pd.DataFrame(df['city_coord'].tolist(), index=df.index)
print(df)
Location city_coord lat lon
0 New York, NY (40.7308619, -73.9871558) 40.730862 -73.987156
1 Seattle, WA (47.6038321, -122.3300624) 47.603832 -122.330062
2 Philadelphia, PA (39.9524152, -75.1635755) 39.952415 -75.163575
3 Richardson, TX (32.9481789, -96.7297206) 32.948179 -96.729721
4 Plano, TX (33.0136764, -96.6925096) 33.013676 -96.692510
5 Wylie, TX (33.0151201, -96.5388789) 33.015120 -96.538879
6 Waxahachie, TX (32.3865312, -96.8483311) 32.386531 -96.848331
7 Washington, DC (38.8950092, -77.0365625) 38.895009 -77.036563