How to get street name from osm.pbf file in OpenStreetMap - python

You can download any dataset from here https://download.geofabrik.de/australia-oceania.html
Here's my code
import osmium as osm
import pandas as pd
class OSMHandler(osm.SimpleHandler):
def __init__(self):
osm.SimpleHandler.__init__(self)
self.osm_data = []
def tag_inventory(self, elem, elem_type):
for tag in elem.tags:
self.osm_data.append([elem_type,
elem.id,
elem.version,
elem.visible,
pd.Timestamp(elem.timestamp),
elem.uid,
elem.user,
elem.changeset,
len(elem.tags),
tag.k,
tag.v])
def node(self, n):
self.tag_inventory(n, "node")
def way(self, w):
self.tag_inventory(w, "way")
def relation(self, r):
self.tag_inventory(r, "relation")
osmhandler = OSMHandler()
# scan the input file and fills the handler list accordingly
osmhandler.apply_file("/DATA/user/nabih/pitcairn-islands-latest.osm.pbf")
# transform the list into a pandas DataFrame
data_colnames = ['type', 'id', 'version', 'visible', 'ts', 'uid',
'user', 'chgset', 'ntags', 'tagkey', 'tagvalue']
df_osm = pd.DataFrame(osmhandler.osm_data, columns=data_colnames)
Here's the df_osm

Street names are values of the name key of highway elements (see https://wiki.openstreetmap.org/wiki/Map_features#Highway for all possible highway types, you may want to further filter it in the query). You can then self join all highway rows with their name rows on id:
df_osm.loc[df_osm.tagkey=='highway', ['id', 'tagvalue']].merge(
df_osm.loc[df_osm.tagkey=='name', ['id', 'tagvalue']],
on='id', suffixes=['_kind', '_name'])
Result for pitcairn-islands-latest.osm.pbf:
id tagvalue_kind tagvalue_name
0 1034153953 residential Main Road
1 1034161481 residential Hill of Difficulty Road
If you want to also include national names you can replace df_osm.tagkey=='name' with df_osm.tagkey.str.startswith('name'). See https://wiki.openstreetmap.org/wiki/Key:name for details and other possible names.

Related

Using regular expression to remove commonly used company suffixes from a list of companies

I have the following code that I use to generate a list of common company suffixes below:
import re
from cleanco import typesources,
import string
def generate_common_suffixes():
unique_items = []
company_suffixes_raw = typesources()
for item in company_suffixes_raw:
for i in item:
if i.lower() not in unique_items:
unique_items.append(i.lower())
unique_items.extend(['holding'])
return unique_items
I'm then trying to use the following code to remove those suffixes from a list of company names
company_name = ['SAMSUNG ÊLECTRONICS Holding, LTD', 'Apple inc',
'FIIG Securities Limited Asset Management Arm',
'First Eagle Alternative Credit, LLC', 'Global Credit
Investments','Seatown', 'Sona Asset Management']
suffixes = generate_common_suffixes()
cleaned_names = []
for company in company_name:
for suffix in suffixes:
new = re.sub(r'\b{}\b'.format(re.escape(suffix)), '', company)
cleaned_names.append(new)
I keep getting a list of unchanged company names despite knowing that the suffixes are there.
Alternate Attempt
I've also tried an alternate method where I'd look for the word and replace it without regex, but i couldn't figure out why it was removing parts of the company name itself - for example, it would remove the first 3 letters in Samsung
for word in common_words:
name = name.replace(word, "")
Any help is greatly appreciated!
import unicodedata
from cleanco import basename
import re
company_names = ['SAMSUNG ÊLECTRONICS Holding, LTD',
'Apple inc',
'FIIG Securities Limited Asset Management Arm',
'First Eagle Alternative Credit, LLC',
'Global Credit Investments',
'Seatown',
'Sona Asset Management']
suffix = ["holding"] # "Common words"? You can add more
cleaned_names = []
for company_name in company_names:
# To Lower
company_name = company_name.lower()
# Fix unicode
company_name = unicodedata.normalize('NFKD', company_name).encode('ASCII', 'ignore').decode()
# Remove punctuation
company_name = re.sub(r'[^\w\s]', '', company_name)
# Remove suffixes
company_name = basename(company_name)
# Remove common words
for word in suffix:
company_name = re.sub(fr"\b{word}\b", '', company_name)
# Save
cleaned_names.append(company_name)
print(cleaned_names)
Ouput:
['samsung aalectronics ', 'apple', 'fiig securities limited asset management arm', 'first eagle alternative credit', 'global credit investments', 'seatown', 'sona asset management']

Trying to create a streamlit app that uses user-provided URLs to scrape and return a downloadable df

I'm trying to use this create_df() function in Streamlit to gather a list of user-provided URLs called "recipes" and loop through each URL to return a df I've labeled "res" towards the end of the function. I've tried several approaches with the Streamlit syntax but I just cannot get this to work as I'm getting this error message:
recipe_scrapers._exceptions.WebsiteNotImplementedError: recipe-scrapers exception: Website (h) not supported.
Have a look at my entire repo here. The main.py script works just fine once you've installed all requirements locally, but when I try running the same script with Streamlit syntax in the streamlit.py script I get the above error. Once you run streamlit run streamlit.py in your terminal and have a look at the UI I've create it should be quite clear what I'm aiming at, which is providing the user with a csv of all ingredients in the recipe URLs they provided for a convenient grocery shopping list.
Any help would be greatly appreciated!
def create_df(recipes):
"""
Description:
Creates one df with all recipes and their ingredients
Arguments:
* recipes: list of recipe URLs provided by user
Comments:
Note that ingredients with qualitative amounts e.g., "scheutje melk", "snufje zout" have been ommitted from the ingredient list
"""
df_list = []
for recipe in recipes:
scraper = scrape_me(recipe)
recipe_details = replace_measurement_symbols(scraper.ingredients())
recipe_name = recipe.split("https://www.hellofresh.nl/recipes/", 1)[1]
recipe_name = recipe_name.rsplit('-', 1)[0]
print("Processing data for "+ recipe_name +" recipe.")
for ingredient in recipe_details:
try:
df_temp = pd.DataFrame(columns=['Ingredients', 'Measurement'])
df_temp[str(recipe_name)] = recipe_name
ing_1 = ingredient.split("2 * ", 1)[1]
ing_1 = ing_1.split(" ", 2)
item = ing_1[2]
measurement = ing_1[1]
quantity = float(ing_1[0]) * 2
df_temp.loc[len(df_temp)] = [item, measurement, quantity]
df_list.append(df_temp)
except (ValueError, IndexError) as e:
pass
df = pd.concat(df_list)
print("Renaming duplicate ingredients e.g., Kruimige aardappelen, Voorgekookte halve kriel met schil -> Aardappelen")
ingredient_dict = {
'Aardappelen': ('Dunne frieten', 'Half kruimige aardappelen', 'Voorgekookte halve kriel met schil',
'Kruimige aardappelen', 'Roodschillige aardappelen', 'Opperdoezer Ronde aardappelen'),
'Ui': ('Rode ui'),
'Kipfilet': ('Kipfilet met tuinkruiden en knoflook'),
'Kipworst': ('Gekruide kipworst'),
'Kipgehakt': ('Gemengd gekruid gehakt', 'Kipgehakt met Mexicaanse kruiden', 'Half-om-halfgehakt met Italiaanse kruiden',
'Kipgehakt met tuinkruiden'),
'Kipshoarma': ('Kalkoenshoarma')
}
reverse_label_ing = {x:k for k,v in ingredient_dict.items() for x in v}
df["Ingredients"].replace(reverse_label_ing, inplace=True)
print("Assigning ingredient categories")
category_dict = {
'brood': ('Biologisch wit rozenbroodje', 'Bladerdeeg', 'Briochebroodje', 'Wit platbrood'),
'granen': ('Basmatirijst', 'Bulgur', 'Casarecce', 'Cashewstukjes',
'Gesneden snijbonen', 'Jasmijnrijst', 'Linzen', 'Maïs in blik',
'Parelcouscous', 'Penne', 'Rigatoni', 'Rode kidneybonen',
'Spaghetti', 'Witte tortilla'),
'groenten': ('Aardappelen', 'Aubergine', 'Bosui', 'Broccoli',
'Champignons', 'Citroen', 'Gele wortel', 'Gesneden rodekool',
'Groene paprika', 'Groentemix van paprika, prei, gele wortel en courgette',
'IJsbergsla', 'Kumato tomaat', 'Limoen', 'Little gem',
'Paprika', 'Portobello', 'Prei', 'Pruimtomaat',
'Radicchio en ijsbergsla', 'Rode cherrytomaten', 'Rode paprika', 'Rode peper',
'Rode puntpaprika', 'Rode ui', 'Rucola', 'Rucola en veldsla', 'Rucolamelange',
'Semi-gedroogde tomatenmix', 'Sjalot', 'Sperziebonen', 'Spinazie', 'Tomaat',
'Turkse groene peper', 'Veldsla', 'Vers basilicum', 'Verse bieslook',
'Verse bladpeterselie', 'Verse koriander', 'Verse krulpeterselie', 'Wortel', 'Zoete aardappel'),
'kruiden': ('Aïoli', 'Bloem', 'Bruine suiker', 'Cranberrychutney', 'Extra vierge olijfolie',
'Extra vierge olijfolie met truffelaroma', 'Fles olijfolie', 'Gedroogde laos',
'Gedroogde oregano', 'Gemalen kaneel', 'Gemalen komijnzaad', 'Gemalen korianderzaad',
'Gemalen kurkuma', 'Gerookt paprikapoeder', 'Groene currykruiden', 'Groentebouillon',
'Groentebouillonblokje', 'Honing', 'Italiaanse kruiden', 'Kippenbouillonblokje', 'Knoflookteen',
'Kokosmelk', 'Koreaanse kruidenmix', 'Mayonaise', 'Mexicaanse kruiden', 'Midden-Oosterse kruidenmix',
'Mosterd', 'Nootmuskaat', 'Olijfolie', 'Panko paneermeel', 'Paprikapoeder', 'Passata',
'Pikante uienchutney', 'Runderbouillonblokje', 'Sambal', 'Sesamzaad', 'Siciliaanse kruidenmix',
'Sojasaus', 'Suiker', 'Sumak', 'Surinaamse kruiden', 'Tomatenblokjes', 'Tomatenblokjes met ui',
'Truffeltapenade', 'Ui', 'Verse gember', 'Visbouillon', 'Witte balsamicoazijn', 'Wittewijnazijn',
'Zonnebloemolie', 'Zwarte balsamicoazijn'),
'vlees': ('Gekruide runderburger', 'Half-om-half gehaktballetjes met Spaanse kruiden', 'Kipfilethaasjes', 'Kipfiletstukjes',
'Kipgehaktballetjes met Italiaanse kruiden', 'Kippendijreepjes', 'Kipshoarma', 'Kipworst', 'Spekblokjes',
'Vegetarische döner kebab', 'Vegetarische kaasschnitzel', 'Vegetarische schnitzel'),
'zuivel': ('Ei', 'Geraspte belegen kaas', 'Geraspte cheddar', 'Geraspte grana padano', 'Geraspte oude kaas',
'Geraspte pecorino', 'Karnemelk', 'Kruidenroomkaas', 'Labne', 'Melk', 'Mozzarella',
'Parmigiano reggiano', 'Roomboter', 'Slagroom', 'Volle yoghurt')
}
reverse_label_cat = {x:k for k,v in category_dict.items() for x in v}
df["Category"] = df["Ingredients"].map(reverse_label_cat)
col = "Category"
first_col = df.pop(col)
df.insert(0, col, first_col)
df = df.sort_values(['Category', 'Ingredients'], ascending = [True, True])
print("Merging ingredients by row across all recipe columns using justify()")
gp_cols = ['Ingredients', 'Measurement']
oth_cols = df.columns.difference(gp_cols)
arr = np.vstack(df.groupby(gp_cols, sort=False, dropna=False).apply(lambda gp: justify(gp.to_numpy(), invalid_val=np.NaN, axis=0, side='up')))
# Reconstruct DataFrame
# Remove entirely NaN rows based on the non-grouping columns
res = (pd.DataFrame(arr, columns=df.columns)
.dropna(how='all', subset=oth_cols, axis=0))
res = res.fillna(0)
res['Total'] = res.drop(['Ingredients', 'Measurement'], axis=1).sum(axis=1)
res=res[res['Total'] !=0] #To drop rows that are being duplicated with 0 for some reason; will check later
print("Processing complete!")
return res
Your function create_df needs a list as an argument but st.text_input returs always a string.
In your streamlit.py, replace this df_download = create_df(recs) by this df_download = create_df([recs]). But if you need to handle multiple urls, you should use str.split like this :
def create_df(recipes):
recipes = recipes.split(",") # <--- add this line to make a list from the user-input
### rest of the code ###
if download:
df_download = create_df(recs)
# Output :

Matching thousands of data takes too much time with Pandas

I receive every day a report with some values and I have to match postal codes from countries all over the world to get the right region. Then I upload the result in my Django app.
Here's a look at my report:
Order Number
Date
City
Postal code
930276
27/09/2022
Madrid
cp: 28033
929670
27/09/2022
Lisboa
cp: 1600-812
I have thousands of rows like this. The objective is to retrieve the region in ISO 3166-2 format. To help me, I accessed the following page Geonames and downloaded all the countries' information (example: "FR.txt", "ES.txt"...)
Because this is a huge txt file, I chose to store it on a S3 Server.
Here is what I tried:
def access_scaleway(region_name, endpoint_url, access_key, secret_key):
""" Accessing Scaleway Bucket """
scaleway = boto3.client('s3', region_name=region_name, endpoint_url=endpoint_url, aws_access_key_id=access_key,
aws_secret_access_key=secret_key)
return scaleway
def get_region_code_accessing_scaleway(countries, regions):
''' Retrieves the region code from the region name. '''
list_countries = countries
list_regions = regions
list_regions_codes = []
scaleway_session = access_scaleway(region_name=settings.SCALEWAY_S3_REGION_NAME,
endpoint_url=settings.SCALEWAY_S3_ENDPOINT_URL,
access_key=settings.SCALEWAY_ACCESS_KEY_ID,
secret_key=settings.SCALEWAY_SECRET_ACCESS_KEY)
for country, region in zip(list_countries, list_regions):
try:
obj = scaleway_session.get_object(Bucket=settings.SCALEWAY_STORAGE_BUCKET_NAME, Key=f'countries/{country}.txt')
df = pd.read_csv(io.BytesIO(obj['Body'].read()), sep='\t', header=None)
df.columns = ['country code', 'postal code', 'place name', 'admin name1', 'admin code1', 'admin name2', 'admin code2', 'admin name3', 'admin code3', 'latitude', 'longitude', 'accuracy']
df['postal code'] = df['postal code'].astype(str)
df['postal code'] = df['postal code'].str.zfill(5)
# Removing all spaces and special characters
postal_code = re.sub("[^0-9^-]", '', region).strip()
region_code = country + "-" + df[df['postal code'] == postal_code]['admin code1'].values[0]
list_regions_codes.append(region_code)
except AttributeError:
list_regions_codes.append(None)
except ValueError:
list_regions_codes.append(None)
return list_regions_codes
But it is way too long. For a simple report of 1000 rows, it takes like 30 min.
My second try was to go with the OpenDataSoft public API. Here is what I tried:
def fetch_data(url, params, headers=None):
response = requests.get(url=url, params=params, headers=headers)
return response
def get_region_code_accessing_scaleway(countries, regions):
''' Retrieves the region code from the region name. '''
list_countries = countries
list_regions = regions
list_regions_codes = []
for country, region in zip(list_countries, list_regions):
try:
#Get response from API
postal_code = re.sub("[^0-9^-]", '', region).strip()
response = fetch_data(
url="https://data.opendatasoft.com/api/v2/catalog/datasets/geonames-postal-code%40public/records?",
params="select=country_code%2C%20postal_code%2C%20admin_code1&where=country_code%3D%22" + country + "%22%20and%20postal_code%3D%22" + postal_code + "%22")
if response.status_code == 200:
data = response.json()
if len(data['records']) > 0:
list_regions_codes.append(country + "-" + data['records'][0]['record']['fields']['admin_code1'])
else:
list_regions_codes.append(None)
else:
print('Error:" + response.status_code')
list_regions_codes.append(None)
But once again, it takes like forever to get matching values.
The last thing I tried was to go with pgeocode but it was also too long.
I don't understand why it is so long because the desired output is this one:
Order Number
Date
City
Postal code
Region code
930276
27/09/2022
Madrid
cp: 28033
ES-MD
929670
27/09/2022
Lisboa
cp: 1600-812
PT-08
Do you have any idea to speed up the process?

Sorting API response with python to excel or csv. Python

I'm trying to sort out UK Police free API response to a readable format-csv or excel.
Im using Requests library. My initial code is getting the response in a json format:
import requests
r=requests.get('https://data.police.uk/api/crimes-street/all-crime?poly=51.169,-0.633:51.186,-0.5436:51.226,-0.6224&date=2019-12')
r_json=r.json()
for i in j:
for key,value in i.items():
print (key, ":", value)
The code above produces as follows:
category : anti-social-behaviour location_type : Force location : {'latitude': '51.196818', 'street': {'id': 1147343, 'name': 'On or near Parking Area'}, 'longitude': '-0.605146'} context : outcome_status : None persistent_id : id : 79955592 location_subtype : month : 2019-12
How can I create a table with correct headers for the response I get? Headers would be 'category', 'latitude', 'street', 'name', 'longitude', ' month'.
You need to get dipper in dictionary tree to get some data like latitude. Results are collected into collection of lists then loaded into data frame and saved as csv file.
import requests
import pandas as pd
r=requests.get('https://data.police.uk/api/crimes-street/all-crime?poly=51.169,-0.633:51.186,-0.5436:51.226,-0.6224&date=2019-12')
r_json=r.json()
# collect data into list of lists
collected_data = []
for data in r_json:
category = data.get('category')
month = data.get('month')
latitude = ''
longitude = ''
street = ''
for key, value in data.items():
if key == 'location':
latitude = value.get('latitude')
longitude = value.get('longitude')
street = value.get('street').get('name')
collected_data.append([category, latitude, longitude, street, month])
# load data into data frame
df = pd.DataFrame(collected_data, columns = ['Category' , 'Latitude', 'Longitude', 'Street', 'Month'])
# save data frame into csv
df.to_csv('data.csv')

pandas dataframe apply function over column creating multiple columns

I have the pandas df below, with a few columns, one of which is ip_addresses
df.head()
my_id someother_id created_at ip_address state
308074 309115 2859690 2014-09-26 22:55:20 67.000.000.000 rejected
308757 309798 2859690 2014-09-30 04:16:56 173.000.000.000 approved
309576 310619 2859690 2014-10-02 20:13:12 173.000.000.000 approved
310347 311390 2859690 2014-10-05 04:16:01 173.000.000.000 approved
311784 312827 2859690 2014-10-10 06:38:39 69.000.000.000 approved
For each ip_address I'm trying to return the description, city, country
I wrote a function below and tried to apply it
from ipwhois import IPWhois
def returnIP(ip) :
obj = IPWhois(str(ip))
result = obj.lookup_whois()
description = result["nets"][len(result["nets"]) - 1 ]["description"]
city = result["nets"][len(result["nets"]) - 1 ]["city"]
country = result["nets"][len(result["nets"]) - 1 ]["country"]
return [description, city, country]
# ---
suspect['ipwhois'] = suspect['ip_address'].apply(returnIP)
My problem is that this returns a list, I want three separate columns.
Any help is greatly appreciated. I'm new to Pandas/Python so if there's a better way to write the function and use Pandas would be very helpful.
from ipwhois import IPWhois
def returnIP(ip) :
obj = IPWhois(str(ip))
result = obj.lookup_whois()
description = result["nets"][len(result["nets"]) - 1 ]["description"]
city = result["nets"][len(result["nets"]) - 1 ]["city"]
country = result["nets"][len(result["nets"]) - 1 ]["country"]
return (description, city, country)
suspect['description'], suspect['city'], suspect['country'] = \
suspect['ip_address'].apply(returnIP)
I was able to solve it with another stackoverflow solution
for n,col in enumerate(cols):
suspect[col] = suspect['ipwhois'].apply(lambda ipwhois: ipwhois[n])
If there's a more elegant way to solve this, please share!

Categories