Find number greater than given parameter in regex - python

I am trying to write whole line if city name and number of rooms greater than given parameter of function. so far, I have wrote following regex expression. But it finds only rooms available exactly given in parameter not more rooms.
if re.search(r'R[0-9]{7},([-\w ]{1,30}%s[\w ]{1,30}),[0-9]{1,9},%d' % (city_name, number_of_bedrooms), string, re.IGNORECASE):
The file that I am looking into is:
R2507956,2242 Grant Street Vancouver BC V5L 2Z7,1699000,5,2,House,13
R2500627,305-1006 Beach Avenue Vancouver BC V6E 1T7,981000,2,2,Condo,34
R2512107,680 W 6th Avenue Vancouver BC V5Z 1A3,989000,2,2,Townhouse,1
R2512000,208-607 E 8th Avenue Vancouver BC V5T 1T2,574900,1,1,Condo,1
R2511923,2146 W 14th Avenue Vancouver BC V6K 2V7,2248000,3,3,House,31
R2511301,2638 Charles Street Vancouver BC V5K 3A5,1890000,8,8,House,18
R2511809,307-2080 E Kent Avenue Vancouver BC V5P 4X2,449000,1,1,Condo,2
R2511747,1408-1775 Quebec Street Vancouver BC V5T 0E3,679900,1,1,Condo,5
R2511972,306-7180 Linden Avenue Burnaby BC V5E 3G6,448800,1,1,Condo,30
R2511059,7760 Berkley Street Burnaby BC V5E 2J7,1150000,2,1,House,20
R2511262,1106-9222 University Crescent Burnaby BC V5A 0A6,629800,2,2,Condo,4
R2510818,5190 Fulwell Street Burnaby BC V5G 1P2,1390000,7,4,House,15
R2510183,5712 Grant Street Burnaby BC V5B 2K4,1698000,3,4,House,18
R2512071,8154 Gilley Avenue Burnaby BC V5J 4Y5,2488000,9,9,House,1
R2510573,5059 Norfolk Street Burnaby BC V5G 1E9,1299000,4,4,House,7
R2512173,11226 236 Street Maple Ridge BC V2W 0C8 ,900000,4,4,House,35
R2512052,21560 Ashbury Court Maple Ridge BC V2X 8Z7,775000,3,2,House,43
R2508895,227-12258 224 Street Maple Ridge BC V2X 8Y7,474900,2,2,Condo,12
R2512451,102 Croteau Court Coquitlam BC V3K 6E2,948000,4,2,House,20
R2512494,1803-1185 The High Street Coquitlam BC V3B 0A9,968000,3,2,Condo,10

You can match and capture the number of bedrooms and then compare if a match occurred.
Also, you can match city names as whole words, that is where regex comes in handy.
Here is a snippet:
number_of_bedrooms = 3
city_name = 'Vancouver'
rx = r'^R[0-9]{7},([^,]*\b%s\b[^,]*),\d{1,9},(\d+)' % (city_name)
with open(filepath, 'r') as f:
for line in f:
m = re.search(rx, line, re.IGNORECASE)
if m:
if int(m.group(2)) >= number_of_bedrooms: # Nr of bedrooms is in Group 2
print(line)
See an online demo. Here, as number_of_bedrooms = 3, the output is
R2507956,2242 Grant Street Vancouver BC V5L 2Z7,1699000,5,2,House,13
R2511923,2146 W 14th Avenue Vancouver BC V6K 2V7,2248000,3,3,House,31
R2511301,2638 Charles Street Vancouver BC V5K 3A5,1890000,8,8,House,18
Since the field with a city is withing commas, [\w ]{1,30} can be replaced with [^,]* patterns.

Related

Draw a Map of cities in python

I have a ranking of countries across the world in a variable called rank_2000 that looks like this:
Seoul
Tokyo
Paris
New_York_Greater
Shizuoka
Chicago
Minneapolis
Boston
Austin
Munich
Salt_Lake
Greater_Sydney
Houston
Dallas
London
San_Francisco_Greater
Berlin
Seattle
Toronto
Stockholm
Atlanta
Indianapolis
Fukuoka
San_Diego
Phoenix
Frankfurt_am_Main
Stuttgart
Grenoble
Albany
Singapore
Washington_Greater
Helsinki
Nuremberg
Detroit_Greater
TelAviv
Zurich
Hamburg
Pittsburgh
Philadelphia_Greater
Taipei
Los_Angeles_Greater
Miami_Greater
MannheimLudwigshafen
Brussels
Milan
Montreal
Dublin
Sacramento
Ottawa
Vancouver
Malmo
Karlsruhe
Columbus
Dusseldorf
Shenzen
Copenhagen
Milwaukee
Marseille
Greater_Melbourne
Toulouse
Beijing
Dresden
Manchester
Lyon
Vienna
Shanghai
Guangzhou
San_Antonio
Utrecht
New_Delhi
Basel
Oslo
Rome
Barcelona
Madrid
Geneva
Hong_Kong
Valencia
Edinburgh
Amsterdam
Taichung
The_Hague
Bucharest
Muenster
Greater_Adelaide
Chengdu
Greater_Brisbane
Budapest
Manila
Bologna
Quebec
Dubai
Monterrey
Wellington
Shenyang
Tunis
Johannesburg
Auckland
Hangzhou
Athens
Wuhan
Bangalore
Chennai
Istanbul
Cape_Town
Lima
Xian
Bangkok
Penang
Luxembourg
Buenos_Aires
Warsaw
Greater_Perth
Kuala_Lumpur
Santiago
Lisbon
Dalian
Zhengzhou
Prague
Changsha
Chongqing
Ankara
Fuzhou
Jinan
Xiamen
Sao_Paulo
Kunming
Jakarta
Cairo
Curitiba
Riyadh
Rio_de_Janeiro
Mexico_City
Hefei
Almaty
Beirut
Belgrade
Belo_Horizonte
Bogota_DC
Bratislava
Dhaka
Durban
Hanoi
Ho_Chi_Minh_City
Kampala
Karachi
Kuwait_City
Manama
Montevideo
Panama_City
Quito
San_Juan
What I would like to do is a map of the world where those cities are colored according to their position on the ranking above. I am opened to further solutions for the representation (such as bubbles of increasing dimension according to the position of the cities in the rank or, if necessary, representing only a sample of countries taken from the top rank, the middle and the bottom).
Thank you,
Federico
Your question has two parts; finding the location of each city and then drawing them on the map. Assuming you have the latitude and longitude of each city, here's how you'd tackle the latter part.
I like Folium (https://pypi.org/project/folium/) for drawing maps. Here's an example of how you might draw a circle for each city, with it's position in the list is used to determine the size of that circle.
import folium
cities = [
{'name':'Seoul', 'coodrs':[37.5639715, 126.9040468]},
{'name':'Tokyo', 'coodrs':[35.5090627, 139.2094007]},
{'name':'Paris', 'coodrs':[48.8588787,2.2035149]},
{'name':'New York', 'coodrs':[40.6976637,-74.1197631]},
# etc. etc.
]
m = folium.Map(zoom_start=15)
for counter, city in enumerate(cities):
circle_size = 5 + counter
folium.CircleMarker(
location=city['coodrs'],
radius=circle_size,
popup=city['name'],
color="crimson",
fill=True,
fill_color="crimson",
).add_to(m)
m.save('map.html')
Output:
You may need to adjust the circle_size calculation a little to work with the number of cities you want to include.

Split column in DataFrame based on item in list

I have the following table and would like to split each row into three columns: state, postcode and city. State and postcode are easy, but I'm unable to extract the city. I thought about splitting each string after the street synonyms and before the state, but I seem to be getting the loop wrong as it will only use the last item in my list.
Input data:
Address Text
0 11 North Warren Circle Lisbon Falls ME 04252
1 227 Cony Street Augusta ME 04330
2 70 Buckner Drive Battle Creek MI
3 718 Perry Street Big Rapids MI
4 14857 Martinsville Road Van Buren MI
5 823 Woodlawn Ave Dallas TX 75208
6 2525 Washington Avenue Waco TX 76710
7 123 South Main St Dallas TX 75201
The output I'm trying to achieve (for all rows, but I only wrote out the first two to save time)
City State Postcode
0 Lisbon Falls ME 04252
1 Augusta ME 04330
My code:
# Extract postcode and state
df["Zip"] = df["Address Text"].str.extract(r'(\d{5})', expand = True)
df["State"] = df["Address Text"].str.extract(r'([A-Z]{2})', expand = True)
# Split after these substrings
street_synonyms = ["Circle", "Street", "Drive", "Road", "Ave", "Avenue", "St"]
# This is where I got stuck
df["Syn"] = df["Address Text"].apply(lambda x: x.split(syn))
df
Here's a way to do that:
import pandas as pd
# data
df = pd.DataFrame(
['11 North Warren Circle Lisbon Falls ME 04252',
'227 Cony Street Augusta ME 04330',
'70 Buckner Drive Battle Creek MI',
'718 Perry Street Big Rapids MI',
'14857 Martinsville Road Van Buren MI',
'823 Woodlawn Ave Dallas TX 75208',
'2525 Washington Avenue Waco TX 76710',
'123 South Main St Dallas TX 75201'],
columns=['Address Text'])
# Extract postcode and state
df["Zip"] = df["Address Text"].str.extract(r'(\d{5})', expand=True)
df["State"] = df["Address Text"].str.extract(r'([A-Z]{2})', expand=True)
# Split after these substrings
street_synonyms = ["Circle", "Street", "Drive", "Road", "Ave", "Avenue", "St"]
def find_city(address, state, street_synonyms):
for syn in street_synonyms:
if syn in address:
# remove street
city = address.split(syn)[-1]
# remove State and postcode
city = city.split(state)[0]
return city
df['City'] = df.apply(lambda x: find_city(x['Address Text'], x['State'], street_synonyms), axis=1)
print(df[['City', 'State', 'Zip']])
"""
City State Zip
0 Lisbon Falls ME 04252
1 Augusta ME 04330
2 Battle Creek MI NaN
3 Big Rapids MI NaN
4 Van Buren MI 14857
5 Dallas TX 75208
6 nue Waco TX 76710
7 Dallas TX 75201
"""

How to apply regex to get the exact house number with approximate residual address match

import re
list =[]
for element in address1:
z = re.match("^\d+", element)
if z:
list.append(z.string)
get_best_fuzzy("SATYAGRAH;OPP. RAJ SUYA BUNGLOW", list)
I have tried the above code, it is giving me the approximate address match for the addresses in my text file. How can I get the exact house number match with approximate rest address match. My addresses are in format:
1004; Jay Shiva Tower; Near Azad Society; Ambawadi Ahmedabad Gujarat 380015 India
1004; Jayshiva Tower; Near Azad Society; Ambawadi Ahmedabad Gujarat 380015 India
101 GAMBS TOWER; FOUR BUNGLOWS;OPPOSITE GOOD SHEPHERD CHURCH ANDHERI WEST MUMBAI Maharashtra 400053 India
101/32-B; SHREE GANESH COMPLEX VEER SAVARKAR BLOCK; SHAKARPUR; EASE DEL HI DELHI Delhi 110092 India
you can try this.
Code :
import re
address = ["1004; Jayshiva Tower; Near Azad Society; Ambawadi Ahmedabad Gujarat 380015 India",
"101 GAMBS TOWER; FOUR BUNGLOWS;OPPOSITE GOOD SHEPHERD CHURCH ANDHERI WEST MUMBAI Maharashtra 400053 India",
"101/32-B; SHREE GANESH COMPLEX VEER SAVARKAR BLOCK; SHAKARPUR; EASE DEL HI DELHI Delhi 110092 India"]
for i in address:
z = re.match("^([^ ;]+)", i)
print(z.group())
Output :
1004
101
101/32-B

AWK reformat portion of results (names) within larger string

My goal is to reformat names from Last First Middle (LFM) to First Middle Last (FML), which are part of a larger string. Here's some sample data:
Name, Address1, Address2
Smith Joe M, 123 Apple Rd, Paris TX
Adams Keith Randall, 543 1st Street, Salinas CA
Price Tiffany, 11232 32nd Street, New York NY
Walker Karen E F, 98 West Ave, Denver CO
What I would like is:
Name, Address1, Address2 Joe M Smith, 123 Apple Rd, Paris
TX Keith Randall Adams, 543 1st Street, Salinas CA
Tiffany Price, 11232 32nd Street, New York NY Karen E F
Walker, 98 West Ave, Denver CO
I know how to reorder the first column, but I end up dropping the rest of the row data:
# Return the first colum via comma seperation (name), then seperate by spaces
# If there are two strings but not three (only a last and first name),
# then change the order to first last.
awk -F, '{print $1}'| awk -F" " '$2!="" && $3=="" {print $2,$1}' >> names.txt
awk -F, '{print $1}'| awk -F" " '$3!="" && $4=="" {print $3,$1,$2}' >> names.txt
...# Continue to iterate column numbers
If there's an easier way to put the last string found and move it to the front I'd like to hear about it, but here's my real interest...
My problem is that I want to reorder the space separated fields of the 1st comma separated field (what I did above), but then also print the rest of the comma separated data.
Is there a way I can store the address info in a variable and append it after the space seperated names?
Alternatively, could I do some kind of nested split?
I'm currently doing this with awk in bash, but am willing to use python/pandas or any other efficient methods.
Thanks for the help!
Using sed, looks terrible but works:
sed -E '2,$s/^([^ ,]*) ([^ ,]*)( [^,]*)?/\2\3 \1/' in
and POSIX version:
sed '2,$s/^\([^ ,]*\) \([^ ,]*\)\( [^,]*\)*/\2\3 \1/' in
output:
Name, Address1, Address2
Joe M Smith, 123 Apple Rd, Paris TX
Keith Randall Adams, 543 1st Street, Salinas CA
Tiffany Price, 11232 32nd Street, New York NY
Karen E F Walker, 98 West Ave, Denver CO
The following AWK script, as ugly as it is, works for your inputs (run with awk -F, -f script.awk):
{
split($1, names, " ");
for (i=2; i<=length(names); i++)
printf("%s ", names[i]);
printf("%s, ", names[1]);
for(i=2; i<NF; i++)
printf("%s,", $i);
print($NF)
}
Input:
Smith Joe M, 123 Apple Rd, Paris TX
Adams Keith Randall, 543 1st Street, Salinas CA
Price Tiffany, 11232 32nd Street, New York NY
Walker Karen E F, 98 West Ave, Denver CO
Output:
Joe M Smith, 123 Apple Rd, Paris TX
Keith Randall Adams, 543 1st Street, Salinas CA
Tiffany Price, 11232 32nd Street, New York NY
Karen E F Walker, 98 West Ave, Denver CO
The same solution in Python:
import sys
import re
for line in sys.stdin:
parts = re.split('\s*,\s*', line)
names = parts[0].split()
print(", ".join([" ".join(names[1:] + names[:1])] + parts[1:]))
Another awk. This one works with the header line and Madonna (ie. single word fields):
$ awk ' # using awk
BEGIN{FS=OFS=","} # csv
{
n=split($1,a," ") # split the first field to a
for(i=n;i>1;i--) # iterate back from the last element of a
a[1]=a[i] " " a[1] # prepending to the first element of a
$1=a[1] # replace the first field with the first element of a
}1' file # output
Output:
Name, Address1, Address2
Joe M Smith, 123 Apple Rd, Paris TX
Keith Randall Adams, 543 1st Street, Salinas CA
Tiffany Price, 11232 32nd Street, New York NY
Karen E F Walker, 98 West Ave, Denver CO
Madonna, ...
$ awk '
BEGIN { FS=OFS=", " }
$1 ~ / / {
last = rest = $1
sub(/ .*/,"",last)
sub(/[^ ]+ /,"",rest)
$1 = rest " " last
}
{ print }
' file
Name, Address1, Address2
Joe M Smith, 123 Apple Rd, Paris TX
Keith Randall Adams, 543 1st Street, Salinas CA
Tiffany Price, 11232 32nd Street, New York NY
Karen E F Walker, 98 West Ave, Denver CO

How to Nest If Statement Within For Loop When Scraping Div Class HTML

Below is a scraper that uses Beautiful Soup to scrape physician information off of this webpage. As you can see from the html code directly below, each physician has an individual profile on the webpage that displays the physician's name, clinic, profession, taxonomy, and city.
<div class="views-field views-field-title practitioner__name" >Marilyn Adams</div>
<div class="views-field views-field-field-pract-clinic practitioner__clinic" >Fortius Sport & Health</div>
<div class="views-field views-field-field-pract-profession practitioner__profession" >Physiotherapist</div>
<div class="views-field views-field-taxonomy-vocabulary-5 practitioner__region" >Fraser River Delta</div>
<div class="views-field views-field-city practitioner__city" ></div>
As you can see from the sample html code, the physician profiles occasionally have information missing. If this occurs, I would like the scraper to print 'N/A'. I need the scraper to print 'N/A' because I would eventually like to put each div class category (name, clinic, profession, etc.) into an array where the lengths of each column are exactly the same so I can properly export the data to a CSV file. Here is an example of what I want the output to look like compared to what is actually showing up.
Actual Expected
[Names] [Names]
Greg Greg
Bob Bob
[Clinic] [Clinic]
Sport/Health Sport/Health
N/A
[Profession] [Profession]
Physical Therapist Physical Therapist
Physical Therapist Physical Therapist
[Taxonomy] [Taxonomy]
Fraser River Fraser River
N/A
[City] [City]
Vancouver Vancouver
Vancouver Vancouver
I have tried writing an if statement nested within each for loop, but the code does not seem to be looping correctly as the "N/A" only shows up once for each div class section. Does anyone know how to properly nest an if statement with a for loop so I am getting the proper amount of "N/As" in each column? Thanks in advance!
import requests
import re
from bs4 import BeautifulSoup
page=requests.get('https://sportmedbc.com/practitioners')
soup=BeautifulSoup(page.text, 'html.parser')
#Find Doctor Info
for doctor in soup.find_all('div',attrs={'class':'views-field views-field-title practitioner__name'}):
for a in doctor.find_all('a'):
print(a.text)
for clinic_name in soup.find_all('div',attrs={'class':'views-field views-field-field-pract-clinic practitioner__clinic'}):
for b in clinic_name.find_all('a'):
if b==(''):
print('N/A')
profession_links=soup.findAll('div',attrs={'class':'views-field views-field-field-pract-profession practitioner__profession'})
for profession in profession_links:
if profession.text==(''):
print('N/A')
print(profession.text)
taxonomy_links=soup.findAll('div',attrs={'class':'views-field views-field-taxonomy-vocabulary-5 practitioner__region'})
for taxonomy in taxonomy_links:
if taxonomy.text==(''):
print('N/A')
print(taxonomy.text)
city_links=soup.findAll('div',attrs={'class':'views-field views-field-taxonomy-vocabulary-5 practitioner__region'})
for city in city_links:
if city.text==(''):
print('N/A')
print(city.text)
For this problem you can use ChainMap from collections module (docs here). That way you can define your default values, in this case 'n/a' and only grab information that exists for each doctor:
from bs4 import BeautifulSoup
import requests
from collections import ChainMap
url = 'https://sportmedbc.com/practitioners'
soup = BeautifulSoup(requests.get(url).text, 'lxml')
def get_data(soup):
default_data = {'name': 'n/a', 'clinic': 'n/a', 'profession': 'n/a', 'region': 'n/a', 'city': 'n/a'}
for doctor in soup.select('.view-practitioners .practitioner'):
doctor_data = {}
if doctor.select_one('.practitioner__name').text.strip():
doctor_data['name'] = doctor.select_one('.practitioner__name').text
if doctor.select_one('.practitioner__clinic').text.strip():
doctor_data['clinic'] = doctor.select_one('.practitioner__clinic').text
if doctor.select_one('.practitioner__profession').text.strip():
doctor_data['profession'] = doctor.select_one('.practitioner__profession').text
if doctor.select_one('.practitioner__region').text.strip():
doctor_data['region'] = doctor.select_one('.practitioner__region').text
if doctor.select_one('.practitioner__city').text.strip():
doctor_data['city'] = doctor.select_one('.practitioner__city').text
yield ChainMap(doctor_data, default_data)
for doctor in get_data(soup):
print('name:\t\t', doctor['name'])
print('clinic:\t\t',doctor['clinic'])
print('profession:\t',doctor['profession'])
print('city:\t\t',doctor['city'])
print('region:\t\t',doctor['region'])
print('-' * 80)
Prints:
name: Jaimie Ackerman
clinic: n/a
profession: n/a
city: n/a
region: n/a
--------------------------------------------------------------------------------
name: Marilyn Adams
clinic: Fortius Sport & Health
profession: Physiotherapist
city: n/a
region: Fraser River Delta
--------------------------------------------------------------------------------
name: Mahsa Ahmadi
clinic: Wellpoint Acupuncture (Sports Medicine)
profession: Acupuncturist
city: Vancouver
region: Vancouver & Sea to Sky
--------------------------------------------------------------------------------
name: Tracie Albisser
clinic: Pacific Sport Northern BC, Tracie Albisser
profession: Strength and Conditioning Specialist, Exercise Physiologist
city: n/a
region: Cariboo - North East
--------------------------------------------------------------------------------
name: Christine Alder
clinic: n/a
profession: n/a
city: Vancouver
region: Vancouver & Sea to Sky
--------------------------------------------------------------------------------
name: Steacy Alexander
clinic: Go! Physiotherapy Sports and Wellness Centre
profession: Physiotherapist
city: Vancouver
region: Vancouver & Sea to Sky
--------------------------------------------------------------------------------
name: Page Allison
clinic: AET Clinic, .
profession: Athletic Therapist
city: Victoria
region: Vancouver Island - Central Coast
--------------------------------------------------------------------------------
name: Dana Alumbaugh
clinic: n/a
profession: Podiatrist
city: Squamish
region: Vancouver & Sea to Sky
--------------------------------------------------------------------------------
name: Manouch Amel
clinic: Mountainview Kinesiology Ltd.
profession: Strength and Conditioning Specialist
city: Anmore
region: Vancouver & Sea to Sky
--------------------------------------------------------------------------------
name: Janet Ames
clinic: Dr. Janet Ames
profession: Physician
city: Prince George
region: Cariboo - North East
--------------------------------------------------------------------------------
name: Sandi Anderson
clinic: n/a
profession: n/a
city: Coquitlam
region: Fraser Valley
--------------------------------------------------------------------------------
name: Greg Anderson
clinic: University of the Fraser Valley
profession: Exercise Physiologist
city: Mission
region: Fraser Valley
--------------------------------------------------------------------------------
EDIT:
For getting the output in columns, you can use this example:
def print_data(header_text, data, key):
print(header_text)
for d in data:
print(d[key])
print()
data = list(get_data(soup))
print_data('[Names]', data, 'name')
print_data('[Clinic]', data, 'clinic')
print_data('[Profession]', data, 'profession')
print_data('[Taxonomy]', data, 'region')
print_data('[City]', data, 'city')
This prints:
[Names]
Jaimie Ackerman
Marilyn Adams
Mahsa Ahmadi
Tracie Albisser
Christine Alder
Steacy Alexander
Page Allison
Dana Alumbaugh
Manouch Amel
Janet Ames
Sandi Anderson
Greg Anderson
[Clinic]
n/a
Fortius Sport & Health
Wellpoint Acupuncture (Sports Medicine)
Pacific Sport Northern BC, Tracie Albisser
n/a
Go! Physiotherapy Sports and Wellness Centre
AET Clinic, .
n/a
Mountainview Kinesiology Ltd.
Dr. Janet Ames
n/a
University of the Fraser Valley
[Profession]
n/a
Physiotherapist
Acupuncturist
Strength and Conditioning Specialist, Exercise Physiologist
n/a
Physiotherapist
Athletic Therapist
Podiatrist
Strength and Conditioning Specialist
Physician
n/a
Exercise Physiologist
[Taxonomy]
n/a
Fraser River Delta
Vancouver & Sea to Sky
Cariboo - North East
Vancouver & Sea to Sky
Vancouver & Sea to Sky
Vancouver Island - Central Coast
Vancouver & Sea to Sky
Vancouver & Sea to Sky
Cariboo - North East
Fraser Valley
Fraser Valley
[City]
n/a
n/a
Vancouver
n/a
Vancouver
Vancouver
Victoria
Squamish
Anmore
Prince George
Coquitlam
Mission

Categories