Scrape website to only show populated categories - python

I am in the process of scraping a website and it pulls the contents of the page, but there are categories with headers that are technically empty, but it still shows the header. I would like to only see categories with events in them. Ideally I could even have the components of each transactions so I can choose which elements I want displayed.
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}
print('Scraping NH Dept of Banking...')
print()
NHurl = 'https://www.nh.gov/banking/corporate-activities/index.htm'
NHr = requests.get(NHurl, headers = headers)
NHsoup = BeautifulSoup(NHr.text, 'html.parser')
NHlist = []
for events in NHsoup.findAll('tr')[2:]:
print(events.text)
NHlist.append(events.text)
print(' '.join(NHlist))
Like I said, this works to get all of the information, but there are a lot of headers/empty space that doesn't need to be pulled. For example, at the time I'm writing this the 'acquisitions', 'conversions', and 'change in control' are empty, but the headers still come in and there's are relatively large blank space after the headers. I feel like a I need some sort of loop to go through each header ('td') and then get it's contents ('tr') but I'm just not quite sure how to do it.

You can use itertools.groupby to group elements and then filter out empty rows:
import requests
from itertools import groupby
from bs4 import BeautifulSoup
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}
print('Scraping NH Dept of Banking...')
print()
NHurl = 'https://www.nh.gov/banking/corporate-activities/index.htm'
NHr = requests.get(NHurl, headers = headers)
NHsoup = BeautifulSoup(NHr.text, 'html.parser')
NHlist = []
for _, g in groupby(NHsoup.select('tr'), lambda k, d={'g':0}: (d.update(g=d['g']+1), d['g']) if k.select('th') else (None, d['g'])):
s = [tag.get_text(strip=True, separator=' ') for tag in g]
if any(i == '' for i in s):
continue
NHlist.append(s)
# This is just pretty printing, all the data are already in NHlist:
l = max(map(len,(j for i in NHlist for j in i))) + 5
for item in NHlist:
print('{: <4} {}'.format(' ', item[0]))
print('-' * l)
for i, ev in enumerate(item[1:], 1):
print('{: <4} {}'.format(i, ev))
print()
Prints:
Scraping NH Dept of Banking...
New Bank
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 12/11/18 The Millyard Bank
Interstate Bank Combination
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 01/16/19 Optima Bank & Trust Company with and into Cambridge Trust Company Portsmouth, NH 03/29/19
Amendment to Articles of Agreement or Incorporation; Business or Capital Plan
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 11/26/18 John Hancock Trust Company Boston, MA 01/14/19
2 12/04/18 Franklin Savings Bank Franklin, NH 01/28/19
3 12/12/18 MFS Heritage Trust Company Boston, MA 01/28/19
4 02/25/19 Ankura Trust Company, LLC Fairfield, CT 03/22/19
5 4/25/19 Woodsville Guaranty Savings Bank Woodsville, NH 06/04/19
6 5/10/19 AB Trust Company New York, NY 06/04/19
Reduction in Capital
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 03/07/19 Primary Bank Bedford, NH 04/10/19
Amendment to Bylaws
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 12/10/18 Northeast Credit Union Porstmouth, NH 02/25/19
2 2/25/19 Members First Credit Union Manchester, NH 04/05/19
3 4/24/19 St. Mary's Bank Manchester, NH 05/30/19
4 6/28/19 Bellwether Community Credit Union
Interstate Branch Office
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 01/23/19 Newburyport Five Cents Savings Bank 141 Portsmouth Ave Exeter, NH 02/01/19
2 03/08/19 One Credit Union Newport, NH 03/29/19
3 03/01/19 JPMorgan Chase Bank, NA Nashua, NH 04/04/19
4 03/26/19 Mascoma Bank Lebanon, NH 04/09/19
5 04/24/19 Newburyport Five Cents Savings Bank 321 Lafayette Rd Hampton NH 05/08/19
6 07/10/19 Mascoma Bank 242-244 North Winooski Avenue Burlington VT 07/18/19
7 07/10/19 Mascoma Bank 431 Pine Street Burlington VT 07/18/19
Interstate Branch Office Closure
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 02/15/19 The Provident Bank 321 Lafayette Rd Hampton, NH 02/25/19
New Branch Office
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 12/07/18 Bank of New Hampshire 16-18 South Main Street Concord NH 01/02/19
2 3/4/19 Triangle Credit Union 360 Daniel Webster Highway, Merrimack, NH 03/11/19
3 04/03/19 Bellwether Community Credit Union 425-453 Commercial Street Manchester, NH 04/17/19
4 06/11/19 Primary Bank 23 Crystal Avenue Derry NH 06/11/19
Branch Office Closure
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 5/15/19 Northeast Credit Union Merrimack, NH 05/21/19
New Loan Production Office
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 04/08/19 Community National Bank 367 Route 120, Unit B-5 Lebanon, NH
03766-1430 04/15/19
Loan Production Office Closure
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 07/22/19 The Provident Bank 20 Trafalgar Square, Suite 447 Nashua NH 03063 07/31/19
Trade Name Requests
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 04/16/19 John Hancock Trust Company To use trade name "Manulife Investment Management Trust Company" 04/24/19
New Trust Company
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 02/19/19 Janney Trust Co., LLC
2 02/25/19 Darwin Trust Company of New Hampshire, LLC
3 07/15/`9 Harbor Trust Company
Dissolution of Trust Company
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 09/19/17 Cambridge Associates Fiduciary Trust, LLC Boston, MA 02/05/19
Trust Office Closure
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 5/10/19 Charter Trust Company Rochester, NH 05/20/19
New Trust Office
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 02/25/19 Ankura Trust Company, LLC 140 Sherman Street, 4th Floor Fairfield, CT 06824 03/22/19
Relocation of Trust Office
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 01/23/19 Geode Capital Management Trust Company, LLC Relocate from: One Post Office Square, 20th Floor, Boston MA To: 100 Summer Street, 12th Flr, Boston, MA 02/01/19
2 03/15/19 Drivetrain Trust Company LLC Relocate from: 630 3rd Avenue, 21st Flr New York, NY 10017 To: 410 Park Avenue, Suite 900 New York, NY 10022 03/29/19
3 04/14/19 Boston Partners Trust Company Relocate from: 909 Third Avenue New York, NY 10022 To: One Grand Central Place 60 East 42nd Street, Ste 1550 New York, NY 10165 04/23/19

You could test which rows contain all '\xa0' (appear blank) and exclude. I append to list and convert to pandas dataframe but you could just print the row direct.
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
r = requests.get('https://www.nh.gov/banking/corporate-activities/index.htm')
soup = bs(r.content, 'lxml')
results = []
for tr in soup.select('tr'):
row = [i.text for i in tr.select('th,td')]
if row.count('\xa0') != len(row):
results.append(row)
pd.set_option('display.width', 100)
df = pd.DataFrame(results)
df.style.set_properties(**{'text-align': 'left'})
df.columns = df.iloc[0]
df = df[1:]
df.fillna(value='', inplace=True)
print(df.head(20))

Not sure if this is how you want it, and there is probably a more elegant way, but I basically did was
Pandas to get the table
Pandas automatically assigns columns, so moved column to first row
Found were rows are all nulls
Dropped rows with all nulls and the previous row (it's sub header)
import pandas as pd
print('Scraping NH Dept of Banking...')
print()
NHurl = 'https://www.nh.gov/banking/corporate-activities/index.htm'
df = pd.read_html(NHurl)[0]
top_row = pd.DataFrame([df.columns], index=[-1])
df.columns = top_row.columns
df = df.append(top_row, sort=True).sort_index().reset_index(drop=True)
null_rows = df[df.isnull().values.all(axis=1)].index.tolist()
drop_hdr_rows = [x - 1 for x in null_rows ]
drop_rows = drop_hdr_rows + null_rows
new_df = df[~df.index.isin(drop_rows)]
Output:
print (new_df.to_string())
0 1 2 3
2 New Bank New Bank New Bank New Bank
3 12/11/18 The Millyard Bank NaN NaN
4 Interstate Bank Combination Interstate Bank Combination Interstate Bank Combination Interstate Bank Combination
5 01/16/19 Optima Bank & Trust Company with and into Camb... Portsmouth, NH 03/29/19
12 Amendment to Articles of Agreement or Incorpor... Amendment to Articles of Agreement or Incorpor... Amendment to Articles of Agreement or Incorpor... Amendment to Articles of Agreement or Incorpor...
13 11/26/18 John Hancock Trust Company Boston, MA 01/14/19
14 12/04/18 Franklin Savings Bank Franklin, NH 01/28/19
15 12/12/18 MFS Heritage Trust Company Boston, MA 01/28/19
16 02/25/19 Ankura Trust Company, LLC Fairfield, CT 03/22/19
17 4/25/19 Woodsville Guaranty Savings Bank Woodsville, NH 06/04/19
18 5/10/19 AB Trust Company New York, NY 06/04/19
19 Reduction in Capital Reduction in Capital Reduction in Capital Reduction in Capital
20 03/07/19 Primary Bank Bedford, NH 04/10/19
21 Amendment to Bylaws Amendment to Bylaws Amendment to Bylaws Amendment to Bylaws
22 12/10/18 Northeast Credit Union Porstmouth, NH 02/25/19
23 2/25/19 Members First Credit Union Manchester, NH 04/05/19
24 4/24/19 St. Mary's Bank Manchester, NH 05/30/19
25 6/28/19 Bellwether Community Credit Union NaN NaN
26 Interstate Branch Office Interstate Branch Office Interstate Branch Office Interstate Branch Office
27 01/23/19 Newburyport Five Cents Savings Bank 141 Portsmouth Ave Exeter, NH 02/01/19
28 03/08/19 One Credit Union Newport, NH 03/29/19
29 03/01/19 JPMorgan Chase Bank, NA Nashua, NH 04/04/19
30 03/26/19 Mascoma Bank Lebanon, NH 04/09/19
31 04/24/19 Newburyport Five Cents Savings Bank 321 Lafayette Rd Hampton NH 05/08/19
32 07/10/19 Mascoma Bank 242-244 North Winooski Avenue Burlington VT 07/18/19
33 07/10/19 Mascoma Bank 431 Pine Street Burlington VT 07/18/19
34 Interstate Branch Office Closure Interstate Branch Office Closure Interstate Branch Office Closure Interstate Branch Office Closure
35 02/15/19 The Provident Bank 321 Lafayette Rd Hampton, NH 02/25/19
36 New Branch Office New Branch Office New Branch Office New Branch Office
37 12/07/18 Bank of New Hampshire 16-18 South Main Street Concord NH 01/02/19
38 3/4/19 Triangle Credit Union 360 Daniel Webster Highway, Merrimack, NH 03/11/19
39 04/03/19 Bellwether Community Credit Union 425-453 Commercial Street Manchester, NH 04/17/19
40 06/11/19 Primary Bank 23 Crystal Avenue Derry NH 06/11/19
41 Branch Office Closure Branch Office Closure Branch Office Closure Branch Office Closure
42 5/15/19 Northeast Credit Union Merrimack, NH 05/21/19
43 New Loan Production Office New Loan Production Office New Loan Production Office New Loan Production Office
44 04/08/19 Community National Bank 367 Route 120, Unit B-5 Lebanon, NH 03766-1430 04/15/19
45 Loan Production Office Closure Loan Production Office Closure Loan Production Office Closure Loan Production Office Closure
46 07/22/19 The Provident Bank 20 Trafalgar Square, Suite 447 Nashua NH 03063 07/31/19
51 Trade Name Requests Trade Name Requests Trade Name Requests Trade Name Requests
52 04/16/19 John Hancock Trust Company To use trade name "Manulife Investment Managem... 04/24/19
53 New Trust Company New Trust Company New Trust Company New Trust Company
54 02/19/19 Janney Trust Co., LLC NaN NaN
55 02/25/19 Darwin Trust Company of New Hampshire, LLC NaN NaN
56 07/15/`9 Harbor Trust Company NaN NaN
57 Dissolution of Trust Company Dissolution of Trust Company Dissolution of Trust Company Dissolution of Trust Company
58 09/19/17 Cambridge Associates Fiduciary Trust, LLC Boston, MA 02/05/19
59 Trust Office Closure Trust Office Closure Trust Office Closure Trust Office Closure
60 5/10/19 Charter Trust Company Rochester, NH 05/20/19
61 New Trust Office New Trust Office New Trust Office New Trust Office
62 02/25/19 Ankura Trust Company, LLC 140 Sherman Street, 4th Floor Fairfield, CT 0... 03/22/19
63 Relocation of Trust Office Relocation of Trust Office Relocation of Trust Office Relocation of Trust Office
64 01/23/19 Geode Capital Management Trust Company, LLC Relocate from: One Post Office Square, 20th Fl... 02/01/19
65 03/15/19 Drivetrain Trust Company LLC Relocate from: 630 3rd Avenue, 21st Flr New Y... 03/29/19
66 04/14/19 Boston Partners Trust Company Relocate from: 909 Third Avenue New York, NY ... 04/23/19

Related

Beautiful soup not loading new page after Selenium click

The first page is loaded and parsed as expected but after the clicking on Next page, the BS4 does not get the new page from driver.page_source
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as ureq
import random
import time
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
def parse_html(pagesource, count):
soup = BeautifulSoup(driver.page_source, 'html.parser')
tables = soup.findChildren('table')
# This will get the first (and only) table. Your page may have more.
my_table = tables[0]
table_body = my_table.find('tbody')
all_rows = table_body.find_all('tr')
# print (all_rows[0])
for row in all_rows:
print (count)
count += 1
try:
path_body = row.find("td", class_="views-field-company-name")
path = path_body.find("a")['href']
company_name = path_body.find("a").text
company_name = company_name.strip()
print (company_name)
issue_datetime = row.find("td", class_="views-field-field-letter-issue-datetime")
# print (type(issue_datetime.find("time")['datetime']))
issue_recepient_office = row.find("td", class_="views-field-field-building").string
issue_recepient_office = issue_recepient_office.strip()
# print (issue_recepient_office)
detailed_description = row.find("td", class_="views-field-field-detailed-description-2").string
if detailed_description:
detailed_description = detailed_description.strip()
else:
detailed_description = ""
#print (detailed_description)
except:
pass
url = 'https://www.fda.gov/inspections-compliance-enforcement-and-criminal-investigations/compliance-actions-and-activities/warning-letters'
driver.get(url)
count = 1
parse_html(driver.page_source, count)
for i in range(0,3):
time.sleep(10)
#print(driver.page_source.encode('utf-8'))
WebDriverWait(driver, 30).until(EC.element_to_be_clickable((By.CSS_SELECTOR, '#datatable_next a'))).click()
time.sleep(30)
parse_html(driver.page_source, count)
driver.quit()
Output:
1
Ruth Special Food Store LLC
Foreign Supplier Verification Program (FSVP)
2
EarthLab, Inc., dba Wise Woman Herbals
3
Big Olaf Creamery LLC dba Big Olaf
CGMP/Food/Prepared, Packed or Held Under Insanitary Conditions/Adulterated/L. monocytogenes
4
Bainbridge Beverage West, LLC
Juice HACCP/CGMP for Foods/Adulterated/Insanitary Conditions
5
VapeL1FE, LLC
Family Smoking Prevention and Tobacco Control Act/Adulterated/Misbranded
6
Mike Millenkamp Dairy Cattle
7
Empowered Diagnostics LLC
Unapproved Products Related to the Coronavirus Disease 2019 (COVID-19)
8
RoyalVibe Health Ltd.
CGMP/QSR/Medical Devices/PMA/Adulterated/Misbranded
9
Land View, Inc.
CGMP/Medicated Feeds/Adulterated
10
Green Pharmaceuticals Inc.
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10
none selenium solution:
import requests
from bs4 import BeautifulSoup
import pandas as pd
PAGE_LENGHT = 50
def get_letters(page: int):
start = page * PAGE_LENGHT
url = f"https://www.fda.gov/datatables/views/ajax?field_letter_issue_datetime=All&field_change_date_2=All&draw={page}&columns%5B0%5D%5Bdata%5D=0&columns%5B0%5D%5Bsearchable%5D=true&columns%5B0%5D%5Borderable%5D=true&columns%5B0%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B1%5D%5Bdata%5D=1&columns%5B1%5D%5Bsearchable%5D=true&columns%5B1%5D%5Borderable%5D=true&columns%5B1%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B2%5D%5Bdata%5D=2&columns%5B2%5D%5Bsearchable%5D=true&columns%5B2%5D%5Borderable%5D=true&columns%5B2%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B3%5D%5Bdata%5D=3&columns%5B3%5D%5Bsearchable%5D=true&columns%5B3%5D%5Borderable%5D=true&columns%5B3%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B4%5D%5Bdata%5D=4&columns%5B4%5D%5Bsearchable%5D=true&columns%5B4%5D%5Borderable%5D=true&columns%5B4%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B5%5D%5Bdata%5D=5&columns%5B5%5D%5Bsearchable%5D=true&columns%5B5%5D%5Borderable%5D=true&columns%5B5%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B6%5D%5Bdata%5D=6&columns%5B6%5D%5Bsearchable%5D=true&columns%5B6%5D%5Borderable%5D=true&columns%5B6%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B7%5D%5Bdata%5D=7&columns%5B7%5D%5Bname%5D=&columns%5B7%5D%5Bsearchable%5D=true&columns%5B7%5D%5Borderable%5D=false&columns%5B7%5D%5Bsearch%5D%5Bregex%5D=false&start={start}&length={PAGE_LENGHT}&search%5Bregex%5D=false&_drupal_ajax=1&_wrapper_format=drupal_ajax&view_base_path=inspections-compliance-enforcement-and-criminal-investigations%2Fcompliance-actions-and-activities%2Fwarning-letters%2Fdatatables-data&view_display_id=warning_letter_solr_block&view_dom_id=4605f153788b3a17043d0e031eb733846503177581602cd9fd58ecd78629801b&view_name=warning_letter_solr_index&view_path=%2Finspections-compliance-enforcement-and-criminal-investigations%2Fcompliance-actions-and-activities%2Fwarning-letters&total_items=3433"
letters = []
for letter in requests.get(url).json()['data']:
letters.append([BeautifulSoup(row, 'lxml').get_text(strip=True) for row in letter])
return letters
result = []
for i in range(0, 5):
result += get_letters(i)
df = pd.DataFrame(result)
print(df)
OUTPUT:
0 1 2 ... 5 6 7
0 12/27/2022 11/07/2022 Land View, Inc. ...
1 12/27/2022 11/22/2022 MD Pharmaceutical Supply, LLC ...
2 12/27/2022 06/01/2022 Supreme Fruit Produce, Inc. ...
3 12/27/2022 10/06/2022 Empowered Diagnostics LLC ...
4 12/27/2022 11/18/2022 RoyalVibe Health Ltd. ...
.. ... ... ... ... .. .. ..
245 08/11/2022 08/11/2022 The Juice Bar ...
246 08/09/2022 06/16/2022 InfuTronix LLC ...
247 08/09/2022 07/12/2022 Zyno Medical LLC ...
248 08/09/2022 07/28/2022 Vitti Labs, LLC ...
249 08/09/2022 07/22/2022 Muscle Feast, LLC ...
UPDATE
to find request use dev tools(f12 default in chrome)
now we need to figure out how to work with this data, this is a simple html text, and bs4 will help us with this. If link needed u can change letters.append to
letters.append({
'Posted Date': BeautifulSoup(letter[0], 'lxml').get_text(strip=True),
'Letter Issue Date': BeautifulSoup(letter[1], 'lxml').get_text(strip=True),
'Company Name': BeautifulSoup(letter[2], 'lxml').get_text(strip=True),
'Issuing Office': BeautifulSoup(letter[3], 'lxml').get_text(strip=True),
'Subject': BeautifulSoup(letter[4], 'lxml').get_text(strip=True),
'Link': 'https://www.fda.gov/' + BeautifulSoup(letter[2], 'lxml').find('a').get('href'),
})
And new output looks like:
Posted Date Letter Issue Date Company Name Issuing Office Subject Link
0 12/27/2022 11/07/2022 Land View, Inc. Division of Human and Animal Food Operations West VI CGMP/Medicated Feeds/Adulterated https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/land-view-inc-638704-11072022
1 12/27/2022 11/22/2022 MD Pharmaceutical Supply, LLC Division of Pharmaceutical Quality Operations I CGMP/Active Pharmaceutical Ingredient (API)/Adulterated https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/md-pharmaceutical-supply-llc-637815-11222022
2 12/27/2022 06/01/2022 Supreme Fruit Produce, Inc. Division of Southwest Imports Foreign Supplier Verification Program (FSVP) https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/supreme-fruit-produce-inc-631972-06012022
3 12/27/2022 10/06/2022 Empowered Diagnostics LLC Center for Devices and Radiological Health Unapproved Products Related to the Coronavirus Disease 2019 (COVID-19) https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/empowered-diagnostics-llc-638164-10062022
4 12/27/2022 11/18/2022 RoyalVibe Health Ltd. Center for Devices and Radiological Health CGMP/QSR/Medical Devices/PMA/Adulterated/Misbranded https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/royalvibe-health-ltd-639553-11182022
5 12/27/2022 11/28/2022 Bainbridge Beverage West, LLC Division of Human and Animal Food Operations West V Juice HACCP/CGMP for Foods/Adulterated/Insanitary Conditions https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/bainbridge-beverage-west-llc-638942-11282022
6 12/27/2022 12/16/2022 Green Pharmaceuticals Inc. Division of Pharmaceutical Quality Operations IV Drug Product/Adulterated https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/green-pharmaceuticals-inc-635162-12162022
7 12/27/2022 12/16/2022 VapeL1FE, LLC Center for Tobacco Products Family Smoking Prevention and Tobacco Control Act/Adulterated/Misbranded https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/vapel1fe-llc-648624-12162022
8 12/27/2022 12/09/2022 Ruth Special Food Store LLC Division of Northeast Imports Foreign Supplier Verification Program (FSVP) https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/ruth-special-food-store-llc-644551-12092022
9 12/27/2022 11/28/2022 Mike Millenkamp Dairy Cattle Division of Human and Animal Food Operations West II New Animal Drug/Adulterated https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/mike-millenkamp-dairy-cattle-640782-11282022
10 12/27/2022 11/10/2022 EarthLab, Inc., dba Wise Woman Herbals Division of Human and Animal Food Operations West VI CGMP/Dietary Supplement/Adulterated/Misbranded https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/earthlab-inc-dba-wise-woman-herbals-634872-11102022
11 12/27/2022 12/09/2022 Big Olaf Creamery LLC dba Big Olaf Division of Human and Animal Food Operations East IV CGMP/Food/Prepared, Packed or Held Under Insanitary Conditions/Adulterated/L. monocytogenes https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/big-olaf-creamery-llc-dba-big-olaf-642758-12092022
12 12/22/2022 12/22/2022 BS Vapes LLC Center for Tobacco Products Family Smoking Prevention and Tobacco Control Act/Adulterated/Misbranded https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/bs-vapes-llc-647308-12222022
13 12/22/2022 12/22/2022 JP & SN Enterprises Inc. d/b/a eCigs International Center for Tobacco Products Family Smoking Prevention and Tobacco Control Act/Adulterated/Misbranded https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/jp-sn-enterprises-inc-dba-ecigs-international-647315-12222022
14 12/20/2022 11/08/2022 Dollar Tree, Inc. Office of Human and Animal Food Operations – West Division 3 Interstate Commerce/Food/Adulterated https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/dollar-tree-inc-629509-11082022
15 12/20/2022 07/27/2022 Sagent Pharmaceuticals, Inc. Division Pharmaceutical Quality Operations I CGMP/Drugs/Adulterated https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/sagent-pharmaceuticals-inc-636636-07272022
16 12/20/2022 11/21/2022 Nature’s Way Farms, LLC Division of Southwest Imports Foreign Supplier Verification Program (FSVP) https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/natures-way-farms-llc-641201-11212022
17 12/20/2022 12/08/2022 Nortec Quimica SA Center for Drug Evaluation and Research | CDER CGMP/Active Pharmaceutical Ingredient (API)/Adulterated https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/nortec-quimica-sa-639894-12082022
18 12/20/2022 11/30/2022 CHS Inc./CHS River Plains Division of Human and Animal Food Operations West I CGMP/Medicated Feeds/Adulterated https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/chs-incchs-river-plains-642790-11302022
19 12/20/2022 12/02/2022 DuPont Nutrition USA Inc. Division of Pharmaceutical Quality Operations I CGMP/Drug Products/Adulterated https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/dupont-nutrition-usa-inc-627211-12022022
20 12/20/2022 11/01/2022 Del Valle Import Corp. Division of Northeast Imports Foreign Supplier Verification Program (FSVP) https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/del-valle-import-corp-642784-11012022
21 12/20/2022 08/25/2022 Sree Nidhi Corp Center for Food Safety and Applied Nutrition (CFSAN) Foreign Supplier Verification Program (FSVP) https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/sree-nidhi-corp-634266-08252022
22 12/20/2022 12/14/2022 Adarsh Daswani, M.D. Center for Drug Evaluation and Research | CDER Clinical Investigator https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/adarsh-daswani-md-648606-12142022
23 12/15/2022 12/15/2022 Vape King Inc. Center for Tobacco Products Family Smoking Prevention and Tobacco Control Act/Adulterated/Misbranded https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/vape-king-inc-646625-12152022
24 12/15/2022 12/15/2022 Vapor E-Cigarette, L.L.C. Center for Tobacco Products Family Smoking Prevention and Tobacco Control Act/Adulterated/Misbranded https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/vapor-e-cigarette-llc-646876-12152022
25 12/13/2022 12/02/2022 SV3, LLC d/b/a Mi-One Brands Center for Tobacco Products Family Smoking Prevention and Tobacco Control Act/Adulterated/Misbranded https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/sv3-llc-dba-mi-one-brands-647624-12022022
26 12/13/2022 12/07/2022 Centrient Pharmaceuticals India Private Limited Center for Drug Evaluation and Research | CDER CGMP/Active Pharmaceutical Ingredient (API)/Adulterated https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/centrient-pharmaceuticals-india-private-limited-640196-12072022
27 12/13/2022 11/22/2022 Cecilia Alvarez Division of Southwest Imports Foreign Supplier Verification Program (FSVP) https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/cecilia-alvarez-643706-11222022
28 12/13/2022 11/29/2022 Gobwa Exotic Imports Inc. Division of Northeast Imports Foreign Supplier Verification Program (FSVP) https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/gobwa-exotic-imports-inc-641031-11292022
29 12/13/2022 12/05/2022 Thriftmaster Texas, LLC. d/b/a ThriftMaster Global Holdings, Inc. and TM Global Biosciences, LLC Center for Drug Evaluation and Research | CDER Finished Pharmaceuticals/Unapproved New Drug/Misbranded/Adulterated Human Foods https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/thriftmaster-texas-llc-dba-thriftmaster-global-holdings-inc-and-tm-global-biosciences-llc-641057
30 12/13/2022 11/21/2022 Euphoria Fancy Food Inc. Division of Northeast Imports Foreign Supplier Verification Program (FSVP) https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/euphoria-fancy-food-inc-641801-11212022
31 12/08/2022 12/08/2022 Cloud House Vape Center for Tobacco Products Family Smoking Prevention and Tobacco Control Act/Adulterated/Misbranded https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/cloud-house-vape-647544-12082022
32 12/08/2022 12/08/2022 Vapors of Ohio Inc d/b/a Nostalgic Vapes Center for Tobacco Products Family Smoking Prevention and Tobacco Control Act/Adulterated/Misbranded https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/vapors-ohio-inc-dba-nostalgic-vapes-644739-12082022
33 12/06/2022 11/28/2022 AG Hair Limited Center for Drug Evaluation and Research | CDER CGMP/Finished Pharmaceuticals/Adulterated https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/ag-hair-limited-638646-11282022
34 12/06/2022 11/22/2022 Glenmark Pharmaceuticals Limited Center for Drug Evaluation and Research | CDER CGMP/Finished Pharmaceuticals/Adulterated https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/glenmark-pharmaceuticals-limited-637314-11222022
35 12/06/2022 09/23/2022 Saffron USA LLC Division of Human and Animal Food Operations East IV Unapproved New Drugs/Misbranded https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/saffron-usa-llc-629821-09232022
36 12/06/2022 10/24/2022 Cryos International USA LLC Division of Biological Products Operations I Deviations/CFR/Regulations for Human Cells, Tissues & Cellular Products (HCT/Ps) https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/cryos-international-usa-llc-639696-10242022
37 12/06/2022 10/17/2022 Zuland Distributor Corp Division of Southwest Imports Foreign Supplier Verification Program (FSVP) https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/zuland-distributor-corp-638899-10172022
38 12/06/2022 11/07/2022 Manzela USA, LLC Division of Southwest Imports Foreign Supplier Verification Program (FSVP) https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/manzela-usa-llc-642268-11072022
39 12/06/2022 11/07/2022 Maliba African Market Corp. Division of Northeast Imports Foreign Supplier Verification Program (FSVP) https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/maliba-african-market-corp-642698-11072022
40 12/06/2022 11/30/2022 Kari Gran Inc. Division of Pharmaceutical Quality Operations IV CGMP/Finished Pharmaceuticals/Adulterated https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/kari-gran-inc-640035-11302022
41 12/01/2022 12/01/2022 Vapor Candy Inc d/b/a The Vape Stop Center for Tobacco Products Family Smoking Prevention and Tobacco Control Act/Adulterated/Misbranded https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/vapor-candy-inc-dba-vape-stop-645475-12012022
42 11/30/2022 11/30/2022 Jayde's Vapor Lounge Center for Tobacco Products Family Smoking Prevention and Tobacco Control Act/Adulterated/Misbranded https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/jaydes-vapor-lounge-645085-11302022
43 11/29/2022 11/10/2022 Vapor Plus OK LLC Center for Tobacco Products Family Smoking Prevention and Tobacco Control Act/Adulterated/Misbranded https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/vapor-plus-ok-llc-646225-11102022
44 11/29/2022 11/18/2022 "David M. Lubeck, M.D./Arbor Centers for EyeCare Center for Drug Evaluation and Research | CDER Clinical Investigator (Sponsor) https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/david-m-lubeck-mdarbor-centers-eyecare-643531-11182022
45 11/29/2022 06/01/2022 Jam Jam Services, Inc. Division of Southeast Imports Foreign Supplier Verification Program (FSVP) https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/jam-jam-services-inc-630847-06012022
46 11/29/2022 09/19/2022 La Serranita Import and Export LLC Division of Northeast Imports Foreign Supplier Verification Program (FSVP) https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/la-serranita-import-and-export-llc-633743-09192022
47 11/29/2022 11/09/2022 J R Imports LLC Division of Southwest Imports Foreign Supplier Verification Program (FSVP) https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/j-r-imports-llc-643214-11092022
48 11/29/2022 09/01/2022 Shuzy Rock Inc. Division of Pharmaceutical Quality Operations I CGMP/Finished Pharmaceuticals/Adulterated https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/shuzy-rock-inc-630110-09012022
49 11/22/2022 10/19/2022 Pepe’s Foods Inc. Division of West Coast Imports Foreign Supplier Verification Program (FSVP) https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/pepes-foods-inc-640716-10192022
50 11/22/2022 11/14/2022 yourtramadol.com Center for Drug Evaluation and Research | CDER Finished Pharmaceuticals/Unapproved New Drug/Misbranded https://www.fda.gov//inspections-compliance-enforcement-and-criminal-investigations/warning-letters/yourtramadolcom-639959-11142022

How to group categorical series (like you can in Tableau

I don't even know how to ask this question, so forgive me if I'm not using appropriate terminology. I have a dataframe of every judicial case that is filed and disposed of. There is a series in this df called 'Court Name' which, as the name implies, lists the court where each case is filed or disposed in. Here they are:
df_combined['Court Name'].value_counts()
Out[27]:
JP 6-1 143768
JP 6-2 111792
JP 3 98831
JP 7 92768
JP 4 74083
383rd District Court 61505
JP 2 60038
JP 5 51013
JP 1 35475
Jury Duty Court 34033
388th District Court 25713
County Court at Law 7 17788
County Court at Law 1 17389
County Criminal Court 4 16877
County Court at Law 4 16823
County Court at Law 2 16812
County Criminal Court 1 16736
County Criminal Court 3 16180
County Criminal Court 2 16025
County Court at Law 5 13243
65th District Court 12635
327th District Court 11957
409th District Court 11707
County Court at Law 6 10818
120th District Court 10633
41st District Court 10308
243rd District Court 9944
Mental Health Court 1 9415
168th District Court 9252
210th District Court 9122
171st District Court 9079
384th District Court 8637
346th District Court 8470
Criminal District Court 1 8274
34th District Court 8228
205th District Court 6141
County Court at Law 3 5283
Mental Health Court 2 4575
448th District Court 3466
Magistration 1835
Probate Court 2 1597
Probate Court 1 1590
384th Competency Court 568
346th Veterans Treatment Court 153
District Clerk 92
County Clerk 43
County Courts at Law 15
Family Court Services 12
Probate Courts 7
Domestic Relations Office 3
County Criminal Courts 2
Deceptive Trade 1
Name: Court Name, dtype: int64
I'm converting from Tableau to Python/Pandas/Numpy/Plotly/Dash, and in Tableau, you can create groups based on a series. What I need to do is to categorize all of the above outputs into
District Courts
County Courts
JP Courts, and
None of the above courts / courts I'm going to filter out.
The end desired result is a new 'Category' series, so let's say case number 1 is filed in the 388th District Court, it's category should be District, and if case 2 is filed in County Court at Law 1, it's category should be County, and so on.
I have already created lists where each of the above 'Court Name' values falls into its proper category, but I don't know what to with those lists, or even if creating these lists is appropriate. I'd like to not develop poor coding habits, so I'm relying on your collective expertise on the most efficient/elegant way to accomplish my end goal.
Thank you all so much in advance!
Jacob

ValueError: All arrays must be of the same length append data in data frame

import requests
from bs4 import BeautifulSoup
import pandas as pd
headers ={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'
}
productlink=[]
n=[]
a=[]
re=[]
ra=[]
w=[]
r =requests.get('https://www.houzz.com/professionals/general-contractor')
soup=BeautifulSoup(r.content, 'html.parser')
tra = soup.find_all('div',class_='hz-pro-search-result__info')
for pro in tra:
name=pro.find('span',class_='mlm header-5 text-unbold').text
n.append(name)
address=pro.find('span',class_='hz-pro-search-result__location-info__text').text
a.append(address)
reviews=pro.find('span',class_='hz-star-rate__review-string').text
re.append(reviews)
rating=pro.find('span',class_='hz-star-rate__rating-number').text
ra.append(rating)
for links in tra:
for link in links.find_all('a',href=True)[2:]:
if link['href'].startswith('https://www.houzz.com/professionals/general-contractors'):
productlink.append(link['href'])
for link in productlink:
r =requests.get(link,headers=headers)
soup=BeautifulSoup(r.content, 'html.parser')
for web in soup.find_all('a',attrs={'class':'sc-62xgu6-0 jxCcwv mwxddt-0 bSdLOV hui-link trackMe'}):
w.append(web['href'])
df = pd.DataFrame({'name':n,'address':a,'reviews':re,'rating':ra,'web':w})
print(df)
the code is working well when I am trying to append the data into the dataframe they show me the that all ValueError: All arrays must be of the same length how to append these data into data frame how to resolve these issue I am very thankful kindly if you help me in this matter
This my output :
Capital Remodeling Hanover, Maryland 21076, United States 409 Reviews 4.8
SOD Home Group 367 Santana Heights, Unit #3-3021, San Jose, California 95128, United States 238 Reviews 5.0
Innovative Construction Inc. 3040 Amwiler Rd, Suite B, Peachtree Corners, Georgia 30360, United States 100 Reviews 5.0
Baron Construction & Remodeling Co. Saratoga & Los Angeles, California 95070, United States 69 Reviews 4.8
Luxe Remodel 329 N. Wetherly Dr., Suite 205, Los Angeles, California 90211, United States 79 Reviews 4.9
California Home Builders & Remodeling Inc. STUDIO CITY, California 91604, United States 232 Reviews 5.0
Sneller Custom Homes and Remodeling, LLC 17018 Seven Pines Dr Ste 100, Spring, Texas 77379, United States 77 Reviews 4.9
123 Remodeling Inc. 5070 N. Kimberly Ave Suite C, Chicago, Illinois 60630, United States 83 Reviews 4.7
Professional builders & Remodeling, Inc 15335 Morrison St #325, Sherman Oaks, California 91403, United States 203 Reviews 5.0
Rudloff Custom Builders 896 Breezewood Lane, West Chester, Pennsylvania 19382, United States 111 Reviews 5.0
LAR Construction & Remodeling 6371 canby ave, Tarzana, California 91335, United States 191 Reviews 5.0
Erie Construction Mid West 4271 Monroe St., Toledo, Ohio 43606, United States 231 Reviews 4.8
Regal Construction & Remodeling Inc. 19537 � Ventura Blvd., Tarzana, California 91356, United States 96 Reviews 4.8
Mr. & Mrs. Construction & Remodeling 2570 N 1st street, ste 212, San Jose, California 95131, United States 75 Reviews 5.0
Bailey Remodeling and Construction LLC 201 Meridian Ave., Suite 201, Louisville, Kentucky 40207, United States 106 Reviews 5.0
https://www.houzz.com/trk/aHR0cDovL3d3dy5iYWlsZXlyZW1vZGVsLmNvbQ/2f005891e940e2c01021b57733580fa3/ue/NDU3NDcxNQ/a3be682e415d6c23590401e416ee1018
Make it as simple as possible and do not store the information from different loops in these bunch of lists, try to store them in one dict:
possible solution
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers ={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'
}
r =requests.get('https://www.houzz.com/professionals/general-contractor')
soup=BeautifulSoup(r.content, 'html.parser')
tra = soup.find_all('div',class_='hz-pro-search-result__info')
data = []
for pro in tra:
name=pro.find('span',class_='mlm header-5 text-unbold').text
address=pro.find('span',class_='hz-pro-search-result__location-info__text').text
reviews=pro.find('span',class_='hz-star-rate__review-string').text
rating=pro.find('span',class_='hz-star-rate__rating-number').text
productlink.append(pro.find('a')['href'])
w = pro.find('a')['href']
data.append({'name':name,'address':address,'reviews':reviews,'rating':rating,'web':w})
for idx,item in enumerate(data):
r =requests.get(item['web'],headers=headers)
soup=BeautifulSoup(r.content, 'html.parser')
for web in soup.find_all('a',attrs={'class':'sc-62xgu6-0 jxCcwv mwxddt-0 bSdLOV hui-link trackMe'}):
data[idx]['web']=(web['href'])
df = pd.DataFrame(data)
df
Output
name address reviews rating web
0 Capital Remodeling Hanover, Maryland 21076, United States 409 Reviews 4.8 https://www.houzz.com/trk/aHR0cDovL3d3dy5jYXBp...
1 SOD Home Group 367 Santana Heights, Unit #3-3021, San Jose, C... 238 Reviews 5.0 https://www.houzz.com/trk/aHR0cHM6Ly9zb2RoZy5j...
2 Innovative Construction Inc. 3040 Amwiler Rd, Suite B, Peachtree Corners, G... 100 Reviews 5.0 https://www.houzz.com/trk/aHR0cHM6Ly9pbm5vdmF0...
3 Baron Construction & Remodeling Co. Saratoga & Los Angeles, California 95070, Unit... 69 Reviews 4.8 https://www.houzz.com/trk/aHR0cDovL3d3dy5iYXJv...
4 Luxe Remodel 329 N. Wetherly Dr., Suite 205, Los Angeles, C... 79 Reviews 4.9 https://www.houzz.com/professionals/general-co...
5 California Home Builders & Remodeling Inc. STUDIO CITY, California 91604, United States 232 Reviews 5.0 https://www.houzz.com/trk/aHR0cDovL3d3dy5teWNh...
6 Sneller Custom Homes and Remodeling, LLC 17018 Seven Pines Dr Ste 100, Spring, Texas 77... 77 Reviews 4.9 https://www.houzz.com/trk/aHR0cDovL3NuZWxsZXJj...
7 123 Remodeling Inc. 5070 N. Kimberly Ave Suite C, Chicago, Illinoi... 83 Reviews 4.7 https://www.houzz.com/trk/aHR0cHM6Ly8xMjNyZW1v...
8 Professional builders & Remodeling, Inc 15335 Morrison St #325, Sherman Oaks, Californ... 203 Reviews 5.0 https://www.houzz.com/trk/aHR0cDovL3d3dy5wcm9m...
9 Rudloff Custom Builders 896 Breezewood Lane, West Chester, Pennsylvani... 111 Reviews 5.0 https://www.houzz.com/trk/aHR0cDovL1J1ZGxvZmZj...
10 LAR Construction & Remodeling 6371 canby ave, Tarzana, California 91335, Uni... 191 Reviews 5.0 https://www.houzz.com/trk/aHR0cDovL3d3dy5sYXJy...
11 Erie Construction Mid West 4271 Monroe St., Toledo, Ohio 43606, United St... 231 Reviews 4.8 https://www.houzz.com/trk/aHR0cDovL3d3dy5lcmll...
12 Regal Construction & Remodeling Inc. 19537 ½ Ventura Blvd., Tarzana, California 913... 96 Reviews 4.8 https://www.houzz.com/trk/aHR0cDovL3JlZ2FscmVu...
13 Mr. & Mrs. Construction & Remodeling 2570 N 1st street, ste 212, San Jose, Californ... 75 Reviews 5.0 https://www.houzz.com/trk/aHR0cDovL3d3dy5NcmFu...
14 Bailey Remodeling and Construction LLC 201 Meridian Ave., Suite 201, Louisville, Kent... 106 Reviews 5.0 https://www.houzz.com/trk/aHR0cDovL3d3dy5iYWls...

How to apply regex to get the exact house number with approximate residual address match

import re
list =[]
for element in address1:
z = re.match("^\d+", element)
if z:
list.append(z.string)
get_best_fuzzy("SATYAGRAH;OPP. RAJ SUYA BUNGLOW", list)
I have tried the above code, it is giving me the approximate address match for the addresses in my text file. How can I get the exact house number match with approximate rest address match. My addresses are in format:
1004; Jay Shiva Tower; Near Azad Society; Ambawadi Ahmedabad Gujarat 380015 India
1004; Jayshiva Tower; Near Azad Society; Ambawadi Ahmedabad Gujarat 380015 India
101 GAMBS TOWER; FOUR BUNGLOWS;OPPOSITE GOOD SHEPHERD CHURCH ANDHERI WEST MUMBAI Maharashtra 400053 India
101/32-B; SHREE GANESH COMPLEX VEER SAVARKAR BLOCK; SHAKARPUR; EASE DEL HI DELHI Delhi 110092 India
you can try this.
Code :
import re
address = ["1004; Jayshiva Tower; Near Azad Society; Ambawadi Ahmedabad Gujarat 380015 India",
"101 GAMBS TOWER; FOUR BUNGLOWS;OPPOSITE GOOD SHEPHERD CHURCH ANDHERI WEST MUMBAI Maharashtra 400053 India",
"101/32-B; SHREE GANESH COMPLEX VEER SAVARKAR BLOCK; SHAKARPUR; EASE DEL HI DELHI Delhi 110092 India"]
for i in address:
z = re.match("^([^ ;]+)", i)
print(z.group())
Output :
1004
101
101/32-B

How to Nest If Statement Within For Loop When Scraping Div Class HTML

Below is a scraper that uses Beautiful Soup to scrape physician information off of this webpage. As you can see from the html code directly below, each physician has an individual profile on the webpage that displays the physician's name, clinic, profession, taxonomy, and city.
<div class="views-field views-field-title practitioner__name" >Marilyn Adams</div>
<div class="views-field views-field-field-pract-clinic practitioner__clinic" >Fortius Sport & Health</div>
<div class="views-field views-field-field-pract-profession practitioner__profession" >Physiotherapist</div>
<div class="views-field views-field-taxonomy-vocabulary-5 practitioner__region" >Fraser River Delta</div>
<div class="views-field views-field-city practitioner__city" ></div>
As you can see from the sample html code, the physician profiles occasionally have information missing. If this occurs, I would like the scraper to print 'N/A'. I need the scraper to print 'N/A' because I would eventually like to put each div class category (name, clinic, profession, etc.) into an array where the lengths of each column are exactly the same so I can properly export the data to a CSV file. Here is an example of what I want the output to look like compared to what is actually showing up.
Actual Expected
[Names] [Names]
Greg Greg
Bob Bob
[Clinic] [Clinic]
Sport/Health Sport/Health
N/A
[Profession] [Profession]
Physical Therapist Physical Therapist
Physical Therapist Physical Therapist
[Taxonomy] [Taxonomy]
Fraser River Fraser River
N/A
[City] [City]
Vancouver Vancouver
Vancouver Vancouver
I have tried writing an if statement nested within each for loop, but the code does not seem to be looping correctly as the "N/A" only shows up once for each div class section. Does anyone know how to properly nest an if statement with a for loop so I am getting the proper amount of "N/As" in each column? Thanks in advance!
import requests
import re
from bs4 import BeautifulSoup
page=requests.get('https://sportmedbc.com/practitioners')
soup=BeautifulSoup(page.text, 'html.parser')
#Find Doctor Info
for doctor in soup.find_all('div',attrs={'class':'views-field views-field-title practitioner__name'}):
for a in doctor.find_all('a'):
print(a.text)
for clinic_name in soup.find_all('div',attrs={'class':'views-field views-field-field-pract-clinic practitioner__clinic'}):
for b in clinic_name.find_all('a'):
if b==(''):
print('N/A')
profession_links=soup.findAll('div',attrs={'class':'views-field views-field-field-pract-profession practitioner__profession'})
for profession in profession_links:
if profession.text==(''):
print('N/A')
print(profession.text)
taxonomy_links=soup.findAll('div',attrs={'class':'views-field views-field-taxonomy-vocabulary-5 practitioner__region'})
for taxonomy in taxonomy_links:
if taxonomy.text==(''):
print('N/A')
print(taxonomy.text)
city_links=soup.findAll('div',attrs={'class':'views-field views-field-taxonomy-vocabulary-5 practitioner__region'})
for city in city_links:
if city.text==(''):
print('N/A')
print(city.text)
For this problem you can use ChainMap from collections module (docs here). That way you can define your default values, in this case 'n/a' and only grab information that exists for each doctor:
from bs4 import BeautifulSoup
import requests
from collections import ChainMap
url = 'https://sportmedbc.com/practitioners'
soup = BeautifulSoup(requests.get(url).text, 'lxml')
def get_data(soup):
default_data = {'name': 'n/a', 'clinic': 'n/a', 'profession': 'n/a', 'region': 'n/a', 'city': 'n/a'}
for doctor in soup.select('.view-practitioners .practitioner'):
doctor_data = {}
if doctor.select_one('.practitioner__name').text.strip():
doctor_data['name'] = doctor.select_one('.practitioner__name').text
if doctor.select_one('.practitioner__clinic').text.strip():
doctor_data['clinic'] = doctor.select_one('.practitioner__clinic').text
if doctor.select_one('.practitioner__profession').text.strip():
doctor_data['profession'] = doctor.select_one('.practitioner__profession').text
if doctor.select_one('.practitioner__region').text.strip():
doctor_data['region'] = doctor.select_one('.practitioner__region').text
if doctor.select_one('.practitioner__city').text.strip():
doctor_data['city'] = doctor.select_one('.practitioner__city').text
yield ChainMap(doctor_data, default_data)
for doctor in get_data(soup):
print('name:\t\t', doctor['name'])
print('clinic:\t\t',doctor['clinic'])
print('profession:\t',doctor['profession'])
print('city:\t\t',doctor['city'])
print('region:\t\t',doctor['region'])
print('-' * 80)
Prints:
name: Jaimie Ackerman
clinic: n/a
profession: n/a
city: n/a
region: n/a
--------------------------------------------------------------------------------
name: Marilyn Adams
clinic: Fortius Sport & Health
profession: Physiotherapist
city: n/a
region: Fraser River Delta
--------------------------------------------------------------------------------
name: Mahsa Ahmadi
clinic: Wellpoint Acupuncture (Sports Medicine)
profession: Acupuncturist
city: Vancouver
region: Vancouver & Sea to Sky
--------------------------------------------------------------------------------
name: Tracie Albisser
clinic: Pacific Sport Northern BC, Tracie Albisser
profession: Strength and Conditioning Specialist, Exercise Physiologist
city: n/a
region: Cariboo - North East
--------------------------------------------------------------------------------
name: Christine Alder
clinic: n/a
profession: n/a
city: Vancouver
region: Vancouver & Sea to Sky
--------------------------------------------------------------------------------
name: Steacy Alexander
clinic: Go! Physiotherapy Sports and Wellness Centre
profession: Physiotherapist
city: Vancouver
region: Vancouver & Sea to Sky
--------------------------------------------------------------------------------
name: Page Allison
clinic: AET Clinic, .
profession: Athletic Therapist
city: Victoria
region: Vancouver Island - Central Coast
--------------------------------------------------------------------------------
name: Dana Alumbaugh
clinic: n/a
profession: Podiatrist
city: Squamish
region: Vancouver & Sea to Sky
--------------------------------------------------------------------------------
name: Manouch Amel
clinic: Mountainview Kinesiology Ltd.
profession: Strength and Conditioning Specialist
city: Anmore
region: Vancouver & Sea to Sky
--------------------------------------------------------------------------------
name: Janet Ames
clinic: Dr. Janet Ames
profession: Physician
city: Prince George
region: Cariboo - North East
--------------------------------------------------------------------------------
name: Sandi Anderson
clinic: n/a
profession: n/a
city: Coquitlam
region: Fraser Valley
--------------------------------------------------------------------------------
name: Greg Anderson
clinic: University of the Fraser Valley
profession: Exercise Physiologist
city: Mission
region: Fraser Valley
--------------------------------------------------------------------------------
EDIT:
For getting the output in columns, you can use this example:
def print_data(header_text, data, key):
print(header_text)
for d in data:
print(d[key])
print()
data = list(get_data(soup))
print_data('[Names]', data, 'name')
print_data('[Clinic]', data, 'clinic')
print_data('[Profession]', data, 'profession')
print_data('[Taxonomy]', data, 'region')
print_data('[City]', data, 'city')
This prints:
[Names]
Jaimie Ackerman
Marilyn Adams
Mahsa Ahmadi
Tracie Albisser
Christine Alder
Steacy Alexander
Page Allison
Dana Alumbaugh
Manouch Amel
Janet Ames
Sandi Anderson
Greg Anderson
[Clinic]
n/a
Fortius Sport & Health
Wellpoint Acupuncture (Sports Medicine)
Pacific Sport Northern BC, Tracie Albisser
n/a
Go! Physiotherapy Sports and Wellness Centre
AET Clinic, .
n/a
Mountainview Kinesiology Ltd.
Dr. Janet Ames
n/a
University of the Fraser Valley
[Profession]
n/a
Physiotherapist
Acupuncturist
Strength and Conditioning Specialist, Exercise Physiologist
n/a
Physiotherapist
Athletic Therapist
Podiatrist
Strength and Conditioning Specialist
Physician
n/a
Exercise Physiologist
[Taxonomy]
n/a
Fraser River Delta
Vancouver & Sea to Sky
Cariboo - North East
Vancouver & Sea to Sky
Vancouver & Sea to Sky
Vancouver Island - Central Coast
Vancouver & Sea to Sky
Vancouver & Sea to Sky
Cariboo - North East
Fraser Valley
Fraser Valley
[City]
n/a
n/a
Vancouver
n/a
Vancouver
Vancouver
Victoria
Squamish
Anmore
Prince George
Coquitlam
Mission

Categories