Web Scraping & BeautifulSoup <li> parsing

Web Scraping & BeautifulSoup <li> parsing - python

I'm just learning web scraping & want to output the result of this website to a csv file
https://www.avbuyer.com/aircraft/private-jets
but am struggling with year, sn & time field in the below code -
when I put "soup" in place of "post" it works but not when I want to put them together
any help would be much appreciated
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://www.avbuyer.com/aircraft/private-jets'
page = requests.get(url)
page
soup = BeautifulSoup(page.text, 'lxml')
soup
df = pd.DataFrame({'Plane':[''], 'Year':[''], 'S/N':[''], 'Total Time':[''], 'Price':[''], 'Location':[''], 'Description':[''], 'Tag':[''], 'Last updated':[''], 'Link':['']})
while True:
postings = soup.find_all('div', class_ = 'listing-item premium')
for post in postings:
try:
link = post.find('a', class_ = 'more-info').get('href')
link_full = 'https://www.avbuyer.com'+ link
plane = post.find('h2', class_ = 'item-title').text
price = post.find('div', class_ = 'price').text
location = post.find('div', class_ = 'list-item-location').text
year = post.find_all('ul', class_ = 'fa-no-bullet clearfix')[2]
year.find_all('li')[0].text
sn = post.find('ul', class_ = 'fa-no-bullet clearfix')[2]
sn.find('li')[1].text
time = post.find('ul', class_ = 'fa-no-bullet clearfix')[2]
time.find('li')[2].text
desc = post.find('div', classs_ = 'list-item-para').text
tag = post.find('div', class_ = 'list-viewing-date').text
updated = post.find('div', class_ = 'list-update').text
df = df.append({'Plane':plane, 'Year':year, 'S/N':sn, 'Total Time':time, 'Price':price, 'Location':location,
'Description':desc, 'Tag':tag, 'Last updated':updated, 'Link':link_full}, ignore_index = True)
except:
pass
next_page = soup.find('a', {'rel':'next'}).get('href')
next_page_full = 'https://www.avbuyer.com'+next_page
next_page_full
url = next_page_full
page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')
df.to_csv('/Users/xxx/avbuyer.csv')

Try this:
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers= {'User-Agent': 'Mozilla/5.0'}
response = requests.get('https://www.avbuyer.com/aircraft/private-jets')
soup = BeautifulSoup(response.content, 'html.parser')
postings = soup.find_all('div', class_ = 'listing-item premium')
temp=[]
for post in postings:
link = post.find('a', class_ = 'more-info').get('href')
link_full = 'https://www.avbuyer.com'+ link
plane = post.find('h2', class_ = 'item-title').text
price = post.find('div', class_ = 'price').text
location = post.find('div', class_ = 'list-item-location').text
t=post.find_all('div',class_='list-other-dtl')
for i in t:
data=[tup.text for tup in i.find_all('li')]
years=data[0]
s=data[1]
total_time=data[2]
temp.append([plane,price,location,link_full,years,s,total_time])
df=pd.DataFrame(temp,columns=["plane","price","location","link","Years","S/N","Totaltime"])
print(df)
output:
plane price location link Years S/N Totaltime
0 Dassault Falcon 2000LXS Make offer North America + Canada, United States - MD, Fo... https://www.avbuyer.com/aircraft/private-jets/... Year 2021 S/N 377 Total Time 33
1 Cirrus Vision SF50 G1 Please call North America + Canada, United States - WI, Fo... https://www.avbuyer.com/aircraft/private-jets/... Year 2018 S/N 0080 Total Time 615
2 Gulfstream IV Make offer North America + Canada, United States - MD, Fo... https://www.avbuyer.com/aircraft/private-jets/... Year 1990 S/N 1148 Total Time 6425
4 Boeing 787-8 Make offer Europe, Monaco, For Sale by Global Jet Monaco https://www.avbuyer.com/aircraft/private-jets/... Year 2010 S/N - Total Time 1
5 Hawker 4000 Make offer South America, Puerto Rico, For Sale by JetHQ https://www.avbuyer.com/aircraft/private-jets/... Year 2009 S/N RC-24 Total Time 2120
6 Embraer Legacy 500 Make offer North America + Canada, United States - NE, Fo... https://www.avbuyer.com/aircraft/private-jets/... Year 2015 S/N 55000016 Total Time 2607
7 Dassault Falcon 2000LXS Make offer North America + Canada, United States - DE, Fo... https://www.avbuyer.com/aircraft/private-jets/... Year 2015 S/N 300 Total Time 1909
8 Dassault Falcon 50EX Please call North America + Canada, United States - TX, Fo... https://www.avbuyer.com/aircraft/private-jets/... Year 2002 S/N 320 Total Time 7091.9
9 Dassault Falcon 2000 Make offer North America + Canada, United States - MD, Fo... https://www.avbuyer.com/aircraft/private-jets/... Year 2001 S/N 146 Total Time 6760
10 Bombardier Learjet 75 Make offer Europe, Switzerland, For Sale by Jetcraft https://www.avbuyer.com/aircraft/private-jets/... Year 2014 S/N 45-491 Total Time 1611
11 Hawker 800B Please call Europe, United Kingdom - England, For Sale by ... https://www.avbuyer.com/aircraft/private-jets/... Year 1985 S/N 258037 Total Time 9621
13 BAe Avro RJ100 Please call North America + Canada, United States - MT, Fo... https://www.avbuyer.com/aircraft/private-jets/... Year 1996 S/N E3282 Total Time 45996
14 Embraer Legacy 600 Make offer North America + Canada, United States - MD, Fo... https://www.avbuyer.com/aircraft/private-jets/... Year 2007 S/N 14501014 Total Time 4328
15 Bombardier Challenger 850 Make offer North America + Canada, United States - AZ, Fo... https://www.avbuyer.com/aircraft/private-jets/... Year 2003 S/N 7755 Total Time 12114.1
16 Gulfstream G650 Please call Europe, Switzerland, For Sale by Jetcraft https://www.avbuyer.com/aircraft/private-jets/... Year 2013 S/N 6047 Total Time 2178
17 Bombardier Learjet 55 Price: USD $995,000 North America + Canada, United States - MD, Fo... https://www.avbuyer.com/aircraft/private-jets/... Year 1982 S/N 020 Total Time 13448
18 Dassault Falcon 8X Please call North America + Canada, United States - MD, Fo... https://www.avbuyer.com/aircraft/private-jets/... Year 2016 S/N 406 Total Time 1627
19 Hawker 800XP Price: USD $1,595,000 North America + Canada, United States - MD, Fo... https://www.avbuyer.com/aircraft/private-jets/... Year 2002 S/N 258578 Total Time 10169

Right now, your try-except clauses are not allowing you to see and debug the errors in your script. If you remove them, you will see:
IndexError: list index out of range in line 24. There are only two elements inside the list, and you are looking for the second one. Therefore, your line should be:
year = post.find_all('ul', class_ = 'fa-no-bullet clearfix')[1]
KeyError: 2 in line 26. You are using find(), which returns a <class 'bs4.element.Tag'> object, not a list. Here you want to use find_all() as you did in line 24. Same happens for line 28.
However, instead of using this expression three times, you should rather store the result in a variable and use it later.
AttributeError: 'NoneType' object has no attribute 'text' in line 31. There is a type, you wrote _classs.
AttributeError: 'NoneType' object has no attribute 'text' in line 32. There is nothing wrong with your code. Instead, there are some entries in the webpage that don't have this element. You should check if the find method gave you any result.
tag = post.find('div', class_ = 'list-viewing-date')
if tag:
tag = tag.text
else:
tag = None
You don't have a way out of your while loop. You should catch whenever the script cannot find a new next_page and add a break.
After changing all this, it worked for me to scrape the first page. I used:
Python 3.9.7
bs4 4.10.0
It is very important that you state what versions of Python and the libraries you are using.
Cheers!

Related

How to scrape data from multiple pages using beautifulsoup

I am try to scrape multiple page but they give me nothing kindly help me to resolve these issue
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers= {'User-Agent': 'Mozilla/5.0'}
for page in range(1,2 ):
response = requests.get("https://www.avbuyer.com/aircraft/private-jets={page}".format(
page=page
),
headers=headers,
)
soup = BeautifulSoup(response.content, 'html.parser')
postings = soup.find_all('div', class_ = 'listing-item premium')
for post in postings:
link = post.find('a', class_ = 'more-info').get('href')
link_full = 'https://www.avbuyer.com'+ link
plane = post.find('h2', class_ = 'item-title').text
price = post.find('div', class_ = 'price').text
location = post.find('div', class_ = 'list-item-location').text
print(location)

problem was in page = that should be page- .Now your code is working fine.
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers = {'User-Agent': 'Mozilla/5.0'}
for page in range(1, 2):
response = requests.get("https://www.avbuyer.com/aircraft/private-jets/page-{page}".format(page=page),headers=headers,)
soup = BeautifulSoup(response.content, 'html.parser')
postings = soup.find_all('div', class_='listing-item premium')
for post in postings:
link = post.find('a', class_='more-info').get('href')
link_full = 'https://www.avbuyer.com' + link
plane = post.find('h2', class_='item-title').text
price = post.find('div', class_='price').text
location = post.find('div', class_='list-item-location').text
print(location)
Output:
North America + Canada, United States - MD, For Sale by Avpro Inc.
North America + Canada, United States - WI, For Sale by Lone Mountain Aircraft Sales
North America + Canada, United States - MD, For Sale by Avpro Inc.
North America + Canada, United States - MD, For Sale by Avpro Inc.
Europe, Monaco, For Sale by Global Jet Monaco
South America, Puerto Rico, For Sale by JetHQ
North America + Canada, United States - NE, For Sale by Duncan Aviation
North America + Canada, United States - DE, For Sale by Leading Edge Aviation Solutions
North America + Canada, United States - TX, For Sale by Par Avion Ltd.
North America + Canada, United States - MD, For Sale by Avpro Inc.
Europe, Switzerland, For Sale by Jetcraft
Europe, United Kingdom - England, For Sale by Jets4UDirect Ltd
North America + Canada, United States - MD, For Sale by Avpro Inc.
North America + Canada, United States - MT, For Sale by SkyWorld Aviation
North America + Canada, United States - MD, For Sale by Avpro Inc.
North America + Canada, United States - AZ, For Sale by Hatt & Associates
Europe, Switzerland, For Sale by Jetcraft
North America + Canada, United States - MD, For Sale by Avpro Inc.
North America + Canada, United States - MD, For Sale by Avpro Inc.
North America + Canada, United States - MD, For Sale by Avpro Inc.

Frequency plot of a Pandas Dataframe

I have a dataframe df such that:
df['user_location'].value_counts()
India 3741
United States 2455
New Delhi, India 1721
Mumbai, India 1401
Washington, DC 1354
...
SpaceCoast,Florida 1
stuck in a book. 1
Beirut , Lebanon 1
Royston Vasey - Tralfamadore 1
Langham, Colchester 1
Name: user_location, Length: 26920, dtype: int64
I want to know the frequency of specific countries like USA, India from the user_location column. Then I want to plot the frequencies as USA, India, and Others.
So, I want to apply some operation on that column such that the value_counts() will give the output as:
India (sum of all frequencies of all the locations in India including cities, states, etc.)
USA (sum of all frequencies of all the locations in the USA including cities, states, etc.)
Others (sum of all frequencies of the other locations)
Seems I should merge the frequencies of rows containing the same country names and merge the rest of them together! But it appears complex while handling the names of the cities, states, etc. What is the most efficient way to do it?

Adding to #Trenton_McKinney 's answer in the comments, if you need to map different country's states/provinces to the country name, you will have to do a little work to make those associations. For example, for India and USA, you can grab a list of their states from wikipedia and map them to your own data to relabel them to their respective country names as follows:
# Get states of India and USA
in_url = 'https://en.wikipedia.org/wiki/States_and_union_territories_of_India#States_and_Union_territories'
in_states = pd.read_html(in_url)[3].iloc[:, 0].tolist()
us_url = 'https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States'
us_states = pd.read_html(us_url)[0].iloc[:, 0].tolist()
states = in_states + us_states
# Make a sample dataframe
df = pd.DataFrame({'Country': states})
Country
0 Andhra Pradesh
1 Arunachal Pradesh
2 Assam
3 Bihar
4 Chhattisgarh
... ...
73 Virginia[E]
74 Washington
75 West Virginia
76 Wisconsin
77 Wyoming
Map state names to country names:
# Map state names to country name
states_dict = {state: 'India' for state in in_states}
states_dict.update({state: 'USA' for state in us_states})
df['Country'] = df['Country'].map(states_dict)
Country
0 India
1 India
2 India
3 India
4 India
... ...
73 USA
74 USA
75 USA
76 USA
77 USA
But from your data sample it looks like you will have a lot of edge cases to deal with as well.

Using the concept of the previous answer, firstly, I have tried to get all the locations including cities, unions, states, districts, territories. Then I have made a function checkl() such that it can check if the location is India or USA and then convert it into its country name. Finally the function has been applied on the dataframe column df['user_location'] :
# Trying to get all the locations of USA and India
import pandas as pd
us_url = 'https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States'
us_states = pd.read_html(us_url)[0].iloc[:, 0].tolist()
us_cities = pd.read_html(us_url)[0].iloc[:, 1].tolist() + pd.read_html(us_url)[0].iloc[:, 2].tolist() + pd.read_html(us_url)[0].iloc[:, 3].tolist()
us_Federal_district = pd.read_html(us_url)[1].iloc[:, 0].tolist()
us_Inhabited_territories = pd.read_html(us_url)[2].iloc[:, 0].tolist()
us_Uninhabited_territories = pd.read_html(us_url)[3].iloc[:, 0].tolist()
us_Disputed_territories = pd.read_html(us_url)[4].iloc[:, 0].tolist()
us = us_states + us_cities + us_Federal_district + us_Inhabited_territories + us_Uninhabited_territories + us_Disputed_territories
in_url = 'https://en.wikipedia.org/wiki/States_and_union_territories_of_India#States_and_Union_territories'
in_states = pd.read_html(in_url)[3].iloc[:, 0].tolist() + pd.read_html(in_url)[3].iloc[:, 4].tolist() + pd.read_html(in_url)[3].iloc[:, 5].tolist()
in_unions = pd.read_html(in_url)[4].iloc[:, 0].tolist()
ind = in_states + in_unions
usToStr = ' '.join([str(elem) for elem in us])
indToStr = ' '.join([str(elem) for elem in ind])
# Country name checker function
def checkl(T):
TSplit_space = [x.lower().strip() for x in T.split()]
TSplit_comma = [x.lower().strip() for x in T.split(',')]
TSplit = list(set().union(TSplit_space, TSplit_comma))
res_ind = [ele for ele in ind if(ele in T)]
res_us = [ele for ele in us if(ele in T)]
if 'india' in TSplit or 'hindustan' in TSplit or 'bharat' in TSplit or T.lower() in indToStr.lower() or bool(res_ind) == True :
T = 'India'
elif 'US' in T or 'USA' in T or 'United States' in T or 'usa' in TSplit or 'united state' in TSplit or T.lower() in usToStr.lower() or bool(res_us) == True:
T = 'USA'
elif len(T.split(','))>1 :
if T.split(',')[0] in indToStr or T.split(',')[1] in indToStr :
T = 'India'
elif T.split(',')[0] in usToStr or T.split(',')[1] in usToStr :
T = 'USA'
else:
T = "Others"
else:
T = "Others"
return T
# Appling the function on the dataframe column
print(df['user_location'].dropna().apply(checkl).value_counts())
Others 74206
USA 47840
India 20291
Name: user_location, dtype: int64
I am quite new in python coding. I think this code can be written in a better and more compact form. And as it is mentioned in the previous answer, there are still a lot of edge cases to deal with. So, I have added it on
Code Review Stack Exchange too. Any criticisms and suggestions to improve the efficiency and readability of my code would be greatly appreciated.

Missing data not being scraped from Hansard

I'm trying to scrape data from Hansard, the official verbatim record of everything spoken in the UK House of Parliament. This is the precise link I'm trying to scrape: in a nutshell, I want to scrape every "mention" container on this page and the following 50 pages after that.
But I find that when my scraper is "finished," it's only collected data on 990 containers and not the full 1010. Data on 20 containers is missing, as if it's skipping a page. When I only set the page range to (0,1), it fails to collect any values. When I set it to (0,2), it collects only the first page's values. Asking it to collect data on 52 pages does not help. I thought that this was perhaps due to the fact that I wasn't giving the URLs enough time to load, so I added some delays in the scraper's crawl. That didn't solve anything.
Can anyone provide me with any insight into what I may be missing? I'd like to make sure that my scraper is collecting all available data.
pages = np.arange(0, 52)
for page in pages:
hansard_url = "https://hansard.parliament.uk/search/Contributions? searchTerm=%22civilian%20casualties%22&startDate=01%2F01%2F1988%2000%3A00%3A00&endDate=07%2F14%2F2020%2000%3A00%3A00"
full_url = hansard_url + "&page=" + str(page) + "&partial=true"
page = get(full_url)
html_soup = BeautifulSoup(page.text, 'html.parser')
mention_containers = html_soup.find_all('div', class_="result contribution")
time.sleep(randint(2,10))
for mention in mention_containers:
topic = mention.div.span.text
topics.append(topic)
house = mention.find("img")["alt"]
if house == "Lords Portcullis":
houses.append("House of Lords")
elif house == "Commons Portcullis":
houses.append("House of Commons")
else:
houses.append("N/A")
name = mention.find('div', class_="secondaryTitle").text
names.append(name)
date = mention.find('div', class_="").text
dates.append(date)
time.sleep(randint(2,10))
hansard_dataset = pd.DataFrame(
{'Date': dates, 'House': houses, 'Speaker': names, 'Topic': topics})
)
print(hansard_dataset.info())
print(hansard_dataset.isnull().sum())
hansard_dataset.to_csv('hansard.csv', index=False, sep="#")
Any help in helping me solve this problem is appreciated.

The server returns on page 48 empty container, so total results are 1000 from pages 1 to 51 (inclusive):
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://hansard.parliament.uk/search/Contributions'
params = {
'searchTerm':'civilian casualties',
'startDate':'01/01/1988 00:00:00',
'endDate':'07/14/2020 00:00:00',
'partial':'True',
'page':1,
}
all_data = []
for page in range(1, 52):
params['page'] = page
print('Page {}...'.format(page))
soup = BeautifulSoup(requests.get(url, params=params).content, 'html.parser')
mention_containers = soup.find_all('div', class_="result contribution")
if not mention_containers:
print('Empty container!')
for mention in mention_containers:
topic = mention.div.span.text
house = mention.find("img")["alt"]
if house == "Lords Portcullis":
house = "House of Lords"
elif house == "Commons Portcullis":
house = "House of Commons"
else:
house = "N/A"
name = mention.find('div', class_="secondaryTitle").text
date = mention.find('div', class_="").get_text(strip=True)
all_data.append({'Date': date, 'House': house, 'Speaker': name, 'Topic': topic})
df = pd.DataFrame(all_data)
print(df)
Prints:
...
Page 41...
Page 42...
Page 43...
Page 44...
Page 45...
Page 46...
Page 47...
Page 48...
Empty container! # <--- here is the server error
Page 49...
Page 50...
Page 51...
Date House Speaker Topic
0 14 July 2014 House of Lords Baroness Warsi Gaza debate in Lords Chamber
1 3 March 2016 House of Lords Lord Touhig Armed Forces Bill debate in Grand Committee
2 2 December 2015 House of Commons Mr David Cameron ISIL in Syria debate in Commons Chamber
3 3 March 2016 House of Lords Armed Forces Bill debate in Grand Committee
4 27 April 2016 House of Lords Armed Forces Bill debate in Lords Chamber
.. ... ... ... ...
995 18 June 2003 House of Lords Lord Craig of Radley Defence Policy debate in Lords Chamber
996 7 September 2004 House of Lords Lord Rea Iraq debate in Lords Chamber
997 14 February 1994 House of Lords The Parliamentary Under-Secretary of State, Mi... Landmines debate in Lords Chamber
998 12 January 2000 House of Commons The Minister of State, Foreign and Commonwealt... Serbia And Kosovo debate in Westminster Hall
999 26 February 2003 House of Lords Lord Rea Iraq debate in Lords Chamber
[1000 rows x 4 columns]

Pandas read_html error when reading in a Wikipedia table

I'm trying to read in a table using read_html
import requests
import pandas as pd
import numpy as np
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_intentional_homicide_rate'
resp = requests.get(url)
tables = pd.read_html(resp.text)
But I get this error
IndexError: list index out of range
Other Wiki pages work fine. What's up with this page and how do I solve the above error?

Seems like the table can't be read because of the jquery table sorter.
It's easy to read tables with the selenium library into a df when you're dealing with jquery instead of plain html. You'll still need to do some cleanup, but this will get the table into a df.
You'll need to install the selenium library and download a web browser driver too.
from selenium import webdriver
driver = r'C:\chromedriver_win32\chromedriver.exe'
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_intentional_homicide_rate'
driver = webdriver.Chrome(driver)
driver.get(url)
the_table = driver.find_element_by_xpath('//*[#id="mw-content-text"]/div/table[2]/tbody/tr/td[2]/table')
data = the_table.text
df = pd.DataFrame([x.split() for x in data.split('\n')])
driver.close()
print(df)
0 1 2 3 4 5 \
0 Country (or dependent territory, None None
1 subnational area, etc.) Region Subregion Rate
2 listed Source None None None None
3 None None None None None None
4 Burundi Africa Eastern Africa 6.02 635
5 Comoros Africa Eastern Africa 7.70 60
6 Djibouti Africa Eastern Africa 6.48 60
7 Eritrea Africa Eastern Africa 8.04 390
8 Ethiopia Africa Eastern Africa 7.56 7,552
9 Kenya Africa Eastern Africa 5.00 2,466
10 Madagascar Africa Eastern Africa 7.69 1,863

Scrape website to only show populated categories

I am in the process of scraping a website and it pulls the contents of the page, but there are categories with headers that are technically empty, but it still shows the header. I would like to only see categories with events in them. Ideally I could even have the components of each transactions so I can choose which elements I want displayed.
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}
print('Scraping NH Dept of Banking...')
print()
NHurl = 'https://www.nh.gov/banking/corporate-activities/index.htm'
NHr = requests.get(NHurl, headers = headers)
NHsoup = BeautifulSoup(NHr.text, 'html.parser')
NHlist = []
for events in NHsoup.findAll('tr')[2:]:
print(events.text)
NHlist.append(events.text)
print(' '.join(NHlist))
Like I said, this works to get all of the information, but there are a lot of headers/empty space that doesn't need to be pulled. For example, at the time I'm writing this the 'acquisitions', 'conversions', and 'change in control' are empty, but the headers still come in and there's are relatively large blank space after the headers. I feel like a I need some sort of loop to go through each header ('td') and then get it's contents ('tr') but I'm just not quite sure how to do it.

You can use itertools.groupby to group elements and then filter out empty rows:
import requests
from itertools import groupby
from bs4 import BeautifulSoup
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}
print('Scraping NH Dept of Banking...')
print()
NHurl = 'https://www.nh.gov/banking/corporate-activities/index.htm'
NHr = requests.get(NHurl, headers = headers)
NHsoup = BeautifulSoup(NHr.text, 'html.parser')
NHlist = []
for _, g in groupby(NHsoup.select('tr'), lambda k, d={'g':0}: (d.update(g=d['g']+1), d['g']) if k.select('th') else (None, d['g'])):
s = [tag.get_text(strip=True, separator=' ') for tag in g]
if any(i == '' for i in s):
continue
NHlist.append(s)
# This is just pretty printing, all the data are already in NHlist:
l = max(map(len,(j for i in NHlist for j in i))) + 5
for item in NHlist:
print('{: <4} {}'.format(' ', item[0]))
print('-' * l)
for i, ev in enumerate(item[1:], 1):
print('{: <4} {}'.format(i, ev))
print()
Prints:
Scraping NH Dept of Banking...
New Bank
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 12/11/18 The Millyard Bank
Interstate Bank Combination
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 01/16/19 Optima Bank & Trust Company with and into Cambridge Trust Company Portsmouth, NH 03/29/19
Amendment to Articles of Agreement or Incorporation; Business or Capital Plan
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 11/26/18 John Hancock Trust Company Boston, MA 01/14/19
2 12/04/18 Franklin Savings Bank Franklin, NH 01/28/19
3 12/12/18 MFS Heritage Trust Company Boston, MA 01/28/19
4 02/25/19 Ankura Trust Company, LLC Fairfield, CT 03/22/19
5 4/25/19 Woodsville Guaranty Savings Bank Woodsville, NH 06/04/19
6 5/10/19 AB Trust Company New York, NY 06/04/19
Reduction in Capital
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 03/07/19 Primary Bank Bedford, NH 04/10/19
Amendment to Bylaws
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 12/10/18 Northeast Credit Union Porstmouth, NH 02/25/19
2 2/25/19 Members First Credit Union Manchester, NH 04/05/19
3 4/24/19 St. Mary's Bank Manchester, NH 05/30/19
4 6/28/19 Bellwether Community Credit Union
Interstate Branch Office
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 01/23/19 Newburyport Five Cents Savings Bank 141 Portsmouth Ave Exeter, NH 02/01/19
2 03/08/19 One Credit Union Newport, NH 03/29/19
3 03/01/19 JPMorgan Chase Bank, NA Nashua, NH 04/04/19
4 03/26/19 Mascoma Bank Lebanon, NH 04/09/19
5 04/24/19 Newburyport Five Cents Savings Bank 321 Lafayette Rd Hampton NH 05/08/19
6 07/10/19 Mascoma Bank 242-244 North Winooski Avenue Burlington VT 07/18/19
7 07/10/19 Mascoma Bank 431 Pine Street Burlington VT 07/18/19
Interstate Branch Office Closure
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 02/15/19 The Provident Bank 321 Lafayette Rd Hampton, NH 02/25/19
New Branch Office
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 12/07/18 Bank of New Hampshire 16-18 South Main Street Concord NH 01/02/19
2 3/4/19 Triangle Credit Union 360 Daniel Webster Highway, Merrimack, NH 03/11/19
3 04/03/19 Bellwether Community Credit Union 425-453 Commercial Street Manchester, NH 04/17/19
4 06/11/19 Primary Bank 23 Crystal Avenue Derry NH 06/11/19
Branch Office Closure
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 5/15/19 Northeast Credit Union Merrimack, NH 05/21/19
New Loan Production Office
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 04/08/19 Community National Bank 367 Route 120, Unit B-5 Lebanon, NH
03766-1430 04/15/19
Loan Production Office Closure
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 07/22/19 The Provident Bank 20 Trafalgar Square, Suite 447 Nashua NH 03063 07/31/19
Trade Name Requests
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 04/16/19 John Hancock Trust Company To use trade name "Manulife Investment Management Trust Company" 04/24/19
New Trust Company
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 02/19/19 Janney Trust Co., LLC
2 02/25/19 Darwin Trust Company of New Hampshire, LLC
3 07/15/`9 Harbor Trust Company
Dissolution of Trust Company
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 09/19/17 Cambridge Associates Fiduciary Trust, LLC Boston, MA 02/05/19
Trust Office Closure
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 5/10/19 Charter Trust Company Rochester, NH 05/20/19
New Trust Office
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 02/25/19 Ankura Trust Company, LLC 140 Sherman Street, 4th Floor Fairfield, CT 06824 03/22/19
Relocation of Trust Office
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 01/23/19 Geode Capital Management Trust Company, LLC Relocate from: One Post Office Square, 20th Floor, Boston MA To: 100 Summer Street, 12th Flr, Boston, MA 02/01/19
2 03/15/19 Drivetrain Trust Company LLC Relocate from: 630 3rd Avenue, 21st Flr New York, NY 10017 To: 410 Park Avenue, Suite 900 New York, NY 10022 03/29/19
3 04/14/19 Boston Partners Trust Company Relocate from: 909 Third Avenue New York, NY 10022 To: One Grand Central Place 60 East 42nd Street, Ste 1550 New York, NY 10165 04/23/19

You could test which rows contain all '\xa0' (appear blank) and exclude. I append to list and convert to pandas dataframe but you could just print the row direct.
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
r = requests.get('https://www.nh.gov/banking/corporate-activities/index.htm')
soup = bs(r.content, 'lxml')
results = []
for tr in soup.select('tr'):
row = [i.text for i in tr.select('th,td')]
if row.count('\xa0') != len(row):
results.append(row)
pd.set_option('display.width', 100)
df = pd.DataFrame(results)
df.style.set_properties(**{'text-align': 'left'})
df.columns = df.iloc[0]
df = df[1:]
df.fillna(value='', inplace=True)
print(df.head(20))

Not sure if this is how you want it, and there is probably a more elegant way, but I basically did was
Pandas to get the table
Pandas automatically assigns columns, so moved column to first row
Found were rows are all nulls
Dropped rows with all nulls and the previous row (it's sub header)
import pandas as pd
print('Scraping NH Dept of Banking...')
print()
NHurl = 'https://www.nh.gov/banking/corporate-activities/index.htm'
df = pd.read_html(NHurl)[0]
top_row = pd.DataFrame([df.columns], index=[-1])
df.columns = top_row.columns
df = df.append(top_row, sort=True).sort_index().reset_index(drop=True)
null_rows = df[df.isnull().values.all(axis=1)].index.tolist()
drop_hdr_rows = [x - 1 for x in null_rows ]
drop_rows = drop_hdr_rows + null_rows
new_df = df[~df.index.isin(drop_rows)]
Output:
print (new_df.to_string())
0 1 2 3
2 New Bank New Bank New Bank New Bank
3 12/11/18 The Millyard Bank NaN NaN
4 Interstate Bank Combination Interstate Bank Combination Interstate Bank Combination Interstate Bank Combination
5 01/16/19 Optima Bank & Trust Company with and into Camb... Portsmouth, NH 03/29/19
12 Amendment to Articles of Agreement or Incorpor... Amendment to Articles of Agreement or Incorpor... Amendment to Articles of Agreement or Incorpor... Amendment to Articles of Agreement or Incorpor...
13 11/26/18 John Hancock Trust Company Boston, MA 01/14/19
14 12/04/18 Franklin Savings Bank Franklin, NH 01/28/19
15 12/12/18 MFS Heritage Trust Company Boston, MA 01/28/19
16 02/25/19 Ankura Trust Company, LLC Fairfield, CT 03/22/19
17 4/25/19 Woodsville Guaranty Savings Bank Woodsville, NH 06/04/19
18 5/10/19 AB Trust Company New York, NY 06/04/19
19 Reduction in Capital Reduction in Capital Reduction in Capital Reduction in Capital
20 03/07/19 Primary Bank Bedford, NH 04/10/19
21 Amendment to Bylaws Amendment to Bylaws Amendment to Bylaws Amendment to Bylaws
22 12/10/18 Northeast Credit Union Porstmouth, NH 02/25/19
23 2/25/19 Members First Credit Union Manchester, NH 04/05/19
24 4/24/19 St. Mary's Bank Manchester, NH 05/30/19
25 6/28/19 Bellwether Community Credit Union NaN NaN
26 Interstate Branch Office Interstate Branch Office Interstate Branch Office Interstate Branch Office
27 01/23/19 Newburyport Five Cents Savings Bank 141 Portsmouth Ave Exeter, NH 02/01/19
28 03/08/19 One Credit Union Newport, NH 03/29/19
29 03/01/19 JPMorgan Chase Bank, NA Nashua, NH 04/04/19
30 03/26/19 Mascoma Bank Lebanon, NH 04/09/19
31 04/24/19 Newburyport Five Cents Savings Bank 321 Lafayette Rd Hampton NH 05/08/19
32 07/10/19 Mascoma Bank 242-244 North Winooski Avenue Burlington VT 07/18/19
33 07/10/19 Mascoma Bank 431 Pine Street Burlington VT 07/18/19
34 Interstate Branch Office Closure Interstate Branch Office Closure Interstate Branch Office Closure Interstate Branch Office Closure
35 02/15/19 The Provident Bank 321 Lafayette Rd Hampton, NH 02/25/19
36 New Branch Office New Branch Office New Branch Office New Branch Office
37 12/07/18 Bank of New Hampshire 16-18 South Main Street Concord NH 01/02/19
38 3/4/19 Triangle Credit Union 360 Daniel Webster Highway, Merrimack, NH 03/11/19
39 04/03/19 Bellwether Community Credit Union 425-453 Commercial Street Manchester, NH 04/17/19
40 06/11/19 Primary Bank 23 Crystal Avenue Derry NH 06/11/19
41 Branch Office Closure Branch Office Closure Branch Office Closure Branch Office Closure
42 5/15/19 Northeast Credit Union Merrimack, NH 05/21/19
43 New Loan Production Office New Loan Production Office New Loan Production Office New Loan Production Office
44 04/08/19 Community National Bank 367 Route 120, Unit B-5 Lebanon, NH 03766-1430 04/15/19
45 Loan Production Office Closure Loan Production Office Closure Loan Production Office Closure Loan Production Office Closure
46 07/22/19 The Provident Bank 20 Trafalgar Square, Suite 447 Nashua NH 03063 07/31/19
51 Trade Name Requests Trade Name Requests Trade Name Requests Trade Name Requests
52 04/16/19 John Hancock Trust Company To use trade name "Manulife Investment Managem... 04/24/19
53 New Trust Company New Trust Company New Trust Company New Trust Company
54 02/19/19 Janney Trust Co., LLC NaN NaN
55 02/25/19 Darwin Trust Company of New Hampshire, LLC NaN NaN
56 07/15/`9 Harbor Trust Company NaN NaN
57 Dissolution of Trust Company Dissolution of Trust Company Dissolution of Trust Company Dissolution of Trust Company
58 09/19/17 Cambridge Associates Fiduciary Trust, LLC Boston, MA 02/05/19
59 Trust Office Closure Trust Office Closure Trust Office Closure Trust Office Closure
60 5/10/19 Charter Trust Company Rochester, NH 05/20/19
61 New Trust Office New Trust Office New Trust Office New Trust Office
62 02/25/19 Ankura Trust Company, LLC 140 Sherman Street, 4th Floor Fairfield, CT 0... 03/22/19
63 Relocation of Trust Office Relocation of Trust Office Relocation of Trust Office Relocation of Trust Office
64 01/23/19 Geode Capital Management Trust Company, LLC Relocate from: One Post Office Square, 20th Fl... 02/01/19
65 03/15/19 Drivetrain Trust Company LLC Relocate from: 630 3rd Avenue, 21st Flr New Y... 03/29/19
66 04/14/19 Boston Partners Trust Company Relocate from: 909 Third Avenue New York, NY ... 04/23/19

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Web Scraping & BeautifulSoup <li> parsing - python

Related

How to scrape data from multiple pages using beautifulsoup

Frequency plot of a Pandas Dataframe

Missing data not being scraped from Hansard

Pandas read_html error when reading in a Wikipedia table

Scrape website to only show populated categories

Categories

Resources