Missing data not being scraped from Hansard

Missing data not being scraped from Hansard - python

I'm trying to scrape data from Hansard, the official verbatim record of everything spoken in the UK House of Parliament. This is the precise link I'm trying to scrape: in a nutshell, I want to scrape every "mention" container on this page and the following 50 pages after that.
But I find that when my scraper is "finished," it's only collected data on 990 containers and not the full 1010. Data on 20 containers is missing, as if it's skipping a page. When I only set the page range to (0,1), it fails to collect any values. When I set it to (0,2), it collects only the first page's values. Asking it to collect data on 52 pages does not help. I thought that this was perhaps due to the fact that I wasn't giving the URLs enough time to load, so I added some delays in the scraper's crawl. That didn't solve anything.
Can anyone provide me with any insight into what I may be missing? I'd like to make sure that my scraper is collecting all available data.
pages = np.arange(0, 52)
for page in pages:
hansard_url = "https://hansard.parliament.uk/search/Contributions? searchTerm=%22civilian%20casualties%22&startDate=01%2F01%2F1988%2000%3A00%3A00&endDate=07%2F14%2F2020%2000%3A00%3A00"
full_url = hansard_url + "&page=" + str(page) + "&partial=true"
page = get(full_url)
html_soup = BeautifulSoup(page.text, 'html.parser')
mention_containers = html_soup.find_all('div', class_="result contribution")
time.sleep(randint(2,10))
for mention in mention_containers:
topic = mention.div.span.text
topics.append(topic)
house = mention.find("img")["alt"]
if house == "Lords Portcullis":
houses.append("House of Lords")
elif house == "Commons Portcullis":
houses.append("House of Commons")
else:
houses.append("N/A")
name = mention.find('div', class_="secondaryTitle").text
names.append(name)
date = mention.find('div', class_="").text
dates.append(date)
time.sleep(randint(2,10))
hansard_dataset = pd.DataFrame(
{'Date': dates, 'House': houses, 'Speaker': names, 'Topic': topics})
)
print(hansard_dataset.info())
print(hansard_dataset.isnull().sum())
hansard_dataset.to_csv('hansard.csv', index=False, sep="#")
Any help in helping me solve this problem is appreciated.

The server returns on page 48 empty container, so total results are 1000 from pages 1 to 51 (inclusive):
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://hansard.parliament.uk/search/Contributions'
params = {
'searchTerm':'civilian casualties',
'startDate':'01/01/1988 00:00:00',
'endDate':'07/14/2020 00:00:00',
'partial':'True',
'page':1,
}
all_data = []
for page in range(1, 52):
params['page'] = page
print('Page {}...'.format(page))
soup = BeautifulSoup(requests.get(url, params=params).content, 'html.parser')
mention_containers = soup.find_all('div', class_="result contribution")
if not mention_containers:
print('Empty container!')
for mention in mention_containers:
topic = mention.div.span.text
house = mention.find("img")["alt"]
if house == "Lords Portcullis":
house = "House of Lords"
elif house == "Commons Portcullis":
house = "House of Commons"
else:
house = "N/A"
name = mention.find('div', class_="secondaryTitle").text
date = mention.find('div', class_="").get_text(strip=True)
all_data.append({'Date': date, 'House': house, 'Speaker': name, 'Topic': topic})
df = pd.DataFrame(all_data)
print(df)
Prints:
...
Page 41...
Page 42...
Page 43...
Page 44...
Page 45...
Page 46...
Page 47...
Page 48...
Empty container! # <--- here is the server error
Page 49...
Page 50...
Page 51...
Date House Speaker Topic
0 14 July 2014 House of Lords Baroness Warsi Gaza debate in Lords Chamber
1 3 March 2016 House of Lords Lord Touhig Armed Forces Bill debate in Grand Committee
2 2 December 2015 House of Commons Mr David Cameron ISIL in Syria debate in Commons Chamber
3 3 March 2016 House of Lords Armed Forces Bill debate in Grand Committee
4 27 April 2016 House of Lords Armed Forces Bill debate in Lords Chamber
.. ... ... ... ...
995 18 June 2003 House of Lords Lord Craig of Radley Defence Policy debate in Lords Chamber
996 7 September 2004 House of Lords Lord Rea Iraq debate in Lords Chamber
997 14 February 1994 House of Lords The Parliamentary Under-Secretary of State, Mi... Landmines debate in Lords Chamber
998 12 January 2000 House of Commons The Minister of State, Foreign and Commonwealt... Serbia And Kosovo debate in Westminster Hall
999 26 February 2003 House of Lords Lord Rea Iraq debate in Lords Chamber
[1000 rows x 4 columns]

Related

Web Scraping & BeautifulSoup <li> parsing

I'm just learning web scraping & want to output the result of this website to a csv file
https://www.avbuyer.com/aircraft/private-jets
but am struggling with year, sn & time field in the below code -
when I put "soup" in place of "post" it works but not when I want to put them together
any help would be much appreciated
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://www.avbuyer.com/aircraft/private-jets'
page = requests.get(url)
page
soup = BeautifulSoup(page.text, 'lxml')
soup
df = pd.DataFrame({'Plane':[''], 'Year':[''], 'S/N':[''], 'Total Time':[''], 'Price':[''], 'Location':[''], 'Description':[''], 'Tag':[''], 'Last updated':[''], 'Link':['']})
while True:
postings = soup.find_all('div', class_ = 'listing-item premium')
for post in postings:
try:
link = post.find('a', class_ = 'more-info').get('href')
link_full = 'https://www.avbuyer.com'+ link
plane = post.find('h2', class_ = 'item-title').text
price = post.find('div', class_ = 'price').text
location = post.find('div', class_ = 'list-item-location').text
year = post.find_all('ul', class_ = 'fa-no-bullet clearfix')[2]
year.find_all('li')[0].text
sn = post.find('ul', class_ = 'fa-no-bullet clearfix')[2]
sn.find('li')[1].text
time = post.find('ul', class_ = 'fa-no-bullet clearfix')[2]
time.find('li')[2].text
desc = post.find('div', classs_ = 'list-item-para').text
tag = post.find('div', class_ = 'list-viewing-date').text
updated = post.find('div', class_ = 'list-update').text
df = df.append({'Plane':plane, 'Year':year, 'S/N':sn, 'Total Time':time, 'Price':price, 'Location':location,
'Description':desc, 'Tag':tag, 'Last updated':updated, 'Link':link_full}, ignore_index = True)
except:
pass
next_page = soup.find('a', {'rel':'next'}).get('href')
next_page_full = 'https://www.avbuyer.com'+next_page
next_page_full
url = next_page_full
page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')
df.to_csv('/Users/xxx/avbuyer.csv')

Try this:
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers= {'User-Agent': 'Mozilla/5.0'}
response = requests.get('https://www.avbuyer.com/aircraft/private-jets')
soup = BeautifulSoup(response.content, 'html.parser')
postings = soup.find_all('div', class_ = 'listing-item premium')
temp=[]
for post in postings:
link = post.find('a', class_ = 'more-info').get('href')
link_full = 'https://www.avbuyer.com'+ link
plane = post.find('h2', class_ = 'item-title').text
price = post.find('div', class_ = 'price').text
location = post.find('div', class_ = 'list-item-location').text
t=post.find_all('div',class_='list-other-dtl')
for i in t:
data=[tup.text for tup in i.find_all('li')]
years=data[0]
s=data[1]
total_time=data[2]
temp.append([plane,price,location,link_full,years,s,total_time])
df=pd.DataFrame(temp,columns=["plane","price","location","link","Years","S/N","Totaltime"])
print(df)
output:
plane price location link Years S/N Totaltime
0 Dassault Falcon 2000LXS Make offer North America + Canada, United States - MD, Fo... https://www.avbuyer.com/aircraft/private-jets/... Year 2021 S/N 377 Total Time 33
1 Cirrus Vision SF50 G1 Please call North America + Canada, United States - WI, Fo... https://www.avbuyer.com/aircraft/private-jets/... Year 2018 S/N 0080 Total Time 615
2 Gulfstream IV Make offer North America + Canada, United States - MD, Fo... https://www.avbuyer.com/aircraft/private-jets/... Year 1990 S/N 1148 Total Time 6425
4 Boeing 787-8 Make offer Europe, Monaco, For Sale by Global Jet Monaco https://www.avbuyer.com/aircraft/private-jets/... Year 2010 S/N - Total Time 1
5 Hawker 4000 Make offer South America, Puerto Rico, For Sale by JetHQ https://www.avbuyer.com/aircraft/private-jets/... Year 2009 S/N RC-24 Total Time 2120
6 Embraer Legacy 500 Make offer North America + Canada, United States - NE, Fo... https://www.avbuyer.com/aircraft/private-jets/... Year 2015 S/N 55000016 Total Time 2607
7 Dassault Falcon 2000LXS Make offer North America + Canada, United States - DE, Fo... https://www.avbuyer.com/aircraft/private-jets/... Year 2015 S/N 300 Total Time 1909
8 Dassault Falcon 50EX Please call North America + Canada, United States - TX, Fo... https://www.avbuyer.com/aircraft/private-jets/... Year 2002 S/N 320 Total Time 7091.9
9 Dassault Falcon 2000 Make offer North America + Canada, United States - MD, Fo... https://www.avbuyer.com/aircraft/private-jets/... Year 2001 S/N 146 Total Time 6760
10 Bombardier Learjet 75 Make offer Europe, Switzerland, For Sale by Jetcraft https://www.avbuyer.com/aircraft/private-jets/... Year 2014 S/N 45-491 Total Time 1611
11 Hawker 800B Please call Europe, United Kingdom - England, For Sale by ... https://www.avbuyer.com/aircraft/private-jets/... Year 1985 S/N 258037 Total Time 9621
13 BAe Avro RJ100 Please call North America + Canada, United States - MT, Fo... https://www.avbuyer.com/aircraft/private-jets/... Year 1996 S/N E3282 Total Time 45996
14 Embraer Legacy 600 Make offer North America + Canada, United States - MD, Fo... https://www.avbuyer.com/aircraft/private-jets/... Year 2007 S/N 14501014 Total Time 4328
15 Bombardier Challenger 850 Make offer North America + Canada, United States - AZ, Fo... https://www.avbuyer.com/aircraft/private-jets/... Year 2003 S/N 7755 Total Time 12114.1
16 Gulfstream G650 Please call Europe, Switzerland, For Sale by Jetcraft https://www.avbuyer.com/aircraft/private-jets/... Year 2013 S/N 6047 Total Time 2178
17 Bombardier Learjet 55 Price: USD $995,000 North America + Canada, United States - MD, Fo... https://www.avbuyer.com/aircraft/private-jets/... Year 1982 S/N 020 Total Time 13448
18 Dassault Falcon 8X Please call North America + Canada, United States - MD, Fo... https://www.avbuyer.com/aircraft/private-jets/... Year 2016 S/N 406 Total Time 1627
19 Hawker 800XP Price: USD $1,595,000 North America + Canada, United States - MD, Fo... https://www.avbuyer.com/aircraft/private-jets/... Year 2002 S/N 258578 Total Time 10169

Right now, your try-except clauses are not allowing you to see and debug the errors in your script. If you remove them, you will see:
IndexError: list index out of range in line 24. There are only two elements inside the list, and you are looking for the second one. Therefore, your line should be:
year = post.find_all('ul', class_ = 'fa-no-bullet clearfix')[1]
KeyError: 2 in line 26. You are using find(), which returns a <class 'bs4.element.Tag'> object, not a list. Here you want to use find_all() as you did in line 24. Same happens for line 28.
However, instead of using this expression three times, you should rather store the result in a variable and use it later.
AttributeError: 'NoneType' object has no attribute 'text' in line 31. There is a type, you wrote _classs.
AttributeError: 'NoneType' object has no attribute 'text' in line 32. There is nothing wrong with your code. Instead, there are some entries in the webpage that don't have this element. You should check if the find method gave you any result.
tag = post.find('div', class_ = 'list-viewing-date')
if tag:
tag = tag.text
else:
tag = None
You don't have a way out of your while loop. You should catch whenever the script cannot find a new next_page and add a break.
After changing all this, it worked for me to scrape the first page. I used:
Python 3.9.7
bs4 4.10.0
It is very important that you state what versions of Python and the libraries you are using.
Cheers!

Scrape website to only show populated categories

I am in the process of scraping a website and it pulls the contents of the page, but there are categories with headers that are technically empty, but it still shows the header. I would like to only see categories with events in them. Ideally I could even have the components of each transactions so I can choose which elements I want displayed.
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}
print('Scraping NH Dept of Banking...')
print()
NHurl = 'https://www.nh.gov/banking/corporate-activities/index.htm'
NHr = requests.get(NHurl, headers = headers)
NHsoup = BeautifulSoup(NHr.text, 'html.parser')
NHlist = []
for events in NHsoup.findAll('tr')[2:]:
print(events.text)
NHlist.append(events.text)
print(' '.join(NHlist))
Like I said, this works to get all of the information, but there are a lot of headers/empty space that doesn't need to be pulled. For example, at the time I'm writing this the 'acquisitions', 'conversions', and 'change in control' are empty, but the headers still come in and there's are relatively large blank space after the headers. I feel like a I need some sort of loop to go through each header ('td') and then get it's contents ('tr') but I'm just not quite sure how to do it.

You can use itertools.groupby to group elements and then filter out empty rows:
import requests
from itertools import groupby
from bs4 import BeautifulSoup
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}
print('Scraping NH Dept of Banking...')
print()
NHurl = 'https://www.nh.gov/banking/corporate-activities/index.htm'
NHr = requests.get(NHurl, headers = headers)
NHsoup = BeautifulSoup(NHr.text, 'html.parser')
NHlist = []
for _, g in groupby(NHsoup.select('tr'), lambda k, d={'g':0}: (d.update(g=d['g']+1), d['g']) if k.select('th') else (None, d['g'])):
s = [tag.get_text(strip=True, separator=' ') for tag in g]
if any(i == '' for i in s):
continue
NHlist.append(s)
# This is just pretty printing, all the data are already in NHlist:
l = max(map(len,(j for i in NHlist for j in i))) + 5
for item in NHlist:
print('{: <4} {}'.format(' ', item[0]))
print('-' * l)
for i, ev in enumerate(item[1:], 1):
print('{: <4} {}'.format(i, ev))
print()
Prints:
Scraping NH Dept of Banking...
New Bank
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 12/11/18 The Millyard Bank
Interstate Bank Combination
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 01/16/19 Optima Bank & Trust Company with and into Cambridge Trust Company Portsmouth, NH 03/29/19
Amendment to Articles of Agreement or Incorporation; Business or Capital Plan
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 11/26/18 John Hancock Trust Company Boston, MA 01/14/19
2 12/04/18 Franklin Savings Bank Franklin, NH 01/28/19
3 12/12/18 MFS Heritage Trust Company Boston, MA 01/28/19
4 02/25/19 Ankura Trust Company, LLC Fairfield, CT 03/22/19
5 4/25/19 Woodsville Guaranty Savings Bank Woodsville, NH 06/04/19
6 5/10/19 AB Trust Company New York, NY 06/04/19
Reduction in Capital
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 03/07/19 Primary Bank Bedford, NH 04/10/19
Amendment to Bylaws
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 12/10/18 Northeast Credit Union Porstmouth, NH 02/25/19
2 2/25/19 Members First Credit Union Manchester, NH 04/05/19
3 4/24/19 St. Mary's Bank Manchester, NH 05/30/19
4 6/28/19 Bellwether Community Credit Union
Interstate Branch Office
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 01/23/19 Newburyport Five Cents Savings Bank 141 Portsmouth Ave Exeter, NH 02/01/19
2 03/08/19 One Credit Union Newport, NH 03/29/19
3 03/01/19 JPMorgan Chase Bank, NA Nashua, NH 04/04/19
4 03/26/19 Mascoma Bank Lebanon, NH 04/09/19
5 04/24/19 Newburyport Five Cents Savings Bank 321 Lafayette Rd Hampton NH 05/08/19
6 07/10/19 Mascoma Bank 242-244 North Winooski Avenue Burlington VT 07/18/19
7 07/10/19 Mascoma Bank 431 Pine Street Burlington VT 07/18/19
Interstate Branch Office Closure
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 02/15/19 The Provident Bank 321 Lafayette Rd Hampton, NH 02/25/19
New Branch Office
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 12/07/18 Bank of New Hampshire 16-18 South Main Street Concord NH 01/02/19
2 3/4/19 Triangle Credit Union 360 Daniel Webster Highway, Merrimack, NH 03/11/19
3 04/03/19 Bellwether Community Credit Union 425-453 Commercial Street Manchester, NH 04/17/19
4 06/11/19 Primary Bank 23 Crystal Avenue Derry NH 06/11/19
Branch Office Closure
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 5/15/19 Northeast Credit Union Merrimack, NH 05/21/19
New Loan Production Office
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 04/08/19 Community National Bank 367 Route 120, Unit B-5 Lebanon, NH
03766-1430 04/15/19
Loan Production Office Closure
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 07/22/19 The Provident Bank 20 Trafalgar Square, Suite 447 Nashua NH 03063 07/31/19
Trade Name Requests
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 04/16/19 John Hancock Trust Company To use trade name "Manulife Investment Management Trust Company" 04/24/19
New Trust Company
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 02/19/19 Janney Trust Co., LLC
2 02/25/19 Darwin Trust Company of New Hampshire, LLC
3 07/15/`9 Harbor Trust Company
Dissolution of Trust Company
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 09/19/17 Cambridge Associates Fiduciary Trust, LLC Boston, MA 02/05/19
Trust Office Closure
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 5/10/19 Charter Trust Company Rochester, NH 05/20/19
New Trust Office
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 02/25/19 Ankura Trust Company, LLC 140 Sherman Street, 4th Floor Fairfield, CT 06824 03/22/19
Relocation of Trust Office
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 01/23/19 Geode Capital Management Trust Company, LLC Relocate from: One Post Office Square, 20th Floor, Boston MA To: 100 Summer Street, 12th Flr, Boston, MA 02/01/19
2 03/15/19 Drivetrain Trust Company LLC Relocate from: 630 3rd Avenue, 21st Flr New York, NY 10017 To: 410 Park Avenue, Suite 900 New York, NY 10022 03/29/19
3 04/14/19 Boston Partners Trust Company Relocate from: 909 Third Avenue New York, NY 10022 To: One Grand Central Place 60 East 42nd Street, Ste 1550 New York, NY 10165 04/23/19

You could test which rows contain all '\xa0' (appear blank) and exclude. I append to list and convert to pandas dataframe but you could just print the row direct.
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
r = requests.get('https://www.nh.gov/banking/corporate-activities/index.htm')
soup = bs(r.content, 'lxml')
results = []
for tr in soup.select('tr'):
row = [i.text for i in tr.select('th,td')]
if row.count('\xa0') != len(row):
results.append(row)
pd.set_option('display.width', 100)
df = pd.DataFrame(results)
df.style.set_properties(**{'text-align': 'left'})
df.columns = df.iloc[0]
df = df[1:]
df.fillna(value='', inplace=True)
print(df.head(20))

Not sure if this is how you want it, and there is probably a more elegant way, but I basically did was
Pandas to get the table
Pandas automatically assigns columns, so moved column to first row
Found were rows are all nulls
Dropped rows with all nulls and the previous row (it's sub header)
import pandas as pd
print('Scraping NH Dept of Banking...')
print()
NHurl = 'https://www.nh.gov/banking/corporate-activities/index.htm'
df = pd.read_html(NHurl)[0]
top_row = pd.DataFrame([df.columns], index=[-1])
df.columns = top_row.columns
df = df.append(top_row, sort=True).sort_index().reset_index(drop=True)
null_rows = df[df.isnull().values.all(axis=1)].index.tolist()
drop_hdr_rows = [x - 1 for x in null_rows ]
drop_rows = drop_hdr_rows + null_rows
new_df = df[~df.index.isin(drop_rows)]
Output:
print (new_df.to_string())
0 1 2 3
2 New Bank New Bank New Bank New Bank
3 12/11/18 The Millyard Bank NaN NaN
4 Interstate Bank Combination Interstate Bank Combination Interstate Bank Combination Interstate Bank Combination
5 01/16/19 Optima Bank & Trust Company with and into Camb... Portsmouth, NH 03/29/19
12 Amendment to Articles of Agreement or Incorpor... Amendment to Articles of Agreement or Incorpor... Amendment to Articles of Agreement or Incorpor... Amendment to Articles of Agreement or Incorpor...
13 11/26/18 John Hancock Trust Company Boston, MA 01/14/19
14 12/04/18 Franklin Savings Bank Franklin, NH 01/28/19
15 12/12/18 MFS Heritage Trust Company Boston, MA 01/28/19
16 02/25/19 Ankura Trust Company, LLC Fairfield, CT 03/22/19
17 4/25/19 Woodsville Guaranty Savings Bank Woodsville, NH 06/04/19
18 5/10/19 AB Trust Company New York, NY 06/04/19
19 Reduction in Capital Reduction in Capital Reduction in Capital Reduction in Capital
20 03/07/19 Primary Bank Bedford, NH 04/10/19
21 Amendment to Bylaws Amendment to Bylaws Amendment to Bylaws Amendment to Bylaws
22 12/10/18 Northeast Credit Union Porstmouth, NH 02/25/19
23 2/25/19 Members First Credit Union Manchester, NH 04/05/19
24 4/24/19 St. Mary's Bank Manchester, NH 05/30/19
25 6/28/19 Bellwether Community Credit Union NaN NaN
26 Interstate Branch Office Interstate Branch Office Interstate Branch Office Interstate Branch Office
27 01/23/19 Newburyport Five Cents Savings Bank 141 Portsmouth Ave Exeter, NH 02/01/19
28 03/08/19 One Credit Union Newport, NH 03/29/19
29 03/01/19 JPMorgan Chase Bank, NA Nashua, NH 04/04/19
30 03/26/19 Mascoma Bank Lebanon, NH 04/09/19
31 04/24/19 Newburyport Five Cents Savings Bank 321 Lafayette Rd Hampton NH 05/08/19
32 07/10/19 Mascoma Bank 242-244 North Winooski Avenue Burlington VT 07/18/19
33 07/10/19 Mascoma Bank 431 Pine Street Burlington VT 07/18/19
34 Interstate Branch Office Closure Interstate Branch Office Closure Interstate Branch Office Closure Interstate Branch Office Closure
35 02/15/19 The Provident Bank 321 Lafayette Rd Hampton, NH 02/25/19
36 New Branch Office New Branch Office New Branch Office New Branch Office
37 12/07/18 Bank of New Hampshire 16-18 South Main Street Concord NH 01/02/19
38 3/4/19 Triangle Credit Union 360 Daniel Webster Highway, Merrimack, NH 03/11/19
39 04/03/19 Bellwether Community Credit Union 425-453 Commercial Street Manchester, NH 04/17/19
40 06/11/19 Primary Bank 23 Crystal Avenue Derry NH 06/11/19
41 Branch Office Closure Branch Office Closure Branch Office Closure Branch Office Closure
42 5/15/19 Northeast Credit Union Merrimack, NH 05/21/19
43 New Loan Production Office New Loan Production Office New Loan Production Office New Loan Production Office
44 04/08/19 Community National Bank 367 Route 120, Unit B-5 Lebanon, NH 03766-1430 04/15/19
45 Loan Production Office Closure Loan Production Office Closure Loan Production Office Closure Loan Production Office Closure
46 07/22/19 The Provident Bank 20 Trafalgar Square, Suite 447 Nashua NH 03063 07/31/19
51 Trade Name Requests Trade Name Requests Trade Name Requests Trade Name Requests
52 04/16/19 John Hancock Trust Company To use trade name "Manulife Investment Managem... 04/24/19
53 New Trust Company New Trust Company New Trust Company New Trust Company
54 02/19/19 Janney Trust Co., LLC NaN NaN
55 02/25/19 Darwin Trust Company of New Hampshire, LLC NaN NaN
56 07/15/`9 Harbor Trust Company NaN NaN
57 Dissolution of Trust Company Dissolution of Trust Company Dissolution of Trust Company Dissolution of Trust Company
58 09/19/17 Cambridge Associates Fiduciary Trust, LLC Boston, MA 02/05/19
59 Trust Office Closure Trust Office Closure Trust Office Closure Trust Office Closure
60 5/10/19 Charter Trust Company Rochester, NH 05/20/19
61 New Trust Office New Trust Office New Trust Office New Trust Office
62 02/25/19 Ankura Trust Company, LLC 140 Sherman Street, 4th Floor Fairfield, CT 0... 03/22/19
63 Relocation of Trust Office Relocation of Trust Office Relocation of Trust Office Relocation of Trust Office
64 01/23/19 Geode Capital Management Trust Company, LLC Relocate from: One Post Office Square, 20th Fl... 02/01/19
65 03/15/19 Drivetrain Trust Company LLC Relocate from: 630 3rd Avenue, 21st Flr New Y... 03/29/19
66 04/14/19 Boston Partners Trust Company Relocate from: 909 Third Avenue New York, NY ... 04/23/19

How to read wikipedia table of 2018 in film using python pandas and BeautifulSoup

I was attempting to find the movies of 2018 January to March of 2018 from wikipedia page using pandas read html.
Here is my code:
import pandas as pd
import numpy as np
link = "https://en.wikipedia.org/wiki/2018_in_film"
tables = pd.read_html(link)
jan_march = tables[5].iloc[1:]
jan_march.columns = ['Opening1','Opening2','Title','Studio','Cast','Genre','Country','Ref']
jan_march.head()
There is some error in reading the columns. If anybody has already scraped some
wikipedia tables may be they can help me solving the problem.
Thanks a lot.
Related links:
Scraping Wikipedia tables with Python selectively
https://roche.io/2016/05/scrape-wikipedia-with-python
Scraping paginated web table with python pandas & beautifulSoup
I am getting this:
But am expecting:

Because of how the table is designed it is not as simple as pd.read_html() while that is a start you will need to do some manipulation to get it in a desirable formate:
import pandas as pd
link = "https://en.wikipedia.org/wiki/2018_in_film"
tables = pd.read_html(link,header=0)[5]
# find na values and shift cells right
i = 0
while i < 2:
row_shift = tables[tables['Unnamed: 7'].isnull()].index
tables.iloc[row_shift,:] = tables.iloc[row_shift,:].shift(1,axis=1)
i+=1
# create new column names
tables.columns = ['Month', 'Day', 'Title', 'Studio', 'Cast and crew', 'Genre', 'Country', 'Ref.']
# forward fill values
tables['Month'] = tables['Month'].ffill()
tables['Day'] = tables['Day'].ffill()
out:
Month Day Title Studio Cast and crew Genre Country Ref.
0 JANUARY 5 Insidious: The Last Key Universal Pictures / Blumhouse Productions Adam Robitel (director); Leigh Whannell (scree... Horror, Thriller US [33]
1 JANUARY 5 The Strange Ones Vertical Entertainment Lauren Wolkstein (director); Christopher Radcl... Drama US [34]
2 JANUARY 5 Stratton Momentum Pictures Simon West (director); Duncan Falconer, Warren... Action, Thriller IT, UK [35]
3 JANUARY 10 Sweet Country Samuel Goldwyn Films Warwick Thornton (director); David Tranter, St... Drama AUS [36]
4 JANUARY 12 The Commuter Lionsgate / StudioCanal / The Picture Company Jaume Collet-Serra (director); Byron Willinger... Action, Crime, Drama, Mystery, Thriller US, UK [37]
5 JANUARY 12 Proud Mary Screen Gems Babak Najafi (director); John S. Newman, Chris... Action, Thriller US [38]
6 JANUARY 12 Acts of Violence Lionsgate Premiere Brett Donowho (director); Nicolas Aaron Mezzan... Action, Thriller US [39]
...

How to write data to new columns in csv when webscraping?

I am scraping the billboard hot r&b/hip hop charts and I am able to get all my data but when I start to write my data to a csv the formatting is all wrong.
The data for Last Week Number, Peak Position and Weeks On Chart all appear under the first 3 columns of my csv and not the columns where the respective headers are.
This is my current code:
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://www.billboard.com/charts/r-b-hip-hop-songs'
# Opens web connetion and grabs page
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
# HTML parsing
page_soup = soup(page_html, "html.parser")
# Grabs song title, artist and picture
mainContainer = page_soup.findAll("div", {"class":"chart-row__main-
display"})
# CSV filename creation
filename = "Billboard_Hip_Hop_Charts.csv"
f = open(filename, "w")
# Creating Headers
headers = "Billboard Number, Artist Name, Song Title, Last Week Number, Peak
Position, Weeks On Chart\n"
f.write(headers)
# Get Billboard Number, Artist Name and Song Title
for container in mainContainer:
# Gets billboard number
billboard_number = container.div.span.text
# Gets artist name
artist_name_a_tag = container.findAll("", {"class":"chart-row__artist"})
artist_name = artist_name_a_tag[0].text.strip()
# Gets song title
song_title = container.h2.text
print("Billboard Number: " + billboard_number)
print("Artist Name: " + artist_name)
print("Song Title: " + song_title)
f.write(billboard_number + "," + artist_name + "," + song_title + "\n")
# Grabs side container from main container
secondaryContainer = page_soup.findAll("div", {"class":"chart-row__secondary"})
# Get Last Week Number, Peak Position and Weeks On Chart
for container in secondaryContainer:
# Gets last week number
last_week_number_tag = container.findAll("", {"class":"chart-row__value"})
last_week_number = last_week_number_tag[0].text
# Gets peak position
peak_position_tag = container.findAll("", {"class":"chart-row__value"})
peak_position = peak_position_tag[1].text
# Gets week on chart
weeks_on_chart_tag = container.findAll("", {"class":"chart-row__value"})
weeks_on_chart = weeks_on_chart_tag[2].text
print("Last Week Number: " + last_week_number)
print("Peak Position: " + peak_position)
print("Weeks On Chart: " + weeks_on_chart)
f.write(last_week_number + "," + peak_position + "," + weeks_on_chart + "\n")
f.close()
This is what my csv looks like with headers Billboard Number, Artist Name, Song Title, Last Week Number, Peak Position and Weeks On Chart.
1 Drake Nice For What
2 Post Malone Featuring Ty Dolla $ign Psycho
3 Drake God's Plan
4 Post Malone Better Now
5 Post Malone Featuring 21 Savage Rockstar
6 BlocBoy JB Featuring Drake Look Alive
7 Post Malone Paranoid
8 Lil Dicky Featuring Chris Brown Freaky Friday
9 Post Malone Rich & Sad
10 Post Malone Featuring Swae Lee Spoil My Night
11 Post Malone Featuring Nicki Minaj Ball For Me
12 Migos Featuring Drake Walk It Talk It
13 Post Malone Featuring G-Eazy & YG Same Bitches
14 Cardi B| Bad Bunny & J Balvin I Like It
15 Post Malone Zack And Codeine
16 Post Malone Over Now
17 Cardi B Be Careful
18 Post Malone Takin' Shots
19 The Weeknd & Kendrick Lamar Pray For Me
20 Rich The Kid Plug Walk
21 The Weeknd Call Out My Name
22 Bruno Mars & Cardi B Finesse
23 Post Malone Candy Paint
24 Ella Mai Boo'd Up
25 Rae Sremmurd & Juicy J Powerglide
26 Post Malone 92 Explorer
27 J. Cole ATM
28 J. Cole KOD
29 Post Malone Otherside
30 Post Malone Blame It On Me
31 J. Cole Kevin's Heart
32 Kendrick Lamar & SZA All The Stars
33 Nicki Minaj Chun-Li
34 Lil Pump Esskeetit
35 Migos Stir Fry
36 Famous Dex Japan
37 Post Malone Sugar Wraith
38 Cardi B Featuring Migos Drip
39 XXXTENTACION Sad!
40 Jay Rock| Kendrick Lamar| Future & James Blake King's Dead
41 Rich The Kid Featuring Kendrick Lamar New Freezer
42 Logic & Marshmello Everyday
43 J. Cole Motiv8
44 YoungBoy Never Broke Again Outside Today
45 Post Malone Jonestown (Interlude)
46 Cardi B Featuring 21 Savage Bartier Cardi
47 YoungBoy Never Broke Again Overdose
48 J. Cole 1985 (Intro To The Fall Off)
49 J. Cole Photograph
50 Khalid| Ty Dolla $ign & 6LACK OTW
1 1 2
2 1 6
3 1 17
4 2 12
5 3 14
10 6 8
...
Any help on placing the data into the right columns helps!

Your code is unnecessarily messy and very hard to read. You didn't need to create two containers at all because one container is sufficient to fetch you the required data. Try the below way and find the csv with data filled in accordingly:
import requests, csv
from bs4 import BeautifulSoup
url = 'https://www.billboard.com/charts/r-b-hip-hop-songs'
with open('Billboard_Hip_Hop_Charts.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerow(['Billboard Number','Artist Name','Song Title','Last Week Number','peak_position','weeks_on_chart'])
res = requests.get(url)
soup = BeautifulSoup(res.text, "html.parser")
for container in soup.find_all("article",class_="chart-row"):
billboard_number = container.find(class_="chart-row__current-week").text
artist_name_a_tag = container.find(class_="chart-row__artist").text.strip()
song_title = container.find(class_="chart-row__song").text
last_week_number_tag = container.find(class_="chart-row__value")
last_week_number = last_week_number_tag.text
peak_position_tag = last_week_number_tag.find_parent().find_next_sibling().find(class_="chart-row__value")
peak_position = peak_position_tag.text
weeks_on_chart_tag = peak_position_tag.find_parent().find_next_sibling().find(class_="chart-row__value").text
print(billboard_number,artist_name_a_tag,song_title,last_week_number,peak_position,weeks_on_chart_tag)
writer.writerow([billboard_number,artist_name_a_tag,song_title,last_week_number,peak_position,weeks_on_chart_tag])
Output are like:
1 Childish Gambino This Is America 1 1 2
2 Drake Nice For What 2 1 6
3 Drake God's Plan 3 1 17
4 Post Malone Featuring Ty Dolla $ign Psycho 4 2 12
5 BlocBoy JB Featuring Drake Look Alive 5 3 14
6 Ella Mai Boo'd Up 10 6 8

(bs4) trying to differentiate different containers in a HTML page

I have a web page from the Houses of Parlament. it has information on MP declared interests and I would like to store all MP interests for a project that I am thinking of.
root = 'https://publications.parliament.uk/pa/cm/cmregmem/160606/abbott_diane.htm'
root is an example webpage. I want my output to be a dictionary, as there are interests under different sub headings and the entry could be a list.
Problem: if you look at the page, the first interest, (employment and earnings) is not wrapped up in a container, but rather the heading is a tag, and not connected to the text underneath it so I could call soup.find_all('p', {xlms='<p, {'xmlns':'http://www.w3.org/1999/xhtml')
but it would return the headings of expenses, and a few other headings like her name, and not the text under it.
which makes it difficult to iterate through the headings and storing the information
What would be the best way of iterating through the page, storing each heading, and the information under each heading?

Something like this may work:
import urllib.request
from bs4 import BeautifulSoup
ret = {}
page = urllib.request.urlopen("https://publications.parliament.uk/pa/cm/cmregmem/160606/abbott_diane.htm")
content = page.read().decode('utf-8')
soup = BeautifulSoup(content, 'lxml')
valid = False
value = ""
for i in soup.findAll('p'):
if i.find('strong') and i.text is not None:
# ignore first pass
if valid:
ret[key] = value
value = ""
valid = True
key = i.text
elif i.text is not None:
value = value + " " + i.text
# get last entry
if key is not None:
ret[key] = value
for x in ret:
print (x)
print (ret[x])
Outputs
4. Visits outside the UK
Name of donor: (1) Stop Aids (2) Aids Alliance Address of donor: (1) Grayston Centre, 28 Charles St, London N1 6HT (2) Preece House, 91-101 Davigdor Rd, Hove BN3 1RE Amount of donation (or estimate of the probable value): for myself and a member of staff, flights £2,784, accommodation £380.52, other travel costs £172, per diems £183; total £3,519.52. These costs were divided equally between both donors. Destination of visit: Uganda Date of visit: 11-14 November 2015 Purpose of visit: to visit the different organisations and charities (development) in regards to AIDS and HIV. (Registered 09 December 2015)Name of donor: Muslim Charities Forum Address of donor: 6 Whitehorse Mews, 37 Westminster Bridge Road, London SE1 7QD Amount of donation (or estimate of the probable value): for a member of staff and myself, return flights to Nairobi £5,170; one night's accommodation in Hargeisa £107.57; one night's accommodation in Borama £36.21; total £5,313.78 Destination of visit: Somaliland Date of visit: 7-10 April 2016 Purpose of visit: to visit the different refugee camps and charities (development) in regards to the severe drought in Somaliland. (Registered 18 May 2016)Name of donor: British-Swiss Chamber of Commerce Address of donor: Bleicherweg, 128002, Zurich, Switzerland Amount of donation (or estimate of the probable value): flights £200.14; one night's accommodation £177, train fare Geneva to Zurich £110; total £487.14 Destination of visit: Geneva and Zurich, Switzerland Date of visit: 28-29 April 2016 Purpose of visit: to participate in a public panel discussion in Geneva in front of British-Swiss Chamber of Commerce, its members and guests. (Registered 18 May 2016) 
2. (b) Any other support not included in Category 2(a)
Name of donor: Ann Pettifor Address of donor: private Amount of donation or nature and value if donation in kind: £1,651.07 towards rent of an office for my mayoral campaign Date received: 28 August 2015 Date accepted: 30 September 2015 Donor status: individual (Registered 08 October 2015)
1. Employment and earnings
Fees received for co-presenting BBC’s ‘This Week’ TV programme. Address: BBC Broadcasting House, Portland Place, London W1A 1AA. (Registered 04 November 2013)14 May 2015, received £700. Hours: 3 hrs. (Registered 03 June 2015)4 June 2015, received £700. Hours: 3 hrs. (Registered 01 July 2015)18 June 2015, received £700. Hours: 3 hrs. (Registered 01 July 2015)16 July 2015, received £700. Hours: 3 hrs. (Registered 07 August 2015)8 January 2016, received £700 for an appearance on 17 December 2015. Hours: 3 hrs. (Registered 14 January 2016)28 July 2015, received £4,000 for taking part in Grant Thornton’s panel at the JLA/FD Intelligence Post-election event. Address: JLA, 14 Berners Street, London W1T 3LJ. Hours: 5 hrs. (Registered 07 August 2015)23rd October 2015, received £1,500 for co-presenting BBC’s "Have I Got News for You" TV programme. Address: Hat Trick Productions, 33 Oval Road Camden, London NW1 7EA. Hours: 5 hrs. (Registered 26 October 2015)10 October 2015, received £1,400 for taking part in a talk at the New Wolsey Theatre in Ipswich. Address: Clive Conway Productions, 32 Grove St, Oxford OX2 7JT. Hours: 5 hrs. (Registered 26 October 2015)21 March 2016, received £4,000 via Speakers Corner (London) Ltd, Unit 31, Highbury Studios, 10 Hornsey Street, London N7 8EL, from Thompson Reuters, Canary Wharf, London E14 5EP, for speaking and consulting on a panel. Hours: 10 hrs. (Registered 06 April 2016)
Abbott, Ms Diane (Hackney North and Stoke Newington)
House of Commons
Session 2016-17
Publications on the internet

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Missing data not being scraped from Hansard - python

Related

Web Scraping & BeautifulSoup <li> parsing

Scrape website to only show populated categories

How to read wikipedia table of 2018 in film using python pandas and BeautifulSoup

How to write data to new columns in csv when webscraping?

(bs4) trying to differentiate different containers in a HTML page

Categories

Resources