Scraping PDFs from page containing multiple search results - python

I am interested in scraping PDFs from any of the speakers on this page. How might I go about this: https://www.nas.gov.sg/archivesonline/speeches/search-result?search-type=advanced&speaker=Amy+Khor
The website has changed from previous occasions and the code used previously such as this:
import requests
from bs4 import BeautifulSoup
url = 'http://www.nas.gov.sg/archivesonline/speeches/search-result?search-type=advanced&speaker='
search_term = 'Amy+Khor'
data = {
'keywords': search_term,
'search-type': 'basic',
'keywords-type': 'all',
'page-num': 1
}
soup = BeautifulSoup(requests.post(url, data=data).text, 'lxml')
cnt = 1
while True:
print()
print('Page no. {}'.format(cnt))
print('-' * 80)
for a in soup.select('a[href$=".pdf"]'):
print(a['href'])
if soup.select_one('span.next-10'):
data['page-num'] += 10
cnt += 1
soup = BeautifulSoup(requests.post(url, data=data).text, 'lxml')
else:
break
The code above no longer works...

To get all PDF links from the pages you can use next example:
import requests
from bs4 import BeautifulSoup
url = "https://www.nas.gov.sg/archivesonline/speeches/search-result"
params = {
"search-type": "advanced",
"speaker": "Amy Khor",
"page-num": "1",
}
for params["page-num"] in range(1, 3): # <--- increase number of pages here
soup = BeautifulSoup(
requests.get(url, params=params).content, "html.parser"
)
for a in soup.select('a[href$="pdf"]'):
print("https:" + a["href"])
print("-" * 80)
Prints:
https://www.nas.gov.sg/archivesonline/data/pdfdoc/MINDEF_20171123001_2.pdf
https://www.nas.gov.sg/archivesonline/data/pdfdoc/MSE_20151126001.pdf
https://www.nas.gov.sg/archivesonline/data/pdfdoc/MSE_20160229002.pdf
...and so on.

Here's how I'd do it if I were to start from scratch.
Google Search is actually pretty powerful, and I feel like this query gets your pdfs:
"Amy Khor" site:https://www.nas.gov.sg/archivesonline/data/pdfdoc filetype:pdf
Then, I'd use either BeautifulSoup or, even better, something like googlesearch-python to get the results and process them into your desired lxml format.

Related

Python web scraping multiple pages

I am scraping all the words from website Merriam-Webster.
I want to scrape all pages starting from a-z and all pages within them and save them to a text file. The problem i'm having is i only get first result of the table instead of all. I know that this is a large amount of text (around 500k) but i'm doing it for educating myself.
CODE:
import requests
from bs4 import BeautifulSoup as bs
URL = 'https://www.merriam-webster.com/browse/dictionary/a/'
page = 1
# for page in range(1, 75):
req = requests.get(URL + str(page))
soup = bs(req.text, 'html.parser')
containers = soup.find('div', attrs={'class', 'entries'})
table = containers.find_all('ul')
for entries in table:
links = entries.find_all('a')
name = links[0].text
print(name)
Now what i want is to get all the entries from this table, but instead i only get the first entry.
I'm kinda stuck here so any help would be appreciated.
Thanks
https://www.merriam-webster.com/browse/medical/a-z
https://www.merriam-webster.com/browse/legal/a-z
https://www.merriam-webster.com/browse/dictionary/a-z
https://www.merriam-webster.com/browse/thesaurus/a-z
To get all entries, you can use this example:
import requests
from bs4 import BeautifulSoup
url = 'https://www.merriam-webster.com/browse/dictionary/a/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for a in soup.select('.entries a'):
print('{:<30} {}'.format(a.text, 'https://www.merriam-webster.com' + a['href']))
Prints:
(a) heaven on earth https://www.merriam-webster.com/dictionary/%28a%29%20heaven%20on%20earth
(a) method in/to one's madness https://www.merriam-webster.com/dictionary/%28a%29%20method%20in%2Fto%20one%27s%20madness
(a) penny for your thoughts https://www.merriam-webster.com/dictionary/%28a%29%20penny%20for%20your%20thoughts
(a) quarter after https://www.merriam-webster.com/dictionary/%28a%29%20quarter%20after
(a) quarter of https://www.merriam-webster.com/dictionary/%28a%29%20quarter%20of
(a) quarter past https://www.merriam-webster.com/dictionary/%28a%29%20quarter%20past
(a) quarter to https://www.merriam-webster.com/dictionary/%28a%29%20quarter%20to
(all) by one's lonesome https://www.merriam-webster.com/dictionary/%28all%29%20by%20one%27s%20lonesome
(all) choked up https://www.merriam-webster.com/dictionary/%28all%29%20choked%20up
(all) for the best https://www.merriam-webster.com/dictionary/%28all%29%20for%20the%20best
(all) in good time https://www.merriam-webster.com/dictionary/%28all%29%20in%20good%20time
...and so on.
To scrape multiple pages:
url = 'https://www.merriam-webster.com/browse/dictionary/a/{}'
for page in range(1, 76):
soup = BeautifulSoup(requests.get(url.format(page)).content, 'html.parser')
for a in soup.select('.entries a'):
print('{:<30} {}'.format(a.text, 'https://www.merriam-webster.com' + a['href']))
EDIT: To get all pages from A to Z:
import requests
from bs4 import BeautifulSoup
url = 'https://www.merriam-webster.com/browse/dictionary/{}/{}'
for char in range(ord('a'), ord('z')+1):
page = 1
while True:
soup = BeautifulSoup(requests.get(url.format(chr(char), page)).content, 'html.parser')
for a in soup.select('.entries a'):
print('{:<30} {}'.format(a.text, 'https://www.merriam-webster.com' + a['href']))
last_page = soup.select_one('[aria-label="Last"]')['data-page']
if last_page == '':
break
page += 1
EDIT 2: To save to file:
import requests
from bs4 import BeautifulSoup
url = 'https://www.merriam-webster.com/browse/dictionary/{}/{}'
with open('data.txt', 'w') as f_out:
for char in range(ord('a'), ord('z')+1):
page = 1
while True:
soup = BeautifulSoup(requests.get(url.format(chr(char), page)).content, 'html.parser')
for a in soup.select('.entries a'):
print('{:<30} {}'.format(a.text, 'https://www.merriam-webster.com' + a['href']))
print('{}\t{}'.format(a.text, 'https://www.merriam-webster.com' + a['href']), file=f_out)
last_page = soup.select_one('[aria-label="Last"]')['data-page']
if last_page == '':
break
page += 1
I think you need another loop:
for entries in table:
links = entries.find_all('a')
for name in links:
print(name.text)

How to fetch href link from a website using BeautifulSoup

I am trying to get all the article links in a given website below.
However, my code does not print anything at all although I specified the class id and the path to it.
below is my code.
import requests
from lxml import html
from bs4 import BeautifulSoup
from urllib.request import urlopen
html = urlopen("https://uynaa.wordpress.com/category/%d0%be%d1%80%d1%87%d1%83%d1%83%d0%bb%d0%b3%d1%8b%d0%bd-%d0%bd%d0%b8%d0%b9%d1%82%d0%bb%d1%8d%d0%bb/").read()
soup = BeautifulSoup(html, "lxml")
productDivs = soup.findAll('div', attrs={'class' : 'post type-post status-publish format-standard hentry category-56456384'})
for div in productDivs:
print(div.find('h2')[a]['href'])
How do I fetch all the links?
The links are loaded dynamically via JavaScript from external URL. You can use this example to print all links:
import json
import requests
from bs4 import BeautifulSoup
data = {'action': 'infinite_scroll', 'page': 1}
api_url = 'https://uynaa.wordpress.com/?infinity=scrolling'
page = 1
while True:
data['page'] = page
data = requests.post(api_url, data=data).json()
# uncomment next line to print all data:
# print(json.dumps(data, indent=4))
for p in data['postflair']:
print(p)
if data['lastbatch']:
break
page += 1
Prints:
https://uynaa.wordpress.com/2014/01/02/2013-in-review/
https://uynaa.wordpress.com/2013/10/07/%d0%b0%d1%84%d0%b3%d0%b0%d0%bd%d0%b8%d1%81%d1%82%d0%b0%d0%bd-%d0%b0%d0%bd%d1%85%d0%b4%d0%b0%d0%b3%d1%87-%d1%88%d0%b0%d0%bb%d1%82%d0%b3%d0%b0%d0%b0%d0%bd/
https://uynaa.wordpress.com/2013/10/07/%d0%b5-%d0%ba%d0%b0%d1%81%d0%bf%d0%b5%d1%80%d1%81%d0%ba%d0%b8%d0%b9-%d0%b1%d0%b8-%d0%b4%d0%b0%d1%80%d0%b0%d0%bd%d0%b3%d1%83%d0%b9%d0%bb%d0%b0%d0%bb-%d1%82%d0%be%d0%b3%d1%82%d0%be%d0%be-%d0%b3%d1%8d/
https://uynaa.wordpress.com/2013/10/07/%d1%88%d0%b0%d0%bd%d1%85%d0%b0%d0%b9-%d0%bd%d0%be%d0%b3%d0%be%d0%be%d0%bd/
https://uynaa.wordpress.com/2013/10/07/%d1%8d%d0%bd%d1%8d-%d0%b3%d0%b0%d0%b7%d0%b0%d1%80-%d0%bc%d0%b0%d0%bd%d0%b0%d0%b9%d1%85-%d0%b1%d0%b0%d0%b9%d1%81%d0%b0%d0%bd-%d1%8e%d0%bc/
https://uynaa.wordpress.com/2013/10/07/500-%d0%b6%d0%b8%d0%bb-%d0%b0%d1%80%d1%87%d0%bb%d1%83%d1%83%d0%bb%d0%b0%d0%b0%d0%b3%d2%af%d0%b9-%d0%b4%d1%8d%d0%bb%d1%85%d0%b8%d0%b9%d0%bd-%d1%86%d0%be%d1%80%d1%8b%d0%bd-%d0%b3%d0%b0%d0%bd%d1%86/
https://uynaa.wordpress.com/2013/02/01/%d1%83%d0%bb%d0%b7-%d0%bd%d1%83%d1%82%d0%b3%d0%b8%d0%b9%d0%bd-%d0%bf%d0%b8%d1%84%d0%b0%d0%b3%d0%be%d1%80/
https://uynaa.wordpress.com/2013/01/21/%d1%82%d0%b5%d0%bb%d0%b5%d0%b2%d0%b8%d0%b7%d0%b8%d0%b9%d0%bd-%d1%82%d2%af%d2%af%d1%85%d1%8d%d0%bd-%d0%b4%d1%8d%d1%85-%d1%85%d0%b0%d0%bc%d0%b3%d0%b8%d0%b9%d0%bd-%d0%b3%d0%b0%d0%b6%d0%b8%d0%b3-%d1%88/
https://uynaa.wordpress.com/2013/01/18/%d0%b0%d0%bf%d0%be%d1%84%d0%b8%d1%81-%d0%be%d0%be%d1%81-%d2%af%d2%af%d0%b4%d1%8d%d0%bd-%d3%a9%d1%80%d0%bd%d3%a9%d1%85-%d0%b6%d2%af%d0%b6%d0%b8%d0%b3/
https://uynaa.wordpress.com/2013/01/17/%d0%b0%d1%80%d0%b8%d1%83%d0%bd%d1%82%d0%bd%d1%8b-%d0%bd%d1%83%d1%82%d0%b0%d0%b3-%d0%b8%d0%b9%d0%b3-%d1%8d%d0%b7%d1%8d%d0%b3%d0%bd%d1%8d%d1%85-%d1%85%d0%b0%d0%bd/
https://uynaa.wordpress.com/2013/01/15/%d1%81%d0%b0%d1%83%d0%b4%d1%8b%d0%bd-%d1%82%d0%b0%d0%b3%d0%bd%d1%83%d1%83%d0%bb%d1%87%d0%b8%d0%b4-%d0%b0%d1%81%d0%b0%d0%b4%d1%8b%d0%b3-%d0%be%d0%bb%d0%b6%d1%8d%d1%8d/
https://uynaa.wordpress.com/2013/01/15/%d0%bc%d0%b0%d0%bb%d0%b8%d0%b3%d1%8d%d1%8d%d1%81-%d1%81%d0%be%d0%bc%d0%b0%d0%bb%d0%b8-%d1%85%d2%af%d1%80%d1%82%d1%8d%d0%bb/
https://uynaa.wordpress.com/2013/01/10/%d1%85%d0%be%d1%80%d0%b2%d0%be%d0%be-%d0%b5%d1%80%d1%82%d3%a9%d0%bd%d1%86-%d1%85%d0%b0%d0%bb%d0%b0%d0%b0%d1%81%d0%b0%d0%bd%d0%b4-%d0%b1%d0%b0%d0%b3%d1%82%d0%b0%d0%bd%d0%b0/
https://uynaa.wordpress.com/2013/01/10/%d1%82%d0%b0%d0%bd%d0%b3%d0%b0%d1%80%d0%b0%d0%b3-%d3%a9%d1%80%d0%b3%d3%a9%d1%85-%d1%91%d1%81%d0%bb%d0%be%d0%bb-%d1%85%d2%af%d0%bb%d1%8d%d1%8d%d0%b6-%d0%b1%d0%b0%d0%b9%d0%b3-%d1%8d%d1%8d/
https://uynaa.wordpress.com/2013/01/09/%d0%b1%d0%be%d0%bb%d0%bb%d0%b8%d0%b2%d1%83%d0%b4%d1%8b%d0%bd-%d0%ba%d0%b8%d0%bd%d0%be%d0%bd%d0%be%d0%be%d1%81-%d1%87-%d0%b0%d0%b9%d0%bc%d0%b0%d0%b0%d1%80/
https://uynaa.wordpress.com/2013/01/08/%d0%bf%d0%b5%d0%bd%d1%82%d0%b0%d0%b3%d0%be%d0%bd-%d0%b1%d0%be%d0%bb%d0%be%d0%bd-%d1%82%d1%82%d0%b3-%d1%8b%d0%b3-%d1%83%d0%b4%d0%b8%d1%80%d0%b4%d0%b0%d1%85-%d0%bc%d0%b0%d0%b3%d0%b0%d0%b4%d0%bb%d0%b0/
https://uynaa.wordpress.com/2013/01/07/%d0%b7%d0%b8%d0%b0%d0%b4-%d1%82%d0%b0%d0%ba%d0%b8%d0%b5%d0%b4%d0%b4%d0%b8%d0%bd/
...and so on.
EDIT: To filter the links only to specified category, you can use this script:
import json
import requests
from bs4 import BeautifulSoup
data = {'action': 'infinite_scroll', 'page': 1}
api_url = 'https://uynaa.wordpress.com/?infinity=scrolling'
all_links = []
page = 1
while True:
data['page'] = page
data = requests.post(api_url, data=data).json()
# uncomment next line to print all data:
# print(json.dumps(data, indent=4))
soup = BeautifulSoup(data['html'], 'html.parser')
for p in soup.select('.post'):
if any('%d0%be%d1%80%d1%87%d1%83%d1%83%d0%bb%d0%b3%d1%8b%d0%bd-%d0%bd%d0%b8%d0%b9%d1%82%d0%bb%d1%8d%d0%bb' in cat['href'] for cat in p.select('[rel="category tag"]')):
if p.h2.a['href'] not in all_links:
print(p.h2.a['href'])
all_links.append(p.h2.a['href'])
if data['lastbatch']:
break
page += 1
print(len(all_links))
Prints 135 links:
...
https://uynaa.wordpress.com/2011/05/13/%e2%80%9c%d1%83%d1%85%d0%b0%d0%b0%d0%bd-%d0%bc%d1%83%d1%83%d1%82%d0%bd%d1%83%d1%83%d0%b4%d1%8b%d0%bd-%d2%af%d0%b5%e2%80%9d/
https://uynaa.wordpress.com/2011/05/04/%d2%af%d1%85%d0%bb%d0%b8%d0%b9%d0%bd-%d1%82%d0%be%d0%b3%d0%bb%d0%be%d0%be%d0%bc/
https://uynaa.wordpress.com/2011/05/04/%d0%be%d1%81%d0%b0%d0%bc%d0%b0-%d0%b1%d0%b8%d0%bd-%d0%bb%d0%b0%d0%b4%d0%b5%d0%bd%d0%b8%d0%b9%d0%b3-%d1%8f%d0%b0%d0%b6-%d0%b8%d0%bb%d1%80%d2%af%d2%af%d0%bb%d1%81%d1%8d%d0%bd-%d0%b1%d1%8d/
135
Not sure why your codes don't work. For me, I used the below codes to get all the links first.
list_href = []
a_tags = soup.find_all('a')
for tag in a_tags:
list_href.append(tag.get('href'))
The links of the articles are in list_href[5:26].

Visible and search URLs for webscraping

When I try to apply filters on the website before webscaping - it yields me to the following URL - https://www.marktplaats.nl/l/auto-s/p/2/#f:10898,10882
However, when I apply it in my script to retrieve href for each and every advertisement, it yields results from this website - https://www.marktplaats.nl/l/auto-s/p/2, completely neglecting 2 of my filters (namely #f:10898,10882).
Can you please advise me what is my problem?
import requests
import bs4
import pandas as pd
frames = []
for pagenumber in range (0,2):
url = 'https://www.marktplaats.nl/l/auto-s/p/'
add_url='/#f:10898,10882'
txt = requests.get(url + str(pagenumber)+add_url)
soup = bs4.BeautifulSoup(txt.text, 'html.parser')
soup_table = soup.find('ul', 'mp-Listings mp-Listings--list-view')
for car in soup_table.findAll('li'):
link = car.find('a')
sub_url = 'https://www.marktplaats.nl/' + link.get('href')
sub_soup = requests.get(sub_url)
soup1 = bs4.BeautifulSoup(sub_soup.text, 'html.parser')
I would suggest that you use their api instead which seems to be open.
If you open the link you will see all the same listings you are searching for (try to find something to format the json, since it will look like just a bunch a text), with the appropriate filters and no need to parse html. You can also modify it easily in request just by changing the headers.
https://www.marktplaats.nl/lrp/api/search?attributesById[]=10898&attributesById[]=10882&l1CategoryId=91&limit=30&offset=0
In code it would look something like this:
def getcars():
url = 'https://www.marktplaats.nl/lrp/api/search'
querystring = {
'attributesById[]': 10898,
'attributesById[]': 10882,
'l1CategoryId': 91,
'limit': 30,
'offset': 0
}
headers = {
}
response = requests.get(url, headers=headers, params=querystring)
x = response.json()
return x
cars = getcars()

How to scrape data from website having "View More" option using BeautifulSoup library in python

I am trying to parse comments from this website link :
I need to get 1000 comments, by default it shows only 10
I want to get 1000 comments, it shows only 10 by default. I am unable to figure out a way to get the content which shows on the webpage after clicking 'View More'
I have the following code uptil now:
import urllib.request
from bs4 import BeautifulSoup
import sys
non_bmp_map = dict.fromkeys(range(0x10000, sys.maxunicode + 1), 0xfffd)
response = urllib.request.urlopen("https://www.mygov.in/group-issue/share-
your-ideas-pm-narendra-modis-mann-ki-baat-26th-march-2017/")
srcode = response.read()
soup = BeautifulSoup(srcode, "html.parser")
all_comments_div=soup.find_all('div', class_="comment_body");
all_comments=[]
for div in all_comments_div:
all_comments.append(div.find('p').text.translate(non_bmp_map))
print (all_comments)
print (len(all_comments))
You can use a while loop to get the next pages
( ie while there is a next page and all comments are less than 1000 )
import urllib.request
from bs4 import BeautifulSoup
import sys
non_bmp_map = dict.fromkeys(range(0x10000, sys.maxunicode + 1), 0xfffd)
all_comments = []
max_comments = 1000
base_url = 'https://www.mygov.in/'
next_page = base_url + '/group-issue/share-your-ideas-pm-narendra-modis-mann-ki-baat-26th-march-2017/'
while next_page and len(all_comments) < max_comments :
response = response = urllib.request.urlopen(next_page)
srcode = response.read()
soup = BeautifulSoup(srcode, "html.parser")
all_comments_div=soup.find_all('div', class_="comment_body");
for div in all_comments_div:
all_comments.append(div.find('p').text.translate(non_bmp_map))
next_page = soup.find('li', class_='pager-next first last')
if next_page :
next_page = base_url + next_page.find('a').get('href')
print('comments: {}'.format(len(all_comments)))
print(all_comments)
print(len(all_comments))
The new comments are loaded via ajax, we need to parse it and then use bs, i.e.:
import json
import requests
import sys
from bs4 import BeautifulSoup
how_many_pages = 5 # how many comments pages you want to parse?
non_bmp_map = dict.fromkeys(range(0x10000, sys.maxunicode + 1), 0xfffd)
all_comments = []
for x in range(how_many_pages):
# note: mygov.in seems very slow...
json_data = requests.get(
"https://www.mygov.in/views/ajax/?view_name=view_comments&view_display_id=block_2&view_args=267721&view_path=node%2\
F267721&view_base_path=comment_pdf_export&view_dom_id=f3a7ae636cabc2c47a14cebc954a2ff0&pager_element=1&sort_by=created&sort_order=DESC&page=0,{}"\
.format(x)).content
d = json.loads(json_data.decode()) # Remove .decode() for python < 3
print(len(d))
if len(d) == 3: # sometimes json lenght is 3
comments = d[2]['data'] # data is the key that contains the comments html
elif len(d) == 2: # others just 2...
comments = d[1]['data']
#From here, we can use your BeautifulSoup code.
soup = BeautifulSoup(comments, "html.parser")
all_comments_div = soup.find_all('div', class_="comment_body");
for div in all_comments_div:
all_comments.append(div.find('p').text.translate(non_bmp_map))
print(all_comments)
Output:
["Sir my humble submission is that please ask public not to man handle doctors because they work in a very delicate situation, to save a patient is not always in his hand. The incidents of manhandling doctors is increasing day by day and it's becoming very difficult to work in these situatons. Majority are not Opting for medical profession,...']

Get the lists of things to do from tripadvisor

how to get the 'things to do' list? I am new to webscraping and i don't know how to loop through each page to get the href of all 'things to do'?tell me where i am doing wrong?Any help would be highly apreciated. Thanks in advance.
import requests
import re
from bs4 import BeautifulSoup
from urllib.request import urlopen
offset = 0
url = 'https://www.tripadvisor.com/Attractions-g255057-Activities-oa' + str(offset) + '-Canberra_Australian_Capital_Territory-Hotels.html#ATTRACTION_LIST_CONTENTS'
urls = []
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
for link in soup.find_all('a', {'last'}):
page_number = link.get('data-page-number')
last_offset = int(page_number) * 30
print('last offset:', last_offset)
for offset in range(0, last_offset, 30):
print('--- page offset:', offset, '---')
url = 'https://www.tripadvisor.com/Attractions-g255057-oa' + str(offset) + '-Canberra_Australian_Capital_Territory-Hotels.html#ATTRACTION_LIST_CONTENTS'
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
for link in soup.find_all('a', {'property_title'}):
iurl='https://www.tripadvisor.com/Attraction_Review-g255057' + link.get('href')
print(iurl)
Basically i want the href of each 'things to do'.
My desired output for 'things to do' is:
https://www.tripadvisor.com/Attraction_Review-g255057-d3377852-Reviews-Weston_Park-Canberra_Australian_Capital_Territory.html
https://www.tripadvisor.com/Attraction_Review-g255057-d591972-Reviews-Canberra_Museum_and_Gallery-Canberra_Australian_Capital_Territory.html
https://www.tripadvisor.com/Attraction_Review-g255057-d312426-Reviews-Lanyon_Homestead-Canberra_Australian_Capital_Territory.html
https://www.tripadvisor.com/Attraction_Review-g255057-d296666-Reviews-Australian_National_University-Canberra_Australian_Capital_Territory.html
Like in below example i used this code for getting the href of each restaurant in canberra city
my code for restauranr which works perfectly is:
import requests
import re
from bs4 import BeautifulSoup
from urllib.request import urlopen
with requests.Session() as session:
for offset in range(0, 1050, 30):
url = 'https://www.tripadvisor.com/Restaurants-g255057-oa{0}-Canberra_Australian_Capital_Territory.html#EATERY_LIST_CONTENTS'.format(offset)
soup = BeautifulSoup(session.get(url).content, "html.parser")
for link in soup.select('a.property_title'):
iurl = 'https://www.tripadvisor.com/' + link.get('href')
print(iurl)
the output of restaurant code is:
https://www.tripadvisor.com/Restaurant_Review-g255057-d1054676-Reviews-Lanterne_Rooms-Canberra_Australian_Capital_Territory.html
https://www.tripadvisor.com/Restaurant_Review-g255057-d755055-Reviews-Courgette_Restaurant-Canberra_Australian_Capital_Territory.html
https://www.tripadvisor.com/Restaurant_Review-g255057-d6893178-Reviews-Pomegranate-Canberra_Australian_Capital_Territory.html
https://www.tripadvisor.com/Restaurant_Review-g255057-d7262443-Reviews-Les_Bistronomes-Canberra_Australian_Capital_Territory.html
.
.
.
.
Ok , it's not that hard, you just have to know which tags to use .
Let me explain with this example :
import requests
from bs4 import BeautifulSoup
base_url = 'https://www.tripadvisor.com/' ## we need this to join the links later ##
main_page = 'https://www.tripadvisor.com/Attractions-g255057-Activities-oa{}-Canberra_Australian_Capital_Territory-Hotels.html#ATTRACTION_LIST_CONTENTS'
links = []
## get the initial page to find the number of pages ##
r = requests.get(main_page.format(0))
soup = BeautifulSoup(r.text, "html.parser")
## select the last page from the list of pages ('a', {'class':'pageNum taLnk'}) ##
last_page = max([ int(page.get('data-offset')) for page in soup.find_all('a', {'class':'pageNum taLnk'}) ])
## now iterate over that range (first page, last page, number of links), and extract the links from each page ##
for i in range(0, last_page + 30, 30):
page = main_page.format(i)
soup = BeautifulSoup(requests.get(page).text, "html.parser") ## get the next page and parse it with BeautifulSoup ##
## get the hrefs from ('div', {'class':'listing_title'}), and join them with base_url to make the links ##
links += [ base_url + link.find('a').get('href') for link in soup.find_all('div', {'class':'listing_title'}) ]
for link in links :
print(link)
That gives us 8 pages and 212 links in total ( 30 on each page, 2 on the last ) .
I hope this clears things up a bit

Categories