I got this project where I'm scraping data on Trulia.com and where I want to get the max number of page (last number) for a specific location (photo below) so I can loop through it and get all the hrefs.
To get that last number, I have my code that run as planned and should return an integer but it doesn't always return the same number. I added the print(comprehension list) to understand what's wrong. Here is the code and the output below. The return is commented but sould return the last number of the output list as an int.
city_link = "https://www.trulia.com/for_rent/San_Francisco,CA/"
def bsoup(url):
resp = r.get(url, headers=req_headers)
soup = bs(resp.content, 'html.parser')
return soup
def max_page(link):
soup = bsoup(link)
page_num = soup.find_all(attrs={"data-testid":"pagination-page-link"})
print([x.get_text() for x in page_num])
# return int(page_num[-1].get_text())
for x in range(10):
max_page(city_link)
I have no clue why sometimes it's returning something wrong. The photo above is the corresponding link.
Okay, now if I understand what you want, you are trying to see how many pages of links there are for a given location for rent. If we can assume the given link is the only required link, this code:
import requests
import bs4
url = "https://www.trulia.com/for_rent/San_Francisco,CA/"
req = requests.get(url)
soup = bs4.BeautifulSoup(req.content, features='lxml')
def get_number_of_pages(soup):
caption_tag = soup.find('div', class_="Text__TextBase-sc-1cait9d-0-
div Text__TextContainerBase-sc-1cait9d-1 RBSGf")
pagination = caption_tag.text
words = pagination.split(" ")
values = []
for word in words:
if not word.isalpha():
values.append(word)
links_per_page = values[0].split('-')[1]
total_links = values[1].replace(',', '')
no_of_pages = round(int(total_links)/int(links_per_page) + 0.5)
return no_of_pages
for i in range(10):
print(get_number_of_pages(soup))
achieves what you're looking for, and has repeatability because it doesn't interact with javascript, but the pagination caption at the bottom of the page.
Related
i am trying to scrape news from reuters but there is a click to view more at the bottom on the website. I could not know how to load the hidden results by using beautiful soup.
from bs4 import BeautifulSoup
import urllib.request
def scrape_reuters_news(ticker):
url = "https://www.reuters.com/search/news?sortBy=relevance&dateRange=pastWeek&blob="+ticker
scraped_data = urllib.request.urlopen(url)
scraped_data = scraped_data.read()
parsed_articles = BeautifulSoup(scraped_data, 'lxml')
links = parsed_articles.find_all("h3")
articles = []
titles = []
title_class = "Text__text___3eVx1j Text__dark-grey___AS2I_p Text__medium___1ocDap Text__heading_2___sUlNJP Heading__base___1dDlXY Heading__heading_2___3f_bIW ArticleHeader__heading___3ibi0Q"
for link in links:
paragraphs = ""
url = "https://www.reuters.com/"+str(link)[41:63]
scraped_data = urllib.request.urlopen(url)
scraped_data = scraped_data.read()
parsed_article = BeautifulSoup(scraped_data, 'lxml')
article = parsed_article.find_all("p")
title = parsed_article.select("h1", {"class": title_class})
titles.append(title[0].text.strip())
for paragraph in article:
paragraphs += paragraph.text + " "
articles.append(paragraphs)
return titles, articles
# edit
ticker = "apple"
news = scrape_reuters_news(ticker)
When you click the load more a callback is issued that you can find in the network tab. If you grab the number of results from the search page, you can add this into the callback to get all results in one go. I then use regex to extract the id to reconstruct each detail page url and the title (headline)
You would then visit each link to get the paragraph info.
Please note:
There is some de-duplication work to do. There exist different ids which lead to same content. So perhaps exclude based on title?
You may need to consider whether any pre-processing of ticker needs to happen e.g. convert to lowercase, replace spaces with "-". I don't know all your use cases.
from bs4 import BeautifulSoup as bs
import requests, re
ticker = 'apple'
with requests.Session() as s:
r = s.get(f'https://www.reuters.com/search/news?sortBy=relevance&dateRange=pastWeek&blob={ticker}')
soup = bs(r.content, 'lxml')
num_results = soup.select_one('.search-result-count-num').text
r = s.get(f'https://www.reuters.com/assets/searchArticleLoadMoreJson?blob={ticker}&bigOrSmall=big&articleWithBlog=true&sortBy=relevance&dateRange=pastWeek&numResultsToShow={num_results}&pn=&callback=addMoreNewsResults')
p = re.compile(r'id: "(.*?)"')
p2 = re.compile(r'headline: "(.*?)"')
links = [f'https://www.reuters.com/article/id{i}' for i in p.findall(r.text)]
headlines = [bs(i, 'lxml').get_text() for i in p2.findall(r.text)]
print(len(links), len(headlines))
From the detail pages you can get the paragraphs with
paras = ' '.join([i.get_text() for i in soup.select('[data-testid*=paragraph-]')])
Hi Wrote a web scraping program and it gets the ASN number correctly, but after all the data is scraped, it returns a error "Array Out if Bounds".
I am using Pycharm and latest python version. Below is my code.
There is already a similar issue on stackoverflow but I am not able to get the pieces together and make it work. (Web Scraping List Index Out Of Range) its the exact same error but I am not sure how to get it working for my List.
Error seems to be at current_country = link.split('/')[2]
Any help is appreciated. Thank you.
import urllib.request
import bs4
import re
import json
url = 'https://ipinfo.io/countries'
SITE = 'https://ipinfo.io'
def url_to_soup(url):
req = urllib.request.Request(url)
opener = urllib.request.build_opener()
html = opener.open(req)
soup = bs4.BeautifulSoup(html, "html.parser")
return soup
def find_pages(page):
pages = []
for link in page.find_all(href=re.compile('/countries')):
pages.append(link.get('href'))
return pages
def scrape_pages(links):
mappings = {}
print("Scraping Pages for ASN Data...")
for link in links:
country_page = url_to_soup(SITE + link)
current_country = link.split('/')[2]
print(current_country)
for row in country_page.find_all('tr'):
columns = row.find_all('td')
if len(columns) > 0:
current_asn = re.findall(r'\d+', columns[0].string)[0]
print(current_asn)
"""
name = columns[1].string
routes_v4 = columns[3].string
routes_v6 = columns[5].string
mappings[current_asn] = {'Country': current_country,
'Name': name,
'Routes v4': routes_v4,
'Routes v6': routes_v6}
return mappings """
main_page = url_to_soup(url)
country_links = find_pages(main_page)
#print(country_links)
asn_mappings = scrape_pages(country_links)
print(asn_mappings)
The last href contains string "/countries" in https://ipinfo.io/countries is actually "/countries":
<li>Global ASNs</li>
After splitting this link, it produced list ["", "countries"] where the third element was missing. To fix this problem, simply check the list length before retrieving the third element:
...
current_country = link.split('/')
if len(current_country) < 3:
continue
current_country = current_country[2]
...
Another solution is to exclude the last href by changing the regexp to:
...
for link in page.find_all(href=re.compile('/countries/')):
...
I'm using BeautifulSoup to pull data out of Reddit sidebars on a selection of subreddits, but my results are changing pretty much every time I run my script.
Specifically, the results in sidebar_urls changes from iteration to iteration; sometimes it will result in [XYZ.com/abc, XYZ.com/def], other times it will return just [XYZ.com/def], and finally, it will sometimes return [].
Any ideas why this might be happening using the code below?
sidebar_urls = []
for i in range(0, len(reddit_urls)):
req = urllib.request.Request(reddit_urls[i], headers=headers)
resp = urllib.request.urlopen(req)
soup = BeautifulSoup(resp, 'html.parser')
links = soup.find_all(href=True)
for link in links:
if "XYZ.com" in str(link['href']):
sidebar_urls.append(link['href'])
It seems you sometimes get a page that does not have a side bar. It could be because Reddit is recognizing you as a robot and returning a default page instead of the one you expect. Consider identifying yourself when requesting the pages, using the User-Agent field:
reddit_urls = [
"https://www.reddit.com/r/leagueoflegends/",
"https://www.reddit.com/r/pokemon/"
]
# Update this to identify yourself
user_agent = "me#example.com"
sidebar_urls = []
for reddit_url in reddit_urls:
response = requests.get(reddit_url, headers={"User-Agent": user_agent})
soup = BeautifulSoup(response.text, "html.parser")
# Find the sidebar tag
side_tag = soup.find("div", {"class": "side"})
if side_tag is None:
print("Could not find a sidebar in page: {}".format(reddit_url))
continue
# Find all links in the sidebar tag
link_tags = side_tag.find_all("a")
for link in link_tags:
link_text = str(link["href"])
sidebar_urls.append(link_text)
print(sidebar_urls)
I'm trying to scrape a list of URL's from the European Parliament's Legislative Observatory. I do not type in any search keyword in order to get all links to documents (currently 13172). I can easily scrape a list of the first 10 results which are displayed on the website using the code below. However, I want to have all links so that I would not need to somehow press the next page button. Please let me know if you know of a way to achieve this.
import requests, bs4, re
# main url of the Legislative Observatory's search site
url_main = 'http://www.europarl.europa.eu/oeil/search/search.do?searchTab=y'
# function gets a list of links to the procedures
def links_to_procedures (url_main):
# requesting html code from the main search site of the Legislative Observatory
response = requests.get(url_main)
soup = bs4.BeautifulSoup(response.text) # loading text into Beautiful Soup
links = [a.attrs.get('href') for a in soup.select('div.procedure_title a')] # getting a list of links of the procedure title
return links
print(links_to_procedures(url_main))
You can follow the pagination by specifying the page GET parameter.
First, get the results count, then calculate the number of pages to process by dividing the count on the results count per page. Then, iterate over pages one by one and collect the links:
import re
from bs4 import BeautifulSoup
import requests
response = requests.get('http://www.europarl.europa.eu/oeil/search/search.do?searchTab=y')
soup = BeautifulSoup(response.content)
# get the results count
num_results = soup.find('span', class_=re.compile('resultNum')).text
num_results = int(re.search('(\d+)', num_results).group(1))
print "Results found: " + str(num_results)
results_per_page = 50
base_url = "http://www.europarl.europa.eu/oeil/search/result.do?page={page}&rows=%s&sort=d&searchTab=y&sortTab=y&x=1411566719001" % results_per_page
links = []
for page in xrange(1, num_results/results_per_page + 1):
print "Current page: " + str(page)
url = base_url.format(page=page)
response = requests.get(url)
soup = BeautifulSoup(response.content)
links += [a.attrs.get('href') for a in soup.select('div.procedure_title a')]
print links
I'm trying to scrape multiple pages off of a single website for BeautifulSoup to parse. So far, I've tried using urllib2 to do this, but have been encountering some problems. What I've attempted is:
import urllib2,sys
from BeautifulSoup import BeautifulSoup
for numb in ('85753', '87433'):
address = ('http://www.presidency.ucsb.edu/ws/index.php?pid=' + numb)
html = urllib2.urlopen(address).read()
soup = BeautifulSoup(html)
title = soup.find("span", {"class":"paperstitle"})
date = soup.find("span", {"class":"docdate"})
span = soup.find("span", {"class":"displaytext"}) # span.string gives you the first bit
paras = [x for x in span.findAllNext("p")]
first = title.string
second = date.string
start = span.string
middle = "\n\n".join(["".join(x.findAll(text=True)) for x in paras[:-1]])
last = paras[-1].contents[0]
print "%s\n\n%s\n\n%s\n\n%s\n\n%s" % (first, second, start, middle, last)
This only gives me results for the second number in the numb sequence, i.e. http://www.presidency.ucsb.edu/ws/index.php?pid=87433. I also made some attempts at using mechanize, but had no success. Ideally what I would like to be able to do is have a page, with a list of links, and then automatically select a link, pass the HTML off to BeautifulSoup, and then move to the next link in the list.
You need to put the rest of the code inside the loop. Right now you're iterating over both items in the tuple, but at the end of the iteration only the last item remains assigned to address which subsequently gets parsed outside the loop.
I think you just missed the indentation in the loop :
import urllib2,sys
from BeautifulSoup import BeautifulSoup
for numb in ('85753', '87433'):
address = ('http://www.presidency.ucsb.edu/ws/index.php?pid=' + numb)
html = urllib2.urlopen(address).read()
soup = BeautifulSoup(html)
title = soup.find("span", {"class":"paperstitle"})
date = soup.find("span", {"class":"docdate"})
span = soup.find("span", {"class":"displaytext"}) # span.string gives you the first bit
paras = [x for x in span.findAllNext("p")]
first = title.string
second = date.string
start = span.string
middle = "\n\n".join(["".join(x.findAll(text=True)) for x in paras[:-1]])
last = paras[-1].contents[0]
print "%s\n\n%s\n\n%s\n\n%s\n\n%s" % (first, second, start, middle, last)
I think this should solve the problem..
Here's a tidier solution (using lxml):
import lxml.html as lh
root_url = 'http://www.presidency.ucsb.edu/ws/index.php?pid='
page_ids = ['85753', '87433']
def scrape_page(page_id):
url = root_url + page_id
tree = lh.parse(url)
title = tree.xpath("//span[#class='paperstitle']")[0].text
date = tree.xpath("//span[#class='docdate']")[0].text
text = tree.xpath("//span[#class='displaytext']")[0].text_content()
return title, date, text
if __name__ == '__main__':
for page_id in page_ids:
title, date, text = scrape_page(page_id)