Beautiful Soup \u003b appears and messes up find_all? - python

I've been working on a web scraper for top news sites. Beautiful Soup in python has been a great tool, letting me get full articles with very simple code. BUT
article_url='https://apnews.com/article/lifestyle-travel-coronavirus-pandemic-health-education-418fe38201db53d2848e0138a28ff824'
session = requests.Session()
retry = Retry(connect=3, backoff_factor=0.5)
adapter = HTTPAdapter(max_retries=retry)
session.mount('http://', adapter)
session.mount('https://', adapter)
user_agent='Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36'
request_header={ 'User-Agent': user_agent}
source=session.get(article_url, headers=request_header).text
soup = BeautifulSoup(source,'lxml')
#get all <p> paragraphs from article
paragraphs=soup.find_all('p')
#print each paragraph as a line
for paragraph in paragraphs:
print(paragraph)
This works great on most news sites I've tried BUT for some reason The AP site gives me no output at all. Which is strange because the exact same code works on maybe 10 other sites like the NYT, WaPo, and The Hill. And I know why.
What it does is, where every other site prints out all the paragraphs, it prints nothing. Except when I look at the paragraphs soup variable, here is the kind of thing I see:
address the pandemic.\u003c/p>\u003cdiv class=\"ad-placeholder\">\u003c/div>\u003cp>Instead, public schools
Clearly what's happening is the < HTML symbol is being translated as \u003b. And because of that find_all('p') can't properly find the HTML tags. But for some reason only the AP site is doing it. When I inspect the AP website, their html has the same symbols as all the other sites.
Does anyone have any idea why this is happening? Or what I can do to fix it? Because I'm seriously confused

For me, at least, I had to extract a javascript object containing the data with regex, then parse with json into json object, then grab the value associated with the page html as you see it in browser, soup it and then extract the paragraphs. I removed the retries stuff; you can easily re-insert that.
import requests
#from requests.adapters import HTTPAdapter
#from requests.packages.urllib3.util.retry import Retry
from bs4 import BeautifulSoup
import re,json
article_url='https://apnews.com/article/lifestyle-travel-coronavirus-pandemic-health-education-418fe38201db53d2848e0138a28ff824'
user_agent='Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36'
request_header={ 'User-Agent': user_agent}
source = requests.get(article_url, headers=request_header).text
data = json.loads(re.search(r"window\['titanium-state'\] = (.*)", source, re.M).group(1))
content = data['content']['data']
content = content[list(content.keys())[0]]
soup = BeautifulSoup(content['storyHTML'])
for p in soup.select('p'):
print(p.text.strip())
Regex:

Related

How can I get URLs from Oddsportal?

How can I get all the URLs from this particular link: https://www.oddsportal.com/results/#soccer
For every URL on this page, there are multiple pages e.g. the first link of the page:
https://www.oddsportal.com/soccer/africa/
leads to the below page as an example:
https://www.oddsportal.com/soccer/africa/africa-cup-of-nations/results/
-> https://www.oddsportal.com/soccer/africa/africa-cup-of-nations/results/#/page/2/...
https://www.oddsportal.com/soccer/africa/africa-cup-of-nations-2019/results/
-> https://www.oddsportal.com/soccer/africa/africa-cup-of-nations-2019/results/#/page/2/...
I would ideally like to code in python as I am pretty comfortable with it (more than other languages through not at all close to what I can call as comfortable)
and
After clicking on the link:
When I go to inspect element, I can see tha the links can be scraped however I am very new to it.
Please help
I have extracted the URLs from the main page that you mentioned.
import requests
import bs4 as bs
url = 'https://www.oddsportal.com/results/#soccer'
headers = {'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'}
resp = requests.get(url, headers=headers)
soup = bs.BeautifulSoup(resp.text, 'html.parser')
base_url = 'https://www.oddsportal.com'
a = soup.findAll('a', attrs={'foo': 'f'})
# This set will have all the URLs of the main page
s = set()
for i in a:
s.add(base_url + i['href'])
Since you are new to web-scraping I suggest you to go through these.
Beautiful Soup - Beautiful Soup is a Python library for pulling data out of HTML and XML files.
Docs: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
requests - Requests is an elegant and simple HTTP library for Python.
Docs: https://docs.python-requests.org/en/master/
Selenium - Selenium is an umbrella project for a range of tools and libraries that enable and support the automation of web browsers.
Docs: https://selenium-python.readthedocs.io/

How to extract beer names with XPATH in Python from this hungarian webshop?

I am trying to extract product information of beers from a hungarian webshop Spar. My code only seems to work and extract the product names when I don't filter to beers. So in case page_url = ('https://www.spar.hu/onlineshop/') then the code works fine but when trying to use the link below in my code, it extracts nothing. I have also tried using the link for total alcohol category but same as when using the beers one.
The code itself:
import requests
from scrapy import Selector
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36'
}
page_url = ('https://www.spar.hu/onlineshop/alkoholos-italok/sor-cider/c/H8-6/?page=1')
html = requests.get(url = page_url, headers = headers).content
sel = Selector(text = html)
for card in sel.xpath('//div[#class= "productBox"]'):
name = card.xpath('//label[#class="productTitle"]/a/text()').extract()
print(name)
When you are sending a GET request, the URL you mentioned above will only return it's source code.
The source code doesn't include the products, but only a Javascript which will then load the products from a different URL.
If you open up your URL in a Webbrowser and press CTRL+U you will see the source code.
If you drill into the page by opening the developer tools in your Webbrowser (F12) and reload the page, you can actually see all the content that the page loads.
When using the network inspection in the developer tools you will come accross this URL: https://sp1004e537.guided.lon5.atomz.com/?page=1&category=H8-6&callback=parseResponse&sp_cs=UTF-8&sort=product-ecr-sortlev&m_sortProdResults_egisp=a&pos=6108207&callback=parseResponse&_=1606416848605
This is the actual URL which returns the list of products as a Json file.
Here is the complete Code for this page:
import requests
import json
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36'
}
page_url = ('https://sp1004e537.guided.lon5.atomz.com/?page=1&category=H8-6&sp_cs=UTF-8&sort=product-ecr-sortlev&m_sortProdResults_egisp=a&pos=6108207&_=1606416848605')
r = requests.get(url = page_url, headers = headers)
if r.status_code == 200:
j = json.loads(r.text[14:-1]) # remove non Json string from the beginning and end parseResponse( [....] )
for beer in j['results']:
print(beer['title-hu'])
I highly recommend to debug the dictionary "beer" since it includes a lot more information.
You could just add a . just before you second xpath, use get() instead of extract() and do a little bit of trimming on the result string.
With the following code :
for card in sel.xpath('//div[#class= "productBox"]'):
name = card.xpath('.//label[#class="productTitle"]/a/text()').get()
print(name)
i get this result:
Tchibo Cafissimo Special őrölt, pörkölt kávékapszula 6 x 10 db 471 g

my python app doesn't work and give a None as answer

Hi i would like to know why my app is giving me that error i've tried already everything what i found in google and still have not idea why is that happeing
import requests
from bs4 import BeautifulSoup
URL = 'https://www.amazon.co.uk/XLTOK-Charging-Transfer-Charger-Nintendo/dp/B0828RYQ7W/ref=sr_1_1_sspa?dchild=1&keywords=type+c&qid=1598485860&sr=8-1-spons&psc=1&spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUE3TDNSNUlITUNKTUMmZW5jcnlwdGVkSWQ9QTAwNDg4MTMyUlFQN0Y4RllGQzE2JmVuY3J5cHRlZEFkSWQ9QTAxNDk0NzMyMFNLSUdPU0taVUpRJndpZGdldE5hbWU9c3BfYXRmJmFjdGlvbj1jbGlja1JlZGlyZWN0JmRvTm90TG9nQ2xpY2s9dHJ1ZQ=='
headers = {"User-agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36'}
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find(id="productTitle")
print(title)
When called:
C:\Proyecto nuevo>Python main.py
None
So if anyone would like to help me will be amazing!!
If you look at the webpage code (the one your are trying to scrape), you will find that it is pretty much all javascript that populates the webpage when it loads. The requests library gets this code and doesn't run it. Your "find title" gets none because the code doesn't contain that.
To scrape this page you will have to run the javascript on it. You can check out the Selenium WebDriver in python to do this.

No results in scraping bing search

i use code below to scrape results from bing and when I see the scraped web page it says "There are no results for python".
but when I search in the browser there is no problem.
import requests
from bs4 import BeautifulSoup
term = 'python'
url = f'https://www.bing.com/search?q={term}&setlang=en-us'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
print(soup.prettify())
I searched and I didn't find any similar problem
You need to pass the user-agent while requesting to get the value.
import requests
from bs4 import BeautifulSoup
term = 'python'
url = 'https://www.bing.com/search?q={}&setlang=en-us'.format(term)
headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
response = requests.get(url,headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
print(soup.prettify())
Since Bing is a dynamic website, meaning Javascript generates the code, you won't be able to scrape it using only Beautifulsoup. Instead, I recommend selenium, which opens a browser that you can control and parse the code with Beautifulsoup.
The same will happen for any other dynamically coded website, including Google and many others.
It's probably because there's no user-agent being passed into request headers (as already mentioned by KunduK) thus when no user-agent is specified while using requests library, it defaults to python-requests so Bing or other search engines understands that it's a bot/script, then it blocks a request. Check what's your user-agent.
Pass user-agent:
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}
requests.get('URL', headers=headers)
How to reduce the chance of being blocked while web scraping search engines.
Alternatively, you can achieve the same thing by using Bing Organic Results API from SerpApi. It's a paid API with a free plan.
The difference is that you don't have to spend time trying to bypass blocks from Bing or other search engines. Instead, focus on the data that needs to be extracted from the structured JSON. Check out the playground.
Disclaimer, I work for SerpApi.

How to get the paragraph having strong tag and the three paragraphs below it using BeautifulSoup and Python requests?

How to get the paragraph having strong tag and the three paragraphs below it using BeautifulSoup and Python requests?
Screenshot of the paragraphs i want to get using BeautifulSoup.
import requests
from bs4 import BeautifulSoup
import re
URL = 'https://www.bbc.co.uk/learningenglish/english/features/the-english-we-speak/ep-200601'
headers={
'user-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36'
}
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
data = soup.find('p', text=re.compile('^Examples'))
print(data)
This script will get all paragraphs under the header "Examples""
import requests
from bs4 import BeautifulSoup
url = 'https://www.bbc.co.uk/learningenglish/english/features/the-english-we-speak/ep-200420'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
def get_paragraphs(first_paragraph):
current_paragraph = first_paragraph
while True:
yield current_paragraph
current_paragraph = current_paragraph.find_next('p')
if not current_paragraph or (current_paragraph.strong and current_paragraph.strong.text.strip() != ''):
break
for p in get_paragraphs(soup.select_one('p:contains("Examples")')):
print(p)
Prints:
<p><strong>Examples</strong></p>
<p><br/>My friend keeps banging on about where he’s going to go when he buys his new car. It’s really frustrating.</p>
<p>That person on the bus was really annoying. She kept banging on about how the prices had gone up.</p>
<p>Will you please stop banging on about my project!? If you think you could do a better job, you can do my work for me.</p>
For url = 'https://www.bbc.co.uk/learningenglish/english/features/the-english-we-speak/ep-200601' it prints:
<p><strong>Examples<br/></strong>He’s always flexing on social media, - showing these glamorous pictures of his holidays! </p>
<p>People who flex are so annoying! It’s just showing off. </p>
<p>She can’t stop flexing about her new house, but it’s actually not that nice.<strong> </strong></p>

Categories