BeautifulSoup cannot read hypens in text? - python

I'm trying to scrape imdb.com with BeautifulSoup in Python, but there are some html tag that contains hypens '-' in its text, so system could not read it.
The page I was trying to scrape: click here
I'm trying to extract "TV-MA" from tag below
<span class="certificate">TV-MA</span>
So, I will crawl using code like this :
item.find("span",{"class": "certificate"}).text
But, the code above will return NoneType object error. So that means, the span tag was not detected when I tried to ".find" the html tag. In the original html file, the span wasn't there as well (I knew this because I've tried to print the html code). But again, when I tried to inspect element of the page, the span tag was there...
I've tried crawling with other text that contain "-" with the same span, such as:
<span class="certificate">PG-13</span>
And the crawl code I've used above will work (which means it will return: "PG-13"). So I don't think the problem is with the code.

This worked for me:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.imdb.com/search/title/?genres=documentary')
soup = BeautifulSoup(r.text, 'html.parser')
text = soup.find("span",{"class": "certificate"}).text
print(text)
Prints:
TV-MA

Related

Website scraping with Python, how do I know where to reference in the html?

I'm a complete beginner who has only built basic Python projects. Right now I'm building a scraper in Python with bs4 to help me read success stories off of a website. These success stories are all in a table, so I thought I would find an html tag that said table and would encompass the entire table.
However, it is all just <div and <span class, and when I use soup.find("div") or ("span") it returns only the single word "div" or "span". This is what I have so far, and I know it isn't right or set up correctly but I'm too inexperienced to know why yet.
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import requests
req = Request('https://www.calix.com/about-calix/success-stories.html', headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
soup = BeautifulSoup(webpage, "lxml")
soup.find("div", {"id": "content-calix-en-site-prod-home-about-calix-success-stories-jcr-content"})
print('div')
I have watched several tutorials on how to use bs4 and I have successfully scraped basic websites, but all I can do for this one is get ALL of the html, not the chunks I need (just the success stories).
You are printing 'div' make sure to be printing soup as soup gets updated whenever you find something within it.
You should have a look at the bs4 documentation.
soup.find("div", {"id": "content-calix-en-site-prod-home-about-calix-success-stories-jcr-content"})
Here you're calling soup.find() but you're not saving the results into a variable, so the results are lost.
print('div')
And here you're printing the literal string div. I don't think that's what you intended.
Try something like this:
div = soup.find("div", {"id": "..."})
print(div)

Extract data from span tag within a div tag

I have the html code below and I'm trying to extract the 3:40 as text to use in my python script. How would I go about grabbing that info?
example element
I would use the BeautifulSoup library. Here is how I would grap this info knowning that you already have the HTML file :
from bs4 import BeautifulSoup
with open(html_path) as html_file:
html_page = BeautifulSoup(html_file, 'html.parser')
div = html_page.find('div', class_='playbackTimeline__duration')
span = div.find('span', {'aria-hidden': 'true'})
text = span.get_text()
I'm not sure if it works, but it gives you an idea on how to do this kind of stuff. Check for "web scraping" if you want more information about that. :)

Using BeautifulSoup4 with Google Translate

I am currently going through the Web Scraping section of AutomateTheBoringStuff and trying to write a script that extracts translated words from Google Translate using BeautifulSoup4.
I inspected the html content of a page where 'Explanation' is the translated word:
<span id="result_box" class="short_text" lang="en">
<span class>Explanation</span>
</span>
Using BeautifulSoup4, I tried different selectors but nothing would return the translated word. Here are a few examples I tried, but they return no results at all:
soup.select('span[id="result_box"] > span')
soup.select('span span')
I even copied the selector directly from the Developer Tools, which gave me #result_box > span. This again returns no results.
Can someone explain to me how to use BeautifulSoup4 for my purpose? This is my first time using BeautifulSoup4 but I think I am using BeautifulSoup more or less correctly because the selector
soup.select('span[id="result_box"]')
gets me the outer span element**
[<span class="short_text" id="result_box"></span>]
**Not sure why the 'leng="en"' part is missing but I am fairly certain I have located the correct element regardless.
Here is the complete code:
import bs4, requests
url = 'https://translate.google.ca/#zh-CN/en/%E6%B2%BB%E5%85%B7'
res = requests.get(url)
res.raise_for_status
soup = bs4.BeautifulSoup(res.text, "html.parser")
translation = soup.select('#result_box span')
print(translation)
EDIT: If I save the Google Translate page as an offline html file and then make a soup object out of that html file, there would be no problem locating the element.
import bs4
file = open("Google Translate.html")
soup = bs4.BeautifulSoup(file, "html.parser")
translation = soup.select('#result_box span')
print(translation)
The result_box div is the correct element but your code only works when you save what you see in your browser as that includes the dynamically generated content, using requests you get only the source itself bar any dynamically generated content. The translation is generated by an ajax call to the url below:
"https://translate.google.ca/translate_a/single?client=t&sl=zh-CN&tl=en&hl=en&dt=at&dt=bd&dt=ex&dt=ld&dt=md&dt=qca&dt=rw&dt=rm&dt=ss&dt=t&ie=UTF-8&oe=UTF-8&source=bh&ssel=0&tsel=0&kc=1&tk=902911.786207&q=%E6%B2%BB%E5%85%B7"
For your requests it returns:
[[["Fixture","治具",,,0],[,,,"Zhì jù"]],,"zh-CN",,,[["治 具",1,[["Fixture",999,true,false],["Fixtures",0,true,false],["Jig",0,true,false],["Jigs",0,true,false],["Governance",0,true,false]],[[0,2]],"治具",0,1]],1,,[["ja"],,[1],["ja"]]]
So you will either have to mimic the request, passing all the necessary parameters or use something that supports dynamic content like selenium
Simply try this :
translation = soup.select('#result_box span')[0].text
print(translation)
You can try this diferent aproach:
if filename.endswith(extension_file):
with open(os.path.join(files_from_folder, filename), encoding='utf-8') as html:
soup = BeautifulSoup('<pre>' + html.read() + '</pre>', 'html.parser')
for title in soup.findAll('title'):
recursively_translate(title)
FOR THE COMPLETE CODE, PLEASE SEE HERE:
https://neculaifantanaru.com/en/python-code-text-google-translate-website-translation-beautifulsoup-library.html
or HERE:
https://neculaifantanaru.com/en/example-google-translate-api-key-python-code-beautifulsoup.html

Extract Link URL After Specified Element with Python and Beautifulsoup4

I'm trying to extract a link from a page with python and the beautifulsoup library, but I'm stuck. The link is on the following page, on the sidebar area, directly underneath the h4 subtitle "Original Source:
http://www.eurekalert.org/pub_releases/2016-06/uonc-euc062016.php
I've managed to isolate the link (mostly), but I'm unsure of how to further advance my targeting to actually extract the link. Here's my code so far:
import requests
from bs4 import BeautifulSoup
url = "http://www.eurekalert.org/pub_releases/2016-06/uonc-euc062016.php"
data = requests.get(url)
soup = BeautifulSoup(data.text, 'lxml')
source_url = soup.find('section', class_='widget hidden-print').find('div', class_='widget-content').findAll('a')[-1]
print(source_url)
I am currently getting the full html of the last element in which I've isolated, where I'm trying to simply get the link. Of note, this is the only link on the page I'm trying to get.
You're looking for the link which is the href html attribute. source_url is a bs4.element.Tag which has the get method like:
source_url.get('href')
You almost got it!!
SOLUTION 1:
You just have to run the .text method on the soup you've assigned to source_url.
So instead of:
print(source_url)
You should use:
print(source_url.text)
Output:
http://news.unchealthcare.org/news/2016/june/e-cigarette-use-can-alter-hundreds-of-genes-involved-in-airway-immune-defense
SOLUTION 2:
You should call source_url.get('href') to get only the specific href tag related to your soup.findall element.
print source_url.get('href')
Output:
http://news.unchealthcare.org/news/2016/june/e-cigarette-use-can-alter-hundreds-of-genes-involved-in-airway-immune-defense

Python: Print Specific line of text out of TD tag

This is an easy one I am sure. I am parsing a website and I am trying to get the specific text in between tags. The text will either == [revoked, Active, Default] I am using Python. I have been able to print out all the inner text results, but I have not been able to find a good solution on the web for specific text. Here is my code
from BeautifulSoup import BeautifulSoup
import urllib2
import re
url = urllib2.urlopen("Some URL")
content = url.read()
soup = BeautifulSoup(content)
for tag in soup.findAll(re.compile("^a")):
print(tag.text)
I'm still not sure I understand what you are trying to do, but I'll try to help.
soup.find_all('a', text=['revoked', 'active', 'default'])
This will select only those <a …> tags that have one of given strings as their text.
I've used the snippet below in a similar occasion. See if this works with your goal:
table = soup.find(id="Table3")
for i in table.stripped_strings:
print(i)

Categories