Missing information in scraped web data, Google translate, Using Python

Missing information in scraped web data, Google translate, Using Python - python

I want to scrape the Google translate website and get the translated text from it using Python 3.
Here is my code:
from bs4 import BeautifulSoup as soup
from urllib.request import Request as uReq
from urllib.request import urlopen as open
my_url = "https://translate.google.com/#en/es/I%20am%20Animikh%20Aich"
req = uReq(my_url, headers={'User-Agent':'Mozilla/5.0'})
uClient = open(req)
page_html = uClient.read()
uClient.close()
html = soup(page_html, 'html5lib')
print(html)
Unfortunately, I am unable to find the required information in the parsed Webpage.
In chrome "Inspect", It is showing that the translated text is inside:
<span id="result_box" class="short_text" lang="es"><span class="">Yo soy Animikh Aich</span></span>
However, When I am searching for the information in the parsed HTML code, this is what I'm finding in it:
<span class="short_text" id="result_box"></span>
I have tried parsing using all of html5lib, lxml, html.parser. I have not been able to find a solution for this.
Please help me with the issue.

you could use a specific python api:
import goslate
gs = goslate.Goslate()
print(gs.translate('I am Animikh Aich', 'es'))
Yo soy Animikh Aich

Try like below to get the desired content:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://translate.google.com/#en/es/I%20am%20Animikh%20Aich")
soup = BeautifulSoup(driver.page_source, 'html5lib')
item = soup.select_one("#result_box span").text
print(item)
driver.quit()
Output:
Yo soy Animikh Aich

JavaScript is modifying the HTML code after it loads. urllib can't handle JavaScript, you'll have to use Selenium to get the data that you want.
For installation and demo, refer this link.

Related

Trouble parsing deeply-nested HTML with BeautifulSoup

for context I am pretty new to Python. I am trying to use bs4 to parse some data out of https://bigfuture.collegeboard.org/college-university-search/university-of-california-los-angeles
To be exact, I want to obtain the 57% number in the "paying" section of the webpage.
My problem is that bs4 will only return the first layer of the HTML, while the data I want is deeply nested in the code. I think it's under 17 divs.
Here is my python code:
import requests
import bs4
url = 'https://bigfuture.collegeboard.org/college-university-search/university-of-california-los-angeles'
res = requests.get(url)
soup = bs4.BeautifulSoup(res.text, "html.parser")
print(soup.find_all("div", {"id": "gwtDiv"}))
(This returns [<div class="clearfix margin60 marginBottomOnly" id="gwtDiv" style="min-height: 300px;height: 300px;height: auto;"></div>] None of the elements inside it are shown.)

If the page is using js to render content inside the element then requests will not be able to get that content since that content is rendered on the client side in a browser. I'd recommend using ChromeDriver and Selenium along with BeautifulSoup.
You can download the chrome driver from here:https://chromedriver.chromium.org/
Put this in the same folder in which you're running your program.
Try something like this
from selenium import webdriver
from bs4 import BeautifulSoup
url = 'https://bigfuture.collegeboard.org/college-university-search/university-of-california-los-angeles'
driver = webdriver.Chrome()
driver.get(url)
html = driver.execute_script("return document.documentElement.outerHTML")
sel_soup = BeautifulSoup(html, 'html.parser')
print(soup.find_all("div", {"id": "gwtDiv"}))

Python urlopener not retrieving tables and lists

I am trying to make a simple webscraper where I take the information off a HTML page. It's simple but I have a problem I can't seem to solve:
When I download the HTML page by myself and parse it using BeautifulSoup, it parses everything and gives me all the data, this is ok but I don't need to do this. Instead I am trying to using a link instead which doesn't seem to be working. Whenever I use the link using the "urlopen" function and parse the page using BeautifulSoup, it always seems to completely ignore/exclude some lists and tables from the HTML file. These tables appear when I look up the page online using the "Inspect Element" method, and they also appear when I download the HTML page myself but they never appear when I use the "urlopen" function. I even tried encoding post data and sending it as an argument of the function but it doesn't seem to work that way either.
import bs4
from urllib.request import urlopen as uReq
from urllib.parse import urlencode as uEnc
from bs4 import BeautifulSoup as soup
my_url = 'https://sp.com.sa/en/tracktrace/?tid=RB961555017SG'
#data = {'tid':'RB961555017SG'}
#sdata = uEnc(data)
#sdata = bytearray(sdata, 'utf-8')
uClient = uReq(my_url, timeout=2) # opening url or downloading the webpage
page_html = uClient.read() # saving html file in page_html
uClient.close() # closing url or connection idk properly
page_soup = soup(page_html, "html.parser") # parsing the html file and saving
updates = page_soup.findAll("div",{"class":"col-sm-12"})
#updates = page_soup.findAll("ol", {})
print(updates)
These tables contain the information I need, is there anyway I can fix this?

request works a bit differently than a browser. E.g. it does not actually run JavaScript.
In this case the table with info is generated by a script rather than hardcoded in HTML. You can see the actual source-code using "view-source:" followed by the url.
view-source:https://sp.com.sa/en/tracktrace/?tid=RB961555017SG
So, we'd want to run that script some way. The easiest is to use "selenium", which uses the driver of your browser. Then simply take the loaded html and run that through BS4. I noticed that there is a more specific tag you can use rather than "col-sm-12". Hope it helps :)
import bs4
from urllib.request import urlopen as uReq
from urllib.parse import urlencode as uEnc
from bs4 import BeautifulSoup as soup
my_url = 'https://sp.com.sa/en/tracktrace/?tid=RB961555017SG'
from selenium import webdriver
import time
chrome_path = "path of chrome driver that fits your current browser version"
driver = webdriver.Chrome(chrome_path)
driver.get(my_url)
time.sleep(5) #to make sure it's loaded
page_html = driver.page_source
driver.quit()
page_soup = soup(page_html, "html.parser") # parsing the html file and saving
updates = page_soup.findAll("table",{"class":"table-shipment table-striped"}) #The table is more specific to you wanted data than the col-sm-12 so I'd rather use that.
print(updates)

Neither Selenium or Beautiful soup showing full html source?

I tried using beautiful soup to parse a website, however when I printed "page_soup" I would only get a portion of the HTML, the beginning portion of the code, which has the info I need, was omitted. No one answered my question. After doing some research I tried using Selenium to access the full HTML, however I got the same results. Below are both of my attempts with selenium and beautiful soup. When I try and print the html it starts off in the middle of the source code, skipping the doctype, lang etc initial statements.
from selenium import webdriver
from bs4 import BeautifulSoup
browser = webdriver.Chrome( executable_path= "/usr/local/bin/chromedriver")
browser.get('https://coronavirusbellcurve.com/')
html = browser.page_source
soup = BeautifulSoup(html)
print(soup)
import bs4
import urllib
from urllib.request import urlopen as uReq
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup
htmlPage = urlopen(pageRequest).read()
page_soup = soup(htmlPage, 'html.parser')
print(page_soup)

The requests module seems to be returning the numbers in the first table on the page assuming you are referring to US Totals.
import requests
r = requests.get('https://coronavirusbellcurve.com/').content
print(r)

Python Beautifulsoup (bs4) findAll not finding all elements

From the url that is in the code, I am ultimately trying to gather all of the players names from the page. However, when I am using .findAll in order to get all of the list elements, I am yet to be successful. Please advise.
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
players_url = 'https://stats.nba.com/players/list/?Historic=Y'
# Opening up the Connection and grabbing the page
uClient = uReq(players_url)
page_html = uClient.read()
players_soup = soup(page_html, "html.parser")
# Taking all of the elements from the unordered lists that contains all of the players.
list_elements = players_soup.findAll('li', {'class': 'players-list__name'})

As #Oluwafemi Sule suggested it is better to use selenium together with BS:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('https://stats.nba.com/players/list/?Historic=Y')
soup = BeautifulSoup(driver.page_source, 'lxml')
for div in soup.findAll('li', {'class': 'players-list__name'}):
print(div.find('a').contents[0])
Output:
Abdelnaby, Alaa
Abdul-Aziz, Zaid
Abdul-Jabbar, Kareem
Abdul-Rauf, Mahmoud
Abdul-Wahad, Tariq
etc.

You can do this with requests alone by pulling direct from the js script which provides the names.
import requests
import json
r = requests.get('https://stats.nba.com/js/data/ptsd/stats_ptsd.js')
s = r.text.replace('var stats_ptsd = ','').replace('};','}')
data = json.loads(s)['data']['players']
players = [item[1] for item in data]
print(players)

As #Oluwafemi Sule suggested) mentioned in the comment:
The list of players generated in the page is done with javascript.
Instead of using Selenium, I recommend you this package requests-html created by the author of very popular requests. It uses Chromium under the hood to render JavaScript content.
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://stats.nba.com/players/list/?Historic=Y')
r.html.render()
for anchor in r.html.find('.players-list__name > a'):
print(anchor.text)
Output:
Abdelnaby, Alaa
Abdul-Aziz, Zaid
Abdul-Jabbar, Kareem
Abdul-Rauf, Mahmoud
Abdul-Wahad, Tariq
...

Requests vs Selenium Python in Youtube

When I use Selenium library to find the length of related channel in YouTube channel Page it gives me 12. But when I use Requests library to find the length it gives me 0.
I want to use requests please help me if it's possible
My code
Requests
import requests
from bs4 import BeautifulSoup
import time
r = requests.get("https://www.youtube.com/channel/UCoykjkkJxsz7JukJR7mGrwg/about")
soup = BeautifulSoup(r.content, 'html.parser')
bb = soup.find_all("ytd-mini-channel-renderer",class_="style-scope ytd-vertical-channel-section-renderer")
print(len(bb))
Selenium
from selenium import webdriver
from bs4 import BeautifulSoup
import time
driver = webdriver.Chrome(chrome_path)
driver.get("https://www.youtube.com/channel/UCoykjkkJxsz7JukJR7mGrwg/about")
soup = BeautifulSoup(driver.page_source, 'html.parser')
bb = soup.find_all("ytd-mini-channel-renderer",class_="style-scope ytd-vertical-channel-section-renderer")
print(len(bb))

Every time I've run into an issue like this, it was because JS was creating the data I was after. If this is the case, you likely won't be able to use requests as it can't handle the JS.
If you navigate to that youtube page in a browser, you can see that "ytd-mini-channel-renderer" exists if you inspect it, but if you view source, you get 0 results. The code you can see from "view source" is what requests is getting.

Sometimes the issue is caused by the soup object having different tags from the ones you see from dev tools, which is what is happening in your case. On analysing the soup object you'll notice the information you need is actually now in <h3 class="yt-lockup-title ">.
This code will pull the results you want:
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.youtube.com/channel/UCoykjkkJxsz7JukJR7mGrwg/about")
soup = BeautifulSoup(r.content, 'html.parser')
bb=soup.find_all('h3',class_='yt-lockup-title')
print(len(bb))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Missing information in scraped web data, Google translate, Using Python - python

you could use a specific python api: import goslate gs = goslate.Goslate() print(gs.translate('I am Animikh Aich', 'es')) Yo soy Animikh Aich

JavaScript is modifying the HTML code after it loads. urllib can't handle JavaScript, you'll have to use Selenium to get the data that you want. For installation and demo, refer this link.

Related

Trouble parsing deeply-nested HTML with BeautifulSoup

Python urlopener not retrieving tables and lists

Neither Selenium or Beautiful soup showing full html source?

Python Beautifulsoup (bs4) findAll not finding all elements

Requests vs Selenium Python in Youtube

Categories

Resources