python scrape a class href - python

I want to scrape the href link using python3
existing code:
import lxml.html
import requests
dom = lxml.html.fromstring(requests.get('https://www.tripadvisor.co.uk/Search?singleSearchBox=true&geo=191&pid=3825&redirect=&startTime=1576072392277&uiOrigin=MASTHEAD&q=the%20grilled%20cheese%20truck&supportedSearchTypes=find_near_stand_alone_query&enableNearPage=true&returnTo=https%253A__2F____2F__www__2E__tripadvisor__2E__co__2E__uk__2F__&searchSessionId=AF4BFA0308CF336B90FD9602FA122CD11576072382852ssid&social_typeahead_2018_feature=true&sid=AF4BFA0308CF336B90FD9602FA122CD11576072410521&blockRedirect=true&ssrc=a&rf=1').content)
result = dom.xpath("//a[#class='review_count']/#href")
print (result)
from this code:
<a class="review_count" href="/Restaurant_Review-g54774-d10073153-Reviews-The_Grilled_Cheese_Truck-Rapid_City_South_Dakota.html#REVIEWS" onclick="return false;" data-clicksource="ReviewCount">3 reviews</a>
with my existing code I'm getting empty print
i have located the link here yet:
widgetEvCall('handlers.openResult', event, this, '/Restaurant_Review-g54774-d10073153-Reviews-The_Grilled_Cheese_Truck-Rapid_City_South_Dakota.html', {type: 'EATERY',element: this,index: 0,section: 1,locationId: '10073153',parentId: '54774',elementType: 'title',selectedId: '10073153'});
so will need help on this , in this case will be even better to get locationId and selectedId to print
any ideas ?

The problem you're having is because the data is loaded over javascript - try viewing the page with javascript disabled
You could try using a tool that will function with javascript eg. selenium - https://selenium-python.readthedocs.io/
Or try to track down where the JavaScript is loading the data from and then request that directly using python

Related

Webscraping with BeautifulSoup, can't find table within html

I am trying to webscrape the main table from this site: https://www.atptour.com/en/stats/leaderboard?boardType=serve&timeFrame=52Week&surface=all&versusRank=all&formerNo1=false
Here is my code:
import requests
from bs4 import BeautifulSoup, Comment
import pandas as pd
url = "https://www.atptour.com/en/stats/leaderboard?boardType=serve&timeFrame=52Week&surface=all&versusRank=all&formerNo1=false"
request = requests.get(url).text
soup = BeautifulSoup(request, 'lxml')
divs = soup.findAll('tbody', id = 'leaderboardTable')
print(divs)
However, this is the only output of this:
How do I access the rest of the html? It appears to not be there when I search through the soup. I have also attached an image of the html I am seeking to access. Any help is appreciated. Thank you!
There is an ajax request that fetches that data, however it's blocked by cloudscraper. There is a package that can bypass that, however doesn't seem to work for this site.
What you'd need to do now, is use something like Selenium to allow the page to render first, then pull the data.
from selenium import webdriver
import pandas as pd
browser = webdriver.Chrome('C:/chromedriver_win32/chromedriver.exe')
browser.get("https://www.atptour.com/en/stats/leaderboard?boardType=serve&timeFrame=52Week&surface=all&versusRank=all&formerNo1=false")
df= pd.read_html(browser.page_source, header=0)[0]
browser.close()
Output:
Your code is working as expected. The HTML you are parsing does not have any data under the table.
$ wget https://www.atptour.com/en/stats/leaderboard\?boardType\=serve\&timeFrame\=52Week\&surface\=all\&versusRank\=all\&formerNo1\=false -O page.html
$ grep -C 3 'leaderboardTable' page.html
class="stat-listing-table-content no-pagination">
<table class="stats-listing-table">
<!-- TODO: This table head will only appear on DESKTOP-->
<thead id="leaderboardTableHeader" class="leaderboard-table-header">
</thead>
<tbody id="leaderboardTable"></tbody>
</table>
</div>
You have shown a screenshot of the developer view that does contain the data. I would guess that there is a Javascript that modifies the HTML after it is loaded and puts in the rows. Your browser is able to run this Javascript, and hence you see the rows. requests of course doesn't run any scripts, it only downloads the HTML.
You can do "save as" in your browser to get the reuslting HTML, or you will have to use a more advanced web module such as Selenium that can run scripts.

Scraping webpage using Python and Requests Package

I am looking to scrape a certain number from a website.
When inspecting in chrome I see the following div I want to pull:
<div class="sc-18nh1jk-0 bTfoun css-1p6fq9y">2472.38</div>
This class name looks weird to me. Here is the code simple code I use to try and pull the '2472.38' number:
from lxml import html
import requests
r = requests.get('MYWEBSITE')
tree = html.fromstring(r.content)
CurrentPrice = tree.xpath('//div[#class="sc-18nh1jk-0 bTfoun css-1p6fq9y"]')
print(CurrentPrice)
output is: []
Any suggestions? Thanks ahead of time!
If you provided the websites url it would have been nice, but I think that this webpage you are trying to scrape is using generated class names which means that the class will be dynamic.

Web scraping google flight prices

I am trying to learn to use the python library BeautifulSoup, I would like to, for example, scrape a price of a flight on Google Flights.
So I connected to Google Flights, for example at this link, and I want to get the cheapest flight price.
So I would get the value inside the div with this class "gws-flights-results__itinerary-price" (as in the figure).
Here is the simple code I wrote:
from bs4 import BeautifulSoup
import urllib.request
url = 'https://www.google.com/flights?hl=it#flt=/m/07_pf./m/05qtj.2019-04-27;c:EUR;e:1;sd:1;t:f;tt:o'
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
div = soup.find('div', attrs={'class': 'gws-flights-results__itinerary-price'})
But the resulting div has class NoneType.
I also try with
find_all('div')
but within all the div I found in this way, there was not the div I was interested in.
Can someone help me?
Looks like javascript needs to run so use a method like selenium
from selenium import webdriver
url = 'https://www.google.com/flights?hl=it#flt=/m/07_pf./m/05qtj.2019-04-27;c:EUR;e:1;sd:1;t:f;tt:o'
driver = webdriver.Chrome()
driver.get(url)
print(driver.find_element_by_css_selector('.gws-flights-results__cheapest-price').text)
driver.quit()
Its great that you are learning web scraping! The reason you are getting NoneType as a result is because the website that you are scraping loads content dynamically. When requests library fetches the url it only contains javascript. and the div with this class "gws-flights-results__itinerary-price" isn't rendered yet! So it won't be possible by the scraping approach you are using to scrape this website.
However you can use other methods such as fetching the page using tools such as selenium or splash to render the javascript and then parse the content.
BeautifulSoup is a great tool for extracting part of HTML or XML, but here it looks like you only need to get the url to another GET-request for a JSON object.
(I am not by a computer now, can update with an example tomorrow.)

Using BeautifulSoup4 with Google Translate

I am currently going through the Web Scraping section of AutomateTheBoringStuff and trying to write a script that extracts translated words from Google Translate using BeautifulSoup4.
I inspected the html content of a page where 'Explanation' is the translated word:
<span id="result_box" class="short_text" lang="en">
<span class>Explanation</span>
</span>
Using BeautifulSoup4, I tried different selectors but nothing would return the translated word. Here are a few examples I tried, but they return no results at all:
soup.select('span[id="result_box"] > span')
soup.select('span span')
I even copied the selector directly from the Developer Tools, which gave me #result_box > span. This again returns no results.
Can someone explain to me how to use BeautifulSoup4 for my purpose? This is my first time using BeautifulSoup4 but I think I am using BeautifulSoup more or less correctly because the selector
soup.select('span[id="result_box"]')
gets me the outer span element**
[<span class="short_text" id="result_box"></span>]
**Not sure why the 'leng="en"' part is missing but I am fairly certain I have located the correct element regardless.
Here is the complete code:
import bs4, requests
url = 'https://translate.google.ca/#zh-CN/en/%E6%B2%BB%E5%85%B7'
res = requests.get(url)
res.raise_for_status
soup = bs4.BeautifulSoup(res.text, "html.parser")
translation = soup.select('#result_box span')
print(translation)
EDIT: If I save the Google Translate page as an offline html file and then make a soup object out of that html file, there would be no problem locating the element.
import bs4
file = open("Google Translate.html")
soup = bs4.BeautifulSoup(file, "html.parser")
translation = soup.select('#result_box span')
print(translation)
The result_box div is the correct element but your code only works when you save what you see in your browser as that includes the dynamically generated content, using requests you get only the source itself bar any dynamically generated content. The translation is generated by an ajax call to the url below:
"https://translate.google.ca/translate_a/single?client=t&sl=zh-CN&tl=en&hl=en&dt=at&dt=bd&dt=ex&dt=ld&dt=md&dt=qca&dt=rw&dt=rm&dt=ss&dt=t&ie=UTF-8&oe=UTF-8&source=bh&ssel=0&tsel=0&kc=1&tk=902911.786207&q=%E6%B2%BB%E5%85%B7"
For your requests it returns:
[[["Fixture","治具",,,0],[,,,"Zhì jù"]],,"zh-CN",,,[["治 具",1,[["Fixture",999,true,false],["Fixtures",0,true,false],["Jig",0,true,false],["Jigs",0,true,false],["Governance",0,true,false]],[[0,2]],"治具",0,1]],1,,[["ja"],,[1],["ja"]]]
So you will either have to mimic the request, passing all the necessary parameters or use something that supports dynamic content like selenium
Simply try this :
translation = soup.select('#result_box span')[0].text
print(translation)
You can try this diferent aproach:
if filename.endswith(extension_file):
with open(os.path.join(files_from_folder, filename), encoding='utf-8') as html:
soup = BeautifulSoup('<pre>' + html.read() + '</pre>', 'html.parser')
for title in soup.findAll('title'):
recursively_translate(title)
FOR THE COMPLETE CODE, PLEASE SEE HERE:
https://neculaifantanaru.com/en/python-code-text-google-translate-website-translation-beautifulsoup-library.html
or HERE:
https://neculaifantanaru.com/en/example-google-translate-api-key-python-code-beautifulsoup.html

source code of web page not available using urllib.urlopen()

I am trying to get video links from 'https://www.youtube.com/trendsdashboard#loc0=ind'. When I do inspect elements, it displays me the source html code for each videos. In source code retrieved using
urllib2.urlopen("https://www.youtube.com/trendsdashboard#loc0=ind").read()
It does not display html source for videos. Is there any otherway to do this?
<a href="/watch?v=dCdvyFkctOo" alt="Flipkart Wish Chain">
<img src="//i.ytimg.com/vi/dCdvyFkctOo/hqdefault.jpg" alt="Flipkart Wish Chain">
</a>
This simple code appears when we inspect elements from browser, but not in source code retrived by urllib
To view the source code you need use read method
If you just use open it gives you something like this.
In [12]: urllib2.urlopen('https://www.youtube.com/trendsdashboard#loc0=ind')
Out[12]: <addinfourl at 3054207052L whose fp = <socket._fileobject object at 0xb60a6f2c>>
To see the source use read
urllib2.urlopen('https://www.youtube.com/trendsdashboard#loc0=ind').read()
Whenever you compare the source code between Python code and Web browser, dont do it through Insect Element, right click on the webpage and click view source, then you will find the actual source. Inspect Element displays the aggregated source code returned by as many network requests created as well as javascript code being executed.
Keep Developer Console open before opening the webpage, stay on Network tab and make sure that 'Preserve Log' is open for Chrome or 'Persist' for Firebug in Firefox, then you will see all the network requests made.
works for me...
import urllib2
url = 'https://www.youtube.com/trendsdashboard#loc0=ind'
html = urllib.urlopen(url).read()
IMO I'd use requests instead of urllib - it's a bit easier to use:
import requests
url = 'https://www.youtube.com/trendsdashboard#loc0=ind'
response = requests.get(url)
html = response.content
Edit
This will get you a list of all <a></a> tags with hyperlinks as per your edit. I use the library BeautifulSoup to parse the html:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
links = [tag for tag in soup.findAll('a') if tag.has_attr('href')]
we also need to decode the data to utf-8.
here is the code:
just use
response.decode('utf-8')
print(response)

Categories