How to locate graphic <canvas> html element for scraping? - python

I'm currently scraping a website dedicated to money conversion using python and library selenium, for that case I need to extract the information received by a graphic which shows the money value hourly, but I've been unable to find where is the function on the component or part of the html scripts that can help me extracting that information.
This is the link to the website: https://dolar.set-icap.com/

Related

How to approach web-scraping in python

I am new to python just started on python web-scraping. I have to scrape data from this realtor site
I need to scrape all the details op read-state agents according to their real-state agency;
For this on the web-browser I have to follow the following instructions
Go to this site
click on agency offices button, enter 4000 pin in search box and then submit.
then we get list of the agencies.
go to our team tab and then we get agents their.
then we have to go to each agents page and record their information.
Can anyone tell me how to approach this. Whats the best way to make this type of scrapers.
Do i have to use selenium for the interaction with the pages.
I have worked on request, BeautifulSoup and simple form submit using mechanize
I would recommend on a searching site that you either use Selenium or Requests with sessions, the advantage of Selenium it it will probably work however it will be slow. For Selenium you should just use the Selenium IDE (Firefox add on) to record what you do then get the HTML from the webpage and use beautifulsoup to parse the data.
If you want to scrape the data quickly and without using much resources I usually use Requests with Sessions. To scrape a website like this you should open up a modern web browser (Firefox, Chrome) and use the network tools for that browser (usually located in developer tools or via right click inspect element). Once you are recording the network you can interact with the webpage to see the connections made to the server. In an example search they may use suggestions e.g
https://suggest.example.com.au/smart-suggest?query=4000&n=7&regions=false
The response then will probably be a JSON of the suggested results. Once you select a suggestion you can just submit a request with that search parameters e.g
https://www.example.com.au/find-agent/agents/petrie-terrace-qld-4000
The URLs for the agents will the be in that HTML page, you just then need to separately send a request to each page to get the information using BeautifulSoup.
You might wanna give Node and Jquery a try. I used to use Python all the time, but it gets messy and hard to maintain after a while.
Using node, you can turn the page HTML into a DOM object and then scrape all the data very easily using Jquery. I have done this for imdb here: “Using JQuery & NodeJS to scrape the web” #asimmittal https://medium.com/#asimmittal/using-jquery-nodejs-to-scrape-the-web-9bb5d439413b
You can modify this to scrape yelp

HTML Parsing with Python (HTML vs. complete website)

I'm trying to parse html from a website that contains information about train tickets and there prices (source below), however I'm having an issue getting back all the html from the website when I use urllib to request the html.
What I need is the price per ticket which doesn't seem to appear when I used urllib to request the html. After doing some investigative work, I determined that if I save the webpage with chrome and select "HTML only", I don't get the price, however if I select "Complete WebPage," I do. Is there anyway to view the HTML that I get when I download the "Complete Webpage" and use that in python. Or is there a way to automate the downloading of the complete webpage and use the downloaded files to parse in python.
Thanks,
George
https://www.raileurope.com/en/us/point_to_point/ptp_results.htm?execution=e3s1&resultId=147840746&cobrand=public&saleCountry=us&resultId=147840746&cobrand=public&saleCountry=us&itemId=-1&fn=fsRequest&cobrand=public&c=USD&roundtrip=0&isAtocRequest=0&georequest=1&lang=en&route-type=0&from0=paris&to0=amsterdam&deptDate0=06%2F07%2F2017&time0=8&pass-question-radio=1&nCountries=&selCountry1=&selCountry2=&selCountry3=&selCountry4=&selCountry5=&familyId=&p=0&additionalTraveler0=adult&additionalTravelerAge0=&paxIds=&nA=1&nY=0&nC=0&nS=0
Take a look at selenium
Since the website is rendered by JS, you will have to use a webdriver to simulate the "Click".
You will need a crawler instead of a simple scraper

Scrape PDFs inside viewer frame

(Complete begginer in web scraping here)
I'm trying to scrape the PDF from this webpage using python:
http://pesquisa.in.gov.br/imprensa/jsp/visualiza/index.jsp?jornal=3&pagina=1&data=31/03/1993
The problem is that the above URL points to the viewer (with date-page parameters), not the PDF file. I tried to inspect the html code to see the URL to the PDF directly, but could not.
any help on how to find the correct URL and implement a way to download them in python?
Edit:
I will later generalize this to other days and pages, the full list of day-page links can be found by searching for the relevant period here: http://portal.imprensanacional.gov.br/

How to extract request url using python selenium

I am new to webscraping and need some help to extract a request-url from the online movie-stream website YIFY. I am familiar with how selenium works and I am trying to find the download url of the movie Revenant.
Using python-selenium I can click on the play icon and if you open your inspect element and go the network tab then you can see the request-url but you can't do inspect element and find it.
Download link - http://download1282.mediafire.com/3pvv1jx9z23g/crdad7bg0ghjh7r/vid.pdf
I am trying to extract this particular download link using python-selenium. Could anyone tell me if it is possible? Well I am not trying to download the movie but checking if it is possible to download the links from it. Here the links are not embedded in the html page, and I will highly appreciate any help.

Python scraping using inspect element or firebug

As I am going through this youtube scraping tutorial https://www.youtube.com/watch?v=qbEN3boz7_M, I was introduced that instead of scraping from the "public" page loaded heavily with all other stuff, there is a way to find a "private" page to scrape the necessary information much more efficiently using inspect element/firebug.
google chrome > inspect element > network > XHR
The person in the youtube video uses stock price as an example and be able to locate a "private" page to scrape much quickly and less intensive to the server. Though when I tried to look at sites I wanted to scrape, for example, http://www.rottentomatoes.com/m/grigris/, going through the inspect element (chrome) > Network > XHR > checking the headers' request URL and preview, I didn't seem to find anything useful.
Am I missing something? How can I ensure if a raw or condensed information is hidden somewhere? Using the Rottentomatoes.com page as an example, how can I tell if there is 1) a "private page" that gives the title and year of the movie and 2) a summary page (in csv-like format) that "stores" all the movies' titles and year in one page?
You can only find XHR requests, if the page is dynamically loading data. In your example, the only thing of note is this URL:
http://www.rottentomatoes.com/api/private/v1.0/users/current/ratings/771355871
Which contains some information about the movie in JSON.
{"media":{"type":"movie","id":771355871,"title":"Grigris","url":"http://www.rottentomatoes.com/m/grigris/","year":2014,"mpaa":"Unrated","runtime":"1 hr. 40 min.","synopsis":"Despite a bum leg, 25-year-old Grigris has hopes of becoming a professional dancer, making some extra cash putting his killer moves to good use on the...","thumbnail":"http://content6.flixster.com/movie/11/17/21/11172196_mob.jpg","cast":[{"name":"Souleymane Démé","id":"771446344"},{"name":"Anaïs Monory","id":"771446153"}]}}
Make sure you have the chrome developer tools open when you load the site. If not, the developer tools don't capture any requests. You can open them and refresh the page, then you should see them under the XHR filter.

Categories