I'd like to scrape content from a site which apparently uses a javascript to generate the tables (the site is oddsportal.com).
I see that Scrapy can't load dynamic content, i read selenium could handle it but i'm planning to use a web server.
Is there a way i can parse this site or get the dynamic request and parse it using scrapy?
For example i'd like to import the full table from this page with the headers, match name and odds
http://www.oddsportal.com/matches/handball/
From what I understand, you have a constraint that you don't have a real display. You can still go with selenium - there is a headless PhantomJS browser that can be automated, there is an option to work in a virtual display, and you can use a remote selenium server or docker-selenium.
There are multiple examples on how to combine selenium and scrapy, for instance:
selenium with scrapy for dynamic page
Scrapy with Selenium crawling but not scraping
And, also check if scrapy-splash middleware would be enough for your use case.
For sites with dynamic content through AJAX and Javascript, I have used PhantomJS. It doesn't require open a browser because it's in itself a fully scriptable web browser. PhantomJS is fast and includes native support for various web standards as DOM handling, CSS selector, JSON and Canvas.
If you aren't a JavaScript Ninja, You should look CasperJS, it is written over PhantomJS. It eases the process of defining a full navigation scenario and provides useful high-level functions.
Here an example about how CasperJS works:
CasperJs and Jquery with chained Selects
Related
I'm trying to scrape data from a site in python, the payload is right and everything works but when I get the response of the site which would normally be the source code of the html page I instead, get just a script tag with some error written in it. See the response I get enclosed :
b'<script language="JavaScript">\nerr = "";\nlargeur = 1024;\nif (screen.width>largeur) { document.location.href="accueil.php?" +err;\t}\nelse { document.location.href="m.accueil.php?largeur=" +screen.width +\'&\' +err;\t}\n</script>'
Information :
after looking at the site it seems that it uses google analytics, I don't really know about what it is but maybe because of the preview things, it can't load the page since i'm not accessing it by a navigator.
What tool are you using to webscrape? Tools like beautiful soup parse pre-loaded HTML content. If a website uses client-side rendering and JavaScript to load content, often times HTML parsers will not function.
You can instead use an automated browser that interacts with a website just as a regular user would. These automated browsers can operate with or without a GUI. Automated browsers when run without a GUI (also known as a headless browser) take up less time and resources than running them with a GUI. Here's a fairly exhaustive list of headless browsers you can use. Note that not all are compatible with Python.
As Buran mentioned in the comments Selenium is an option. Selenium is very well documented and has a large community following so it's easy to find helpful articles or tutorials. It's a multi-driver so it can run different types of browsers (firefox, chrome, etc.), both headless and with a GUI.
I am new to python just started on python web-scraping. I have to scrape data from this realtor site
I need to scrape all the details op read-state agents according to their real-state agency;
For this on the web-browser I have to follow the following instructions
Go to this site
click on agency offices button, enter 4000 pin in search box and then submit.
then we get list of the agencies.
go to our team tab and then we get agents their.
then we have to go to each agents page and record their information.
Can anyone tell me how to approach this. Whats the best way to make this type of scrapers.
Do i have to use selenium for the interaction with the pages.
I have worked on request, BeautifulSoup and simple form submit using mechanize
I would recommend on a searching site that you either use Selenium or Requests with sessions, the advantage of Selenium it it will probably work however it will be slow. For Selenium you should just use the Selenium IDE (Firefox add on) to record what you do then get the HTML from the webpage and use beautifulsoup to parse the data.
If you want to scrape the data quickly and without using much resources I usually use Requests with Sessions. To scrape a website like this you should open up a modern web browser (Firefox, Chrome) and use the network tools for that browser (usually located in developer tools or via right click inspect element). Once you are recording the network you can interact with the webpage to see the connections made to the server. In an example search they may use suggestions e.g
https://suggest.example.com.au/smart-suggest?query=4000&n=7®ions=false
The response then will probably be a JSON of the suggested results. Once you select a suggestion you can just submit a request with that search parameters e.g
https://www.example.com.au/find-agent/agents/petrie-terrace-qld-4000
The URLs for the agents will the be in that HTML page, you just then need to separately send a request to each page to get the information using BeautifulSoup.
You might wanna give Node and Jquery a try. I used to use Python all the time, but it gets messy and hard to maintain after a while.
Using node, you can turn the page HTML into a DOM object and then scrape all the data very easily using Jquery. I have done this for imdb here: “Using JQuery & NodeJS to scrape the web” #asimmittal https://medium.com/#asimmittal/using-jquery-nodejs-to-scrape-the-web-9bb5d439413b
You can modify this to scrape yelp
Can we use Scrapy for getting content from a web page which is loaded by Javascript?
I'm trying to scrape usage examples from this page,
but since they are loaded using Javascript as a JSON object I'm not able to get them with Scrapy.
Could you suggest what is the best way to deal with such issues?
Open your browser's developer tools and look at the Network tab. If you hit the "next" button on that page enough, it'll send out a new request:
After removing the JSONP paramter, the URL is pretty straightforward:
https://corpus.vocabulary.com/api/1.0/examples.json?query=unalienable&maxResults=24&startOffset=24&filter=0
By making the minimal number of requests, your spider will be fast.
If you want to just emulate a full browser and execute the JavaScript, you can use something like Selenium or Scrapinghub's Splash (and its corresponding Scrapy plugin).
I'm trying to put together a little collection of plugins that I need in order to interact with html pages. What I need ranges from simple browsing and interacting with buttons or links of a web page (as is "write some text in this textbox and press this button") to parsing a html page and sending custom get/post messages to the server.
I am using Python 3 and up to now I have Requests for simple webpage loading, custom get and post messages,
BeautifulSoup for parsing the HTML tree and I'm thinking of trying out Mechanize for simple web page interactions.
Are there any other libraries out there that are similar to the 3 I am using so far? Is there some sort of gathering place where all Python libraries hang out? Because I sometimes find if difficult to find what I am looking for.
The set of tools/libraries for web-scraping really depends on the multiple factors: purpose, complexity of the page(s) you want to crawl, speed, limitations etc.
Here's a list of tools that are popular in a web-scraping world in Python nowadays:
selenium
Scrapy
splinter
ghost.py
requests (and grequests)
mechanize
There are also HTML parsers out there, these are the most popular:
BeautifuSoup
lxml
Scrapy is probably the best thing that happened to be created for web-scraping in Python. It's really a web-scraping framework that makes it easy and straightforward, Scrapy provides everything you can imagine for a web-crawling.
Note: if there is a lot AJAX and js stuff involved in loading, forming the page you would need a real browser to deal with it. This is where selenium helps - it utilizes a real browser allowing you to interact with it by the help of a WebDriver.
Also see:
Web scraping with Python
Headless Selenium Testing with Python and PhantomJS
HTML Scraping
Python web scraping resource
Parsing HTML using Python
Hope that helps.
Hello how can i make changes in my web browser with python? Like filling forms and pressing Submit?
What lib's should i use? And maybe someone of you have some examples?
Using urllib does not make any changes in opened browser for me
Urllib is not intended to do anyting with your browser, but rather to get contents from urls.
To fill in forms and this kind of things, have a look into mechanize, to scrap the webpages, consider using pyquery.
Selenium is great for this. It's a browser automation tool that you can use to launch a browser (any major browser or a 'headless' one), navigate to a url, and interact with the page.
It's used primarily for testing web code against multiple browsers, but is also very useful for 'scraping' pages and automating mundane tasks.
Here are the python docs: http://selenium-python.readthedocs.org/en/latest/index.html