How to approach web-scraping in python

How to approach web-scraping in python - python

I am new to python just started on python web-scraping. I have to scrape data from this realtor site
I need to scrape all the details op read-state agents according to their real-state agency;
For this on the web-browser I have to follow the following instructions
Go to this site
click on agency offices button, enter 4000 pin in search box and then submit.
then we get list of the agencies.
go to our team tab and then we get agents their.
then we have to go to each agents page and record their information.
Can anyone tell me how to approach this. Whats the best way to make this type of scrapers.
Do i have to use selenium for the interaction with the pages.
I have worked on request, BeautifulSoup and simple form submit using mechanize

I would recommend on a searching site that you either use Selenium or Requests with sessions, the advantage of Selenium it it will probably work however it will be slow. For Selenium you should just use the Selenium IDE (Firefox add on) to record what you do then get the HTML from the webpage and use beautifulsoup to parse the data.
If you want to scrape the data quickly and without using much resources I usually use Requests with Sessions. To scrape a website like this you should open up a modern web browser (Firefox, Chrome) and use the network tools for that browser (usually located in developer tools or via right click inspect element). Once you are recording the network you can interact with the webpage to see the connections made to the server. In an example search they may use suggestions e.g
https://suggest.example.com.au/smart-suggest?query=4000&n=7&regions=false
The response then will probably be a JSON of the suggested results. Once you select a suggestion you can just submit a request with that search parameters e.g
https://www.example.com.au/find-agent/agents/petrie-terrace-qld-4000
The URLs for the agents will the be in that HTML page, you just then need to separately send a request to each page to get the information using BeautifulSoup.

You might wanna give Node and Jquery a try. I used to use Python all the time, but it gets messy and hard to maintain after a while.
Using node, you can turn the page HTML into a DOM object and then scrape all the data very easily using Jquery. I have done this for imdb here: “Using JQuery & NodeJS to scrape the web” #asimmittal https://medium.com/#asimmittal/using-jquery-nodejs-to-scrape-the-web-9bb5d439413b
You can modify this to scrape yelp

Related

Python Webscraping Solution Reccomendations required

I would like to know what is the best/preferred PYTHON 3.x solution (fast to execute, easy to implement, option to specify user agent, send browser & version etc to webserver to avoid my IP being blacklisted) which can scrape data on all of below options (mentioned based on complexity as per my understanding).
Any Static webpage with data in tables / Div
Dynamic webpage which completes loading in one go
Dynamic webpage which requires signin using username password & completes loading in one go after we login.
Sample URL for username password: https://dashboard.janrain.com/signin?dest=http://janrain.com
Dynamic web-page which requires sign-in using oauth from popular service like LinkedIn, google etc & completes loading in one go after we login. I understand this involves some page redirects, token handling etc.
Sample URL for oauth based logins: https://dashboard.janrain.com/signin?dest=http://janrain.com
All of bullet point 4 above combined with option of selecting some drop-down (lets say like "sort by date") or can involve selecting some check-boxes, based on which the dynamic data displayed would change.
I need to scrape the data after the action of check-boxes/drop-downs has been performed as any user would do it to change the display of the dynamic data
Sample URL - https://careers.microsoft.com/us/en/search-results?rk=l-seattlearea
You have option of drop-down as well as some checkbox in the page
Dynamic webpage with Ajax loading in which data can keep loading as
=> 6.1 we keep scrolling down like facebook, twitter or linkedin main page to get data
Sample URL - facebook, twitter, linked etc
=> 6.2 or we keep clicking some button/div at the end of the ajax container to get next set of data;
Sample URL - https://www.linkedin.com/pulse/cost-climate-change-indian-railways-punctuality-more-editors-india-/
Here you have to click "Show Previous Comments" at the bottom of the page if you need to look & scrape all the comments
I want to learn & build one exhausted scraping solution which can be tweaked to cater to all options from the easy task of bullet point 1 to the complex task of bullet point 6 above as and when required.

I would recommend to use BeautifulSoup for your problems 1 and 2.
For 3 and 5 you can use Selenium WebDriver (available as python library).
Using Selenium you can perform all the possible operations you wish (e.g. login, changing drop down values, navigating, etc.) and then you can access the web content by driver.page_source (you may need to use sleep function to wait until the content is fully loaded)
For 6 you can use their own API to get list of news feeds and their links (mostly the returned object comes with link to a particular news feed), once you get the links you can use BeautifulSoup for get the web content.
Note: Pleas do read each web site terms and conditions before scraping because some of them have mentioned Automated Data Collection as Unethical behavior which we we should not do as professional.

Scrapy is for you if you looking for the real scaleable bulletproof solution. In fact scrapy framework is an industry standard for python crawling tasks.
By the way: I'd suggest you avoid JS rendering: all that stuff(chromedriver, selenium, phantomjs) is a last option to crawl sites.
Most of ajax data you can parse simply by forging needed requests.
Just spend more time in Chrome's "network" tab.

Scrapy for dynamic content

Can we use Scrapy for getting content from a web page which is loaded by Javascript?
I'm trying to scrape usage examples from this page,
but since they are loaded using Javascript as a JSON object I'm not able to get them with Scrapy.
Could you suggest what is the best way to deal with such issues?

Open your browser's developer tools and look at the Network tab. If you hit the "next" button on that page enough, it'll send out a new request:
After removing the JSONP paramter, the URL is pretty straightforward:
https://corpus.vocabulary.com/api/1.0/examples.json?query=unalienable&maxResults=24&startOffset=24&filter=0
By making the minimal number of requests, your spider will be fast.
If you want to just emulate a full browser and execute the JavaScript, you can use something like Selenium or Scrapinghub's Splash (and its corresponding Scrapy plugin).

Submit form on internet website

I want to build a python script to submit some form on internet website. Such as a form to publish automaticaly some item on site like ebay.
Is it possible to do it with BeautifulSoup or this is only to parse some website?
Is it possible to do it with selenium but quickly without open really the browser?
Are there any other ways to do it?

Look at the requests library.. Also, check out the chrome debugger toolbar to see the requests fly by. There is also a utility called postman, where you can "design", queries, then generate code in many different flavors (including pythons requests library).
BeautifulSoup is for parsing HTML.

You can use selenium with PhantomJS to do this without the browser opening. You have to use the Keys portion of selenium to send data to the form to be submitted. It is also worth noting that this method will not work if there are captcha's on the form.

The mechanize library can fill and submit forms.

How can I input data into a webpage to scrape the resulting output using Python?

I am familiar with BeautifulSoup and urllib2 to scrape data from a webpage. However, what if a parameter needs to be entered into the page before the result that I want to scrape is returned?
I'm trying to obtain the geographic distance between two addresses using this website: http://www.freemaptools.com/how-far-is-it-between.htm
I want to be able to go to the page, enter two addresses, click "Show", and then extract the "Distance as the Crow Flies" and "Distance by Land Transport" values and save them to a dictionary.
Is there any way to input data into a webpage using Python?

Take a look at tools like mechanize or scrape:
http://pypi.python.org/pypi/mechanize
http://stockrt.github.com/p/emulating-a-browser-in-python-with-mechanize/
http://www.ibm.com/developerworks/linux/library/l-python-mechanize-beautiful-soup/
http://zesty.ca/scrape/
Packt Publishing has an article on that matter, too:
http://www.packtpub.com/article/web-scraping-with-python

Yes! Try mechanize for this kind of Web screen-scraping task.

I think you can also use PySide/PyQt, because they have a browser core of qtwebkit, you can control the browser to open pages, simulate human actions(fill, click...), then scrape data from pages. FMiner is work on this way, it's a web scraping software I developed with PySide.
Or you can try phantomjs, it's an easy library to control browser, but not it's javascript not python lanuage.

In addition with the answers already given, you could simply do a request on that page. Using your browser you could always inspect the Network (under Tools/Web Developer tools) behaviors and actions when you interact with the page. E.g. http://www.freemaptools.com/ajax/getaandb.php?a=Florida_Usa&b=New%20York_Usa&c=6052 -> request query for getting the results page you are expecting. Request that page and scrape the field you wanted to. IMHO, page requests are way faster than screen scraping (case-to-case basis).
But of course, you could always do screen scraping/browser simulation also (Mechanize, Splinter) and use headless browsers (PhantomJS, etc.) or the browser driver of the browser you want to use.

The query may have been resolved.
You can use Selenium WebDriver for this purpose. A web page can be interacted using programming language. All the operations can be performed as if a human user is accessing the web page.

web scraping a problem site

I'm trying to scrape some information from a web site, but am having trouble reading the relevant pages. The pages seem to first send a basic setup, then more detailed info. My download attempts only seem to capture the basic setup. I've tried urllib and mechanize so far.
Firefox and Chrome have no trouble displaying the pages, although I can't see the parts I want when I view page source.
A sample url is https://personal.vanguard.com/us/funds/snapshot?FundId=0542&FundIntExt=INT
I'd like, for example, average maturity and average duration from the lower right of the page. The problem isn't extracting that info from the page, it's downloading the page so that I can extract the info.

The page uses JavaScript to load the data. Firefox and Chrome are only working because you have JavaScript enabled - try disabling it and you'll get a mostly empty page.
Python isn't going to be able to do this by itself - your best compromise would be to control a real browser (Internet Explorer is easiest, if you're on Windows) from Python using something like Pamie.

The website loads the data via ajax. Firebug shows the ajax calls. For the given page, the data is loaded from https://personal.vanguard.com/us/JSP/Funds/VGITab/VGIFundOverviewTabContent.jsf?FundIntExt=INT&FundId=0542
See the corresponding javascript code on the original page:
<script>populator = new Populator({parentId:
"profileForm:vanguardFundTabBox:tab0",execOnLoad:true,
populatorUrl:"/us/JSP/Funds/VGITab/VGIFundOverviewTabContent.jsf?FundIntExt=INT&FundId=0542",
inline:fals e,type:"once"});
</script>

The reason why is because it's performing AJAX calls after it loads. You will need to account for searching out those URLs to scrape it's content as well.

As RichieHindle mentioned, your best bet on Windows is to use the WebBrowser class to create an instance of an IE rendering engine and then use that to browse the site.
The class gives you full access to the DOM tree, so you can do whatever you want with it.
http://msdn.microsoft.com/en-us/library/system.windows.forms.webbrowser(loband).aspx

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to approach web-scraping in python - python

Related

Python Webscraping Solution Reccomendations required

Scrapy for dynamic content

Submit form on internet website

How can I input data into a webpage to scrape the resulting output using Python?

web scraping a problem site

Categories

Resources