I was hoping someone might be able to provide some insights into the feasibility of utilizing the scrapy python framework for creating a realtime wrapper.
To clarify my definition of the term "wrapper" in this context let me describe my situation... I was hoping to use scrapy to essentially script a solution to allow a user to execute a search query on a website which in turn would call a scrapy spider in real-time within which that spider is told to:
login to a 3rd party write
execute the users search query
retrieve only the actual html results for the returned query by extracting the resulting html content by specifying the unique result set container class and/or xpath).
modify the extracted html results (by either reforming the html and/or injecting a new header/footer or css elements). 5) and finally returning the modified html results in real-time so the html can be directly injected into the original domain all by being transparent to the user.
I should point out that I am familiar with writing scrapy spider for large scale crawling in bulk but I am less familiar with the prospect or feasibility of being able to use it to construct a real-time type of "wrapper".
If anyone has any insight, advice or examples which illustrate a similar situation I would greatly appreciate it. CH
You may try HTQL browser interface for python at http://htql.net/. An example to Bing search in real time is:
import htql;
a=htql.Browser();
b=a.goUrl("http://www.bing.com/");
c=a.goForm("<form>1", {"q":"test"});
for d in htql.HTQL(c[0], "<a (tx like '%test%')>"):
print(d);
e=a.click("<a (tx like '%test%' and not (href like '/search%'))>1");
It can be coupled with IRobotSoft scraper to do everything visually, by changing the browser to:
a=htql.Browser(2);
More details can be found from this manual http://htql.net/htql-python-manual.pdf or ask at http://irobotsoft.org/bb/
Related
I'm using scrapy (on PyCharm v2020.1.3) to build a spider that crawls this webpage: "https://www.woolworths.com.au/shop/browse/drinks/cordials-juices-iced-teas/iced-teas", i want to extract the products names, and the breadcrumb in a list format, and save the results in a csv file.
I tried the following code but it returns empty brackets [] , after i've inspected the html code i discovred that the content is hidden in angularjs format.
If someone has a solution for that it would be great
Thank you
import scrapy
class ProductsSpider(scrapy.Spider):
name = 'products'
start_urls = ['https://www.woolworths.com.au/shop/browse/drinks/cordials-juices-iced-teas/iced-teas']
def parse(self, response):
product = response.css('a.shelfProductTile-descriptionLink::text').extract()
yield "productnames"
You won't be able to get the desired products through parsing the HTML. It is heavily javascript orientated and therefore scrapy wont parse this.
The simplest way to get the product names, I'm not sure what you mean by breadcrumbs is to re-engineer the HTTP requests. The woolworths website generates the product details via an API. If we can mimick the request the browser makes to obtain that product information we can get the information in a nice neat format.
First you have to set within settings.py ROBOTSTXT_OBEY = False. Becareful about protracted scrapes of this data because your IP will probably get banned at some point.
Code Example
import scrapy
class TestSpider(scrapy.Spider):
name = 'test'
allowed_domains = ['woolworths.com']
data = {
'excludeUnavailable': 'true',
'source': 'RR-Best Sellers'}
def start_requests(self):
url = 'https://www.woolworths.com.au/apis/ui/products/58520,341057,305224,70660,208073,69391,69418,65416,305227,305084,305223,427068,201688,427069,341058,305195,201689,317793,714860,57624'
yield scrapy.Request(url=url,meta=self.data,callback=self.parse)
def parse(self, response):
data = response.json()
for a in data:
yield {
'name': a['Name'],
}
Explanation
We start of with our defined url in start_requests. This URL is the specific URL of the API woolworth uses to obtain information for iced tea. For any other link on woolworths the part of the URL after /products/ will be specific to that part of the website.
The reason why we're using this, is because using browser activity is slow and prone to being brittle. This is fast and the information we can get is usually highly structured much better for scraping.
So how do we get the URL you may be asking ? You need to inspect the page, and find the correct request. If you click on network tools and then reload the website. You'll see a bunch of requests. Usually the largest sized request is the one with all the data. Clicking that and clicking preview gives you a box on the right hand side. This gives all the details of the products.
In this next image, you can see a preview of the product data
We can then get the request URL and anything else from this request.
I will often copy this request as a CURL (Bash Command) as seen here
And enter it into curl.trillworks.com. This can convert CURL to python. Giving you a nice formatted headers and any other data needed to mimick the request.
Now putting this into jupyter and playing about, you actually only need the params NOT the headers which is much better.
So back to the code. We do a request, using meta argument we can pass on the data, remember because it's outside the function we have to use self.data and then specifying the callback to parse.
We can use the response.json() method to convert the JSON object to a set of python dictionaries corresponding to each product. YOU MUST have scrapy V2.2 to use this method. Other you could use data = json.loads(response.text), but you'll have put to import json at the top of the script.
From the preview and playing about with the json in requests we can see these python dictionaries are actually within a list and so we can use a for loop to loop round each product, which is what we are doing here.
We then yield a dictionary to extract the data, a refers to each products which is it's own dictionary and a['Name'] refers to that specific python dictionary key 'Name' and giving us the correct value. To get a better feel for this, I always use requests package in jupyter to figure out the correct way to get the data I want.
The only thing left to do is to use scrapy crawl test -o products.csv to output this to a CSV file.
I can't really help you more than this until you specify any other data you want from this page. Please remember that you're going against what the site wants you to scrape, but also any other pages on that website you will need to find out the specific URL to the API to get those products. I have given you the way to do this, I suggest if you want to automate this it would be worth your while trying to struggle with this. We are hear to help but an attempt on your part is how you're going to learn coding.
Additional Information on the Approach of Dynamic Content
There is a wealth of information on this topic. Here are some guidelines to think about when looking at javascript orientated websites. The default is you should try re-engineer the requests the browser makes to load the pages information. This is what the javascript in this site and many other sites is doing, it's providing a dynamic way without reloading the page to display new information by making an HTTP request. If we can mimic that request, we can get the information we desire. This is the most efficent way to get dynamic content.
In order of preference
Re-engineering the HTTP requests
Scrapy-splash
Scrapy_selenium
importing selenium package into your scripts
Scrapy-splash is slightly better than the selenium package, as it pre-renders the page, giving you access to the selectors with the data. Selenium is slow, prone to errors but will allow you to mimic browser activity.
There are multiple ways to include selenium into your scripts see down below as an overview.
Recommended Reading/Research
Look at the scrapy documentation with regard to dynamic content here
This will give you an overview of the steps to handling dynamic content. I will say generally speaking selenium should be thought of as a last resort. It's pretty inefficient when doing larger scale scraping.
If you are consider adding in the selenium package into your script. This might be the lower barrier of entry to getting your script working but not necessarily that efficient. At the end of the day scrapy is a framework but there is a lot of flexibility in adding in 3rd party packages. The spider scripts are just a python class importing the scrapy architecture in the background. As long as you're mindful of the response and translating some of the selenium to work with scrapy, you should be able to input selenium into your scripts. I would this solution is probably the least efficient though.
Consider using scrapy-splash, splash pre-renders the page and allows for you to add in javascript execution. Docs are here and a good article from scrapinghub here
Scrapy-selenium is a package with a custom scrapy downloader middleware that allows you to do selenium actions and execute javascript. Docs here You'll need to have a play around to get the login in procedure from this, it doesn't have the same level of detail as the selenium package itself.
I have a problem getting javascript content into HTML to use it for scripting. I used multiple methods as phantomjs or python QT library and they all get most of the content in nicely but the problem is that there are javascript buttons inside the page like this:
Pls see screenshot here
Now when I load this page from a script these buttons won't default to any value so I am getting back 0 for all SELL/NEUTRAL/BUY values below. Is there a way to set these values when you load the page from a script?
Example page with all the values is: https://www.tradingview.com/symbols/NEBLBTC/technicals/
Any help would be greatly appreciated.
If you are trying to achieve this with scrapy or with derivation of cURL or urrlib I am afraid that you can't do this. Python has another external packages such selenium that allow you to interact with the javascript of the page, but the problem with selenium is too slow, if you want something similar to scrapy you could check how the site works (as i can see it works through ajax or websockets) and fetch the info that you want through urllib, like you would do with an API.
Please let me know if you understand me or i misunderstood your question
I used seleneum which was perfect for this job, it is indeed slow but fits my purpose. I also used the seleneum firefox plugin to generate the python script as it was very challenging to find where exactly in the code as the button I had to press.
I am trying to scrape a web site using python and beautiful soup. I encountered that in some sites, the image links although seen on the browser is cannot be seen in the source code. However on using Chrome Inspect or Fiddler, we can see the the corresponding codes.
What I see in the source code is:
<div id="cntnt"></div>
But on Chrome Inspect, I can see a whole bunch of HTML\CSS code generated within this div class. Is there a way to load the generated content also within python? I am using the regular urllib in python and I am able to get the source but without the generated part.
I am not a web developer hence I am not able to express the behaviour in better terms. Please feel free to clarify if my question seems vague !
You need JavaScript Engine to parse and run JavaScript code inside the page.
There are a bunch of headless browsers that can help you
http://code.google.com/p/spynner/
http://phantomjs.org/
http://zombie.labnotes.org/
http://github.com/ryanpetrello/python-zombie
http://jeanphix.me/Ghost.py/
http://webscraping.com/blog/Scraping-JavaScript-webpages-with-webkit/
The Content of the website may be generated after load via javascript, In order to obtain the generated script via python refer to this answer
A regular scraper gets just the HTML document. To get any content generated by JavaScript logic, you rather need a Headless browser that would also generate the DOM, load and run the scripts like a regular browser would. The Wikipedia article and some other pages on the Net have lists of those and their capabilities.
Keep in mind when choosing that some previously major products of those are abandoned now.
TRY THIS FIRST!
Perhaps the data technically could be in the javascript itself and all this javascript engine business is needed. (Some GREAT links here!)
But from experience, my first guess is that the JS is pulling the data in via an ajax request. If you can get your program simulate that, you'll probably get everything you need handed right to you without any tedious parsing/executing/scraping involved!
It will take a little detective work though. I suggest turning on your network traffic logger (such as "Web Developer Toolbar" in Firefox) and then visiting the site. Focus your attention attention on any/all XmlHTTPRequests. The data you need should be found somewhere in one of these responses, probably in the middle of some JSON text.
Now, see if you can re-create that request and get the data directly. (NOTE: You may have to set the User-Agent of your request so the server thinks you're a "real" web browser.)
Say I look at the following Tumblr post: http://ronbarak.tumblr.com/post/40692813…
It (currently) has 292 notes.
I'd like to get all the above notes using a Python script (e.g., via urllib2, BeautifulSoup, simplejson, or tumblr Api).
Some extensive Googling did not produce any items relating to notes' extraction in Tumblr.
Can anyone point me in the right direction on which tool will enable me to do that?
Unfortunately looks like the Tumblr API has some limitations (lacks of meta information about Reblogs, notes limited by 50), so you can't get all the notes.
It is also forbidden to do page scraping according to the Terms of Service.
"You may not do any of the following while accessing or using the Services: (...) scrape the Services, and particularly scrape Content (as defined below) from the Services, without Tumblr's express prior written consent;"
Source:
https://groups.google.com/forum/?fromgroups=#!topic/tumblr-api/ktfMIdJCOmc
Without JS you get separate pages that only contain the notes. For the mentioned blog post the first page would be:
http://ronbarak.tumblr.com/notes/40692813320/4Y70Zzacy
Following pages are linked at the bottom, e.g.:
http://ronbarak.tumblr.com/notes/40692813320/4Y70Zzacy?from_c=1358403506
http://ronbarak.tumblr.com/notes/40692813320/4Y70Zzacy?from_c=1358383221
http://ronbarak.tumblr.com/notes/40692813320/4Y70Zzacy?from_c=1358377013
…
(See my answer on how to find the next URL in a’s onclick attribute.)
Now you could use various tools to download/parse the data.
The following wget command should download all notes pages for that post:
wget --recursive --domains=ronbarak.tumblr.com --include-directories=notes http://ronbarak.tumblr.com/notes/40692813320/4Y70Zzacy
Like Fabio implies, it is better to use the API.
If for whatever reasons you cannot, then the tools you will use will depend on what you want to do with the data in the posts.
for a data dump: urllib will return a string of the page you want
looking for a specific section in the html: lxml is pretty good
looking for something in unruly html: definitely beautifulsoup
looking for a specific item in a section: beautifulsoup, lxml, text parsing is what you need.
need to put the data in a database/file: use scrapy
Tumblr url scheme is simple: url/scheme/1, url/scheme/2, url/scheme/3, etc... until you get to the end of the posts and the servers just does not return any data anymore.
So if you are going to brute force your way to scraping, you can easily tell your script to dump all the data on your hard drive until, say the contents tag, is empty.
One last word of advice, please remember to put a small sleep(1000) in your script, because you could put some stress on Tumblr servers.
how to load all notes on tumblr? also covers the topic, but unor's response (above) does it very well.
I'm a little new to web crawlers and such, though I've been programming for a year already. So please bear with me as I try to explain my problem here.
I'm parsing info from Yahoo! News, and I've managed to get most of what I want, but there's a little portion that has stumped me.
For example: http://news.yahoo.com/record-nm-blaze-test-forest-management-225730172.html
I want to get the numbers beside the thumbs up and thumbs down icons in the comments. When I use "Inspect Element" in my Chrome browser, I can clearly see the things that I have to look for - namely, an em tag under the div class 'ugccmt-rate'. However, I'm not able to find this in my python program. In trying to track down the root of the problem, I clicked to view source of the page, and it seems that this tag is not there. Do you guys know how I should approach this problem? Does this have something to do with the javascript on the page that displays the info only after it runs? I'd appreciate some pointers in the right direction.
Thanks.
The page is being generated via JavaScript.
Check if there is a mobile version of the website first. If not, check for any APIs or RSS/Atom feeds. If there's nothing else, you'll either have to manually figure out what the JavaScript is loading and from where, or use Selenium to automate a browser that renders the JavaScript for you for parsing.
Using the Web Console in Firefox you can pretty easily see what requests the page is actually making as it runs its scripts, and figure out what URI returns the data you want. Then you can request that URI directly in your Python script and tease the data out of it. It is probably in a format that Python already has a library to parse, such as JSON.
Yahoo! may have some stuff on their server side to try to prevent you from accessing these data files in a script, such as checking the browser (user-agent header), cookies, or referrer. These can all be faked with enough perseverance, but you should take their existence as a sign that you should tread lightly. (They may also limit the number of requests you can make in a given time period, which is impossible to get around.)