I'm trying to use scrapy to scrape a site that uses javascript extensively to manipulate the document, cookies, etc (but nothing simple like JSON responses). For some reason I can't determine from the network traffic, the page I need comes up as an error when I scrape but not when viewed in the browser. So what I want to do is use webkit to render the page as it appears in the browser, and then scrape this. The scrapyjs project was made for this purpose.
To access the page I need, I had to have logged in previously, and saved some session cookies. My problem is that I cannot successfully provide the session cookie to webkit when it renders the page. There are two ways I could think to do this:
use scrapy page requests exclusively until I get to the page that needs webkit, and then pass along the requisite cookies.
use webkit within scrapy (via a modified version of scrapyjs), for the entire session from login until I get to the page I need, and allow it to preserve cookies as needed.
Unfortunately neither approach seems to be working.
Along the lines of approach 1, I tried the following:
In settings.py --
DOWNLOADER_MIDDLEWARES = {
'scrapyjs.middleware.WebkitDownloader': 701, #to run after CookiesMiddleware
}
I modified scrapyjs to send cookies: scrapyjs/middleware.py--
import gtk
import webkit
import jswebkit
#import gi.repository import Soup # conflicting static and dynamic includes!?
import ctypes
libsoup = ctypes.CDLL('/usr/lib/i386-linux-gnu/libsoup-2.4.so.1')
libwebkit = ctypes.CDLL('/usr/lib/libwebkitgtk-1.0.so.0')
def process_request( self, request, spider ):
if 'renderjs' in request.meta:
cookies = request.headers.getlist('Cookie')
if len(cookies)>0:
cookies = cookies[0].split('; ')
cookiejar = libsoup.soup_cookie_jar_new()
libsoup.soup_cookie_jar_set_accept_policy(cookiejar,0) #0==ALWAYS ACCEPT
up = urlparse(request.url)
for c in cookies:
sp=c.find('=') # find FIRST = as split position
cookiename = c[0:sp]; cookieval = c[sp+1:];
libsoup.soup_cookie_jar_add_cookie(cookiejar, libsoup.soup_cookie_new(cookiename,cookieval,up.hostname,'None',-1))
session = libwebkit.webkit_get_default_session()
libsoup.soup_session_add_feature(session,cookiejar)
webview = self._get_webview()
webview.connect('load-finished', self.stop_gtk)
webview.load_uri(request.url)
...
The code for setting the cookiejar is adapted from this response. The problem may be with how imports work; perhaps this is not the right webkit that I'm modifying -- I'm not too familiar with webkit and the python documentation is poor. (I can't use the second answer's approach with from gi.repository import Soup because it mixes static and dynamic libraries. I also can't find any get_default_session() in webkit as imported above).
The second approach fails because sessions aren't preserved across requests, and again I don't know enough about webkit to know how to make it persist in this framework.
Any help appreciated!
Actually, the first approach does work, but with one modification. The path to the cookies needs to be '/' (at least in my application), and not 'None' as in the code above. Ie, the line should be
libsoup.soup_cookie_jar_add_cookie(cookiejar, libsoup.soup_cookie_new(cookiename,cookieval,up.hostname,'/',-1))
Unfortunately this only pushes the question back a bit. Now the cookies are saved properly, but the full page (including the frames) is still not being loaded and rendered with webkit as I had expected, and so the DOM is not complete as I see it in within the browser. If I simply request the frame that I want, then I get the error page instead of the content that is shown in a real browser. I'd love to see how to use webkit to render the whole page, including frames. Or how to achieve the second approach, completing the entire session in webkit.
Not knowing complete work flow of Ithe application, you need to make sure setting the cookie jar happens before any other network activity is done by webkit. http://webkitgtk.org/reference/webkitgtk/unstable/webkitgtk-Global-functions.html#webkit-get-default-session. In my experience, this practically means even before instantiating the web view.
Another thing to check for is if your frames are from same domain.Cookie policies will not allow cookies across different domain.
Lastly, you can probably inject the cookies. See http://webkitgtk.org/reference/webkitgtk/unstable/webkitgtk-webkitwebview.html#WebKitWebView-navigation-policy-decision-requested or resource-request-starting and then set the cookies on actual soup message.
Related
I'm using scrapy (on PyCharm v2020.1.3) to build a spider that crawls this webpage: "https://www.woolworths.com.au/shop/browse/drinks/cordials-juices-iced-teas/iced-teas", i want to extract the products names, and the breadcrumb in a list format, and save the results in a csv file.
I tried the following code but it returns empty brackets [] , after i've inspected the html code i discovred that the content is hidden in angularjs format.
If someone has a solution for that it would be great
Thank you
import scrapy
class ProductsSpider(scrapy.Spider):
name = 'products'
start_urls = ['https://www.woolworths.com.au/shop/browse/drinks/cordials-juices-iced-teas/iced-teas']
def parse(self, response):
product = response.css('a.shelfProductTile-descriptionLink::text').extract()
yield "productnames"
You won't be able to get the desired products through parsing the HTML. It is heavily javascript orientated and therefore scrapy wont parse this.
The simplest way to get the product names, I'm not sure what you mean by breadcrumbs is to re-engineer the HTTP requests. The woolworths website generates the product details via an API. If we can mimick the request the browser makes to obtain that product information we can get the information in a nice neat format.
First you have to set within settings.py ROBOTSTXT_OBEY = False. Becareful about protracted scrapes of this data because your IP will probably get banned at some point.
Code Example
import scrapy
class TestSpider(scrapy.Spider):
name = 'test'
allowed_domains = ['woolworths.com']
data = {
'excludeUnavailable': 'true',
'source': 'RR-Best Sellers'}
def start_requests(self):
url = 'https://www.woolworths.com.au/apis/ui/products/58520,341057,305224,70660,208073,69391,69418,65416,305227,305084,305223,427068,201688,427069,341058,305195,201689,317793,714860,57624'
yield scrapy.Request(url=url,meta=self.data,callback=self.parse)
def parse(self, response):
data = response.json()
for a in data:
yield {
'name': a['Name'],
}
Explanation
We start of with our defined url in start_requests. This URL is the specific URL of the API woolworth uses to obtain information for iced tea. For any other link on woolworths the part of the URL after /products/ will be specific to that part of the website.
The reason why we're using this, is because using browser activity is slow and prone to being brittle. This is fast and the information we can get is usually highly structured much better for scraping.
So how do we get the URL you may be asking ? You need to inspect the page, and find the correct request. If you click on network tools and then reload the website. You'll see a bunch of requests. Usually the largest sized request is the one with all the data. Clicking that and clicking preview gives you a box on the right hand side. This gives all the details of the products.
In this next image, you can see a preview of the product data
We can then get the request URL and anything else from this request.
I will often copy this request as a CURL (Bash Command) as seen here
And enter it into curl.trillworks.com. This can convert CURL to python. Giving you a nice formatted headers and any other data needed to mimick the request.
Now putting this into jupyter and playing about, you actually only need the params NOT the headers which is much better.
So back to the code. We do a request, using meta argument we can pass on the data, remember because it's outside the function we have to use self.data and then specifying the callback to parse.
We can use the response.json() method to convert the JSON object to a set of python dictionaries corresponding to each product. YOU MUST have scrapy V2.2 to use this method. Other you could use data = json.loads(response.text), but you'll have put to import json at the top of the script.
From the preview and playing about with the json in requests we can see these python dictionaries are actually within a list and so we can use a for loop to loop round each product, which is what we are doing here.
We then yield a dictionary to extract the data, a refers to each products which is it's own dictionary and a['Name'] refers to that specific python dictionary key 'Name' and giving us the correct value. To get a better feel for this, I always use requests package in jupyter to figure out the correct way to get the data I want.
The only thing left to do is to use scrapy crawl test -o products.csv to output this to a CSV file.
I can't really help you more than this until you specify any other data you want from this page. Please remember that you're going against what the site wants you to scrape, but also any other pages on that website you will need to find out the specific URL to the API to get those products. I have given you the way to do this, I suggest if you want to automate this it would be worth your while trying to struggle with this. We are hear to help but an attempt on your part is how you're going to learn coding.
Additional Information on the Approach of Dynamic Content
There is a wealth of information on this topic. Here are some guidelines to think about when looking at javascript orientated websites. The default is you should try re-engineer the requests the browser makes to load the pages information. This is what the javascript in this site and many other sites is doing, it's providing a dynamic way without reloading the page to display new information by making an HTTP request. If we can mimic that request, we can get the information we desire. This is the most efficent way to get dynamic content.
In order of preference
Re-engineering the HTTP requests
Scrapy-splash
Scrapy_selenium
importing selenium package into your scripts
Scrapy-splash is slightly better than the selenium package, as it pre-renders the page, giving you access to the selectors with the data. Selenium is slow, prone to errors but will allow you to mimic browser activity.
There are multiple ways to include selenium into your scripts see down below as an overview.
Recommended Reading/Research
Look at the scrapy documentation with regard to dynamic content here
This will give you an overview of the steps to handling dynamic content. I will say generally speaking selenium should be thought of as a last resort. It's pretty inefficient when doing larger scale scraping.
If you are consider adding in the selenium package into your script. This might be the lower barrier of entry to getting your script working but not necessarily that efficient. At the end of the day scrapy is a framework but there is a lot of flexibility in adding in 3rd party packages. The spider scripts are just a python class importing the scrapy architecture in the background. As long as you're mindful of the response and translating some of the selenium to work with scrapy, you should be able to input selenium into your scripts. I would this solution is probably the least efficient though.
Consider using scrapy-splash, splash pre-renders the page and allows for you to add in javascript execution. Docs are here and a good article from scrapinghub here
Scrapy-selenium is a package with a custom scrapy downloader middleware that allows you to do selenium actions and execute javascript. Docs here You'll need to have a play around to get the login in procedure from this, it doesn't have the same level of detail as the selenium package itself.
I am trying to parse the website "https://ih.advfn.com/stock-market/NYSE/gen-electric-GE/stock-price" and extract its most recent messages from its board. It is bot protected with Cloud-flare. I am using python and its relative libraries and this is what I have so far
from bs4 import BeautifulSoup as soup #parses/cuts the html
import cfscrape
import requests
url = 'https://ih.advfn.com/stock-market/NYSE/gen-electric-GE/stock-
price'
r=requests.get(url)
html = soup(r.text, "html.parser")
containers = html.find("div",{"id":"bbPosts"})
print(containers.text.strip())
I am not able to use the html parser because the site detects and blocks my script then.
My questions are:
How can I parse the web pages to pull the table data?
Might I mention that this is for a security class I am taking. I am not using this for malicious reasons.
There are multiple ways of bypassing the site protection. You have to see exactly how they are blocking you.
One common way of blocking requests is to look at the User Agent header. The client ( in your case the requests library ) will inform the server about it's identity.
Generally speaking, a browser will say I am a browser and a library will say I am a library. The server can then say I allow browsers but not libraries to access my content.
However, for this particular case, you can simply lie to the server by sending your own User Agent header.
You can see a example here. Try to use your browsers user agent.
Other blocking techniques include ip ranges. One way to bypass this is via a vpn. This is one of the easiest vpns to set up. Just spin up a machine on amazon and get this container running.
What else could happen, you might try to access a single page application that is not rendered server side. In this case, what you should receive with that get requests is a very small html file that essentially references a javascript file. If this is the case, what you need is a actual browser that you control programatically. I would suggest you look at Google Chrome Headless however there are others. You can also use Selenium
Web crawling is a beautiful but very deep subject. I think these pointers should set you on the right direction.
Also, as a quick mention, my advice is to avoid from bs4 import BeautifulSoup as soup. I would recommend html2text
I'm scraping data from some amazon url, but of course sometimes I get captcha. I was wondering enable/disable cookies option has to do with any of this. I rotate around 15 proxies while crawling. I guess the question is should I enable or disable cookies in the settings.py for clean pages or it's irrelavant?
I thought if I enable it website would know the history of what the IP does and after some point notice the pattern and won't allow it (this is my guess) so I should disable it?? or this is not even true about how cookies work and what they are for
How are you accessing these URLs, do you use the urllib library? If so, you might not have noticed but urllib has a default user-agent. The user-agent is part of the HTTP request (stored in the header) and identifies the type of software you have used to access a page. This allows websites to display their content correctly on different browsers but can also be used to determine if you are using an automated program (they don't like bots).
Now the default urllib user agent tells the website you are using python to access the page (usually a big no-no). You can spoof your user-agent quite easily to stop any nasty captcha codes from appearing!
headers = { 'User-Agent' : 'Mozilla/5.0' }
req = urllib2.Request('www.example.com', None, headers)
html = urllib2.urlopen(req).read()
Because you're using scrapy to crawl webpages you may need to make changes your settings.py file so that you can change the user-agent there.
EDIT
Other reasons why captchas might be appearing all over the place is because you are moving too fast through a website. If you add a sleep call inbetween url requests then this might solve your captcha issue!
Other reasons for Captcha's appearing:
You are clicking on honeypot links (links that are within the html code but not displayed on the webpage) designed to catch crawlers.
You may need to change the pattern of crawling as it may be flagged as "non-human".
Check the websites robots.txt file which shows what is and isn't allowed to be crawled.
I'm a developer for a big GUI app and we have a web site for bug tracking. Anybody can submit a new bug to the bug tracking site. We can detect certain failures from our desktop app (i.e. an unhandled exception) and in such cases we would like to open the submit-new-bug form in the user predefined browser, adding whatever information we can gather about the failure to some form fields. We can either retrieve the submit-new-bug form using GET or POST http methods and we can provide default field values to that form. So from the http server side everything is pretty much OK.
So far we can successfully open a URL passing the default values as GET parameters in the URL using the webbrowser module from the Python Standard Library. There are, however, some limitations of this method such as the maximum allowed length of the URL for some browsers (specially MS IE). The webbrowser module doesn't seem to have a way to request the URL using POST. OTOH there's the urllib2 module that provides the type of control we want but AFAIK it lacks the possibility of opening the retrieved page in the user preferred browser.
Is there a way to get this mixed behavior we want (to have the fine control of urllib2 with the higher level functionallity of webbrowser)?
PS: We have thought about the possibility of retreiving the URL with urllib2, saving its content to a temp file and opening that file with webbrowser. This is a little nasty solution and in this case we would have to deal with other issues such as relative URLs. Is there a better solution?
This is not proper answer. but it also work
import requests
import webbrowser
url = "https://www.facebook.com/login/device-based/regular/login/?login_attempt=1&lwv=110"
myInput = {'email':'mymail#gmail.com','pass':'mypaass'}
x = requests.post(url, data = myInput)
y = x.text
f = open("home.html", "a")
f.write(y)
f.close()
webbrowser.open('file:///root/python/home.html')
I don't know of any way you can open the result of a POST request in a web browser without saving the result to a file and opening that.
What about taking an alternative approach and temporarily storing the data on the server. Then the page can be opened in the browser with a simple id parameter, and the saved partially filled form would be shown.
You could use tempfile.NamedTemporaryFile():
import tempfile
import webbrowser
import jinja2
t = jinja2.Template('hello {{ name }}!') # you could load template from a file
f = tempfile.NamedTemporaryFile() # deleted when goes out of scope (closed)
f.write(t.render(name='abc'))
f.flush()
webbrowser.open_new_tab(f.name) # returns immediately
A better approach if the server can be easily modified is to make POST request with partial parameters using urllib2 and open url generated by server using webbrowser as suggested by #Acorn.
I'm trying to write a server process that will allow you to enter a URL, then every 30 min ping that URL and capture it as an image. Is this possible with a combination of something like CURL, urllib2 and PIL?
Curl, urllib2, etc., grab the HTML code for a web page. But a page doesn't look like anything on its own. Instead, a browser uses that code and renders a web page according to its own internal rules of how that code should be used. And, of course, each browser renders the page slightly differently.
In other words, you can't take a snapshot of a page without having a web browser to generate the page to take the snapshot of.
If you're feeling very ambitious, you can create your own custom, scriptable page renderer by using the rendering engine from the browser of your choice -- they all make the rendering engine a separate component that you can work with separately. IE's is called "Trident", Firefox's is called "Gecko", Chrome's is "WebKit", etc.
Otherwise you'll want to just do some sort of UI scripting, like you might do with iOpus or Selenium. Selenium is scriptable with python, so that's one for you right there.
EDIT
Here you go. That looks pretty simple.
The ImageGrab can be used to take a screenshot on windows. However, you can't do this purely using CURL, urllib2 and PIL, because you will have to render the web site. The easiest would probably be to open the website in a browser and grab a screenshot.