How can i get the parsed html in scrapy from hardcoded url - python

In my scrapy I just want the html response inside a variable from custom url.
Suppose I have the url
url = "http://www.example.com"
Now I want to get the html of that page for parsing
pageHtml = scrapy.get(url)
I want something like this
page = urllib2.urlopen('http://yahoo.com').read()
The only problem that I can't use above line in my crawler is because my session is already authenticated by scrapy so I can't use any other function for getting the html of that function
I don't want response in any callback but simply straight inside the variable

Basically, you just need to add the relevant imports for the code in that question to work. You'll also need to add a link variable which is used but not defined in that example code.
import httplib
from scrapy.spider import BaseSpider
from scrapy.http import TextResponse
bs = BaseSpider('some')
# etc

Related

Parsing output from scrapy splash

I'm testing out a splash instance with scrapy 1.6 following https://blog.scrapinghub.com/2015/03/02/handling-javascript-in-scrapy-with-splash and https://aaqai.me/notes/scrapy-splash-setup. My spider:
import scrapy
from scrapy_splash import SplashRequest
from scrapy.utils.response import open_in_browser
class MySpider(scrapy.Spider):
start_urls = ["http://yahoo.com"]
name = 'mytest'
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse, endpoint='render.html', args={'wait': 7.5},)
def parse(self, response):
# response.body is a result of render.html call; it
# contains HTML processed by a browser.
open_in_browser(response)
return None
The output opens up in notepad rather than a browser. How can I open this in a browser?
If you are using the splash middleware and everything the splash response goes into the regular response object with you can access via response.css and response.xpath. Depending on what endpoint you use you can execute JavaScript and other stuff.
If you need to do moving around a page and other stuff you will need to write a LUA script to execute with the proper endpoint. As far as parsing the output it automatically goes into the response object.
Get rid of open_in_browser I'm not exactly sure what you are doing but if all you want to do is parse the page you can do so like so
body = response.css('body').extract_first()
links = response.css('a::attr(href)').extract()
If you could please clarify your question most people don't want to look in links to try and guess what your having trouble with.
Update for clarified question:
It sounds like you may want scrapy shell with Splash this will enable you to experiment with selectors:
scrapy shell 'http://localhost:8050/render.html?url=http://page.html&timeout=10&wait=0.5'
In order to access Splash in a browser instance simply go to http://0.0.0.0:8050/ you input the URL in there. I'm not sure about the method in the tutorial but this is how you can interact with the Splash session.

When I take html from a website using urllib2, the inner html is empty. Anyone know why?

I am working on a project and one of the steps includes getting a random word which I will use later. When I try to grab the random word, it gives me '<span id="result"></span>' but as you can see, there is no word inside.
Code:
import urllib2
from bs4 import BeautifulSoup
quote_page = 'http://watchout4snakes.com/wo4snakes/Random/RandomWord'
page = urllib2.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
name_box = soup.find("span", {"id": "result"})
print name_box
name = name_box.text.strip()
print name
I am thinking that maybe it might need to wait for a word to appear, but I'm not sure how to do that.
This word is added to the page using JavaScript. We can verify this by looking at the actual HTML that is returned in the request and comparing it with what we see in the web browser DOM inspector. There are two options:
Use a library capable of executing JavaScript and giving you the resulting HTML
Try a different approach that doesn't require JavaScript support
For 1, we can use something like requests_html. This would look like:
from requests_html import HTMLSession
url = 'http://watchout4snakes.com/wo4snakes/Random/RandomWord'
session = HTMLSession()
r = session.get(url)
# Some sleep required since the default of 0.2 isn't long enough.
r.html.render(sleep=0.5)
print(r.html.find('#result', first=True).text)
For 2, if we look at the network requests that the page is making, then we can see that it retrieves random words by making a POST request to http://watchout4snakes.com/wo4snakes/Random/RandomWord. Making a direct request with a library like requests (recommended in the standard library documentation here) looks like:
import requests
url = 'http://watchout4snakes.com/wo4snakes/Random/RandomWord'
print(requests.post(url).text)
So the way that the site works is that it sends you the site with no word in the span box, and edits it in later through JavaScript; that's why you get a span box with nothing inside.
However, since you're trying to get the word I'd definitely suggest you use a different method to getting the word, rather than scraping the word off the page, you can simply send a POST request to http://watchout4snakes.com/wo4snakes/Random/RandomWord with no body and receive the word in response.
You're using Python 2 but in Python 3 (for example, so I can show this works) you can do:
>>> import requests
>>> r = requests.post('http://watchout4snakes.com/wo4snakes/Random/RandomWord')
>>> print(r.text)
doom
You can do something similar using urllib in Python 2 as well.

Scraping the core-text of webpages in python

Is there a module that can be used to get the core-text of a webpage?
Something the will remove the headers / menus / social links?
Thank you
I think that varies from site to site. Since every website has a different structure you could not come up with a standard extractor.
For extracting a particular part of the web page, you can approach as following:
from urllib2 import urlopen
from scrapy.http import HtmlResponse
url = "some_website_you_want_to_crawl"
url_response = urlopen(url)
resp = HtmlResponse(url=url, body=url_response.read())
core_text = resp.xpath('xpath_to_core_text').extract()[0]

How to test standalone code outside of a scrapy spider?

I can test my xpaths on a HTML body string by running the two lines under (1) below.
But what if I have a local file myfile.html whose content is exactly the body string. How would I run some standalone code, outside of a spider? I am seeking something that is roughly similar to the lines under (2) below.
I am aware that scrapy shell myfile.html tests xpaths. My intention is to run some Python code on the response, and hence scrapy shell is lacking (or else tedious).
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
# (1)
body = '<html><body><span>Hello</span></body></html>'
print Selector(text=body).xpath('//span/text()').extract()
# (2)
response = HtmlResponse(url='file:///tmp/myfile.html')
print Selector(response=response).xpath('//span/text()').extract()
You can have a look at how scrapy tests for link extractors are implemented.
They use a get_test_data() helper to read a file as bytes,
and then instanciate a Response with a fake URL and body set to the bytes read with get_test_data():
...
body = get_testdata('link_extractor', 'sgml_linkextractor.html')
self.response = HtmlResponse(url='http://example.com/index', body=body)
...
Depending on the encoding used for the local HTML files, you may need to pass a non-UTF-8 encoding argument to Response

i can not get the body element of html page in web scraping by python

I would like to parse a website with urllib python library. I wrote this:
from bs4 import BeautifulSoup
from urllib.request import HTTPCookieProcessor, build_opener
from http.cookiejar import FileCookieJar
def makeSoup(url):
jar = FileCookieJar("cookies")
opener = build_opener(HTTPCookieProcessor(jar))
html = opener.open(url).read()
return BeautifulSoup(html, "lxml")
def articlePage(url):
return makeSoup(url)
Links = "http://collegeprozheh.ir/%d9%85%d9%82%d8%a7%d9%84%d9%87- %d9%85%d8%af%d9%84-%d8%b1%d9%82%d8%a7%d8%a8%d8%aa%db%8c-%d8%af%d8%b1-%d8%b5%d9%86%d8%b9%d8%aa-%d9%be%d9%86%d9%84-%d9%87%d8%a7%db%8c-%d8%ae%d9%88%d8%b1%d8%b4%db%8c%d8%af/"
print(articlePage(Links))
but the website does not return content of body tag.
this is result of my program:
cURL = window.location.href;
var p = new Date();
second = p.getTime();
GetVars = getUrlVars();
setCookie("Human" , "15421469358743" , 10);
check_coockie = getCookie("Human");
if (check_coockie != "15421469358743")
document.write("Could not Set cookie!");
else
window.location.reload(true);
</script>
</head><body></body>
</html>
i think the cookie has caused this problem.
The page is using JavaScript to check the cookie and to generate the content. However, urllib does not process JavaScript and thus the page shows nothing.
You'll either need to use something like Selenium that acts as a browser and executes JavaScript, or you'll need to set the cookie yourself before you request the page (from what I can see, that's all the JavaScript code does). You seem to be loading a file containing cookie definitions (using FileCookieJar), however you haven't included the content.

Categories