SCRAPY - XPATH select a object inside a node - python

I need to get an object inside a variable inside a node which is a javascript node.
(Using scrapy 1.8.0 didn't update yet hehe)
Maybe I don't explain myself clearly but as soon you see it... you will understand.
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Document</title>
<script id='myscript'>
oneVariable = {...}
theVariable = {"Data": "blahblah", "More-Data": {...}}
</script>
</head>
<body>
</body>
</html>
Ok I got the whole node with his information manually using scrapy shell and then the selector
response.xpath('//*[#id="myscript"]').get()
Can I get the "theVariable" I want just with XPATH selectors or functions (like get(), getAll() etc)?
Thanks in advance!

Try changing you xpath expression to something like:
substring-after(//script[#id="myscript"],"theVariable = ")

Related

Emulating CORS issues with pytest

I need to test requests that can be sent through iframe. For example: i have some page on domain_01:
<!doctype html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport"
content="width=device-width, user-scalable=no, initial-scale=1.0, maximum-scale=1.0, minimum-scale=1.0">
<meta http-equiv="X-UA-Compatible" content="ie=edge">
<title>Document</title>
</head>
<style>
body {
margin: 0 auto;
}
</style>
<body>
<iframe id="inlineFrameExample"
title="Inline Frame Example"
width="1600"
height="900"
src="http://domain_02:8000/app/dashboard">
</iframe>
</body>
</html>
And as you can see here this page contains iframe with link to page on domain_02. I try to understand: is it possible to emulate request that goes to domain_02 through this iframe on doamin_01 with pytest.
Main task what i need to solve it's create tests with different requests and check that there is no CORS issues with it.
How i check it now: manually only. I run second web-server through inner python server (python -m http.server 8090) and set dns-record on local dns-server to emulate domain_01. It will be so cool to run this tests with pytest.

Python & Selenium: How to get Elements in DevTools with CDP (Chrome DevTools Protocol)

I'd like to get all source code in Elements with Chrome DevTools.
Although I tried the following code, these values are not match with the above code.
body = driver.execute_cdp_cmd("DOM.getOuterHTML", {"backendNodeId": 1})
print(body)
Is it possible to get all source code with CDP?
How can I get all source code with CDP?
I know the another way to scrape the source code.
But I'd like to know how to get the source code in Elements in DevTools. (F12)
EDIT: See CDP solution at the end
Assuming by "f12 source code" you mean "the current DOM, after it has been manipulated by JS or anything else, as opposed to the original source code".
so, consider the following html page:
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>Hi</title>
<script>
document.addEventListener("DOMContentLoaded", function(){
setTimeout(function(){
document.getElementById("test").innerHTML+=" World!"
}, 3000)
});
</script>
</head>
<body>
<h1 id="test">Hello</h1>
</body>
</html>
3 seconds after page load, the h1 will contain "Hello World!"
And that is exactly what we see when running the following code:
from selenium import webdriver
from time import sleep
driver = webdriver.Chrome()
driver.get("http://localhost:8000/") # replace with your page
sleep(6) # probably replace with smarter logic
html = driver.execute_script("return document.documentElement.outerHTML")
print (html)
That outputs:
<html lang="en"><head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>Hi</title>
<script>
document.addEventListener("DOMContentLoaded", function(){
setTimeout(function(){
document.getElementById("test").innerHTML+=" World!"
}, 3000)
});
</script>
</head>
<body>
<h1 id="test">Hello World!</h1>
</body></html>
EDIT, using CDP instead:
The behavior you're describing is odd, but okay, let's find a different solution.
It seems there's limited support for CDP in selenium 4 (so far) in python.
as of Now (May 2022) There is no driver.getDevTools() in python, only java and JS (Node) (?).
Anyway, I'm not even sure that would have helped us.
Raw CDP will suffice for now:
from selenium import webdriver
from time import sleep
# webdriver.remote.webdriver.import_cdp()
driver = webdriver.Chrome()
driver.get("http://localhost:8000/")
sleep(6)
doc = driver.execute_cdp_cmd(cmd="DOM.getDocument",cmd_args={})
doc_root_node_id = doc["root"]["nodeId"]
result = driver.execute_cdp_cmd(cmd="DOM.getOuterHTML",cmd_args={"nodeId":doc_root_node_id})
print (result['outerHTML'])
prints:
<!DOCTYPE html><html lang="en"><head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>Hi</title>
<script>
document.addEventListener("DOMContentLoaded", function(){
setTimeout(function(){
document.getElementById("test").innerHTML+=" World!"
}, 3000)
});
</script>
</head>
<body>
<h1 id="test">Hello World!</h1>
</body></html>

XPath for html elements

I'd like to use Scrapy to crawl a few hundred websites and just scrape the basic (title, meta* and body) html elements. I know that I should use CrawlSpider for this and adjust some of the settings based on broad crawls. The part that I'm having trouble figuring it out is how to use xpath to create the rules for scraping just those basic html elements. Lots of tutorials I see involve inspecting the element and finding the css class for that element. That is fine for the body element but what about the title and meta tags?
There XPath and CSS selector you can use to select nodes in HTML.
the element is a node, but the node not always an element.
So, then you know head, meta, body are all elements. the class attributes in the div is the same as the charset attribute in meta element. They are all attributes nodes.
e.g:
<!DOCTYPE html>
<html lang='zh-cn'>
<head>
<meta charset='utf-8'>
<meta http-equiv='X-UA-Compatible' content='IE=edge'>
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="renderer" content="webkit">
<title>title</title>
</head>
<body>
<div>website content</div>
</body>
</html>
if you want to select
<meta http-equiv='X-UA-Compatible' content='IE=edge'>
you can use XPATH like this:
//head/meta[#http-equiv="X-UA-Compatible"]
You can search the elements in <head> the same way you find in <body>, for example:
//html/head/title
or
//html/head/meta
Well for the title node you can write a simple XPath expression : //title which is the abbreviated syntax of /descendant-or-self::node()/child::title and that's it.
For the meta node guess what you can just write //meta too or if you want you can use the absolute path /html/head/meta
PS. You can do the same thing for the body node.

lxml: Element is not a child of this node

I'm trying to change the value of title within the following html document:
<html lang="en">
<head>
<meta charset="utf-8">
<title id="title"></title>
<base href="/">
<meta name="viewport" content="width=device-width, initial-scale=1">
</head>
<body>
<app-root></app-root>
</body>
</html>
I wrote the following python script which uses lxml, in order to accomplish the task:
from lxml.html import fromstring, tostring
from lxml.html import builder as E
html = fromstring(open('./index.html').read())
html.replace(html.get_element_by_id('title'), E.TITLE('TEST'))
But after running the script, I get the following error:
ValueError: Element is not a child of this node.
What's supposed to cause this error? Thank you.
The 'title' tag is a child of the 'head' node. In your code you use replace on the 'html' node, which has no 'title' elements (not directly), hence the ValueError.
You can get the desired results if you use replace on the 'head' node.
html.find('head').replace(html.get_element_by_id('title'), E.TITLE('TEST'))

Download a HTML page locally (+CSS, +images)

I'd like to be able to download a HTML page (let's say this actual question!):
f = urllib2.urlopen('https://stackoverflow.com/questions/33914277')
content = f.read() # soup = BeautifulSoup(content) could be useful?
g = open("mypage.html", 'w')
g.write(content)
g.close()
such that it is displayed the same way locally than online. Currently here is the (bad) result:
(source: gget.it)
Thus, one need to download CSS, and modify the HTML itself such that it points to this local CSS file... and the same for images, etc.
How to do this? (I think there should be simpler than this answer, that doesn't handle CSS, but how? Library?)
Since css and image files fall under CORS policy, from your local html you still can refer to them while they are in the cloud. The problem is unresolved URIs. In the html head section you have smth. like this:
<head>
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<link rel="stylesheet" type="text/css" href="/assets/8943fcf6/select.css" />
<link href="/css/media.css" rel="stylesheet" type="text/css">
<script type="text/javascript" src="/assets/jquery.yii.js"></script>
<script type="text/javascript" src="/assets/select.js"></script>
</head>
Obviously /css/media.css implies base address, ex. http://example.com. To resolve it for local file you need to make http://example.com/css/media.css as href value in your local copy of html. So now you should parse and add the base into the local code:
<head>
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<link rel="stylesheet" type="text/css" href="http://example.com/assets/select.css" />
<link href="http://example.com/css/media.css" rel="stylesheet" type="text/css">
<script type="text/javascript" src="http://example.com/assets/jquery.yii.js"></script>
<script type="text/javascript" src="http://example.com/assets/select.js"></script>
</head>
Use any means for that (js, php...)
Update
Since a local file also contains images' references throughout the body section you'll need to resolve them too.

Categories