Get image URL from HTML5 canvas using Selenium Python - python

<canvas class="word-cloud-canvas" id="word-cloud-canvas-1892" height="270" width="320"></canvas>
How to get the URL source of an image from HTML5 canvas using Selenium Python?
I tried to use
driver.execute_script("return arguments[0].toDataURL('image/png');", canvasElement)
But it only return the binary? of the image.
I don't want to save the image, but get the URL of the image. Is it possible?

I faced a similar issue and the only alternative I could find is to use subprocess and phantomjs
Here is the Python code
import json, subprocess
output = check_output(['phantomjs', 'getResources.js', main_url])
urls = json.loads(output)
for url in urls:
#filter and process URLs
and the Javascript file content
// getResources.js
// Usage:
// phantomjs getResources.js your_url
var page = require('webpage').create();
var system = require('system');
var urls = Array();
page.onResourceRequested = function(request, networkRequest) {
urls.push(request.url)
};
page.onLoadFinished = function(status) {
setTimeout(function() {
console.log(JSON.stringify(urls));
phantom.exit();
}, 16000);
};
page.onResourceError = function() {
return false;
}
page.onError = function() {
return false;
}
page.open(system.args[1]);
PhantomJS supports various options as well; for example to change the user agent you can use something like this:
page.settings.userAgent = 'Mozilla/5.0 (Windows NT 6.1; WOW64) ...';
This is a simplified version of this answer which I used for my issue.

Related

Python request.get() response different to response in browser or when proxied over burp suite

I am trying to send a get request with python like this:
import requests
url = "internal_url" # I replaced all internal urls
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0", "Accept": "*/*", "Accept-Language": "en-GB,en;q=0.5", "Accept-Encoding": "gzip, deflate", "X-Requested-With": "XMLHttpRequest", "Connection": "close", "Referer": "internal url"}
r = requests.get(url , headers=header)
print(r.text)
As reponse I am expecting json data. But instead I get this:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<meta http-equiv="Content-Script-Type" content="text/javascript">
<script type="text/javascript">
function getCookie(c_name) { // Local function for getting a cookie value
if (document.cookie.length > 0) {
c_start = document.cookie.indexOf(c_name + "=");
if (c_start!=-1) {
c_start=c_start + c_name.length + 1;
c_end=document.cookie.indexOf(";", c_start);
if (c_end==-1)
c_end = document.cookie.length;
return unescape(document.cookie.substring(c_start,c_end));
}
}
return "";
}
function setCookie(c_name, value, expiredays) { // Local function for setting a value of a cookie
var exdate = new Date();
exdate.setDate(exdate.getDate()+expiredays);
document.cookie = c_name + "=" + escape(value) + ((expiredays==null) ? "" : ";expires=" + exdate.toGMTString()) + ";path=/";
}
function getHostUri() {
var loc = document.location;
return loc.toString();
}
setCookie('STRING redacted', IP-ADDRESS redacted, 10);
try {
location.reload(false);
} catch (err1) {
try {
location.reload();
} catch (err2) {
location.href = getHostUri();
}
}
</script>
</head>
<body>
<noscript>This site requires JavaScript and Cookies to be enabled. Please change your browser settings or upgrade your browser.</noscript>
</body>
</html>
When I changed the request to use the burp suite proxy so I can see the request, it suddenly works and I get the correct response:
proxies = {"http": "127.0.0.1:8080", "https": "http://127.0.0.1:8080"}
r = requests.get(url, headers=headers, verify=False, proxies=proxies)
My browser displays the correct results as text when I visit the link itself. Burp suite proxy not needed.
I think its possible that it has to do with the company proxy.
But even when I tried to run the request with company proxies supplied it still does not work.
Is there something I am missing?
EDIT:
After some more searching it seems like I get redirected when I dont use any proxies in python. That doesnt happen when I go over the burp suite proxy.
After a few days and some outside help I finally found the solution. Posting it here for the future.
My problem was that I was using a partially qualified domain name instead of a fully qualified domain name
So for example: myhost instead of myhost.example.com
Burp suite or the browser were handling the translation for me but in python I had to do it myself.

How to send HTML message to Python Server using websockets

A program is written to communicate between HTML and Python Server. Connection is established but communication message that I wrote in HTML (is string format) is not getting displayed on server side.
I have tried Python-server programs, websockets and html programming
'''Server side program using python'''
def RcvDatafromUI(self):
while True:
json_data = self.conn.recv(1024)
try:
if json_data == b'\r\n' or b'':
print("Looping Again")
continue
else:
print("RcvdDAta : ", json_data)
json_string = json_data.decode('utf-8')
print("DecodedDAta : ", json_string)
#print('type decode=',type(json_string))
json_dict = json.loads(ast.literal_eval(json_string))
print('json_dict=', json_dict)
y = parsejsondata.ParseJson(json_dict)
if y == "Success":
PloverMain.SendDAtatoUI('{"Command":"Start
","Status":"Success/Fail"}')
PloverMain.RcvDatafromUI()
'''html code'''
<!DOCTYPE HTML>
<html>
<head>
<script type = "text/javascript">
function WebSocketTest() {
if ("WebSocket" in window) {
alert("WebSocket is supported by your Browser!");
// Let us open a web socket
var exampleSocket = new WebSocket("ws://localhost:12345/echo");
alert("website opened")
exampleSocket.onopen = function(event) {
// Web Socket is connected, send data using send()
exampleSocket.send('{"Command":"Start","Status":"Check"}');
alert("Message is sent...");
};
exampleSocket.onmessage = function (event) {
var received_msg = event.data;
alert("Message is received...");
};
exampleSocket.onclose = function() {
// websocket is closed.
alert("Connection is closed...");
};
} else {
// The browser doesn't support WebSocket
alert("WebSocket NOT supported by your Browser!");
}
}
</script>
</head>
<body>
<div id = "sse">
Run WebSocket
</div>
</body>
</html>
Expected:(on server side)
RcvdDAta : '{"Command":"Start","Status":"Check"}'
Acutal:
RcvdDAta :
b'GET /echo HTTP/1.1\r\nHost: localhost:12345\r\nConnection: Upgrade\r\nPragma: no-cache\r\nCache-Control: no-cache\r\nUpgrade: websocket\r\nOrigin: file://\r\nSec-WebSocket-Version: 13\r\nUser-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36\r\nAccept-Encoding: gzip, deflate, br\r\nAccept-Language: en-US,en;q=0.9\r\nSec-WebSocket-Key: s1qAa96xxm9IZqU21C0TQA==\r\nSec-WebSocket-Extensions: permessage-deflate; client_max_window_bits\r\n\r\n'

How can I get the original URL from an archive.is short link using python?

I would like to write a function which takes an archive.is (or archive.fo, archive.li, or archive.today) link as input and gives the URL of the original site as output.
For example, if the input was 'http://archive.is/9mIro', then I would want the output to be 'http://www.dailytelegraph.com.au/news/nsw/australian-army-bans-male-recruits-to-get-female-numbers-up/news-story/69ee9dc1d4f8836e9cca7ca2e3e5680a'.
How can I do this in python?
Yes, your approach could work for another site, but archive.is seems to protect their data from automatic queries, when I try curl, python (urllib2) I get error Empty reply from server. You need something like phantomjs that mimic real browser. And I believe it will only work for few queries and then will show captcha or give errors. Also they seem to log ip addresses and even phantomjs get errors from same machine where curl or python was tried.
Here's phantomjs code that works:
var webPage = require('webpage');
var page = webPage.create();
var system = require('system');
var args = system.args;
page.settings.userAgent = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36';
function getOriginalUrl(shortUrl, cb) {
page.open(shortUrl, function(status) {
//console.log(status);
var url = page.evaluate(function(){
return document.querySelector('form input').value;
});
cb(url);
});
}
if (args.length > 1) {
getOriginalUrl(args[1],function(url){
console.log(url);
phantom.exit();
});
} else {
console.log('Pass url');
phantom.exit();
}

Using python to scrape ASP.NET site with id in url

I'm trying to scrape the search results of this ASP.NET website using Python requests to send a POST request. Even though I use a GET request to get the requestverificationtoken and include it in my header I get just get this reply:
{"Token":"Y2VgsmEAAwA","Link":"/search/Y2VgsmEAAwA/"}
which is not the valid link. It's the total search results with no defined arrival data or area as included in my POST request. What am I missing? Who do I scrape a site like this that generates a (session?) ID for the URL?
Thank you so much in advance to all of you!
My python script:
import json
import requests
from bs4 import BeautifulSoup
r = requests.Session()
# GET request
gr = r.get("http://www.feline.dk")
bsObj = BeautifulSoup(gr.text,"html.parser")
auth_string = bsObj.find("input", {"name": "__RequestVerificationToken"})['value']
#print(auth_string)
#print(gr.url)
# POST request
search_request = {
"Geography.Geography":"Danmark",
"Geography.GeographyLong=":"Danmark (Ferieområde)",
"Geography.Id":"da509992-0830-44bd-869d-0270ba74ff62",
"Geography.SuggestionId": "",
"Period.Arrival":"16-1-2016",
"Period.Duration":7,
"Period.ArrivalCorrection":"false",
"Price.MinPrice":None,
"Price.MaxPrice":None,
"Price.MinDiscountPercentage":None,
"Accommodation.MinPersonNumber":None,
"Accommodation.MinBedrooms":None,
"Accommodation.NumberOfPets":None,
"Accommodation.MaxDistanceWater":None,
"Accommodation.MaxDistanceShopping":None,
"Facilities.SwimmingPool":"false",
"Facilities.Whirlpool":"false",
"Facilities.Sauna":"false",
"Facilities.InternetAccess":"false",
"Facilities.SatelliteCableTV":"false",
"Facilities.FireplaceStove":"false",
"Facilities.Dishwasher":"false",
"Facilities.WashingMachine":"false",
"Facilities.TumblerDryer":"false",
"update":"true"
}
payload = {
"searchRequestJson": json.dumps(search_request),
}
header ={
"Accept":"application/json, text/html, */*; q=0.01",
"Accept-Encoding":"gzip, deflate",
"Accept-Language":"da-DK,da;q=0.8,en-US;q=0.6,en;q=0.4",
"Connection":"keep-alive",
"Content-Length":"720",
"Content-Type":"application/x-www-form-urlencoded; charset=UTF-8",
"Cookie":"ASP.NET_SessionId=ebkmy3bzorzm2145iwj3bxnq; __RequestVerificationToken=" + auth_string + "; aid=382a95aab250435192664e80f4d44e0f; cid=google-dk; popout=hidden; __utmt=1; __utma=1.637664197.1451565630.1451638089.1451643956.3; __utmb=1.7.10.1451643956; __utmc=1; __utmz=1.1451565630.1.1.utmgclid=CMWOra2PhsoCFQkMcwod4KALDQ|utmccn=(not%20set)|utmcmd=(not%20set)|utmctr=(not%20provided); BNI_Feline.Web.FelineHolidays=0000000000000000000000009b84f30a00000000",
"Host":"www.feline.dk",
"Origin":"http://www.feline.dk",
#"Referer":"http://www.feline.dk/search/Y2WZNDPglgHHXpe2uUwFu0r-JzExMYi6yif5KNswMDBwMDAAAA/",
"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36",
"X-Requested-With":"XMLHttpRequest"
}
gr = r.post(
url = 'http://www.feline.dk/search',
data = payload,
headers = header
)
#print(gr.url)
bsObj = BeautifulSoup(gr.text,"html.parser")
print(bsObj)
After multiples tries, I found that your search request is misformatted (need to be URL Encoded and not JSON), and cookies informations are overwrited in headers (Just let session make the work).
I simplified the code like that and I get the desired result
r = requests.Session()
# GET request
gr = r.get("http://www.feline.dk")
bsObj = BeautifulSoup(gr.text,"html.parser")
auth_string = bsObj.find("input", {"name": "__RequestVerificationToken"})['value']
# POST request
search_request = "Geography.Geography=Hou&Geography.GeographyLong=Hou%2C+Danmark+(Ferieomr%C3%A5de)&Geography.Id=847fcbc5-0795-4396-9318-01e638f3b0f6&Geography.SuggestionId=&Period.Arrival=&Period.Duration=7&Period.ArrivalCorrection=False&Price.MinPrice=&Price.MaxPrice=&Price.MinDiscountPercentage=&Accommodation.MinPersonNumber=&Accommodation.MinBedrooms=&Accommodation.NumberOfPets=&Accommodation.MaxDistanceWater=&Accommodation.MaxDistanceShopping=&Facilities.SwimmingPool=false&Facilities.Whirlpool=false&Facilities.Sauna=false&Facilities.InternetAccess=false&Facilities.SatelliteCableTV=false&Facilities.FireplaceStove=false&Facilities.Dishwasher=false&Facilities.WashingMachine=false&Facilities.TumblerDryer=false"
gr = r.post(
url = 'http://www.feline.dk/search/',
data = search_request,
headers = {'Content-Type': 'application/x-www-form-urlencoded'}
)
print(gr.url)
Result :
http://www.feline.dk/search/Y2U5erq-ZSr7NOfJEozPLD5v-MZkw8DAwMHAAAA/
Thank you Kantium for your answer, in my case, i found that the RequestVerificationToken was actually generated in a JS script inside the page.
1 - Call the first page that generates the code, in my case it returned something like this inside the HTML:
<script>
Sys.Net.WebRequestManager.add_invokingRequest(function (sender, networkRequestEventArgs) {
var request = networkRequestEventArgs.get_webRequest();
var headers = request.get_headers();
headers['RequestVerificationToken'] = '546bd932b91b4cdba97335574a263e47';
});
$.ajaxSetup({
beforeSend: function (xhr) {
xhr.setRequestHeader("RequestVerificationToken", '546bd932b91b4cdba97335574a263e47');
},
complete: function (result) {
console.log(result);
},
});
</script>
2 - Grab the RequestVerificationToken code and then add it to your request along with the cookie from set-cookie.
let resp_setcookie = response.headers["set-cookie"];
let rege = new RegExp(/(?:RequestVerificationToken", ')(\S*)'/);
let token = rege.exec(response.body)[1];
I actually store them in a global variable, and later in my Nodejs Request i would add this to the request object:
headers.Cookie = gCookies.cookie;
headers.RequestVerificationToken = gCookies.token;
So that the end request would look something like this:
Remember that you can monitor requests sent using:
require("request-debug")(requestpromise);
Good luck !

Why does cURL give correct response but scrapy does not?

Why does cURL give correct response but scrapy does not?
The site I want to scrape using javascript to fill in a form then POSTs it and verifies before serving the content.
I've replicated this js in python, after scraping the parameters from the javascript in the initial GET request. My value of "TS644333_75" matches the js value (as tested by doing a document.write(..) out, instead of letting it submit like normal), and if you copy and paste the result into cURL that works too. For example:
curl --http1.0 'http://www.betvictor.com/sports/en/football' -H 'Connection: keep-alive'
-H 'Accept-Encoding: gzip,deflate' -H 'Accept-Language: en'
-H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
-H 'Referer: http://www.betvictor.com/sports/en/football' -H 'User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0'
--data
'TS644333_id=3&
TS644333_75=286800b2a80cd3334cd2895e42e67031%3Ayxxy%3A3N6QfX3q%3A1685704694&
TS644333_md=1&
TS644333_rf=0&
TS644333_ct=0&
TS644333_pd=0' --compressed
Where to get TS644333_75 I've simply copy and pasted the result my python code calculated when simulating the js.
Monitoring packets in wireshark shows this POST as shown here (I've added some line spaces to make the POST data more readable, but otherwise it's as seen in wireshark).
However if I start a scrapy shell:
1) scrapy shell "http://www.betvictor.com/sports/en/football"
and construct a form request:
2) from scrapy.http import FormRequest
req=FormRequest(
url='http://www.betvictor.com/sports/en/football',
formdata={
'TS644333_id': '3',
'TS644333_75': '286800b2a80cd3334cd2895e42e67031:yxxy:3N6QfX3q:1685704694',
'TS644333_md': '1',
'TS644333_rf': '0',
'TS644333_ct': '0',
'TS644333_pd': '0'
},
headers={
'Referer': 'http://www.betvictor.com/sports/en/football',
'Connection': 'keep-alive'
}
)
Then fetch it
3) fetch(req)
The response body I get back is just another javascript challenge, not the served up content desired.
Yet the packet seen in wireshark is (again with some newlines for readability in POST params)
shown here, and to my eyes looks indentical
.
What is going wrong? How can packets that appear identical lead to different server responses? Why is this not working with scrapy?
It could be the encoding of the ":" in the parameter computed that I POST, but it looks to have been encoded correctly, and both match in wireshark, so I can't see that as the issue.
It seems to work if you append a slash to your URL - so same scrapy request, but with URL changed to:
http://www.betvictor.com/sports/en/football/
Additional Example:
I had the same problem when testing another website where the page worked on curl nicely, but did not work with requests. After fighting with it for sometime, this answer with extra slash solved the problem.
import requests
import json
r = requests.get(r'https://bet.hkjc.com/marksix/getJSON.aspx/?sd=20190101&ed=20190331&sb=0')
pretty_json = json.loads(r.text)
print (json.dumps(pretty_json, indent=2))
returns this:
[
{
"id": "19/037",
"date": "30/03/2019",
"no": "15+17+18+37+39+49",
"sno": "31",
"sbcode": "",
...
...
The slash after .aspx is important. It doesn't work without it. Without the slash, the page returns an empty javascript challenge.
import requests
import json
#no slash
r = requests.get(r'https://bet.hkjc.com/marksix/getJSON.aspx?sd=20190101&ed=20190331&sb=0')
print(r.text)
returns this:
<HTML>
<head>
<script>
Challenge=341316;
ChallengeId=49424326;
GenericErrorMessageCookies="Cookies must be enabled in order to view this page.";
</script>
<script>
function test(var1)
{
var var_str=""+Challenge;
var var_arr=var_str.split("");
var LastDig=var_arr.reverse()[0];
var minDig=var_arr.sort()[0];
var subvar1 = (2 * (var_arr[2]))+(var_arr[1]*1);
var subvar2 = (2 * var_arr[2])+var_arr[1];
var my_pow=Math.pow(((var_arr[0]*1)+2),var_arr[1]);
var x=(var1*3+subvar1)*1;
var y=Math.cos(Math.PI*subvar2);
var answer=x*y;
answer-=my_pow*1;
answer+=(minDig*1)-(LastDig*1);
answer=answer+subvar2;
return answer;
}
</script>
<script>
client = null;
if (window.XMLHttpRequest)
{
var client=new XMLHttpRequest();
}
else
{
if (window.ActiveXObject)
{
client = new ActiveXObject('MSXML2.XMLHTTP.3.0');
};
}
if (!((!!client)&&(!!Math.pow)&&(!!Math.cos)&&(!![].sort)&&(!![].reverse)))
{
document.write("Not all needed JavaScript methods are supported.<BR>");
}
else
{
client.onreadystatechange = function()
{
if(client.readyState == 4)
{
var MyCookie=client.getResponseHeader("X-AA-Cookie-Value");
if ((MyCookie == null) || (MyCookie==""))
{
document.write(client.responseText);
return;
}
var cookieName = MyCookie.split('=')[0];
if (document.cookie.indexOf(cookieName)==-1)
{
document.write(GenericErrorMessageCookies);
return;
}
window.location.reload(true);
}
};
y=test(Challenge);
client.open("POST",window.location,true);
client.setRequestHeader('X-AA-Challenge-ID', ChallengeId);
client.setRequestHeader('X-AA-Challenge-Result',y);
client.setRequestHeader('X-AA-Challenge',Challenge);
client.setRequestHeader('Content-Type' , 'text/plain');
client.send();
}
</script>
</head>
<body>
<noscript>JavaScript must be enabled in order to view this page.</noscript>
</body>
</HTML>
It turned out that the order of the parameters really mattered for this server (I guess because it was simulating a hidden form with ordered inputs, and this was an extra validation check). In python requests using a POST str and url encoding by hand (i.e. ':' --> '%3A') makes things work. So although the wireshark packets are near enough identical, the only way they differ is the param string order, and indeed this is the key.
In Scrapy passing a tuple like:
ot= ( ('TS644333_id', '3'),
('TS644333_75', value),
('TS644333_md', '1'),
('TS644333_rf', '0'),
('TS644333_ct', '0'),
('TS644333_pd', '0')
)
to formdata= rather than a dictionary, so that order is preserved works too.
Also the header {'Content-Type': 'application/x-www-form-urlencoded'} is required.
As anana noted in his answer, appending a trailing '/' to all request URLs also fixes things, in fact you can get away with just GET requests alone, with no js simulation and no form POSTing if you do this!

Categories