BeautifulSoup cannot scrape data from HKJC web

BeautifulSoup cannot scrape data from HKJC web - python

I am trying to scrape data from
https://racing.hkjc.com/racing/information/English/Racing/LocalResults.aspx?RaceDate=2020/01/27&Racecourse=ST&RaceNo=1
using BeautifulSoup in python with below simple code,
import requests
from bs4 import BeautifulSoup
url = "https://racing.hkjc.com/racing/information/English/Racing/LocalResults.aspx?RaceDate=2020/01/27&Racecourse=ST&RaceNo=2"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())
It works occasionally but most of the times it returns the below result,
<html>
<head>
<script>
Challenge=579033;
ChallengeId=232487458;
GenericErrorMessageCookies="Cookies must be enabled in order to view this page.";
</script>
<script>
function test(var1)
{
var var_str=""+Challenge;
var var_arr=var_str.split("");
var LastDig=var_arr.reverse()[0];
var minDig=var_arr.sort()[0];
var subvar1 = (2 * (var_arr[2]))+(var_arr[1]*1);
var subvar2 = (2 * var_arr[2])+var_arr[1];
var my_pow=Math.pow(((var_arr[0]*1)+2),var_arr[1]);
var x=(var1*3+subvar1)*1;
var y=Math.cos(Math.PI*subvar2);
var answer=x*y;
answer-=my_pow*1;
answer+=(minDig*1)-(LastDig*1);
answer=answer+subvar2;
return answer;
}
</script>
<script>
client = null;
if (window.XMLHttpRequest)
{
var client=new XMLHttpRequest();
}
else
{
if (window.ActiveXObject)
{
client = new ActiveXObject('MSXML2.XMLHTTP.3.0');
};
}
if (!((!!client)&&(!!Math.pow)&&(!!Math.cos)&&(!![].sort)&&(!![].reverse)))
{
document.write("Not all needed JavaScript methods are supported.<BR>");
}
else
{
client.onreadystatechange = function()
{
if(client.readyState == 4)
{
var MyCookie=client.getResponseHeader("X-AA-Cookie-Value");
if ((MyCookie == null) || (MyCookie==""))
{
document.write(client.responseText);
return;
}
var cookieName = MyCookie.split('=')[0];
if (document.cookie.indexOf(cookieName)==-1)
{
document.write(GenericErrorMessageCookies);
return;
}
window.location.reload(true);
}
};
y=test(Challenge);
client.open("POST",window.location,true);
client.setRequestHeader('X-AA-Challenge-ID', ChallengeId);
client.setRequestHeader('X-AA-Challenge-Result',y);
client.setRequestHeader('X-AA-Challenge',Challenge);
client.setRequestHeader('Content-Type' , 'text/plain');
client.send();
}
</script>
</head>
<body>
<noscript>
JavaScript must be enabled in order to view this page.
</noscript>
</body>
</html>
Could anyone tell me why it works and crashes?

Related

Send file path from flask to Ajax

I'm trying to send a file path to my Ajax script which reads file contents and displays it on the page
#app.route('/main', methods=['GET'])
def main():
filename = '/static/js/'+current_user.username+'log.txt'
return render_template('main.html',name=current_user.username,data=filename)
js script
var checkInterval = 1; //seconds
var fileServer = '{{ data }}';
var lastData;
function checkFile() {
$.get(fileServer, function (data) {
if (lastData !== data) {
$( "#target" ).val( data );
$( "#target" ).animate({
scrollTop: $( "#target" )[0].scrollHeight - $( "#target" ).height()
}, 'slow');
lastData = data;
}
});
}
$(document).ready(function () {
setInterval(checkFile, 1000 * checkInterval);
});
I tried different ways to do this, changed fileServer to 'data.filename'/{{ data| json }} etc but got no luck.
How can I do this?

If you pass the entire url generated with url_for as a parameter it should work.
#app.route('/main', methods=['GET'])
def main():
filename = url_for('static', filename=f'js/{current_user.username}log.txt')
return render_template('main.html', name=current_user.username, data=filename)
As a supplement, I also specify that the request should not be stored in the cache.
const checkInterval = 1;
const fileServer = "{{ data }}";
let lastData;
function checkFile() {
$.get({ url: fileServer, cache: false }, function(data) {
if (lastData !== data) {
$("#target").val(data);
$("#target").animate({
scrollTop: $("#target")[0].scrollHeight - $("#target").height()
}, "slow");
lastData = data;
}
});
}
$(document).ready(function() {
setInterval(checkFile, 1000 * checkInterval);
});
I used jquery version 3 for testing.

How to deploy Pytorch in Python via a REST API with Flask?

I am working on AWS Sagemaker and my goal is to follow this tutorial from Pytorch's official documentation.
The original predict function from the tutorial above is the following:
#app.route('/predict', methods=['POST'])
def predict():
if request.method == 'POST':
file = request.files['file']
img_bytes = file.read()
class_id, class_name = get_prediction(image_bytes=img_bytes)
return jsonify({'class_id': class_id, 'class_name': class_name})
I was getting this error, so I added 'GET' as a method as mentioned in here. I also simplified my example to its minimal expression:
from flask import Flask, jsonify, request
app = Flask(__name__)
#app.route('/predict', methods=['GET','POST'])
def predict():
if request.method == 'POST':
return jsonify({'class_name': 'cat'})
return 'OK'
if __name__ == '__main__':
app.run()
I perform requests with the following code:
import requests
resp = requests.post("https://catdogclassifier.notebook.eu-west-1.sagemaker.aws/proxy/5000/predict",
files={"file": open('/home/ec2-user/SageMaker/cat.jpg', 'rb')})
resp is <Response [200]> but resp.json() returns JSONDecodeError: Expecting value: line 1 column 1 (char 0) Finally, resp.url points me to a page saying 'OK'.
Moreover, this is the output of resp.content
<!DOCTYPE HTML>
<html>
<head>
<style type="text/css">
#loadingImage {
margin: 10em auto;
width: 234px;
height: 238px;
background-repeat: no-repeat;
background-image: url(data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAOoAAADuBAMAAADFK4ZrAAAAMFBMVEUAAAAaGhoXFxfo6Og3NzdPT0/4+PgmJibFxcWhoaF8fHzY2NiysrJmZmaNjY1DQ0OLk+cnAAAAAXRSTlMAQObYZgAABlpJREFUeNrtnGnITFEYx5+a3Ox1TjRFPpxbGmSZsmfLliKy73RKZImaSVmy7xFJtmSLLFlHdrIklH3fhSyJLOEDQpak4865951znmfGUuf34f3w1u13/+c+d+a555w74HA4HA6Hw+FwOBwOh+P/Z8DBfec3be3WbdO2U3P2SPgTeA1WXmuT9n2fse9/Eq1vvUhCoYntWzfqu1DBfdZ+Yz0oKA0WpblgWYhE6+UFzBub+8ZnYfBE7z1QIIY9ZoJFICYuLUxZDbkhWBEkboVqPdrJNGwuWJGIz0nQiD3pPYMibcFy0j+jHTZN8LEZglTktvpfstOWFozxd4SkJvAOWdpj7DvVJE46uJNgRvh9ZVbU78RxQxzbwIzpo0Vl/AFG6l0Q5tbE2qyo3+mIsR7wmQWVJwWiYq3FOzEb+Jikioq2ekuYHfyOioq2NkkxS8qdVFGR1lJvmS18rFRRcVZVv+bEr6ioKOuQtwxB5STAAoa3LmEY+AcoIfDWEYKhqDVpAUNbvTUMSVuBsKqoaPDWh+zPW1VF6Pzsibmff6u3INI56su9Vz+4Pyqdb2vpFAuDi693Z9Ub9POJZ9+8rkzk1XokXFpuyyAJikHzbvh5tMbehg5u26laf3MiLfJmHSiYTuLuINDwDjQXOayUTwie2CghjP09clgptZTYLKNa1875sdYPk0Ikh9P5sHqX9fH9LCESb246D9bSQm8RkkWe5sIU3bpLs1aekaOXvEy26gMcfwQ5aJoiWvUB5h8l5GKVIFqbiezu4KjBqXYmWp9mR+0JBuwWJGusud71GVA8RbIOFVlRr4ORlTbCe1FRYTLBqt83vI9ZVFoNl0plt7eYqMqKu6yfUVdVWVGXNT4fFVVZUXdrBYmKqqyou/WTWVSitUTWAE8yi0q0DlfFZD43Nplq3cECdDeLSrR6CxADPJlsDRZTDYmMqqz23xx8NCIqwlpSWH/dFGdka51gCc83qT+aVR+tchLRBCCswRKuKk2uCdnqBZ4geUtEb4ewxoIlvNqkEliRXLSeA4lnEI+6CuMZ+DIi0DFJ2xPVyVh/9lcBA2Kdi5JWk7ZZeRcwYaYW1rIeoVQLphBHzVqmzpFav1IGDPAOtVL0lWBE4/utIug9FYzwaiuk8Qp/7QgkOBwOh8PhcDgcDofJ00M9MMSzP6bB9lYRtD9jKL2gjmk9w2zS/5JgEfDKZg+WJfzfjqluuxqpMw5MqCuYolYSsZ6OWdiZos2n0KYE+XyTE+9sPZ9SVhRp7W496VoVDCjGiqSl9XJQFbqVvzOpx8ARXXBW6wm+BdoRRGs8aTuDz+fTrSa3fGntPIlWPt96LrOyRFmt74Jj5qOjrOTPiEuBQ9qRrVURCzPrMVb7cmymrd3SrFURu7HKSaKVz0fsmagGRGtVzG6sikQrn4/Y9MDXE61VMbtE45MMraSocJwFKC8NraSoxd+wAKMBZ1VREXuNeEuatSpqn378KMnKj1q1pOqy0rJOQq1hjQaa9TMYMF3bjEW0xhcbNBEtsg/KEK2sZtJ+R/cEoFr5HchBU6GtC5KtLHEyxythb7SrMoluZWMzlq+EVZN5sPKPSbtXwlTXT+pgbkaf/IG0XvZHKVZFIkrrzU6HVL2kWZX2rjTeKB2/AjSrQvSqBxoDzimponKGalX4+jzMvqs+C2E0kK0KXm7b7++oewOWaa8YaLVEtzIuJt7atwd+MGjAvpfvuQi/uyXNquMnWt/e9urZpmtd0z4LJ74WyFbd6/uM//gTRaUkwkqFd4e/YK0sCdar+Kh4a/WGb3HW8hJv5UfVy5lWxK8D3lodoDgq7IQkyqp64LopRNTFgLKqdj92Q1iX0jiJtKonm4Zp61LKANKqnmy8I7bWzYC2qoe4UhvsxvezJFirIt8yrjwJMFZ96+Ihi0tbbr29Us3A1wx0nebWm4CiVErfpRlbxTRytcyYd5lrQoDiZhVF+LWOZiJkQ+pgkw8LPn4GYIltEL5aolJpU7mlkwDPsCe9kiH/fc1zNDUqKQpPhp9MukhpgX50J3a+jR/pjN9MQmHwGl/1RcQTwSMJBSN2+n1Ku7yCj9ySgYJSe25X5ovgr0bd2imh0AxbueJ96tcvZI2aeO9FPfgjeI32nX2+qVu/TZuWHtw5CBwOh8PhcDgcDofD4XA4HP8i3wDmy/sFKv4WfAAAAABJRU5ErkJggg==);
-webkit-animation:spin 4s linear infinite;
-moz-animation:spin 4s linear infinite;
animation:spin 4s linear infinite;
}
#-moz-keyframes spin { 100% { -moz-transform: rotate(360deg); } }
#-webkit-keyframes spin { 100% { -webkit-transform: rotate(360deg); } }
#keyframes spin { 100% { -webkit-transform: rotate(360deg); transform:rotate(360deg); } }
</style>
</head>
<body>
<div id="loadingImage"></div>
<script type="text/javascript">
var RegionFinder = (function()
{
function RegionFinder( location ) {
this.location = location;
}
RegionFinder.prototype = {
getURLWithRegion: function() {
var isDynamicDefaultRegion = ifPathContains(this.location.pathname, "region/dynamic-default-region");
var queryArgs = removeURLParameter(this.location.search, "region");
var hashArgs = this.location.href.split("#")[1] || "";
if (hashArgs) {
hashArgs = "#" + hashArgs;
}
var region = this._getCurrentRegion();
var newArgs = "region=" + region;
if (_shouldAuth()) {
newArgs = "needs_auth=true";
region = "nil";
}
if (queryArgs &&
queryArgs != "?") {
queryArgs += "&" + newArgs;
} else {
queryArgs = "?" + newArgs;
}
if (!region) {
var contactUs = "https://portal.aws.amazon.com/gp/aws/html-forms-controller/contactus/aws-report-issue1";
alert("How embarrassing! There is something wrong with this URL, please contact AWS at " + contactUs);
}
var pathname = isDynamicDefaultRegion ? "/console/home" : this.location.pathname;
return this.location.protocol + "//" + _getRedirectHostFromAttributes() +
pathname + queryArgs + hashArgs;
},
_getCurrentRegion: function() {
return _getRegionFromHash( this.location ) ||
_getRegionFromAttributes();
}
};
function ifPathContains(url, parameter) {
return (url.indexOf(parameter) != -1);
}
function removeURLParameter(url, parameter) {
var urlparts= url.split('?');
if (urlparts.length>=2) {
var prefix= encodeURIComponent(parameter);
var pars= urlparts[1].split(/[&;]/g);
//reverse iteration as may be destructive
for (var i= pars.length; i-- > 0;) {
if (pars[i].lastIndexOf(prefix, 0) !== -1) {
pars.splice(i, 1);
}
}
url= urlparts[0]+'?'+pars.join('&');
return url;
} else {
return url;
}
}
function _getRegionFromAttributes() {
return "eu-west-1";
};
function _shouldAuth() {
return "";
};
function _getRedirectHostFromAttributes() {
return "eu-west-1.console.aws.amazon.com";
}
function _getRegionFromHash( location ) {
var hashArgs = "#" + (location.href.split("#")[1] || "");
var hashRegionArg = "";
var match = hashArgs.match("region=([a-zA-Z0-9-]+)");
if (match && match.length > 1 && match[1]) {
hashRegionArg = match[1];
}
return hashRegionArg;
}
return RegionFinder;
})();
var regionFinder = new RegionFinder( window.location );
window.location.href = regionFinder.getURLWithRegion();
</script>
</body>
</html>
What am I missing?

Looks like the content of your resp is HTML as opposed to JSON; this is likely a consequence of how the Jupyter server proxy endpoint you're attempting to POST to (https://catdogclassifier.notebook.eu-west-1.sagemaker.aws/proxy/5000/predict) is configured.
It looks like you're using a SageMaker notebook instance, so you might not have much control over this configuration. A workaround could be to instead deploy your Flask server as a SageMaker endpoint running outside JupyterLab, instead of directly on a notebook instance.
If you want to prototype using only a notebook instance, you can alternately just bypass the proxy entirely and simply call your Flask route relative to localhost from another notebook tab while the Flask server runs in your main notebook tab:
import requests
resp = requests.post("https://localhost:5000/predict",
files={"file": open('/home/ec2-user/SageMaker/cat.jpg', 'rb')})

How do I check if a file has been uploaded from an HTML front end to a python backend?

I have an HTML and angularJS code in the front end to upload a file at the backend. How do I check if the file has been uploaded in python?
https://ajax.googleapis.com/ajax/libs/angularjs/1.3.14/angular.min.js">
<div ng-controller = "myCtrl">
<input type = "file" file-model = "myFile"/>
<button ng-click = "uploadFile()">upload me</button>
</div>
<script>
var myApp = angular.module('myApp', []);
myApp.directive('fileModel', ['$parse', function ($parse) {
return {
restrict: 'A',
link: function(scope, element, attrs) {
var model = $parse(attrs.fileModel);
var modelSetter = model.assign;
element.bind('change', function(){
scope.$apply(function(){
modelSetter(scope, element[0].files[0]);
});
});
}
};
}]);
myApp.service('fileUpload', ['$https:', function ($https:) {
this.uploadFileToUrl = function(file, uploadUrl){
var fd = new FormData();
fd.append('file', file);
$https:.post(uploadUrl, fd, {
transformRequest: angular.identity,
headers: {'Content-Type': undefined}
})
.success(function(){
})
.error(function(){
});
}
}]);
myApp.controller('myCtrl', ['$scope', 'fileUpload', function($scope, fileUpload){
$scope.uploadFile = function(){
var file = $scope.myFile;
console.log('file is ' );
console.dir(file);
var uploadUrl = "/fileUpload";
fileUpload.uploadFileToUrl(file, uploadUrl);
};
}]);
</script>

Requests does not return html anymore - Python

I am trying to get a name from a public Linkedin url via python requests (2.7).
The code used to work fine.
import requests
from bs4 import BeautifulSoup
url = "https://www.linkedin.com/in/linustorvalds"
html = requests.get(url).content
link = BeautifulSoup(html).title.text.split("|")[0].replace(" ","")
print link
The desired output is:
linustorvalds
I am getting the following error message:
AttributeError: 'NoneType' object has no attribute 'text'
The issue seems to be that html is not returning the real content of the page. So there is no 'title' found. This is the result of printing html:
<html><head>
<script type="text/javascript">
window.onload = function() {
var newLocation = "";
if (window.location.protocol == "http:") {
var cookies = document.cookie.split("; ");
for (var i = 0; i < cookies.length; ++i) {
if ((cookies[i].indexOf("sl=") == 0) && (cookies[i].length > 3)) {
newLocation = "https:" + window.location.href.substring(window.location.protocol.length);
}
}
}
if (newLocation.length == 0) {
var domain = location.host;
var newDomainIndex = 0;
if (domain.substr(0, 6) == "touch.") {
newDomainIndex = 6;
}
else if (domain.substr(0, 7) == "tablet.") {
newDomainIndex = 7;
}
if (newDomainIndex) {
domain = domain.substr(newDomainIndex);
}
newLocation = "https://" + domain + "/uas/login?trk=sentinel_org_block&session_redirect=" + encodeURIComponent(window.location)
}
window.location.href = newLocation;
}
</script>
</head></html>
Am I being blocked? What are the possible suggestions to make this code work as before?
Thanks a lot!

Try setting a User-Agent header:
html = requests.get(url, headers={"User-Agent": "Requests"}).content

Get a specific data from specific <script type="text/javascript">

I have a html page with multiple javascript tags. The problem that I want to extract data from specific tag:
<head>
...
</head>
<body>
...
<script type="text/javascript">
$j(document).ready(function() {
if (!($j.cookie("ios"))) {
new $c.free.widgets.FreeAdvDialog().open();
$j.cookie("ios", "seen", { path: '/', expires: 10000});
};
ajax_keys = ["d24349f205e3deb7f1015f42d3a14da7205b62e4", "0ae78c4797d47745ebd44e2754367da10c6f56a4", "567b2bfb6fd1aee784115da54e5e116a280ee225", "fc5cd251be46ff101c471553d52c07bf08c9aa65"];
var is_dm = false;
/* async chart loader */
var chart = new $c.free.widgets.Chart({
target: $j('#graph'),
width: 990,
height: 275,
site: "911.com",
source_panel: 'us'
});
var chart_view = new $c.free.widgets.ChartView({
chart: chart,
csv_button: 'csv-export',
save_button: 'graph-image',
embed_button: 'embed-graph',
key: ajax_keys[1]
});
chart_view.render();
/* zoom info initialization */
var zoom_info = new $c.free.widgets.ZoomInfo({
site: "911.com",
el: '#zoominfo',
key: ajax_keys[3]
});
zoom_info.load();
/* compete numbers initialization */
var compete_numbers = new $c.free.widgets.CompeteNumbers({
site: "911.com",
key: ajax_keys[0],
el: '#compete_numbers'
});
compete_numbers.load();
/* DM Marketing widget init */
new $c.free.widgets.DMSignupMessage({
is_dm: is_dm,
compete_numbers: compete_numbers
});
/* personalization initialization */
var logged_in_as = null;
var d = {
site_name: "911.com",
logged_in_as: logged_in_as,
current_source_panel: {"display_abbreviation": "us", "panel_name": "us", "image_url": "http://media.compete.com/site_media/images/icons/flag_us.gif", "id": 1, "display_name": "United States"}
};
var auth_model = new $c.free.widgets.FreeLoginModel(d);
var links_opts = { model: auth_model };
var links_view = new $c.free.widgets.FreeAccountLinksView(links_opts);
var sites_view = new $c.free.widgets.FollowSiteButtonView(links_opts);
var manage_view = new $c.free.widgets.ManageSitesListButtonView(links_opts);
var sites = new $c.free.widgets.SimilarSitesCollection([], {
site: "911.com",
source_panel: 'us',
key: ajax_keys[2],
auth: auth_model
});
var graph = new $c.free.widgets.BarGraph({
el: $j('#similar-sites'),
collection: sites
});
// tell KISSMetrics where we are
// also identify user so KM console can refer to them by email
if(logged_in_as != null) {
_kmq.push(['identify', logged_in_as]);
}
_kmq.push(['record', 'Viewed Free Site Analytics Report (M)']);
});
...
How can I get ajax_keys (i.e. "d24349f205e3deb7f1015f42d3a14da7205b62e4") from specific tag of the page?
p.s. i tried to use regular expressions in python script but i can't retrieve necessary element from tag.
Thanks for help.

If you use a library like BeautifulSoup you can fetch the specific script tag, and then use a regex on the contents of the tag instead of the entire document.
That said, it looks like a regex will work assuming there is only the one ajax_keys:
import re
ajaxre = re.compile(r"^\s+ajax_keys = ([^;]+)", re.MULTILINE)
ajax_string = ajaxre.match(source).group(1)
# to get it as a python list
import json
ajax_keys = json.loads(ajax_string)
Edit: thanks #Karl Knechtel for json.loads

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

BeautifulSoup cannot scrape data from HKJC web - python

Related

Send file path from flask to Ajax

How to deploy Pytorch in Python via a REST API with Flask?

How do I check if a file has been uploaded from an HTML front end to a python backend?

Requests does not return html anymore - Python

Get a specific data from specific <script type="text/javascript">

Categories

Resources