Properly watch websites for updates - python

I wrote a script that I'm using to push updates to Pushbullet channels whenever a new Nexus factory image is released. A separate channel exists for each of the first 11 devices on that page, and I'm using a rather convoluted script to watch for updates. The full setup is here (specifically this script), but I'll briefly summarize the script below. My question is this: This is clearly not the correct way to be doing this, as it's very susceptible to multiple points of failure. What would be a better method of doing this? I would prefer to stick with Python, but I'm open to other languages if they would be simpler/better.
(This question is prompted by the fact that I updated my apache 2.4 config tonight and it apparently triggered a slight change in the output of the local files that are watched by urlwatch, so ALL 11 channels got an erroneous update pushed to them.)
Basic script functionality (some nonessential parts are not included):
Create dictionary of each device codename associated with its full model name
Get existing Nexus Factory Images page using Requests
Make bs4 object from source code
For each of the 11 devices in the dictionary (loop), do the following:
Open/create page in public web directory for the device
Write source to that page, filtered using bs4: str(soup.select("h2#" + dev + " ~ table")[0])
Call urlwatch on the page to check for updates, save output to temp file
If temp file size is > 0 then the page has changed, so push update to the appropriate channel
Remove webpage and temp file
A thought that I had while typing this question: Would a possible solution be to save each current version string (for example: 5.1.0 (LMY47I)) as a pickled variable, then if urlwatch detects a difference it would compare the new version string to the pickled one and only push if they're different? I would throw regex matching in as well to ensure that the new format matches the old format and just has updated data, but could this at least be a good temporary measure to try to prevent future false alarms?

Scraping is inherently fragile, but if they don't change the source format it should be pretty straightforward in this case. You should parse the webpage into a data structure. Using bs4 is fine for this. The end result should be a python dictionary:
{
'mantaray': {
'4.2.2 (JDQ39)': {'link': 'https://...'},
'4.3 (JWR66Y)': {'link': 'https://...'},
},
...
}
Save this structure with json.dumps. Now every time you parse the page you can generate a similar data structure and compare it to the one you have on disk (update the saved one each time after you are done).
Then the only part left is comparing the datastructure. You can iterate all models and check that each version you have in the current version of the page exists in the previous version. If it does not, you have a new version.
You can also potentially generate an easy to use API for this using https://www.kimonolabs.com/ instead of doing the parsing yourself.

Related

Script for automating online tool query

So I had a number of amino acid sequence strings that I wanted to use as input into a tool that studies its interactions with certain components of the human immune system (http://www.cbs.dtu.dk/services/NetMHCcons/).
I wanted to ask what, if any, would be a way of accessing, inputting data and getting the output, via a script (R or python preferably). My main issue was I had a lot of sequences that need to be queried separately so wanted to automate the whole thing. The website has one field that reads "Submission" which takes in the string input. There is another field "select species/loci" which gives a drop down menu from which an option needs to be selected. Lastly there's a "submit" button. The output simply loads on the page after hitting submit.
I've tentatively poked around with RSelenium and Rcurl but wanted to ask if there was a more efficient method.
I took a look at what it'd take to send a POST request to this service from Python, and it looks possible:
this form takes in "multipart/form-data" (see: How to send a "multipart/form-data" with requests in python?), you'll need to send your data in this format. You could inspect a request from the browser (using the dev tools) and copy the fields from there as a starting point.
once the form is submitted, it doesn't give you the result right away. You'd need to get your job ID from the response, and then poll the URL: http://www.cbs.dtu.dk/cgi-bin/webface2.fcgi?jobid={your_job_id}&wait=20 until it gives you the result
the result will then need to be downloaded and parsed
This tool is however available as a portable version for linux/mac: https://services.healthtech.dtu.dk/software.php
Perhaps downloading this version would make it easier?
Try this :
Submitting to a web form using python
This link is an answer to how to send web forms in python, using urllib. Check your source code and extract the necessary data using re module from the source code of the link you have put up, and send the request.
save the HTML source code of http://www.cbs.dtu.dk/services/NetMHCcons/ in the python file as
source_code = '''...'''
The HTML can be found by using CTRL+U in firefox.

Accessing Hovertext with html

I am trying to access hover text found on graph points at this site (bottom):
http://matchhistory.na.leagueoflegends.com/en/#match-details/TRLH1/1002200043?gameHash=b98e62c1bcc887e4&tab=overview
I have the full site html but I am unable to find the values displayed in the hover text. All that can be seen when inspecting a point are x and y values that are transformed versions of these values. The mapping can be determined with manual input taken from the hovertext but this defeats the purpose of looking at the html. Additionally, the mapping changes with each match history so it is not feasible to do this for a large number of games.
Is there any way around this?
thank you
Explanation
Nearly everything on this webpage is loaded via JSON through JavaScript. We don't even have to request the original page. You will, however, have to repiece together the page via id's of items, mysteries and etc., which won't be too hard because you can request masteries similar to how we fetch items.
So, I went through the network tab in inspect and I noticed that it loaded the following JSON formatted URL:
https://acs.leagueoflegends.com/v1/stats/game/TRLH1/1002200043?gameHash=b98e62c1bcc887e4
If you notice, there is a gameHash and the id (similar to that of the link you just sent me). This page contains everything you need to rebuild it, given that you fetch all reliant JSON files.
Dealing with JSON
You can use json.loads in Python to load it, but a great tool I would recomend is:
https://jsonformatter.curiousconcept.com/
You copy and paste JSON in there and it will help you understand the data structure.
Fetching items
The webpage loads all this information via a JSON file:
https://ddragon.leagueoflegends.com/cdn/7.10.1/data/en_US/item.json
It contains all of the information and tool tips about each item in the game. You can access your desired item via: theirJson['data']['1001']. Each image on the page's file name is the id (or 1001) in this example.
For instance, for 'Boots of Speed':
import requests, json
itemJson = json.loads(requests.get('https://ddragon.leagueoflegends.com/cdn/7.10.1/data/en_US/item.json').text)
print(itemJson['data']['1001'])
An alternative: Selenium
Selenium could be used for this. You should look it up. It's been ported for several programming languages, one being Python. It may work as you want it to here, but I sincerely think that the JSON method (describe above), although a little more convoluted, will perform faster (since speed, based on your post, seems to be an important factor).

How to extract files from ScrapingHub?

I have deployed some Scrapy spiders to scrape data which I can download in .csv from ScrapingHub.
Some of these spiders have FilePipeline which I used to download files (pdf) to a specific folder. Is there any way I can retrieve these files from ScrapingHub via the platform or API?
Though I have to go over scraping hubs documentation, I'm quite certain despite of having a file explorer there's no actual file being generated or it's being ignored while during the crawl and stanchion... I assume so given the fact that if you try to deploy one of your projects with anything other than the files that correspond to a scrappy project() unless you do some hacking around with your settings and setup file for then scrapinghub to accept your extra parameters orphans)... For example if you try to have a ton of start URLs in a file and then use a real and function to parse all that into your spider... Works like a charm but scrapinghub wasn't built with that in mind...
I assume you know that you can download your files in a CSV or desired format straight from the web interface... Personally I use scraping Hub client API in Python... All three libraries of which I believe to our deprecated at this point but you kind of have to mix and match to get fully functional feet for example...
I have this side gig for a pretty well-known pornt website, what I do for them is content aggregation I spend a lot of time watching a lot o debauchery but for people like myself it's just fun... Hope that you're reading this and not think too much of a pervert LOL got to make that money right? Anyways... By using scraping hugs API client for python I'm able to connect to my account with the API key and maneuver my way around and do as I please; personally I think that there are some limitations , not so much of a limitation is just that one thing that really bothers me is that the function to get the name of a project was deprecated with the first version of there client Library... I'd like the see, when I'm parsing my items the name of the project of which where the spider is to run different jobs Ergo the crawlz... So when I first started to mess around with the client it just look messy,
What's even more awesome it's my life so sweet is that when you create a project run your spider and all your items are collected can directly download these files from the web interface as I mentioned, but what I can do is Target my output to give me desired effect for example.
I'm crawling a site and I'm getting a media item like videos, there are three things you always need. The name of the media or the title of the video , the URL source to where the video can be reached or URL where it is embedded of which you can then request for every instance that you need... And of course the metadata of what is tags and categories that are associated with video media.
The largest crawl that's outputted the most items now I believe was 150,000, it was abroad crawl and it was something like the 15 or 17% of dupla Fire cases. Each video I then call using the API client by its given dictionary or key value (not a dictionary btw)... Of course in my case I will always use all three of the key values but I can Target categories or tags of which RN or under the key value o its corresponding place and output only the items and their totality (meaning still output all three items) foot print out only the ones that meet or match a particular string or expression I want allowing me the able who really Parts through my content quite effectively. In this particular scrapy project, Im just simply printing out or creating a .m3u playlist from all this 'pronz'!

Send PDF file from Django website to LogicalDOC

I'm developing my Django website since about 2 months and I begin to get a good global result with my own functions.
But, now I have to start a very hard part (to my mind) and I need some advices, ideas before to do that.
My Django website creates some PDF files from HTML templates with Django variables. Up to now, I'm saving PDF files directly on my Desktop (in a specific folder) but it's completely unsecured.
So, I installed another web application which is named LogicalDoc in order to save PDF file directly on this application. PDF files are created and sent to LogicalDoc.
LogicalDoc owns 2 API : SOAP and REST (http://wiki.logicaldoc.com/rest/#/) and I know that Django could communicate with REST method.
I'm reading this part of Django documentation too in order to understand How I can process : https://docs.djangoproject.com/en/dev/topics/http/file-uploads/
I made a scheme in order to understand what I'm exposing :
Then, I write a script which makes some things :
When the PDF file is created, I create a folder inside LogicalDoc which takes for example the following name : lastname_firstname_birthday
Two possibilities : If the folder exists,I don't create a new folder, else I create it.
Once it's done, I send the PDF file directly inside the folder by comparing PDF name with folder name to do that
I have some questions about this process :
Firstly, is it possible to make this kind of things ?
Is it hard to do that ?
What kind of advices could you give me ?
Thank you so much !
PS : If you need some part of my script, mainly PDF creating part, I can post it just after my question ;)
An idea is pretty simple, however it always requires some practice.
I strongly advice you to use REST api and forget about SOAP as the only thing it can bring to you - is 'pain' :)
If we check documentation, document/create it gives next information.
Endpoint we have to communicate with.
[protocol]://[server]:[port]/document/create
HTTP method to use - POST
List of parameters to provide with your request: body,
document, content
Even more, you can test API by clicking on "Try it out" button and check requests in "Network" tab of your browser (if you open Developer Tools)
I am not sure what kind of metadata do you have to provide in 'document' parameter but what I know you can easy get an idea of what should be done by testing it and putting XML or JSON data into 'document' parameter.
Content is an array of bytes transferred to the server (which would be your file).
To sum up, a request to 'document/create' uri will be simple
body = { 'headers': {},'object': {},}
document = "<note>data</note>"
content=open('report.xls', 'rb') #r - reading, b - binary
r = requests.post('http://logicaldoc/document/create', body=body, document=document, content=content)
Please keep in mind that file transferring requests take time and sometimes you may get timeout exception. Your code will stop and will be waiting for response, so it may be a good idea to get some practice with asyncio or celery. Just keep in mind those kind of possible issues.

Scrape with Python , commanded by excel vba

I already had a previous question, but that was pasted in vba tags etc. So I'll try again with proper tags and title since I gained a bit of knowledge now, hopefully.
The problem:
I need to find ~1000 dates from a database with plant variety data which probably is behind a login so here is a screenshot . Now I could of course fill out this form ~1000 times but there must be a smarter way to do this. If it were an HTML site I would know what to do, and have vba just pull in the results. I have been reading all morning about these javascript pages and ajax libraries but it is above my level. So hopefully someone can help me out a bit. I also used firebug to see what is going on when I press search:
These codes are similar to the last picture posted, make it easier to read. Code left here for copying.
f.cc.facet.limit
-1
f.cc.facet.mincount
1
f.end_date.facet.date.end
2030-01-01T00:00:00Z
f.end_date.facet.date.gap
+5YEARS
f.end_date.facet.date.oth...
all
f.end_date.facet.date.sta...
1945-01-01T00:00:00Z
f.end_type.facet.limit
20
f.end_type.facet.mincount
1
f.grant_start_date.facet....
NOW/YEAR
f.grant_start_date.facet....
+5YEARS
f.grant_start_date.facet....
all
f.grant_start_date.facet....
1900-01-01T00:00:00Z
f.status.facet.limit
20
f.status.facet.mincount
1
f.type.facet.limit
20
f.type.facet.mincount
1
facet
true
facet.date
grant_start_date
facet.date
end_date
facet.field
cc
facet.field
type
facet.field
status
facet.field
end_type
fl
uc,cc,type,latin_name,common_name,common_name_en,common_name_others,app_num,app_date,grant_start_date
,den_info,den_final,id
hl
true
hl.fl
cc,latin_name,den_info,den_final
hl.fragsize
5000
hl.requireFieldMatch
false
json.nl
map
q
cc:IT AND latin_name:(Zea Mays) AND den_info:Antilles
qi
3-9BgbCWwYBd7aIWPU1/onjQ==
rows
25
sort
uc asc,score desc
start
0
type
upov
wt
json
Source
fl=uc%2Ccc%2Ctype%2Clatin_name%2Ccommon_name%2Ccommon_name_en%2Ccommon_name_others%2Capp_num%2Capp_date
%2Cgrant_start_date%2Cden_info%2Cden_final%2Cid&hl=true&hl.fragsize=5000&hl.requireFieldMatch=false&json
.nl=map&wt=json&type=upov&sort=uc%20asc%2Cscore%20desc&rows=25&start=0&qi=3-9BgbCWwYBd7aIWPU1%2FonjQ
%3D%3D&hl.fl=cc%2Clatin_name%2Cden_info%2Cden_final&q=cc%3AIT%20AND%20latin_name%3A(Zea%20Mays)%20AND
%20den_info%3AAntilles&facet=true&f.cc.facet.limit=-1&f.cc.facet.mincount=1&f.type.facet.limit=20&f.type
.facet.mincount=1&f.status.facet.limit=20&f.status.facet.mincount=1&f.end_type.facet.limit=20&f.end_type
.facet.mincount=1&f.grant_start_date.facet.date.start=1900-01-01T00%3A00%3A00Z&f.grant_start_date.facet
.date.end=NOW%2FYEAR&f.grant_start_date.facet.date.gap=%2B5YEARS&f.grant_start_date.facet.date.other
=all&f.end_date.facet.date.start=1945-01-01T00%3A00%3A00Z&f.end_date.facet.date.end=2030-01-01T00%3A00
%3A00Z&f.end_date.facet.date.gap=%2B5YEARS&f.end_date.facet.date.other=all&facet.field=cc&facet.field
=type&facet.field=status&facet.field=end_type&facet.date=grant_start_date&facet.date=end_date
And this is what it looks like in HTML, atleast according to firebug:
{"response":{"start":0,"docs":[{"id":"6751513","grant_start_date":"1999-02-04T22:59:59Z","den_final":"Antilles","app_num":"005642_A 005642","latin_name":"Zea mays L.","common_name_others":["MAIS"],"uc":"ZEAAA_MAY","type":"NLI","app_date":"1997-01-10T22:59:59Z","cc":"IT"}],"numFound":1},"qi":"3-9BgbCWwYBd7aIWPU1/onjQ==","facet_counts":{"facet_queries":{},"facet_ranges":{},"facet_dates":{"end_date":{"after":0,"start":"1945-01-01T00:00:00Z","before":0,"2010-01-01T00:00:00Z":1,"between":1,"end":"2030-01-01T00:00:00Z","gap":"+5YEARS"},"grant_start_date":{"after":0,"1995-01-01T00:00:00Z":1,"start":"1900-01-01T00:00:00Z","before":0,"between":1,"end":"2015-01-01T00:00:00Z","gap":"+5YEARS"}},"facet_intervals":{},"facet_fields":{"status":{"approved":1},"end_type":{"ter":1},"type":{"nli":1},"cc":{"it":1}}},"sv":"bswa1.wipo.int","lastUpdated":1435987857572,"highlighting":{"6751513":{"den_final":["Antilles<\/em>"],"latin_name":["Zea<\/em> mays<\/em> L."],"cc":["IT<\/em>"]}}}
Edit:
It uses the GET method and XMLHttpRequest, as can be seen from this screenshot:
I already found how to make python run from excel vba here in this topic
I also downloaded beautiful soup but python is not my kind of language, so any help would be greatly appreciated.
Image refered to in comment on answer of Will
1) Use Excel to store your search parameters.
2) Run a few manual searches to find out what parameters you need to change on each request.
3) Invoke an http get request to the url that you have found in firebug/Fiddler (the url that it calls when you click "search" manually). See Urllib3 https://urllib3.readthedocs.org/en/latest/
3) Look at Json pickle to help you deal with the json response, saving (serializing) it to a file.
4) Reading and writing data involves IO libraries. Google is your friend. (Possibly easier to save your excel file as a csv and then just read the csv file for your search parameters).
5) Download PyCharm for your python development - it's really good.
Hope this helps.
I finally figured it out. I don't need to use python, I can just use an url, and then import the content into excel. I found out with Fiddler that the URL should become https://www3.wipo.int/pluto/user/jsp/select.jsp? And then the piece of code from the OP goes behind that.
The rest of my solution can be found in another question I had. It uses no Python but only VBA, which commands IE to open a website and copies the content of it.

Categories