I'm currently looking to put together a quick script using Python 2.x to try and obtain the MD5 hash value of a number of images and movies on specific websites. I have noted on the w3.org website that the HTTP/1.1 protocol does offer an option within the content field to access the MD5 value but I'm wondering if this has to be set by the website admin? My script is as below:-
import httplib
c = httplib.HTTPConnection("www.samplesite.com")
c.request("HEAD", "/sampleimage.jpg")
r = c.getresponse()
res = r.getheaders()
print res
I have a feeling I need to edit 'HEAD' or possibly r.getheaders but I'm just not sure what to replace them with.
Any suggestions? As said, I'm just looking to point at an image and to then capture the MD5 hash value of the said image / movie. Ideally I don't want to have to download the image / movie to save bandwidth hence why I'm trying to do it this way.
Thanks in advance
Yes, it's rare that servers will actually respond to requests with an MD5 header. You can check for that, but in most cases, you'll actually need to download the video or image, unfortunately.
(At least hashlib is simple!)
Related
Seeking a bit of guidance on a general approach as to how one would automate the retrieval of data from a My Google Map. While I could easily export any given layer to KML/KMZ, I'm looking for a way to do this within a larger script, that will automate the process. Preferably, where I wouldn't even have to log in to the map itself to complete the data pull.
So, what do you think the best approach is? Two possible options I'm considering are 1) using selenium/beautiful soup to simulate page-clicks on Google Maps and export the KMZ or 2) making use of Python Google Maps API. Though, I'm not sure if this API makes it possible to download Google Maps layer via a script.
To be clear, the data is already in the map - I'm just looking for a way to export it. It could either be a KMZ export, or better yet, GeoJSON.
Any thoughts or advice welcome! Thank you in advance.
I used my browser’s inspection feature to figure out what was going on under the hood with the website I was interested in grabbing data from, which led me to this solution.
I use Selenium to login and navigate said website, then transfer my cookies to Python’s Requests package. I have Requests send a specific query to the server whose response is in the form of JSON. I was able to figure out what query to send and what form the response would be through the inspection feature previously stated. Once I have the response in JSON I use Python’s JSON package to convert into a Python dict to use however I need.
Sounds like you might not necessarily need Selenium but it does sound like the Requests package would be useful to your use case. I think your first step is figuring out what form the server response is when you interact with the website naturally to get what you want.
Hopefully this helps to some degree!
I am trying to access hover text found on graph points at this site (bottom):
http://matchhistory.na.leagueoflegends.com/en/#match-details/TRLH1/1002200043?gameHash=b98e62c1bcc887e4&tab=overview
I have the full site html but I am unable to find the values displayed in the hover text. All that can be seen when inspecting a point are x and y values that are transformed versions of these values. The mapping can be determined with manual input taken from the hovertext but this defeats the purpose of looking at the html. Additionally, the mapping changes with each match history so it is not feasible to do this for a large number of games.
Is there any way around this?
thank you
Explanation
Nearly everything on this webpage is loaded via JSON through JavaScript. We don't even have to request the original page. You will, however, have to repiece together the page via id's of items, mysteries and etc., which won't be too hard because you can request masteries similar to how we fetch items.
So, I went through the network tab in inspect and I noticed that it loaded the following JSON formatted URL:
https://acs.leagueoflegends.com/v1/stats/game/TRLH1/1002200043?gameHash=b98e62c1bcc887e4
If you notice, there is a gameHash and the id (similar to that of the link you just sent me). This page contains everything you need to rebuild it, given that you fetch all reliant JSON files.
Dealing with JSON
You can use json.loads in Python to load it, but a great tool I would recomend is:
https://jsonformatter.curiousconcept.com/
You copy and paste JSON in there and it will help you understand the data structure.
Fetching items
The webpage loads all this information via a JSON file:
https://ddragon.leagueoflegends.com/cdn/7.10.1/data/en_US/item.json
It contains all of the information and tool tips about each item in the game. You can access your desired item via: theirJson['data']['1001']. Each image on the page's file name is the id (or 1001) in this example.
For instance, for 'Boots of Speed':
import requests, json
itemJson = json.loads(requests.get('https://ddragon.leagueoflegends.com/cdn/7.10.1/data/en_US/item.json').text)
print(itemJson['data']['1001'])
An alternative: Selenium
Selenium could be used for this. You should look it up. It's been ported for several programming languages, one being Python. It may work as you want it to here, but I sincerely think that the JSON method (describe above), although a little more convoluted, will perform faster (since speed, based on your post, seems to be an important factor).
I am a python newbie. I am currently doing basic web-scraping. On browsing through several GitHub projects, I found one that lets the user download an srt file.
Here's the doubt. Suppose the url is like this:
http://www.opensubtitles.org/en/subtitles/6528547/silicon-valley-the-lady-bs
How to get the random hash value 6528547? On a side note, I request tips on how to get started working with APIs
Assuming that you have the url and just want to get the "hash", the easiest way to get the hash is to split it using '/' as the parameter and then getting the 5th element of the list returned.
url = "" #suppose you have the url here
hash = url.split('/')[5]
Screenshot
I wrote a script that I'm using to push updates to Pushbullet channels whenever a new Nexus factory image is released. A separate channel exists for each of the first 11 devices on that page, and I'm using a rather convoluted script to watch for updates. The full setup is here (specifically this script), but I'll briefly summarize the script below. My question is this: This is clearly not the correct way to be doing this, as it's very susceptible to multiple points of failure. What would be a better method of doing this? I would prefer to stick with Python, but I'm open to other languages if they would be simpler/better.
(This question is prompted by the fact that I updated my apache 2.4 config tonight and it apparently triggered a slight change in the output of the local files that are watched by urlwatch, so ALL 11 channels got an erroneous update pushed to them.)
Basic script functionality (some nonessential parts are not included):
Create dictionary of each device codename associated with its full model name
Get existing Nexus Factory Images page using Requests
Make bs4 object from source code
For each of the 11 devices in the dictionary (loop), do the following:
Open/create page in public web directory for the device
Write source to that page, filtered using bs4: str(soup.select("h2#" + dev + " ~ table")[0])
Call urlwatch on the page to check for updates, save output to temp file
If temp file size is > 0 then the page has changed, so push update to the appropriate channel
Remove webpage and temp file
A thought that I had while typing this question: Would a possible solution be to save each current version string (for example: 5.1.0 (LMY47I)) as a pickled variable, then if urlwatch detects a difference it would compare the new version string to the pickled one and only push if they're different? I would throw regex matching in as well to ensure that the new format matches the old format and just has updated data, but could this at least be a good temporary measure to try to prevent future false alarms?
Scraping is inherently fragile, but if they don't change the source format it should be pretty straightforward in this case. You should parse the webpage into a data structure. Using bs4 is fine for this. The end result should be a python dictionary:
{
'mantaray': {
'4.2.2 (JDQ39)': {'link': 'https://...'},
'4.3 (JWR66Y)': {'link': 'https://...'},
},
...
}
Save this structure with json.dumps. Now every time you parse the page you can generate a similar data structure and compare it to the one you have on disk (update the saved one each time after you are done).
Then the only part left is comparing the datastructure. You can iterate all models and check that each version you have in the current version of the page exists in the previous version. If it does not, you have a new version.
You can also potentially generate an easy to use API for this using https://www.kimonolabs.com/ instead of doing the parsing yourself.
I'm trying to connect to a torrent tracker to receive a list of peers to play bit torrent with, however I am having trouble forming the proper GET request.
As far as I understand, I must obtain the 20 byte SHA1 hash of the bencoded 'info' section from the .torrent file. I use the following code:
h = hashlib.new('sha1')
h.update(bencode.bencode(meta_dict['info']))
info_hash = h.digest()
This is where I am stuck. I can not figure out how to create the proper url-encoded info_hash to stick into a URL string as a parameter.
I believe it involves some combination of urllib.urlencode and urllib.quote, however my attempts have not worked so far.
well a bit late but might help someone.
Using module requests encodes the url by it's self. First you need to create a dictionary with the parameters (info_hash, peer_id etc). Then you only have to do a get request
response = requests.get(tracker_url, params=params)
I think that urllib.quote_plus() is all you need.