In my Python code I'm downloading tiles form Openstreetmap (OSM). For Performance and traffic reason, they are stored in an temp storage. However before reusing this data, I want to check if this data is still up-to-date.
This is how simple download is done:
import urllib2
# Normal import without Version control:
url = r"http://a.tile.openstreetmap.org/1/1/1.png"
imgstr = urllib2.urlopen(url).read()
I'm searching something likle this (pseudo code)
imgstr = ... # Value from database
local_version = ... # Value from database
online_version = getolineversionnumber(url)
if not(online_version==local_verion):
imgstr = urllib2.urlopen(url).read()
version = online_version
is there such a function like getolineversionnumber?
**Question answered by hint of scai. No more answers required. **
It is good practice, to post answers to own questions for other readers. Here is what I have learned.
The property which i was searching is called etag (https://en.wikipedia.org/wiki/HTTP_ETag).
and accessed like:
import urllib2
url = r"http://a.tile.openstreetmap.org/1/1/1.png"
request = urllib2.Request(url)
opener = urllib2.build_opener()
firstdatastream = opener.open(request)
online_version=firstdatastream.headers.dict['etag']
Related
I have a list of urls which look more or less like this:
'https://myurl.com/images/avatars/cb55-f14b-455d1-9ac4w20190416075520341'
I'm trying to validate the image behind the url and check what image type (png, jpeg or other) it has and write back the image type into a new dataframe column imgType.
my code so far
import pandas as pd
import requests
df = pd.read_csv('/path/to/allLogo.csv')
urls = df.T.values.tolist()[4]
for x in urls:
#i'm stuck here... as the content doesn't seem to give me image type.
s=requests.get(url, verify=False).content
df["imgType"] =
df.to_csv('mypath/output.csv')
Could someone help me with this? thanks in advance
One possibility is to check response headers for 'Content-Type' - but it depends on the server what headers are sent back to the client (without knowing the real URL is hard to tell):
import requests
url = 'https://s3.amazonaws.com/github/ribbons/forkme_right_darkblue_121621.png'
response = requests.get(url)
# uncomment this to print all response headers:
# print(response.headers)
print(response.headers['Content-Type'])
Prints:
image/png
check what image type (png, jpeg or other)
If you manage to download it, either to disk (file) or memory (as bytes - .content of requests' response) then you might harness python built-in module imghdr, following way:
import imghdr
imgtype = imghdr.what("path/to/image.png") # testing file on disk
or
import requests
r = requests.get("url_of_image")
imgtype = imghdr.what(h=r.content) # testing
Keep in mind that imghdr does recognize limited set of image file format (see linked docs), however it should suffice if you are only interested in detecting png vs jpeg vs other.
For example, I need to get every image displayed on a page while knowing the address they're stored at. The method I'm looking for will allow using https://scontent-arn2-1.cdninstagram.com/* Where * is a random link that I can make a list of or just download.
import requests as req
phot = req.get('https://scontent-arn2-1.cdninstagram.com/vp/7987513770b3cfa32372114cde65ff7d/5D53D9CA/t51.2885-15/e35/57306185_137865170666087_417139389797653381_n.jpg?_nc_ht=scontent-arn2-1.cdninstagram.com')
with open('sea.png', 'wb') as f:
f.write(phot.content)
You could accomplish what you are describing by creating a function that takes the base_url and a list of paths to resources to download:
import requests as req
def get_resources(base_url, resource_paths):
responses = [req.get(base_url + path) for path in resource_paths]
res = get_resource('http://example.com/', ['img1', 'img2', 'etc'])
# This would fetch the following resources:
# - 'http://example.com/img1'
# - 'http://example.com/img2'
# - 'http://example.com/etc'
I'd like to download the entire revision history of a single article on Wikipedia, but am running into a roadblock.
It is very easy to download an entire Wikipedia article, or to grab pieces of its history using the Special:Export URL parameters:
curl -d "" 'https://en.wikipedia.org/w/index.php?title=Special:Export&pages=Stack_Overflow&limit=1000&offset=1' -o "StackOverflow.xml"
And of course I can download the entire site including all versions of every article from here, but that's many terabytes and way more data than I need.
Is there a pre-built method for doing this? (Seems like there must be.)
The example above only gets information about the revisions, not the actual contents themselves. Here's a short python script that downloads the full content and metadata history data of a page into individual json files:
import mwclient
import json
import time
site = mwclient.Site('en.wikipedia.org')
page = site.pages['Wikipedia']
for i, (info, content) in enumerate(zip(page.revisions(), page.revisions(prop='content'))):
info['timestamp'] = time.strftime("%Y-%m-%dT%H:%M:%S", info['timestamp'])
print(i, info['timestamp'])
open("%s.json" % info['timestamp'], "w").write(json.dumps(
{ 'info': info,
'content': content}, indent=4))
Wandering around aimlessly looking for clues to another question I have myself — my way of saying I know nothing substantial about this topic! — I just came upon this a moment after reading your question: http://mwclient.readthedocs.io/en/latest/reference/page.html. Have a look for the revisions method.
EDIT: I also see http://mwclient.readthedocs.io/en/latest/user/page-ops.html#listing-page-revisions.
Sample code using the mwclient module:
#!/usr/bin/env python3
import logging, mwclient, pickle, os
from mwclient import Site
from mwclient.page import Page
logging.root.setLevel(logging.DEBUG)
logging.debug('getting page...')
env_page = os.getenv("MEDIAWIKI_PAGE")
page_name = env_page is not None and env_page or 'Stack Overflow'
page_name = Page.normalize_title(env_page)
site = Site('en.wikipedia.org') # https by default. change w/`scheme=`
page = site.pages[page_name]
logging.debug('extracting revisions (may take a really long time, depending on the page)...')
revisions = []
for i, revision in enumerate(page.revisions()):
revisions.append(revision)
logging.debug('saving to file...')
with open('{}Revisions.mediawiki.pkl'.format(page_name), 'wb+') as f:
pickle.dump(revisions, f, protocol=0) # protocol allows backwards compatibility between machines
I am trying to write a class in Python to open a specific URL given and return the data of that URL...
class Openurl:
def download(self, url):
req = urllib2.Request( url )
content = urllib2.urlopen( req )
data = content.read()
content.close()
return data
url = 'www.somesite.com'
dl = openurl()
data = dl.download(url)
Could someone correct my approach? I know one might ask why not just directly open it, but I want to show a message while it is being downloaded. The class will only have one instance.
You have a few problems.
One that I'm sure is not in your original code is the failure to import urllib2.
The second problem is that dl = openurl() should be dl = Openurl(). This is because Python is case sensitive.
The third problem is that your URL needs http:// before it. This gets rid of an unknown url type error. After that, you should be good to go!
It should be dl = Openurl(), python is case sensitive
The JSON syntax definition say that
html/xml tags (like the <script>...</script> part) are not part of
valid json, see the description at http://json.org.
A number of browsers and tools ignore these things silently, but python does
not.
I'd like to insert the javascript code (google analytics) to get info about the users using this service (place, browsers, OS ...).
What do you suggest to do?
I should solve the problem on [browser output][^1] or [python script][^2]?
thanks,
Antonio
[^1]: Browser output
<script>...</script>
[{"key": "value"}]
[^2]: python script
#!/usr/bin/env python
import urllib2, urllib, json
url="http://.........."
params = {}
url = url + '?' + urllib.urlencode(params, doseq=True)
req = urllib2.Request(url)
headers = {'Accept':'application/json;text/json'}
for key, val in headers.items():
req.add_header(key, val)
data = urllib2.urlopen(req)
print json.load(data)
These sound like two different kinds of services--one is a user-oriented web view of some data, with visualizations, formatting, etc., and one is a machine-oriented data service. I would keep these separate, and maybe build the user view as an extension to the data service.