I am currently modifying a simple monitoring script I made some time ago that basically :
Build a list of dictionaries containing, amongst other things
A website URL
The time it took to respond (set as None by default)
The data it sent back (set as None by default)
Query (GET) each URL from the list and fill the 'time' and 'data' fields with the relevant data.
Store the results in a database.
The script used to work fine, but as the list of URLS to monitor grew the time it takes to complete all the queries has become way too long for me.
My solution is to modify the script to fetch the URLs in a concurrent way. To do that I chose to use eventlet, since this example from the documentation does almost exactly what I want.
The catch is that since my list of URLs contains dictionaries I can't use pool.imap() to iterate through my list. (As far as I know)
The Eventlet documentation has another similar example* that uses a GreenPile object to spawn jobs, it seems I can use that to launch my URL fetching function, but I can't seem to be able to retrieve the result of this thread.
Here is my test code:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import eventlet
from eventlet.green import urllib2
urls = [{'url': 'http://www.google.com/intl/en_ALL/images/logo.gif', 'data': None},
{'url': 'http://www.google.com', 'data': None}]
def fetch(url):
return urllib2.urlopen(url).read()
pool = eventlet.GreenPool()
pile = eventlet.GreenPile(pool)
for url in urls:
pile.spawn(fetch, url['url']) #can I get the return of the function here?
#or
for url in urls:
url['data'] = ??? #How do I get my data back?
#Eventlet's documentation way
data = "\n".join(pile)
As far as I understand pile is an iterable so I can iterate through it but I can't access its content via an index, is this correct?
So, how (is it possible?) can I directly fill my urls list? Another solution could be build one "flat" list of urls, another list containing the url, resp time and data and use pool.imap() on the first list and fill the second one with that, but I'd rather keep my list of dictionaries.
*I can't post more than 3 links with this account, please see the "Design patterns - Dispatch patterns" page from the eventlet documentation.
You can iterate over the GreenPile but you need to return something from your fetch so you have more than just the response. I modified the example so fetch returns a tuple that is the url and the response body.
The urls variable is now a dict of urls(string) to data (None or string). Iterating over a GreenPile continues until there are no more tasks. Iterating should be done in the same thread that calls spawn
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import eventlet
from eventlet.green import urllib2
#Change to map urls to the data found at them
urls = {'http://www.google.com/intl/en_ALL/images/logo.gif': None,
'http://www.google.com' :None}
def fetch(url):
#return the url and the response
return (url, urllib2.urlopen(url).read())
pool = eventlet.GreenPool()
pile = eventlet.GreenPile(pool)
for url in urls.iterkeys():
pile.spawn(fetch, url) #can I get the return of the function here? - No
for url,response in pile:
#stick it back into the dict
urls[url] = response
for k,v in urls.iteritems():
print '%s - %d bytes' % (k,len(v))
Related
I am working on a project that is divided into two parts:
Retrieve a specific page
Once the ID of this page is extracted,
Send requests to an API to obtain additional information on this page
For the second point, and to follow Scrapy's asynchronous philosophy, where should such a code be placed? (I hesitate between in the spider or in a pipeline).
Do we have to use different libraries like asyncio & aiohttp to be able to achieve this goal asynchronously? (I love aiohttp so this is not a problem to use it)
Thanks you
Since you're doing this to fetch additional information about an item, I'd just yield a request from the parsing method, passing the already scraped information in the meta attribute.
You can see an example of this at https://doc.scrapy.org/en/latest/topics/request-response.html#topics-request-response-ref-request-callback-arguments
This can also be done in a pipeline (either using scrapy's engine API, or a different library, e.g. treq).
I do however think that doing it "the normal way" from the spider makes more sense in this instance.
I recently had the same problem (again) and found an elegant solution using Twisted decorators t.i.d.inlineCallbacks.
# -*- coding: utf-8 -*-
import scrapy
import re
from twisted.internet.defer import inlineCallbacks
from sherlock import utils, items, regex
class PagesSpider(scrapy.spiders.SitemapSpider):
name = 'pages'
allowed_domains = ['thing.com']
sitemap_follow = [r'sitemap_page']
def __init__(self, site=None, *args, **kwargs):
super(PagesSpider, self).__init__(*args, **kwargs)
#inlineCallbacks
def parse(self, response):
# things
request = scrapy.Request("https://google.com")
response = yield self.crawler.engine.download(request, self)
# Twisted execute the request and resume the generator here with the response
print(response.text)
I am looking to find various statistics about players in games such as CS:GO from the Steam Web API, but cannot work out how to search through the JSON returned from the query (e.g. here) in Python.
I just need to be able to get a specific part of the list that is provided, e.g. finding total_kills from the link above. If I had a way that could sort through all of the information provided and filters it down to just that specific thing (in this case total_kills) then that would help a load!
The code I have at the moment to turn it into something Python can read is:
url = "http://api.steampowered.com/IPlayerService/GetOwnedGames/v0001/?key=FE3C600EB76959F47F80C707467108F2&steamid=76561198185148697&include_appinfo=1"
data = requests.get(url).text
data = json.loads(data)
If you are looking for a way to search through the stats list then try this:
import requests
import json
def findstat(data, stat_name):
for stat in data['playerstats']['stats']:
if stat['name'] == stat_name:
return stat['value']
url = "http://api.steampowered.com/ISteamUserStats/GetUserStatsForGame/v0002/?appid=730&key=FE3C600EB76959F47F80C707467108F2&steamid=76561198185148697"
data = requests.get(url).text
data = json.loads(data)
total_kills = findstat(data, 'total_kills') # change 'total_kills' to your desired stat name
print(total_kills)
This is the first time I ask question here. If something I got wrong, please forgive me.
And I am a newer in python for one month, I try to use the scrapy to learn something more about spider.
question is here:
def get_chapterurl(self, response):
item = DingdianItem()
item['name'] = str(response.meta['name']).replace('\xa0', '')
yield item
yield Request(url=response.url, callback=self.get_chapter, meta={'name':name_id})
def get_chapter(self, response):
urls = re.findall(r'<td class="L">(.*?)</td>', response.text)
As you can see, I yield item and Requests at the same time, but the get_chapter function did not run the first line(I take a break point there), so where was I wrong?
Sorry for disturbing you.
I have google for a time, but get noting...
Your request gets filtered out.
Scrapy has in-built request filter that prevents you from downloading the same page twice (intended feature).
Lets say you are on http://example.com; this request you yield:
yield Request(url=response.url, callback=self.get_chapter, meta={'name':name_id})
tries to download http://example.com again. And if you look at the crawling log it should say something along the lines of "ignoring duplicate url http://example.com".
You can always ignore this feature by setting dont_filter=True parameter in your Request object, as so:
yield Request(url=response.url, callback=self.get_chapter, meta={'name':name_id},
dont_filter=True)
However! I'm having trouble understanding the intention of your code but it seems that you don't really want to download the same url twice.
You don't have to schedule a new request either, you can just call your callback with the request you already have:
response = response.replace(meta={'name': name_id}) # update meta
# why crawl it again, if we can just call the callback directly!
# for python2
for result in self.get_chapter(response):
yield result
# or if you are running python3:
yield from self.get_chapter(response):
I created a web crawler that, given a base_url, will spider out and find all possible endpoints. While I am able to get all the endpoints, I need a way to figure out how I got there in the first place -- a 'url stack-trace' persay or breadcrumbs of url's leading to each endpoint.
I first start by finding all url's given a base url. Since the sublinks I'm looking for are within a json, I thought the best way to do this would be using a variation of a recurisve dictionary example I found here: http://www.saltycrane.com/blog/2011/10/some-more-python-recursion-examples/:
import requests
import pytest
import time
BASE_URL = "https://www.my-website.com/"
def get_leaf_nodes_list(base_url):
"""
:base_url: The starting point to crawl
:return: List of all possible endpoints
"""
class Namespace(object):
# A wrapper function is used to create a Namespace instance to hold the ns.results variable
pass
ns = Namespace()
ns.results = []
r = requests.get(BASE_URL)
time.sleep(0.5) # so we don't cause a DDOS?
data = r.json()
def dict_crawler(data):
# Retrieve all nodes from nested dict
if isinstance(data, dict):
for item in data.values():
dict_crawler(item)
elif isinstance(data, list) or isinstance(data, tuple):
for item in data:
dict_crawler(item)
else:
if type(data) is unicode:
if "http" in data: # If http in value, keep going
# If data is not already in ns.results, don't append it
if str(data) not in ns.results:
ns.results.append(data)
sub_r = requests.get(data)
time.sleep(0.5) # so we don't cause a DDOS?
sub_r_data = sub_r.json()
dict_crawler(sub_r_data)
dict_crawler(data)
return ns.results
To reiterate, the get_leaf_nodes_list does a get request and looks for any values within the json for a url (if 'http' string is in the value for each key) to recursively do more get requests until there's no url's left.
So to reiterate here are the questions I have:
How do I get a linear history of all the url's I hit to get to each endpoint?
Corollary to that, how would I store this history? As the leaf nodes list grows, my process gets expontentially slower and I am wondering if there's a better data type out there to store this information or a more efficient process to the code above.
I have a URL:
http://somewhere.com/relatedqueries?limit=2&query=seedterm
where modifying the inputs, limit and query, will generate wanted data. Limit is the max number of term possible and query is the seed term.
The URL provides text result formatted in this way:
oo.visualization.Query.setResponse({version:'0.5',reqId:'0',status:'ok',sig:'1303596067112929220',table:{cols:[{id:'score',label:'Score',type:'number',pattern:'#,##0.###'},{id:'query',label:'Query',type:'string',pattern:''}],rows:[{c:[{v:0.9894380670262618,f:'0.99'},{v:'newterm1'}]},{c:[{v:0.9894380670262618,f:'0.99'},{v:'newterm2'}]}],p:{'totalResultsCount':'7727'}}});
I'd like to write a python script that takes two arguments (limit number and the query seed), go fetch the data online, parse the result and return a list with the new terms ['newterm1','newterm2'] in this case.
I'd love some help, especially with the URL fetching since I have never done this before.
It sounds like you can break this problem up into several subproblems.
Subproblems
There are a handful of problems that need to be solved before composing the completed script:
Forming the request URL: Creating a configured request URL from a template
Retrieving data: Actually making the request
Unwrapping JSONP: The returned data appears to be JSON wrapped in a JavaScript function call
Traversing the object graph: Navigating through the result to find the desired bits of information
Forming the request URL
This is just simple string formatting.
url_template = 'http://somewhere.com/relatedqueries?limit={limit}&query={seedterm}'
url = url_template.format(limit=2, seedterm='seedterm')
Python 2 Note
You will need to use the string formatting operator (%) here.
url_template = 'http://somewhere.com/relatedqueries?limit=%(limit)d&query=%(seedterm)s'
url = url_template % dict(limit=2, seedterm='seedterm')
Retrieving data
You can use the built-in urllib.request module for this.
import urllib.request
data = urllib.request.urlopen(url) # url from previous section
This returns a file-like object called data. You can also use a with-statement here:
with urllib.request.urlopen(url) as data:
# do processing here
Python 2 Note
Import urllib2 instead of urllib.request.
Unwrapping JSONP
The result you pasted looks like JSONP. Given that the wrapping function that is called (oo.visualization.Query.setResponse) doesn't change, we can simply strip this method call out.
result = data.read()
prefix = 'oo.visualization.Query.setResponse('
suffix = ');'
if result.startswith(prefix) and result.endswith(suffix):
result = result[len(prefix):-len(suffix)]
Parsing JSON
The resulting result string is just JSON data. Parse it with the built-in json module.
import json
result_object = json.loads(result)
Traversing the object graph
Now, you have a result_object that represents the JSON response. The object itself be a dict with keys like version, reqId, and so on. Based on your question, here is what you would need to do to create your list.
# Get the rows in the table, then get the second column's value for
# each row
terms = [row['c'][2]['v'] for row in result_object['table']['rows']]
Putting it all together
#!/usr/bin/env python3
"""A script for retrieving and parsing results from requests to
somewhere.com.
This script works as either a standalone script or as a library. To use
it as a standalone script, run it as `python3 scriptname.py`. To use it
as a library, use the `retrieve_terms` function."""
import urllib.request
import json
import sys
E_OPERATION_ERROR = 1
E_INVALID_PARAMS = 2
def parse_result(result):
"""Parse a JSONP result string and return a list of terms"""
prefix = 'oo.visualization.Query.setResponse('
suffix = ');'
# Strip JSONP function wrapper
if result.startswith(prefix) and result.endswith(suffix):
result = result[len(prefix):-len(suffix)]
# Deserialize JSON to Python objects
result_object = json.loads(result)
# Get the rows in the table, then get the second column's value
# for each row
return [row['c'][2]['v'] for row in result_object['table']['rows']]
def retrieve_terms(limit, seedterm):
"""Retrieves and parses data and returns a list of terms"""
url_template = 'http://somewhere.com/relatedqueries?limit={limit}&query={seedterm}'
url = url_template.format(limit=limit, seedterm=seedterm)
try:
with urllib.request.urlopen(url) as data:
data = perform_request(limit, seedterm)
result = data.read()
except:
print('Could not request data from server', file=sys.stderr)
exit(E_OPERATION_ERROR)
terms = parse_result(result)
print(terms)
def main(limit, seedterm):
"""Retrieves and parses data and prints each term to standard output"""
terms = retrieve_terms(limit, seedterm)
for term in terms:
print(term)
if __name__ == '__main__'
try:
limit = int(sys.argv[1])
seedterm = sys.argv[2]
except:
error_message = '''{} limit seedterm
limit must be an integer'''.format(sys.argv[0])
print(error_message, file=sys.stderr)
exit(2)
exit(main(limit, seedterm))
Python 2.7 version
#!/usr/bin/env python2.7
"""A script for retrieving and parsing results from requests to
somewhere.com.
This script works as either a standalone script or as a library. To use
it as a standalone script, run it as `python2.7 scriptname.py`. To use it
as a library, use the `retrieve_terms` function."""
import urllib2
import json
import sys
E_OPERATION_ERROR = 1
E_INVALID_PARAMS = 2
def parse_result(result):
"""Parse a JSONP result string and return a list of terms"""
prefix = 'oo.visualization.Query.setResponse('
suffix = ');'
# Strip JSONP function wrapper
if result.startswith(prefix) and result.endswith(suffix):
result = result[len(prefix):-len(suffix)]
# Deserialize JSON to Python objects
result_object = json.loads(result)
# Get the rows in the table, then get the second column's value
# for each row
return [row['c'][2]['v'] for row in result_object['table']['rows']]
def retrieve_terms(limit, seedterm):
"""Retrieves and parses data and returns a list of terms"""
url_template = 'http://somewhere.com/relatedqueries?limit=%(limit)d&query=%(seedterm)s'
url = url_template % dict(limit=2, seedterm='seedterm')
try:
with urllib2.urlopen(url) as data:
data = perform_request(limit, seedterm)
result = data.read()
except:
sys.stderr.write('%s\n' % 'Could not request data from server')
exit(E_OPERATION_ERROR)
terms = parse_result(result)
print terms
def main(limit, seedterm):
"""Retrieves and parses data and prints each term to standard output"""
terms = retrieve_terms(limit, seedterm)
for term in terms:
print term
if __name__ == '__main__'
try:
limit = int(sys.argv[1])
seedterm = sys.argv[2]
except:
error_message = '''{} limit seedterm
limit must be an integer'''.format(sys.argv[0])
sys.stderr.write('%s\n' % error_message)
exit(2)
exit(main(limit, seedterm))
i didn't understand well your problem because from your code there it seem to me that you use Visualization API (it's the first time that i hear about it by the way).
But well if you are just searching for a way to fetch data from a web page you could use urllib2 this is just for getting data, and if you want to parse the retrieved data you will have to use a more appropriate library like BeautifulSoop
if you are dealing with another web service (RSS, Atom, RPC) rather than web pages you can find a bunch of python library that you can use and that deal with each service perfectly.
import urllib2
from BeautifulSoup import BeautifulSoup
result = urllib2.urlopen('http://somewhere.com/relatedqueries?limit=%s&query=%s' % (2, 'seedterm'))
htmletxt = resul.read()
result.close()
soup = BeautifulSoup(htmltext, convertEntities="html" )
# you can parse your data now check BeautifulSoup API.