How to continuously pull data from a URL in Python?

How to continuously pull data from a URL in Python? - python

I have a link, e.g. www.someurl.com/api/getdata?password=..., and when I open it in a web browser it sends a constantly updating document of text. I'd like to make an identical connection in Python, and dump this data to a file live as it's received. I've tried using requests.Session(), but since the stream of data never ends (and dropping it would lose data), the get request also never ends.
import requests
s = requests.Session()
x = s.get("www.someurl.com/api/getdata?password=...") #never terminates
What's the proper way to do this?

I found the answer I was looking for here: Python Requests Stream Data from API
Full implementation:
import requests
url = "www.someurl.com/api/getdata?password=..."
s = requests.Session()
with open('file.txt','a') as fp:
with s.get(url,stream=True) as resp:
for line in resp.iter_lines(chunk_size=1):
fp.write(str(line))
Note that chunk_size=1 is necessary for the data to immediately respond to new complete messages, rather than waiting for an internal buffer to fill before iterating over all the lines. I believe chunk_size=None is meant to do this, but it doesn't work for me.

You can keep making get requests to the url
import requests
import time
url = "www.someurl.com/api/getdata?password=..."
sess = requests.session()
while True:
req = sess.get(url)
time.sleep(10)

this will terminate the request after 1 second ,
import multiprocessing
import time
import requests
data = None
def get_from_url(x):
s = requests.Session()
data = s.get("www.someurl.com/api/getdata?password=...")
if __name__ == '__main__':
while True:
p = multiprocessing.Process(target=get_from_url, name="get_from_url", args=(1,))
p.start()
# Wait 1 second for get request
time.sleep(1)
p.terminate()
p.join()
# do something with the data
print(data) # or smth else

Related

Python 3 - Router Access

I have been using the following script to grab the PIN from our router. It is changed often so I decided to use a script so it would be easier than having to access the router from the browser.
The script is as follows:
from requests.auth import HTTPBasicAuth
import requests
import re
while True:
try:
response = requests.get('http://192.168.2.1/settings.html',
auth=HTTPBasicAuth('username', 'password'))
html = response.content
m = re.findall(b'var routerpin\s+=\s+(.*)', html)
break
except:
m = None
print(m)
The trouble I am having is the first time the script is run the variable 'm' returns an empty list. It does not give an exception. I thought by using a try - except loop and using the None or empty set as the exception would allow it to work.
When the script runs once it returns m = []
after this the script returns the correct data. I know this is down to the first run not authenticating with the router but not sure how I can handle it to run twice and grab the data.
Probably a really simple answer but any help much appreciated.

Try using a session object to manage your authenticated session:
s = requests.Session()
s.auth = ('username', 'password')
auth = s.post('http://192.168.2.1')
response = s.get('http://192.168.2.1/settings.html')
html = response.content
# etc

Python multiprocess Pool vs Process

I'm new to Python multiprocessing. I don't quite understand the difference between Pool and Process. Can someone suggest which one I should use for my needs?
I have thousands of http GET requests to send. After sending each and getting the response, I want to store to response (a simple int) to a (shared) dict. My final goal is to write all data in the dict to a file.
This is not CPU intensive at all. All my goal is the speed up sending the http GET requests because there are too many. The requests are all isolated and do not depend on each other.
Shall I use Pool or Process in this case?
Thanks!
----The code below is added on 8/28---
I programmed with multiprocessing. The key challenges I'm facing are:
1) GET request can fail sometimes. I have to set 3 retries to minimize the need to rerun my code/all requests. I only want to retry the failed ones. Can I achieve this with async http requests without using Pool?
2) I want to check the response value of every requests, and have exception handling
The code below is simplified from my actual code. It is working fine, but I wonder if it's the most efficient way of doing things. Can anyone give any suggestions? Thanks a lot!
def get_data(endpoint, get_params):
response = requests.get(endpoint, params = get_params)
if response.status_code != 200:
raise Exception("bad response for " + str(get_params))
return response.json()
def get_currency_data(endpoint, currency, date):
get_params = {'currency': currency,
'date' : date
}
for attempt in range(3):
try:
output = get_data(endpoint, get_params)
# additional return value check
# ......
return output['value']
except:
time.sleep(1) # I found that sleeping for 1s almost always make the retry successfully
return 'error'
def get_all_data(currencies, dates):
# I have many dates, but not too many currencies
for currency in currencies:
results = []
pool = Pool(processes=20)
for date in dates:
results.append(pool.apply_async(get_currency_data, args=(endpoint, date)))
output = [p.get() for p in results]
pool.close()
pool.join()
time.sleep(10) # Unfortunately I have to give the server some time to rest. I found it helps to reduce failures. I didn't write the server. This is not something that I can control

Neither. Use asynchronous programming. Consider the below code pulled directly from that article (credit goes to Paweł Miech)
#!/usr/local/bin/python3.5
import asyncio
from aiohttp import ClientSession
async def fetch(url, session):
async with session.get(url) as response:
return await response.read()
async def run(r):
url = "http://localhost:8080/{}"
tasks = []
# Fetch all responses within one Client session,
# keep connection alive for all requests.
async with ClientSession() as session:
for i in range(r):
task = asyncio.ensure_future(fetch(url.format(i), session))
tasks.append(task)
responses = await asyncio.gather(*tasks)
# you now have all response bodies in this variable
print(responses)
def print_responses(result):
print(result)
loop = asyncio.get_event_loop()
future = asyncio.ensure_future(run(4))
loop.run_until_complete(future)
Just maybe create a URL's array, and instead of the given code, loop against that array and issue each one to fetch.
EDIT: Use requests_futures
As per #roganjosh comment below, requests_futures is a super-easy way to accomplish this.
from requests_futures.sessions import FuturesSession
sess = FuturesSession()
urls = ['http://google.com', 'https://stackoverflow.com']
responses = {url: sess.get(url) for url in urls}
contents = {url: future.result().content
for url, future in responses.items()
if future.result().status_code == 200}
EDIT: Use grequests to support Python 2.7
You can also us grequests, which supports Python 2.7 for performing asynchronous URL calling.
import grequests
urls = ['http://google.com', 'http://stackoverflow.com']
responses = grequests.map(grequests.get(u) for u in urls)
print([len(r.content) for r in rs])
# [10475, 250785]
EDIT: Using multiprocessing
If you want to do this using multiprocessing, you can. Disclaimer: You're going to have a ton of overhead by doing this, and it won't be anywhere near as efficient as async programming... but it is possible.
It's actually pretty straightforward, you're mapping the URL's through the http GET function:
import requests
urls = ['http://google.com', 'http://stackoverflow.com']
from multiprocessing import Pool
pool = Pool(8)
responses = pool.map(requests.get, urls)
The size of the pool will be the number of simultaneously issues GET requests. Sizing it up should increase your network efficiency, but it'll add overhead on the local machine for communication and forking.
Again, I don't recommend this, but it certainly is possible, and if you have enough cores it's probably faster than doing the calls synchronously.

Faster Scraping of JSON from API: Asynchronous or?

I need to scrape roughly 30GB of JSON data from a website API as quickly as possible. I don't need to parse it -- I just need to save everything that shows up on each API URL.
I can request quite a bit of data at a time -- say 1MB or even 50MB 'chunks' (API parameters are encoded in the URL and allow me to select how much data I want per request)
the API places a limit of 1 request per second.
I would like to accomplish this on a laptop and 100MB/sec internet connection
Currently, I am accomplishing this (synchronously & too slowly) by:
-pre-computing all of the (encoded) URL's I want to scrape
-using Python 3's requests library to request each URL and save the resulting JSON one-by-one in separate .txt files.
Basically, my synchronous, too-slow solution looks like this (simplified slightly):
#for each pre-computed encoded URL do:
curr_url_request = requests.get(encoded_URL_i, timeout=timeout_secs)
if curr_url_request.ok:
with open('json_output.txt', 'w') as outfile:
json.dump(curr_url_request.json(), outfile)
What would be a better/faster way to do this? Is there a straight-forward way to accomplish this asynchronously but respecting the 1-request-per-second threshold? I have read about grequests (no longer maintained?), twisted, asyncio, etc but do not have enough experience to know whether/if one of these is the right way to go.
EDIT
Based on Kardaj's reply below, I decided to give async Tornado a try. Here's my current Tornado version (which is heavily based on one of the examples in their docs). It successfully limits concurrency.
The hangup is, how can I do an overall rate-limit of 1 request per second globally across all workers? (Kardaj, the async sleep makes a worker sleep before working, but does not check whether other workers 'wake up' and request at the same time. When I tested it, all workers grab a page and break the rate limit, then go to sleep simultaneously).
from datetime import datetime
from datetime import timedelta
from tornado import httpclient, gen, ioloop, queues
URLS = ["https://baconipsum.com/api/?type=meat",
"https://baconipsum.com/api/?type=filler",
"https://baconipsum.com/api/?type=meat-and-filler",
"https://baconipsum.com/api/?type=all-meat&paras=2&start-with-lorem=1"]
concurrency = 2
def handle_request(response):
if response.code == 200:
with open("FOO"+'.txt', "wb") as thisfile:#fix filenames to avoid overwrite
thisfile.write(response.body)
#gen.coroutine
def request_and_save_url(url):
try:
response = yield httpclient.AsyncHTTPClient().fetch(url, handle_request)
print('fetched {0}'.format(url))
except Exception as e:
print('Exception: {0} {1}'.format(e, url))
raise gen.Return([])
#gen.coroutine
def main():
q = queues.Queue()
tstart = datetime.now()
fetching, fetched = set(), set()
#gen.coroutine
def fetch_url(worker_id):
current_url = yield q.get()
try:
if current_url in fetching:
return
#print('fetching {0}'.format(current_url))
print("Worker {0} starting, elapsed is {1}".format(worker_id, (datetime.now()-tstart).seconds ))
fetching.add(current_url)
yield request_and_save_url(current_url)
fetched.add(current_url)
finally:
q.task_done()
#gen.coroutine
def worker(worker_id):
while True:
yield fetch_url(worker_id)
# Fill a queue of URL's to scrape
list = [q.put(url) for url in URLS] # this does not make a list...it just puts all the URLS into the Queue
# Start workers, then wait for the work Queue to be empty.
for ii in range(concurrency):
worker(ii)
yield q.join(timeout=timedelta(seconds=300))
assert fetching == fetched
print('Done in {0} seconds, fetched {1} URLs.'.format(
datetime.now() - tstart, len(fetched)))
if __name__ == '__main__':
import logging
logging.basicConfig()
io_loop = ioloop.IOLoop.current()
io_loop.run_sync(main)

You are parsing the content and then serializing it again. You can just write the content directly to a file.
curr_url_request = requests.get(encoded_URL_i, timeout=timeout_secs)
if curr_url_request.ok:
with open('json_output.txt', 'w') as outfile:
outfile.write(curr_url_request.content)
That probably removes most of the processing overhead.

tornado has a very powerful asynchronous client. Here's a basic code that may do the trick:
from tornado.httpclient import AsyncHTTPClient
import tornado
URLS = []
http_client = AsyncHTTPClient()
loop = tornado.ioloop.IOLoop.current()
def handle_request(response):
if response.code == 200:
with open('json_output.txt', 'a') as outfile:
outfile.write(response.body)
#tornado.gen.coroutine
def queue_requests():
results = []
for url in URLS:
nxt = tornado.gen.sleep(1) # 1 request per second
res = http_client.fetch(url, handle_request)
results.append(res)
yield nxt
yield results # wait for all requests to finish
loop.add_callback(loop.stop)
loop.add_callback(queue_requests)
loop.start()
This is a straight-forward approach that may lead to too many connections with the remote server. You may have to resolve such problem using a sliding window while queuing the requests.
In case of request timeouts or specific headers required, feel free to read the doc

Make Post Requests with Files Simultaneously

I write a simple server and it runs well.
So I want to write some codes which will make many post requests to my server simultaneously to simulate a pressure test. I use python.
Suppose the url of my server is http://myserver.com.
file1.jpg and file2.jpg are the files needed to be uploaded to the server.
Here is my testing code. I use threading and urllib2.
async_posts.py
from Queue import Queue
from threading import Thread
from poster.encode import multipart_encode
from poster.streaminghttp import register_openers
import urllib2, sys
num_thread = 4
queue = Queue(2*num_thread)
def make_post(url):
register_openers()
data = {"file1": open("path/to/file1.jpg"), "file2": open("path/to/file2.jpg")}
datagen, headers = multipart_encode(data)
request = urllib2.Request(url, datagen, headers)
start = time.time()
res = urllib2.urlopen(request)
end = time.time()
return res.code, end - start # Return the status code and duration of this request.
def deamon():
while True:
url = queue.get()
status, duration = make_post(url)
print status, duration
queue.task_done()
for _ in range(num_thread):
thd = Thread(target = daemon)
thd.daemon = True
thd.start()
try:
urls = ["http://myserver.com"] * num_thread
for url in urls:
queue.put(url)
queue.join()
except KeyboardInterrupt:
sys.exit(1)
When num_thread is small (ex: 4), my code runs smoothly. But as I switch num_thread to slightly larger number, say 10, all the threading things break down and keep throwing httplib.BadStatusLine error.
I don't know why my code goes wrong or maybe there is better way to do this?
A a reference, my server is written in python using flask and gunicorn.
Thanks in advance.

Best way to constantly request http data?

Which is the best way to request constant data from a server in Python? I've tried with Urllib3 but for some reason after a while the python script stops. And I am also trying urllib2 (see below the code), but I notice there's a huge delay sometimes (that did not happen as frequently with urllib3) and the response is not every 0.5 seconds (sometimes it's every 6 seconds). What can I do to solve this?
import socket
import urllib2
import time
# timeout in seconds
timeout = 10
socket.setdefaulttimeout(timeout)
while True:
try:
# this call to urllib2.urlopen now uses the default timeout
# we have set in the socket module
req = urllib2.Request('https://www.okcoin.com/api/v1/future_ticker.do?symbol=btc_usd&contract_type=this_week')
response = urllib2.urlopen(req)
r = response.read()
req2 = urllib2.Request('http://market.bitvc.com/futures/ticker_btc_week.js')
response2 = urllib2.urlopen(req2)
r2 = response2.read()
except:
continue
print r + str(time.time())
print r2 + str(time.time())
time.sleep(0.5)

I think I found the problem. I needed to keep an open http session. That way I get the data more continuously. What's the best way of doing this? I did "http = requests.Session()" and using requests now.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to continuously pull data from a URL in Python? - python

You can keep making get requests to the url import requests import time url = "www.someurl.com/api/getdata?password=..." sess = requests.session() while True: req = sess.get(url) time.sleep(10)

Related

Python 3 - Router Access

Python multiprocess Pool vs Process

Faster Scraping of JSON from API: Asynchronous or?

Make Post Requests with Files Simultaneously

Best way to constantly request http data?

Categories

Resources