python3.6 aiohttp response times and sizes - python

With python3.6 aiohttp is there a way to get all the following stats which i used to get using pycurl when making a request.
name_lookup_time = curl_handle.getinfo(pycurl.NAMELOOKUP_TIME)
connect_time = curl_handle.getinfo(pycurl.CONNECT_TIME)
app_connect_time = curl_handle.getinfo(pycurl.APPCONNECT_TIME)
pre_transfer_time = curl_handle.getinfo(pycurl.PRETRANSFER_TIME)
start_transfer_time = curl_handle.getinfo(pycurl.STARTTRANSFER_TIME)
total_time = curl_handle.getinfo(pycurl.TOTAL_TIME)
redirect_time = curl_handle.getinfo(pycurl.REDIRECT_TIME)
redirect_cnt = curl_handle.getinfo(pycurl.REDIRECT_COUNT)
total_size = curl_handle.getinfo(pycurl.SIZE_DOWNLOAD)
Currently i'm using a pycurl multi implementation in my crawler to make the requests and then i have code which gathers detailed data for each request. To replace pycurl and learn something new. I am interested in replacing the pycurl multi code which makes the requests with a aiohttp client implementation.
Reading the aiohttp(https://aiohttp.readthedocs.io/en/stable/client_reference.html#aiohttp.ClientResponse) documentation, i found that one can get the redirect_cnt by looking and counting the history Sequence which is part of the ClientResponse object.
history
A Sequence of ClientResponse objects of preceding requests (earliest request first) if there were redirects, an empty sequence otherwise.
I could do without most of the detailed time data. However i would like to gather total_time and total_size for my minimum requirement.

Related

Optimal sending of tens of thousands of GET requests using Python requests library

I have a script that uses a loop to append a numerical ID as a token to an API endpoint, as in the following example.
The API response is then appended to a list and transformed into a dataframe upon which I can perform transformations as required.
import pandas as pd
import json
import requests
tokens_list = ['6819100647','4444297007','3136364694','631018055','6275646907','2835497899'
,'5584897496','1271294249','5292810357','9101862265','220608001','1546117949','3672368984'
,'8893786688','8681209578','134973384','2947578755','6271362195','1070349738','747443183'...]
outputs = []
for i in tokens_list :
url = "https://my_api_endpoint/"+str(i)
response = requests.get(url, params=None, headers=None)
response_json = json.loads(response.text)
response_json_data = pd.json_normalize(response_json ['data'])
outputs_list.append(response_json_data )
outputs_df = pd.concat(outputs_list)
This works as intended for many hundreds of tokens/loop iterations, however there may potentially be 50k+ potential tokens, hence 50k+ loop iterations/calls.
How might I improve the script such that I can make tens of thousands of sequential requests without facing any timeout issues or other such problems?

Python requests.post code limits results to 200. How to adjust code to get all 1000+ results?

I can't seem to figure our where to adjust any data limit in the requests.post statement. But there appears to be a 200 results limit in the output. Not very comfortable writing advanced requests.post scripts and could use some help with this issue.
import requests
r = requests.post('https://cares.myflfamilies.com/PublicSearch/Search?dcfSearchBox=Orange in Counties')
openFile = open('flData', 'wb')
for chunk in r.iter_content(100000):
openFile.write(chunk)

Killing a Sub Process while using Multiprocessing

I need to shorten multiple urls using the Google Shortener API. Since each shortening process is not interdependent I decided to use multiprocessing library in Python.
raw_data is a data frame which contains my long url. Api_Key is a list which contains api key from multiple google accounts(as i dont want to hit the limit of api usage for a day)
Here is the underlying code.
import pandas as pd
import time
from multiprocessing import Pool
import math
import requests
raw_data["Shortened"] = "xx"
Api_Key_List1 = ['Key1',
'Key2',
'Key3']
raw_data = raw_data[0:3]
LongUrl = raw_data['SMS Click traker'].tolist()
args = [[LongUrl[index], Api_Key_List1[index]] for index,value in enumerate(LongUrl)]
print("xxxx")
def goo_shorten_url(url,key):
post_url = 'https://www.googleapis.com/urlshortener/v1/url?key=%s'%key
payload = {'longUrl': url}
headers = {'content-type': 'application/json'}
time.sleep(2)
r = requests.post(post_url, data=json.dumps(payload), headers=headers)
return(json.loads(r.text)["id"])
def helper(args):
return goo_shorten_url(*args)
if __name__ == '__main__':
p = Pool(processes = 3)
data = p.map(helper,args)
print("main")
raw_data["Shortened"]=data
p.close()
print(raw_data)
Using this code i am successfully shorten multiple urls in one go thus saving a lot of time. Here are some questions which i am having difficulty in finding answers to:
What is the optimal number of sub-processes that I shall be running? (I am using a simple quadcore processor 8 gb ram laptop, Windows 7)
Sometimes due to some issues the api does not return a shortened url. The code is stuck forever. This was my experience when i was not using multiprocess. What happens to the process in such case?
Assuming the subprocess will get stuck for a long set of urls (70 000). How do I kill it and also ensure the code safely returns me all remaining shortened urls with proper matching in the data frame?
How can I identify those urls which were not Shortened?
I am a newbie to python programming, bear with me. This will help me better understand the inner working of multiprocessing.

How to configure and run Solr full dataimport from MySQL using Python?

I need to perform full import or delta import programmatically using python and mysql. I am aware of the process in java. We can do it in following way:
CommonsHttpSolrServer server = new CommonsHttpSolrServer("http://localhost:8983/solr");
ModifiableSolrParams params = new ModifiableSolrParams();
params.set("command", "full-import");
QueryRequest request = new QueryRequest(params);
request.setPath("/dataimport");
server.request(request);
I am trying to implement it in python. Can you suggest the equivalent code in python or any solr python api that supports this?
You trigger the DataImportHandler by making a single HTTP request, and the Java example is just a way to do that using the SolrJ package.
In native python3 you can do this by using urllib.request:
import urllib.request
urllib.request.urlopen('http://localhost:8983/solr/collection/dataimport?command=full-import')
In python2 the same function is available under urllib2:
import urllib2
urllib2.urlopen('http://localhost:8983/solr/collection/dataimport?command=full-import')
Or if you're using the requests library (which can be installed through pip install requests):
import requests
requests.get('http://localhost:8983/solr/collection/dataimport?command=full-import')
There are a few python APIs, but I use mysolr (http://mysolr.readthedocs.io/en/latest/user/userguide.html) because you can use json in indexing, making it faster.
from mysolr import Solr
## For full index, delete all data after final commit:
solr.delete_by_query('*:*', commit=False)
solr = Solr("http://localhost:8983/solr/collection", version=4)
documents = [
{'id' : 1,
'field1' : 'foo'
},
{'id' : 2,
'field1' : 'bar'
}
]
solr.update(documents, 'json', commit=False)
solr.commit()
You can query like 1000 records at a time, create a list of them ("documents" above), and send them to the solr index. Then when finished, do the commit. If it's a full query, you can clear all data without committing, and the old data will be deleted once you do the final commit.

Retrieve image from method called by URL in python

I'm trying to retrieve an image that is returned through a given URL using python, for example this one:
http://fundamentus.com.br/graficos3.php?codcvm=2453&tipo=108
I am trying to do this by using urllib retrieve method:
import urllib
urlStr = "http://fundamentus.com.br/graficos3.php?codcvm=2453&tipo=108"
filename = "image.png"
urllib.urlretrieve(urlStr,filename)
I already used this for other URLs, (such as http://chart.finance.yahoo.com/z?s=CMIG4.SA&t=9m), but for the first one it's not working.
Does anyone have an idea about how to make this for the given URL?
Note: I'm using Python 2.7
You need to use a session which you can do with requests:
import requests
with requests.Session() as s:
s.get("http://fundamentus.com.br/graficos.php?papel=CMIG4&tipo=2")
with open("out.png", "wb") as f:
f.write(s.get("http://fundamentus.com.br/graficos3.php?codcvm=2453&tipo=108").content)
It works in your browser as you had visited the initial page where the image was so any necessary cookies were set.
While more verbose than #PadraicCunningham response. This should also do the trick. I'd run into a similar problem (host would only support certain browsers), so i had to start using urllib2 instead of just urllib. pretty powerful and is a module which comes with python.
Basically, you capture all the information you need during your initial request, and add it to your next request and subsequent requests. The requests module seems to pretty much do all of this for you behind the scenes. If only I'd known about that all these years...
import urllib2
urlForCookie = 'http://fundamentus.com.br/graficos.php?papel=CMIG4&tipo=2'
urlForImage = 'http://fundamentus.com.br/graficos3.php?codcvm=2453&tipo=108'
initialRequest = urllib2.Request(urlForCookie)
siteCookie = urllib2.urlopen(req1).headers.get('Set-Cookie')
imageReq = urllib2.Request(urlForImage)
imageReq.add_header('cookie', siteCookie)
with open("image2.pny",'w') as f:
f.write(urllib2.urlopen(req2).read())
f.close()

Categories