Python - Improve urllib performance - python

I have a huge dataframe with 1M+ rows and I want to collect some information from Nominatim using URL. I have used the geopy library before, but I had some problems, so I decide to use the API instead.
But, my code is running too slow to get the requests.
URLs sample:
urls = ['https://nominatim.openstreetmap.org/?addressdetails=1&q=Suape&format=json&limit=1&accept-language=en',
'https://nominatim.openstreetmap.org/?addressdetails=1&q=Kwangyang&format=json&limit=1&accept-language=en',
'https://nominatim.openstreetmap.org/?addressdetails=1&q=Luanda&format=json&limit=1&accept-language=en',
'https://nominatim.openstreetmap.org/?addressdetails=1&q=Ancona&format=json&limit=1&accept-language=en',
'https://nominatim.openstreetmap.org/?addressdetails=1&q=Xiamen&format=json&limit=1&accept-language=en',
'https://nominatim.openstreetmap.org/?addressdetails=1&q=Jebel%20Dhanna/Ruwais&format=json&limit=1&accept-language=en',
'https://nominatim.openstreetmap.org/?addressdetails=1&q=Nemrut%20Bay%20&format=json&limit=1&accept-language=en']
Sample code below
For a single url for test:
import pandas as pd
import urllib.request
import json
req = urllib.request.Request(url, None)
f = urllib.request.urlopen(req)
page = f.read()
nominatim = json.loads(page.decode())
result = nominatim[0]['address']['country']
So, I created a function to apply in my dataframe column that have location address (eg. Suape, Kwangyang, Luanda...):
def country(address):
try:
if ' ' in address:
address = address.replace(' ', '%20')
url = f'https://nominatim.openstreetmap.org/?addressdetails=1&q={address}&format=json&limit=1&accept-language=en'
req = urllib.request.Request(url, None)
f = urllib.request.urlopen(req)
page = f.read()
nominatim = json.loads(page.decode())
result = nominatim[0]['address']['country']
else:
url = f'https://nominatim.openstreetmap.org/?addressdetails=1&q={address}&format=json&limit=1&accept-language=en'
req = urllib.request.Request(url, None)
f = urllib.request.urlopen(req)
page = f.read()
nominatim = json.loads(page.decode())
result = nominatim[0]['address']['country']
return result
except:
pass
It takes too long to run. I have tried to optmize my function, try different approaches to apply to column, and now I'm trying to improve the requests, because is the part that takes more time to conclude. Any suggestions? I have tried threading, but not works as expected (maybe I have not done properly). I test asyncio also, but the code doesn't worked.
Thank you!
Edit: I'll only use the requests in unique values of this column, that corresponding to approx. 4K rows. But, even in 4K rows, the code takes too much time to run.

Related

Retrieving items in a loop

I'm trying to retrieve JSON data from https://www.fruityvice.com/api/fruit/.
So, I'm creating a function to do that, but as return i've got only 1 fruit.
import requests
import json
def scrape_all_fruits():
for ID in range(1, 10):
url = f'https://www.fruityvice.com/api/fruit/{ID}'
response = requests.get(url)
data = response.json()
return data
print(scrape_all_fruits())
Can anyone explain to me, why ?

How to get all UniProt results as tsv for query

I'm looking for a programmatic way to get all the Uniprot ids and sequences (Swiss-Prot + TrEMBL) for a given protein length but if I run my query I get only first 25 results. Is there any way to run a loop to get them all?
My code:
import requests, sys
WEBSITE_API = "https://rest.uniprot.org"
# Helper function to download data
def get_url(url, **kwargs):
response = requests.get(url, **kwargs);
if not response.ok:
print(response.text)
response.raise_for_status()
sys.exit()
return response
r = get_url(f"{WEBSITE_API}/uniprotkb/search?query=length%3A%5B100%20TO%20109%5D&fields=id,accession,length,sequence", headers={"Accept": "text/plain; format=tsv"})
with open("request.txt","w") as file:
file.write(r.text)
I ended up with a different url which seems to be working to give back more/all results. It was taking a long time to return all ~5M, so I've restricted it to just 10,000 right now
import requests
import pandas as pd
import io
# Helper function to download data
def get_url(url, **kwargs):
response = requests.get(url, **kwargs);
if not response.ok:
print(response.text)
response.raise_for_status()
sys.exit()
return response
#create the url
base_url = 'https://www.uniprot.org/uniprot/?query=length:[100%20TO%20109]'
additional_fields = [
'&format=tab',
'&limit=10000', #currently limiting the number of returned results, comment out this line to get all output
'&columns=id,accession,length,sequence',
]
url = base_url+''.join(additional_fields)
print(url)
#get the results as a large pandas table
r = get_url(url)
df = pd.read_csv(io.StringIO(r.text),sep='\t')
df
Output:

API Python, many pages as parameters

Good afternoon people,
I have an API that I want to consume, but it has more than one page to fetch information.
In the example below, I'm searching on page 1, but I have 50+ pages to query. How to customize this to automatically fetch all pages?
Then I will save in a variable to configure a dataframe
import json,requests
link = "https://some.com.br/api/v1/integration/customers.json"
headers = {'iliot-company-token': '3r5s$ddfdassss'}
Parameters = {"page": 1}
clients2 = requests.get(link, headers=headers, json=Parameters)
lista_clients2 = clients.json
print(lista_clients2())
Assuming you have exactly 50 pages, numbered 1-50, this approach might work - defining your request as a function/method then using map to got through all pages. This assumes all of the pages would have the same header, API token etc.
import json,requests
link = "https://some.com.br/api/v1/integration/customers.json"
headers = {'iliot-company-token': '3r5s$ddfdassss'}
def get_data(page):
Parameters = {"page": page}
clients2 = requests.get(link, headers=headers, json=Parameters)
lista_clients2 = clients.json
print(lista_clients2())
#Get it all at once
all_page_data = list(map(get_data, range(1,51)))
#If you want to make a dataframe
import pandas as pd
df = pd.DataFrame(all_page_data)
#You can also split out json-formatted data, if it's in a single column
full_df = pd.json_normalize(df[0])

Increase speed of the code which makes many api calls and stores all data into one dataframe

I wrote a code which takes identifier number and make request to specific API and returns concerning data related to this identifier number. The code run through dataframe and takes the identifier numbers (approximately 600) and returns concerning information and converts into pandas dataframe. Finally, all dataframes are concatenated into one dataframe. Code is running super slow. Is there any ways how to make it faster. I am not confident in python and will appreciate if you can share solution code.
Code:
file = dataframe
list_df = []
for index,row in file.iterrows():
url = "https://some_api/?eni_number="+str(row['id'])+"&apikey=some_apikey"
payload={}
headers = {
'Accept': 'application/json'
}
response = requests.request("GET", url, headers=headers, data=payload)
a = json.loads(response.text)
df_index = json_normalize(a, 'vessels')
df_index['eni_number'] = row['id']
list_df.append(df_index)
#print(df_index)
total = pd.concat(list_df)
It seems the bottleneck here is that HTTP requests are executed synchronously, one after the other. So most of the time is wasted waiting for responses from the server.
You may have better results by using an asynchronous approach, for example using grequests to execute all HTTP requests in parallel:
import grequests
ids = dataframe["id"].to_list()
urls = [f"https://some_api/?eni_number={id}&apikey=some_apikey" for id in ids]
payload = {}
headers = {'Accept': 'application/json'}
requests = (grequests.get(url, headers=headers, data=payload) for url in urls)
responses = grequests.map(requests) # execute all requests in parallel
list_df = []
for id, response in zip(ids, responses):
a = json.loads(response.text)
df_index = json_normalize(a, 'vessels')
df_index['eni_number'] = id
list_df.append(df_index)
total = pd.concat(list_df)

How to parse Etherscan?

How can I parse every single page for eth addresses from https://etherscan.io/token/generic-tokenholders2?a=0x6425c6be902d692ae2db752b3c268afadb099d3b&s=0&p=1 ? Then add it to .txt .
Okay, possibly off-topic, but I had a play around with this. (Mainly because I thought I might need to use something similar to grab stuff in future that Etherscan's APIs don't return... )
The following Python2 code will grab what you're after. There's a hacky sleep in there to get around what I think is either something to do with how quickly the pages load, or some rate limiting imposed by Etherscan. I'm not sure.
Data gets written to a .csv file - a text file wouldn't be much fun.
#!/usr/bin/env python
from __future__ import print_function
import os
import requests
from bs4 import BeautifulSoup
import csv
import time
RESULTS = "results.csv"
URL = "https://etherscan.io/token/generic-tokenholders2?a=0x6425c6be902d692ae2db752b3c268afadb099d3b&s=0&p="
def getData(sess, page):
url = URL + page
print("Retrieving page", page)
return BeautifulSoup(sess.get(url).text, 'html.parser')
def getPage(sess, page):
table = getData(sess, str(int(page))).find('table')
return [[X.text.strip() for X in row.find_all('td')] for row in table.find_all('tr')]
def main():
resp = requests.get(URL)
sess = requests.Session()
with open(RESULTS, 'wb') as f:
wr = csv.writer(f, quoting=csv.QUOTE_ALL)
wr.writerow(map(str, "Rank Address Quantity Percentage".split()))
page = 0
while True:
page += 1
data = getPage(sess, page)
# Even pages that don't contain the data we're
# after still contain a table.
if len(data) < 4:
break
else:
for row in data:
wr.writerow(row)
time.sleep(1)
if __name__ == "__main__":
main()
I'm sure it's not the best Python in the world.

Categories