How to get all UniProt results as tsv for query - python

I'm looking for a programmatic way to get all the Uniprot ids and sequences (Swiss-Prot + TrEMBL) for a given protein length but if I run my query I get only first 25 results. Is there any way to run a loop to get them all?
My code:
import requests, sys
WEBSITE_API = "https://rest.uniprot.org"
# Helper function to download data
def get_url(url, **kwargs):
response = requests.get(url, **kwargs);
if not response.ok:
print(response.text)
response.raise_for_status()
sys.exit()
return response
r = get_url(f"{WEBSITE_API}/uniprotkb/search?query=length%3A%5B100%20TO%20109%5D&fields=id,accession,length,sequence", headers={"Accept": "text/plain; format=tsv"})
with open("request.txt","w") as file:
file.write(r.text)

I ended up with a different url which seems to be working to give back more/all results. It was taking a long time to return all ~5M, so I've restricted it to just 10,000 right now
import requests
import pandas as pd
import io
# Helper function to download data
def get_url(url, **kwargs):
response = requests.get(url, **kwargs);
if not response.ok:
print(response.text)
response.raise_for_status()
sys.exit()
return response
#create the url
base_url = 'https://www.uniprot.org/uniprot/?query=length:[100%20TO%20109]'
additional_fields = [
'&format=tab',
'&limit=10000', #currently limiting the number of returned results, comment out this line to get all output
'&columns=id,accession,length,sequence',
]
url = base_url+''.join(additional_fields)
print(url)
#get the results as a large pandas table
r = get_url(url)
df = pd.read_csv(io.StringIO(r.text),sep='\t')
df
Output:

Related

Retrieving items in a loop

I'm trying to retrieve JSON data from https://www.fruityvice.com/api/fruit/.
So, I'm creating a function to do that, but as return i've got only 1 fruit.
import requests
import json
def scrape_all_fruits():
for ID in range(1, 10):
url = f'https://www.fruityvice.com/api/fruit/{ID}'
response = requests.get(url)
data = response.json()
return data
print(scrape_all_fruits())
Can anyone explain to me, why ?

Python - Improve urllib performance

I have a huge dataframe with 1M+ rows and I want to collect some information from Nominatim using URL. I have used the geopy library before, but I had some problems, so I decide to use the API instead.
But, my code is running too slow to get the requests.
URLs sample:
urls = ['https://nominatim.openstreetmap.org/?addressdetails=1&q=Suape&format=json&limit=1&accept-language=en',
'https://nominatim.openstreetmap.org/?addressdetails=1&q=Kwangyang&format=json&limit=1&accept-language=en',
'https://nominatim.openstreetmap.org/?addressdetails=1&q=Luanda&format=json&limit=1&accept-language=en',
'https://nominatim.openstreetmap.org/?addressdetails=1&q=Ancona&format=json&limit=1&accept-language=en',
'https://nominatim.openstreetmap.org/?addressdetails=1&q=Xiamen&format=json&limit=1&accept-language=en',
'https://nominatim.openstreetmap.org/?addressdetails=1&q=Jebel%20Dhanna/Ruwais&format=json&limit=1&accept-language=en',
'https://nominatim.openstreetmap.org/?addressdetails=1&q=Nemrut%20Bay%20&format=json&limit=1&accept-language=en']
Sample code below
For a single url for test:
import pandas as pd
import urllib.request
import json
req = urllib.request.Request(url, None)
f = urllib.request.urlopen(req)
page = f.read()
nominatim = json.loads(page.decode())
result = nominatim[0]['address']['country']
So, I created a function to apply in my dataframe column that have location address (eg. Suape, Kwangyang, Luanda...):
def country(address):
try:
if ' ' in address:
address = address.replace(' ', '%20')
url = f'https://nominatim.openstreetmap.org/?addressdetails=1&q={address}&format=json&limit=1&accept-language=en'
req = urllib.request.Request(url, None)
f = urllib.request.urlopen(req)
page = f.read()
nominatim = json.loads(page.decode())
result = nominatim[0]['address']['country']
else:
url = f'https://nominatim.openstreetmap.org/?addressdetails=1&q={address}&format=json&limit=1&accept-language=en'
req = urllib.request.Request(url, None)
f = urllib.request.urlopen(req)
page = f.read()
nominatim = json.loads(page.decode())
result = nominatim[0]['address']['country']
return result
except:
pass
It takes too long to run. I have tried to optmize my function, try different approaches to apply to column, and now I'm trying to improve the requests, because is the part that takes more time to conclude. Any suggestions? I have tried threading, but not works as expected (maybe I have not done properly). I test asyncio also, but the code doesn't worked.
Thank you!
Edit: I'll only use the requests in unique values of this column, that corresponding to approx. 4K rows. But, even in 4K rows, the code takes too much time to run.

How to create multiple data frames from a list of html data?

I have the following code to pull data from a website that has multiple pages and then turn the html tables into a dataframe. However the code I have below takes the last table of the html data so I don't get the full result.
I have some code at the bottom of the page s = Scraper(urlsList[0]) that accesses the list urlsList defined above it. What I can't figure out is how to create essentially a loop to create dataframes for each point in the list(0,1,2,3.. etc). So say when the following code is run:
s = Scraper(urlsList[n]) #n being the url page number
df
A separate dataframe is produced for each n.
At the moment I have a for loop which goes through the page numbers in the url 1 by 1.
Unfortunately I can't share the real URL as it requires authentication but I've made one up to show how the code functions.
import io
import urllib3
import pandas as pd
from requests_kerberos import OPTIONAL, HTTPKerberosAuth
a = get_mwinit_cookie()
pageCount = 6
urlsList = []
urls=
urls = https://example-url.com/ABCD/customer.currentPage={}&end
for x in range(pageCount)[1:]:
urlsList.append(urls.format(x))
def Scraper(url):
urllib3.disable_warnings()
with requests_retry_session() as req:
resp = req.get(url,
timeout=30,
verify=False,
allow_redirects=True,
auth=HTTPAuth(mutual_authentication=OPTIONAL),
cookies=a)
global df
#data = resp.text
data = pd.read_html(resp.text, flavor=None, header=0, index_col=0)
#resp.text, flavor=None, header=0, index_col=0
df = pd.concat(data, sort=False)
#df = data
print(df)
s = Scraper(urlsList[0])
df
You need to return something from your scraper function. Then you can collect the data from the pages in a list of DataFrames and use pd.concat() on that list.
def Scraper(url):
urllib3.disable_warnings()
with requests_retry_session() as req:
resp = req.get(url,
timeout=30,
verify=False,
allow_redirects=True,
auth=HTTPAuth(mutual_authentication=OPTIONAL),
cookies=a)
return pd.read_html(resp.text, flavor=None, header=0, index_col=0)[0]
pages_data = []
for url in urlsList:
pages_data.append(Scraper(url))
df = pd.concat(pages_data, sort=False)

Increase speed of the code which makes many api calls and stores all data into one dataframe

I wrote a code which takes identifier number and make request to specific API and returns concerning data related to this identifier number. The code run through dataframe and takes the identifier numbers (approximately 600) and returns concerning information and converts into pandas dataframe. Finally, all dataframes are concatenated into one dataframe. Code is running super slow. Is there any ways how to make it faster. I am not confident in python and will appreciate if you can share solution code.
Code:
file = dataframe
list_df = []
for index,row in file.iterrows():
url = "https://some_api/?eni_number="+str(row['id'])+"&apikey=some_apikey"
payload={}
headers = {
'Accept': 'application/json'
}
response = requests.request("GET", url, headers=headers, data=payload)
a = json.loads(response.text)
df_index = json_normalize(a, 'vessels')
df_index['eni_number'] = row['id']
list_df.append(df_index)
#print(df_index)
total = pd.concat(list_df)
It seems the bottleneck here is that HTTP requests are executed synchronously, one after the other. So most of the time is wasted waiting for responses from the server.
You may have better results by using an asynchronous approach, for example using grequests to execute all HTTP requests in parallel:
import grequests
ids = dataframe["id"].to_list()
urls = [f"https://some_api/?eni_number={id}&apikey=some_apikey" for id in ids]
payload = {}
headers = {'Accept': 'application/json'}
requests = (grequests.get(url, headers=headers, data=payload) for url in urls)
responses = grequests.map(requests) # execute all requests in parallel
list_df = []
for id, response in zip(ids, responses):
a = json.loads(response.text)
df_index = json_normalize(a, 'vessels')
df_index['eni_number'] = id
list_df.append(df_index)
total = pd.concat(list_df)

Python: how to extract data from Odata API that contains pages #odata.nextLink

I need to pull data from an Odata API. With code below I do receive data, but only 250 rows.
The JSON contains a key called: #odata.nextLink that contains one value, this is the BASE_URL + endpoint + ?$skip=250
How can I loop through the next pages?
import requests
import pandas as pd
import json
BASE_URL = "base_url"
def session_token():
url = BASE_URL + '/api/oauth/token'
headers = {"Accept": "application\json",
"Content-Type": "application/x-www-form-urlencoded;charset=UTF-8"}
body = {"username":"user",
"password": "pwd",
"grant_type": "password"}
return "Bearer "+ requests.post(url, headers = headers, data = body).json()["access_token"]
def make_request(endpoint, token = session_token()):
headers = {"Authorization": token}
response = requests.get(BASE_URL + endpoint, headers = headers)
if response.status_code == 200:
json_data = json.loads(response.text)
return json_data
make_request("/odata/endpoint")
Following #Marek Piotrowski's advise I modified and came to a solution:
def main():
url = "endpoint"
while True:
if not url:
break
response = make_request("endpoint")
if response.status_code == 200:
json_data = json.loads(response.text)
url = json_data["#odata.nextLink"] # Fetch next link
yield json_data['value']
result = pd.concat((json_normalize(row) for row in main()))
print(result) # Final dataframe, works like a charm :)
Something like that would retrieve all records, I believe (assuming there's #odata.nextLink in json_data indeed):
def retrieve_all_records(endpoint, token = session_token()):
all_records = []
headers = {"Authorization": token}
url = BASE_URL + endpoint
while True:
if not url:
break
response = requests.get(url, headers = headers)
if response.status_code == 200:
json_data = json.loads(response.text)
all_records = all_records + json_data['records']
url = json_data['#odata.nextLink']
return all_records
The code is untested, though. Let me know if it works. Alternatively, you could make some recursive call to make_request, I believe, but you'd have to store results somewhere above the function itself then.
I know that this is late, but you could look at this article from Towards Data Science of Ephram Mwai
He pretty solved the problem with a good script.

Categories