Retrieving items in a loop - python

I'm trying to retrieve JSON data from https://www.fruityvice.com/api/fruit/.
So, I'm creating a function to do that, but as return i've got only 1 fruit.
import requests
import json
def scrape_all_fruits():
for ID in range(1, 10):
url = f'https://www.fruityvice.com/api/fruit/{ID}'
response = requests.get(url)
data = response.json()
return data
print(scrape_all_fruits())
Can anyone explain to me, why ?

Related

How to get all UniProt results as tsv for query

I'm looking for a programmatic way to get all the Uniprot ids and sequences (Swiss-Prot + TrEMBL) for a given protein length but if I run my query I get only first 25 results. Is there any way to run a loop to get them all?
My code:
import requests, sys
WEBSITE_API = "https://rest.uniprot.org"
# Helper function to download data
def get_url(url, **kwargs):
response = requests.get(url, **kwargs);
if not response.ok:
print(response.text)
response.raise_for_status()
sys.exit()
return response
r = get_url(f"{WEBSITE_API}/uniprotkb/search?query=length%3A%5B100%20TO%20109%5D&fields=id,accession,length,sequence", headers={"Accept": "text/plain; format=tsv"})
with open("request.txt","w") as file:
file.write(r.text)
I ended up with a different url which seems to be working to give back more/all results. It was taking a long time to return all ~5M, so I've restricted it to just 10,000 right now
import requests
import pandas as pd
import io
# Helper function to download data
def get_url(url, **kwargs):
response = requests.get(url, **kwargs);
if not response.ok:
print(response.text)
response.raise_for_status()
sys.exit()
return response
#create the url
base_url = 'https://www.uniprot.org/uniprot/?query=length:[100%20TO%20109]'
additional_fields = [
'&format=tab',
'&limit=10000', #currently limiting the number of returned results, comment out this line to get all output
'&columns=id,accession,length,sequence',
]
url = base_url+''.join(additional_fields)
print(url)
#get the results as a large pandas table
r = get_url(url)
df = pd.read_csv(io.StringIO(r.text),sep='\t')
df
Output:

Python - Improve urllib performance

I have a huge dataframe with 1M+ rows and I want to collect some information from Nominatim using URL. I have used the geopy library before, but I had some problems, so I decide to use the API instead.
But, my code is running too slow to get the requests.
URLs sample:
urls = ['https://nominatim.openstreetmap.org/?addressdetails=1&q=Suape&format=json&limit=1&accept-language=en',
'https://nominatim.openstreetmap.org/?addressdetails=1&q=Kwangyang&format=json&limit=1&accept-language=en',
'https://nominatim.openstreetmap.org/?addressdetails=1&q=Luanda&format=json&limit=1&accept-language=en',
'https://nominatim.openstreetmap.org/?addressdetails=1&q=Ancona&format=json&limit=1&accept-language=en',
'https://nominatim.openstreetmap.org/?addressdetails=1&q=Xiamen&format=json&limit=1&accept-language=en',
'https://nominatim.openstreetmap.org/?addressdetails=1&q=Jebel%20Dhanna/Ruwais&format=json&limit=1&accept-language=en',
'https://nominatim.openstreetmap.org/?addressdetails=1&q=Nemrut%20Bay%20&format=json&limit=1&accept-language=en']
Sample code below
For a single url for test:
import pandas as pd
import urllib.request
import json
req = urllib.request.Request(url, None)
f = urllib.request.urlopen(req)
page = f.read()
nominatim = json.loads(page.decode())
result = nominatim[0]['address']['country']
So, I created a function to apply in my dataframe column that have location address (eg. Suape, Kwangyang, Luanda...):
def country(address):
try:
if ' ' in address:
address = address.replace(' ', '%20')
url = f'https://nominatim.openstreetmap.org/?addressdetails=1&q={address}&format=json&limit=1&accept-language=en'
req = urllib.request.Request(url, None)
f = urllib.request.urlopen(req)
page = f.read()
nominatim = json.loads(page.decode())
result = nominatim[0]['address']['country']
else:
url = f'https://nominatim.openstreetmap.org/?addressdetails=1&q={address}&format=json&limit=1&accept-language=en'
req = urllib.request.Request(url, None)
f = urllib.request.urlopen(req)
page = f.read()
nominatim = json.loads(page.decode())
result = nominatim[0]['address']['country']
return result
except:
pass
It takes too long to run. I have tried to optmize my function, try different approaches to apply to column, and now I'm trying to improve the requests, because is the part that takes more time to conclude. Any suggestions? I have tried threading, but not works as expected (maybe I have not done properly). I test asyncio also, but the code doesn't worked.
Thank you!
Edit: I'll only use the requests in unique values of this column, that corresponding to approx. 4K rows. But, even in 4K rows, the code takes too much time to run.

How can we loop through URLs, all of which consist of JSON data, convert each to a dataframe, and save each one to a separate CSV file?

I am trying to read data from multiple URLs, convert each JSON dataset to a dataframe, and save each dataframe in tabular format, like CSV. I am testing this code.
import requests
url = 'https://www.chsli.org/sites/default/files/transparency/111888924_GoodSamaritanHospitalMedicalCenter_standardcharges.json'
r = requests.get(url)
data = r.json()
url = 'https://www.northwell.edu/sites/northwell.edu/files/Northwell_Health_Machine_Readable_File.json'
r = requests.get(url)
data = r.json()
url = 'https://www.montefiorehealthsystem.org/documents/charges/462931956_newrochelle_Standard_Charges.json'
r = requests.get(url)
data = r.json()
url = 'https://www.kaleidahealth.org/general-information/330005_Kaleida-Health_StandardCharges.json'
r = requests.get(url)
data = r.json()
url = 'https://www.mskcc.org/teaser/standard-charges-nyc.json'
r = requests.get(url)
data = r.json()
That code seems to read each URL fine. I guess I'm stuck with how to standardize the process of converting multiple JSON data sources into dataframes, and save each dataframe as a CSV. I tested this code.
import pandas as pd
import requests
import json
url = 'https://www.northwell.edu/sites/northwell.edu/files/Northwell_Health_Machine_Readable_File.json'
r = requests.get(url)
data = r.json()
df = pd.json_normalize(data)
df.to_csv(r'C:\Users\\ryans\\Desktop\\northwell.csv')
url = 'https://www.chsli.org/sites/default/files/transparency/111888924_GoodSamaritanHospitalMedicalCenter_standardcharges.json'
r = requests.get(url)
data = r.json()
df = pd.json_normalize(data)
df.to_csv(r'C:\Users\\ryans\\Desktop\\chsli.csv')
That seems to save data in two CSVs and each one has many, many of columns and just a few rows of data. I'm not sure why this happens. Somehow, it seems like pd.json_normalize is NOT converting the JSON into a tabular shape. Any thoughts?
Also, I'd like to parse the URL to include it in the name of the CSV that is saved. So, this 'https://www.northwell.edu/' becomes this 'C:\Users\ryans\Desktop\northwell.csv' and this 'https://www.chsli.org/' becomes this 'C:\Users\ryans\Desktop\chsli.csv'.
For the JSON decoding :
The problem is that each url has its own data format
For example with "https://www.montefiorehealthsystem.org/documents/charges/462931956_newrochelle_Standard_Charges.json" → the json data is inside the data field.
import requests
import json
import pandas as pd
url = 'https://www.montefiorehealthsystem.org/documents/charges/462931956_newrochelle_Standard_Charges.json'
r = requests.get(url)
data = r.json()
data = pd.DataFrame(data['data'], columns=data['columns'], index=data['index'])
For the URL parsing :
urls = ['https://www.chsli.org/sites/default/files/transparency/111888924_GoodSamaritanHospitalMedicalCenter_standardcharges.json',
'https://www.northwell.edu/sites/northwell.edu/files/Northwell_Health_Machine_Readable_File.json',
'https://www.montefiorehealthsystem.org/documents/charges/462931956_newrochelle_Standard_Charges.json',
'https://www.kaleidahealth.org/general-information/330005_Kaleida-Health_StandardCharges.json',
'https://www.mskcc.org/teaser/standard-charges-nyc.json']
for u in urls:
print('C:\\Users\\ryans\\Desktop\\' + u.split('.')[1] + '.csv')
output :
C:\Users\ryans\Desktop\chsli.csv
C:\Users\ryans\Desktop\northwell.csv
C:\Users\ryans\Desktop\montefiorehealthsystem.csv
C:\Users\ryans\Desktop\kaleidahealth.csv
C:\Users\ryans\Desktop\mskcc.csv

Scraping using BeautifulSoup only gets me 33 responses off of an infinite scrolling page. How do i increase the number of responses?

The website link:
https://collegedunia.com/management/human-resources-management-colleges
The code:
import requests
from bs4 import BeautifulSoup
r = requests.get("https://collegedunia.com/management/human-resources-management-colleges")
c = r.content
soup = BeautifulSoup(c,"html.parser")
all = soup.find_all("div",{"class":"jsx-765939686 col-4 mb-4 automate_client_img_snippet"})
l = []
for divParent in all:
item = divParent.find("div",{"class":"jsx-765939686 listing-block text-uppercase bg-white position-relative"})
d = {}
d["Name"] = item.find("div",{"class":"jsx-765939686 top-block position-relative overflow-hidden"}).find("div",{"class":"jsx-765939686 clg-name-address"}).find("h3").text
d["Rating"] = item.find("div",{"class":"jsx-765939686 bottom-block w-100 position-relative"}).find("ul").find_all("li")[-1].find("a").find("span").text
d["Location"] = item.find("div",{"class":"jsx-765939686 clg-head d-flex"}).find("span").find("span",{"class":"mr-1"}).text
l.append(d)
import pandas
df = pandas.DataFrame(l)
df.to_excel("Output.xlsx")
The page keeps adding colleges as you scroll down, i dont know if i could get all the data, but is there a way to atleast increase the number of responses i get. There are a total of 2506 entries, as can be seen on the website?
Seeing to your Question we can see it in the network requests data is being fetched from the ajax request and they are using base64 encoded params to fetch the data you can follow the below code to get the data and parse it in your desire format.
Code:
import json
import pandas
import requests
import base64
collegedata = []
count = 0
while True:
datadict = {"url": "management/human-resources-management-colleges", "stream": "13", "sub_stream_id": "607",
"page": count}
data = base64.urlsafe_b64encode(json.dumps(datadict).encode()).decode()
params = {
"data": data
}
response = requests.get('https://collegedunia.com/web-api/listing', params=params).json()
if response["hasNext"]:
for i in response["colleges"]:
d = {}
d["Name"] = i["college_name"]
d["Rating"] = i["rating"]
d["Location"] = i["college_city"] + ", " + i["state"]
collegedata.append(d)
print(d)
else:
break
count += 1
df = pandas.DataFrame(collegedata)
df.to_excel("Output.xlsx", index=False)
Output:
Let me know if you have any questions :)
When you analyse the website via the network tab on chrome, you can see the website makes xhr calls in the back.
The endpoint to which it sends a http get request is as follows:
https://collegedunia.com/web-api/listing?data=eyJ1cmwiOiJtYW5hZ2VtZW50L2h1bWFuLXJlc291cmNlcy1tYW5hZ2VtZW50LWNvbGxlZ2VzIiwic3RyZWFtIjoiMTMiLCJzdWJfc3RyZWFtX2lkIjoiNjA3IiwicGFnZSI6M30=
When you send a get via requests module, you get a json response back.
import requests
url = "https://collegedunia.com/web-api/listing?data=eyJ1cmwiOiJtYW5hZ2VtZW50L2h1bWFuLXJlc291cmNlcy1tYW5hZ2VtZW50LWNvbGxlZ2VzIiwic3RyZWFtIjoiMTMiLCJzdWJfc3RyZWFtX2lkIjoiNjA3IiwicGFnZSI6M30="
res = requests.get(url)
print(res.json())
But you need all the data, not only for page 1. The data sent in the request is base64 encoded i.e if you decode the data parameter of the get request, you can see the following
{"url":"management/human-resources-management-colleges","stream":"13","sub_stream_id":"607","page":3}
Now, change the page number, sub_stream_id, steam etc. accordingly and get the complete data from the website.

Nested while loop for API json collection

I'm requesting 590 pages from the Meetup API. I've iterated with a while loop to get the pages. Now that I have the pages I need to request this pages and format them correctly as python in order to place into a Pandas dataframe.
This is how it looks when you do it for one url :
url = ('https://api.meetup.com/2/groups?offset=1&format=json&category_id=34&photo-host=public&page=100&radius=200.0&fields=&order=id&desc=false&sig_id=243750775&sig=768bcf78d9c73937fcf2f5d41fe6070424f8d0e3')
r = requests.get(url).json()
data = pd.io.json.json_normalize(r['results'])
But because I have so many pages I want to do this automatically and iterate through them all.
That's how nested while loops came to mind and this is what I tried:
urls = 0
offset = 0
url = 'https://api.meetup.com/2/groups?offset=%d&format=json&category_id=34&photo-host=public&page=100&radius=200.0&fields=&order=id&desc=false&sig_id=243750775&sig=768bcf78d9c73937fcf2f5d41fe6070424f8d0e3'
r = requests.get(urls%d = 'https://api.meetup.com/2/groups?offset=%d&format=json&category_id=34&photo-host=public&page=100&radius=200.0&fields=&order=id&desc=false&sig_id=243750775&sig=768bcf78d9c73937fcf2f5d41fe6070424f8d0e3').json()
while urlx < 591:
new_url = r % urls % offset
print(new_url)
offset += 1
However, it isn't working and I'm receiving many errors including this one:
SyntaxError: keyword can't be an expression
Not sure what you're trying to do, and the code has lots of issues.
But if you just want to loop through 0 to 591 and fetch URLs, then here's the code:
import requests
import pandas as pd
dfs = []
base_url = 'https://api.meetup.com/2/groups?offset=%d&format=json&category_id=34&photo-host=public&page=100&radius=200.0&fields=&order=id&desc=false&sig_id=243750775&sig=768bcf78d9c73937fcf2f5d41fe6070424f8d0e3'
for i in range(0, 592):
url = base_url % i
r = requests.get(url).json()
print("Fetching URL: %s\n" % url)
# do something with r here
# here I'll append it to a list of dfs
dfs.append(pd.io.json.json_normalize(r['results']))

Categories