Problem with for loop while querying PubAG API with python - python

Here is my code, I am trying to get a 100 unique publications for each search term:
import requests
from time import sleep
import pandas as pd
search_terms = ['grapes', 'cotton','apple', 'onion','cucumber']
res_data = []
pag_data = []
for i in search_terms[0:5]:
try:
api_key = "DEMO_KEY"
endpoint = 'https://api.nal.usda.gov/pubag/rest/search?query=abstract:'+str(i)
print(endpoint)
q_params = {"api_key": api_key, "per_page":100}
response = requests.get(endpoint, params=q_params)
print(response)
r = response.json()
res_data.append(r)
print(len(res_data))
a = res_data[0]['response']
pag_data.append(a)
print(len(pag_data))
sleep(5)
except:
print(i)
continue
Then to create DataFrames
pag_dfs = [pd.DataFrame(i['docs']) for i in pag_data]
df = pd.concat(pag_dfs, axis=0, ignore_index=True)
I am getting the 500 publications but only 100 are unique, If I do it one by one it works. How can I improve my for loop to get unique records for each search term?
Docs:https://pubag.nal.usda.gov/apidocs
edit:
I got the for loop to work by adding another for loop for the second list, this adds a lot of duplicates but I then processed it with pandas to get the unique publications for each term.
res_data = []
pag_data = []
for i in search_terms:
try:
api_key = "DEMO_KEY"
endpoint = 'https://api.nal.usda.gov/pubag/rest/search?query=abstract:'+str(i)
print(endpoint)
q_params = {"api_key": api_key, "per_page":100}
response = requests.get(endpoint, params=q_params)
print(response)
r = response.json()
res_data.append(r)
print(len(res_data))
for x in res_data:
a = x['response']
pag_data.append(a)
print(len(pag_data))
#sleep(2)
except:
print(i)
continue

As I mentioned in the comments there are things that can be done to improve your code quality.
for i in search_terms[0:5] is equivalent to for i in search_terms
Looping variable name i is bad, does not say what it is. Use something better, for example for term in search_terms
You don't have to do str(i) seeing as it's already a string
The continue in your except block is useless, the for-loop will continue without it too.
You are also doing unnecessary things such as creating the res_data list.
Now to a potential solution. Given the lack of proper details regarding exactly what you want out I have made an assumption or two. With that said, here is a solution that yields the following results that might be what you need:
nalt_all ... pmcid_url
0 [adipose tissue, anthocyanins, aorta, body wei... ... NaN
2 [aquaculture, data collection, eco-efficiency,... ... NaN
7 [ribosomal DNA, host plants, Xylella fastidios... ... NaN
8 [bioactive properties, canola oil, droplet siz... ... NaN
10 [Aphidoidea, Bactrocera, Citrus, adults, birds... ... NaN
.. ... ... ...
472 [Cucumis sativus, Escherichia coli, biosynthes... ... NaN
480 [biochemistry, cucumbers, fungicides, hydropho... ... NaN
491 [Cucumber mosaic virus, Eisenia fetida, Tobacc... ... NaN
494 [Cucumber mosaic virus, RNA libraries, correla... ... NaN
499 [Aphidoidea, Capsicum annuum, Cucumber mosaic ... ... NaN
[141 rows x 28 columns]
With columns:
nalt_all, subject, author_lastname, last_modified_date, language, title, startpage, usda_authored_publication, id, text_availability, doi_url, issue, author, format, endpage, journal_name, abstract, pmid_url, url, volume, publication_year, usda_funded_publication, issn, page, author_primary, doi, handle_url, pmcid_url
import requests
from time import sleep
import pandas as pd
api_key = "DEMO_KEY"
q_params = {"api_key": api_key, "per_page": 100}
# Fetch data from the API
search_terms = ['grapes', 'cotton','apple', 'onion','cucumber']
publications = []
for i, search_term in enumerate(search_terms):
try:
endpoint = 'https://api.nal.usda.gov/pubag/rest/search?query=abstract:' + search_term
response = requests.get(endpoint, params=q_params)
r = response.json()
# Here I used the list destructuring operator *
publications = [*publications, *r["response"]["docs"]]
# `publications += r["response"]["docs"]` is an equivalent way of merging the lists
sleep(5) # Important, don't spam the API
except requests.HTTPError as e:
print("=" * 50)
print(f"Failed: {i}, {search_term}, {e}")
print("=" * 50)
# Merge all the publications (each publication is a dictionary)
pubs = {}
for i, pub in enumerate(publications):
pubs.update({i: pub})
"""
The creation of the Pandas dataframe
1) Create the dataframe and specify that we want the keys of the `pubs` dictionary to be the rows of the dataframe by specifying `orient="index"`
2) We drop all duplicates with respect to the column "issn" from the dataframe. The ISSN is a unique identifier for each publication.
3) We drop all columns that have all "NaN" because they are useless
4) We drop the first "date" column as it too seems useless, it only contains "0000".
"""
df = pd.DataFrame.from_dict(pubs, orient="index").drop_duplicates(subset="issn").dropna(axis=1, how='all').drop("date", axis=1)
print(df)
EDIT: You can perform the merging of the publication dictionaries in the first loop if you want to, this would make the script faster as it avoids unnecessary work. I skipped it in the above solution so not to complicate it too much. Here is how the code would look if you do that.
import requests
from time import sleep
import pandas as pd
api_key = "DEMO_KEY"
q_params = {"api_key": api_key, "per_page": 100}
# Fetch data from the API
search_terms = ['grapes', 'cotton','apple', 'onion','cucumber']
publications = dict()
j = 0
for i, search_term in enumerate(search_terms):
try:
endpoint = 'https://api.nal.usda.gov/pubag/rest/search?query=abstract:' + search_term
response = requests.get(endpoint, params=q_params)
r = response.json()
for publication in r["response"]["docs"]:
publications.update({j: publication})
j += 1
sleep(5) # Important, don't spam the API
except requests.HTTPError as e:
print("=" * 50)
print(f"Failed: {i}, {search_term}, {e}")
print("=" * 50)
"""
The creation of the Pandas dataframe
1) Create the dataframe and specify that we want the keys of the `pubs` dictionary to be the rows of the dataframe by specifying `orient="index"`
2) We drop all duplicates with respect to the column "issn" from the dataframe. The ISSN is a unique identifier for each publication.
3) We drop all columns that have all "NaN" because they are useless
4) We drop the first "date" column as it too seems useless, it only contains "0000".
"""
df = pd.DataFrame.from_dict(publications, orient="index").drop_duplicates(subset="issn").dropna(axis=1, how='all').drop("date", axis=1)
print(df, end="\n\n")
for col in df.columns:
print(col, end=", ")

Related

Adding Column to data frame based on list content in a loop? - Python

I'm pulling data from the NHL API for player stats based on individual games. I'm trying to make a loop that calls the data, parses the JSON, creates a dict which I then can create a data frame from for an entire team. The code before my looping looks like this:
API_URL = "https://statsapi.web.nhl.com/api/v1"
response = requests.get(API_URL + "/people/8477956/stats?stats=gameLog", params={"Content-Type": "application/json"})
data = json.loads(response.text)
df_list_dict = []
for game in data['stats'][0]['splits']:
curr_dict = game['stat']
curr_dict['date'] = game['date']
curr_dict['isHome'] = game['isHome']
curr_dict['isWin'] = game['isWin']
curr_dict['isOT'] = game['isOT']
curr_dict['team'] = game['team']['name']
curr_dict['opponent'] = game['opponent']['name']
df_list_dict.append(curr_dict)
df = pd.DataFrame.from_dict(df_list_dict)
print(df)
This gives me a digestible data frame for a single player. (/people/{player}/....
I want to iterate through a list (the list being an NHL team), while adding a column that identifies the player and concatenates the created data frames. My attempt thus far looks like this:
import requests
import json
import pandas as pd
Rangers = ['8478550', '8476459', '8479323', '8476389', '8475184', '8480817', '8480078', '8476624', '8481554', '8482109', '8476918', '8476885', '8479324',
'8482073', '8479328', '8480833', '8478104', '8477846', '8477380', '8477380', '8477433', '8479333', '8479991']
def callapi(player):
response = (requests.get(f'https://statsapi.web.nhl.com/api/v1/people/{player}/stats?stats=gameLog', params={"Content-Type": "application/json"}))
data = json.loads(response.text)
df_list_dict = []
for game in data['stats'][0]['splits']:
curr_dict = game['stat']
curr_dict['date'] = game['date']
curr_dict['isHome'] = game['isHome']
curr_dict['isWin'] = game['isWin']
curr_dict['isOT'] = game['isOT']
curr_dict['team'] = game['team']['name']
curr_dict['opponent'] = game['opponent']['name']
df_list_dict.append(curr_dict)
df = pd.DataFrame.from_dict(df_list_dict)
print(df)
for player in Rangers:
callapi(player)
print(callapi)
When this is printed I can see all the data frames that were created. I cannot use curr_dict[] to add a column based on the list position (the player ID) because must be a slice or integer, not string.
What I'm hoping to do is make this one data frame in which the stats are identified by a player id column.
My python knowledge is very scattered, I feel as if with the progress I've made I should know how to complete this but I've simply hit a wall. Any help would be appreciated.
You can use concurrent.futures to parallelize the requests before concatenating them all together, and json_normalize to parse the json.
import concurrent.futures
import json
import os
import pandas as pd
import requests
class Scrape:
def main(self) -> pd.DataFrame:
rangers = ["8478550", "8476459", "8479323", "8476389", "8475184", "8480817", "8480078",
"8476624", "8481554", "8482109", "8476918", "8476885", "8479324", "8482073",
"8479328", "8480833", "8478104", "8477846", "8477380", "8477380", "8477433",
"8479333", "8479991"]
with concurrent.futures.ProcessPoolExecutor(max_workers=os.cpu_count()) as executor:
return pd.concat(executor.map(self.get_stats, rangers)).reset_index(drop=True).fillna(0)
#staticmethod
def get_stats(player: str) -> pd.DataFrame:
url = f"https://statsapi.web.nhl.com/api/v1/people/{player}/stats?stats=gameLog"
with requests.Session() as request:
response = request.get(url, timeout=30)
if response.status_code != 200:
print(response.raise_for_status())
data = json.loads(response.text)
df = (pd.
json_normalize(data=data, record_path=["stats", "splits"])
.rename(columns={"team.id": "team_id", "team.name": "team_name",
"opponent.id": "opponent_id", "opponent.name": "opponent_name"})
).assign(player_id=player)
df = df[df.columns.drop(list(df.filter(regex="link|gamePk")))]
df.columns = df.columns.str.split(".").str[-1]
if "faceOffPct" not in df.columns:
df["faceOffPct"] = 0
return df
if __name__ == "__main__":
stats = Scrape().main()
print(stats)

How to iterate over dataframe rows for individual API calls

I'm trying to set up a loop to pull in weather data for about 500 weather stations for an entire year which I have in my dataframe. The base URL stays the same, and the only part that changes is the weather station ID.
I'd like to create a dataframe with the results. I believe i'd use requests.get to pull in data for all the weather stations in my list, which the IDs to use in the URL are in a column called "API ID" in my dataframe. I am a python beginner - so any help would be appreciated! My code is below but doesn't work and returns an error:
"InvalidSchema: No connection adapters were found for '0 " http://www.ncei.noaa.gov/access/services/data/...\nName: API ID, Length: 497, dtype: object'
.
def callAPI(API_id):
for IDs in range(len(API_id)):
url = ('http://www.ncei.noaa.gov/access/services/data/v1?dataset=daily-summaries&dataTypes=PRCP,SNOW,TMAX,TMIN&stations=' + distances['API ID'] + '&startDate=2020-01-01&endDate=2020-12-31&includeAttributes=0&includeStationName=true&units=standard&format=json')
r = requests.request('GET', url)
d = r.json()
ll = []
for index1,rows1 in distances.iterrows():
station = rows1['Closest Station']
API_id = rows1['API ID']
data = callAPI(API_id)
ll.append([(data)])
I am not sure about your whole code base, but this is the function that will return the data from the API, If you have multiple station id on a single df column then you can use a for loop otherwise no need to do that.
Also, you are not returning the result from the function. Check the return keyword at the end of the function.
Working code:
import requests
def callAPI(API_id):
url = ('http://www.ncei.noaa.gov/access/services/data/v1?dataset=daily-summaries&dataTypes=PRCP,SNOW,TMAX,TMIN&stations=' + API_id + '&startDate=2020-01-01&endDate=2020-12-31&includeAttributes=0&includeStationName=true&units=standard&format=json')
r = requests.request('GET', url)
d = r.json()
return d
print(callAPI('USC00457180'))
So your full code will be something like this,
def callAPI(API_id):
url = ('http://www.ncei.noaa.gov/access/services/data/v1?dataset=daily-summaries&dataTypes=PRCP,SNOW,TMAX,TMIN&stations=' + API_id + '&startDate=2020-01-01&endDate=2020-12-31&includeAttributes=0&includeStationName=true&units=standard&format=json')
r = requests.request('GET', url)
d = r.json()
return d
ll = []
for index1,rows1 in distances.iterrows():
station = rows1['Closest Station']
API_id = rows1['API ID']
data = callAPI(API_id)
ll.append([(data)])
Note: Even better use asynchronous calls to the API to make the process faster. Something like this: https://stackoverflow.com/a/56926297/1138192

Processing API data (json) into a singular data frame (list of list of dictionaries)?

So this is a somewhat of a continuation from a previous post of mine except now I have API data to work with. I am trying to get keys Type and Email as columns in a data frame to come up with a final number. My code:
jsp_full=[]
for p in payloads:
payload = {"payload": {"segmentId":p}}
r = requests.post(url,headers = header, json = payload)
#print(r, r.reason)
time.sleep(r.elapsed.total_seconds())
json_data = r.json() if r and r.status_code == 200 else None
json_keys = json_data['payload']['supporters']
json_package = []
jsp_full.append(json_package)
for row in json_keys:
SID = row['supporterId']
Handle = row['contacts']
a_key = 'value'
list_values = [a_list[a_key] for a_list in Handle]
string = str(list_values).split(",")
data = {
'SupporterID' : SID,
'Email' : strip_characters(string[-1]),
'Type' : labels(p)
}
json_package.append(data)
t2 = round(time.perf_counter(),2)
b_key = "Email"
e = len([b_list[b_key] for b_list in json_package])
t = str(labels(p))
#print(json_package)
print(f'There are {e} emails in the {t} segment')
print(f'Finished in {t2 - t1} seconds')
excel = pd.DataFrame(json_package)
excel.to_excel(r'C:\Users\am\Desktop\email parsing\{0} segment {1}.xlsx'.format(t, str(today)), sheet_name=t)
This part works all well and good. Each payload in the API represents a different segment of people so I split them out into different files. However, I am at a point where I need to combine all records into a single data frame hence why I append out to jsp_full. This is a list of a list of dictionaries.
Once I have that I would run the balance of my code which is like this:
S= pd.DataFrame(jsp_full[0], index = {0})
Advocacy_Supporters = S.sort_values("Type").groupby("Type", as_index=False)["Email"].first()
print(Advocacy_Supporters['Email'].count())
print("The number of Unique Advocacy Supporters is :")
Advocacy_Supporters_Group = Advocacy_Supporters.groupby("Type")["Email"].nunique()
print(Advocacy_Supporters_Group)
Some sample data:
[{'SupporterID': '565f6a2f-c7fd-4f1b-bac2-e33976ef4306', 'Email': 'somebody#somewhere.edu', 'Type': 'd_Student Ambassadors'}, {'SupporterID': '7508dc12-7647-4e95-a8b8-bcb067861faf', 'Email': 'someoneelse#email.somewhere.edu', 'Type': 'd_Student Ambassadors'},...`
My desired output is a dataframe that looks like so:
SupporterID Email Type
565f6a2f-c7fd-4f1b-bac2-e33976ef4306 somebody#somewhere.edu d_Student Ambassadors
7508dc12-7647-4e95-a8b8-bcb067861faf someoneelse#email.somewhere.edu d_Student Ambassadors
Any help is greatly appreciated!!
So because this code creates an excel file for each segment, all I did was read back in the excels via a for loop like so:
filesnames = ['e_S Donors', 'b_Contributors', 'c_Activists', 'd_Student Ambassadors', 'a_Volunteers', 'f_Offline Action Takers']
S= pd.DataFrame()
for i in filesnames:
data = pd.read_excel(r'C:\Users\am\Desktop\email parsing\{0} segment {1}.xlsx'.format(i, str(today)),sheet_name= i, engine = 'openpyxl')
S= S.append(data)
This did the trick since it was in a format I already wanted.

Loop and add function component as index

I would like to change the index of the following code. Instead of having 'close' as the index, I want to have the corresponding x from the function. As sometimes like in this example even if i provide 4 curr only 3 are available. Meaning that I cannot add the list as the index after looping as the size changes. Thank you for your help. I should add that even with the set_index(x) the index remain 'close'.
The function daily_price_historical retrieve prices from a public API . There are exactly 7 columns from which I select the the first one (close).
The function:
def daily_price_historical(symbol, comparison_symbol, all_data=False, limit=1, aggregate=1, exchange=''):
url = 'https://min-api.cryptocompare.com/data/histoday?fsym={}&tsym={}&limit={}&aggregate={}'\
.format(symbol.upper(), comparison_symbol.upper(), limit, aggregate)
if exchange:
url += '&e={}'.format(exchange)
if all_data:
url += '&allData=true'
page = requests.get(url)
data = page.json()['Data']
df = pd.DataFrame(data)
df.drop(df.index[-1], inplace=True)
return df
The code:
curr = ['1WO', 'ABX','ADH', 'ALX']
d_price = []
for x in curr:
try:
close = daily_price_historical(x, 'JPY', exchange='CCCAGG').close
d_price.append(close).set_index(x)
except:
pass
d_price = pd.concat(d_price, axis=1)
d_price = d_price.transpose()
print(d_price)
The output:
0
close 2.6100
close 0.3360
close 0.4843
The function daily_price_historical returns a dataframe, so daily_price_historical(x, 'JPY', exchange='CCCAGG').close is a pandas Series. The title of a Series is its name, but you can change it with rename. So you want:
...
close = daily_price_historical(x, 'JPY', exchange='CCCAGG').close
d_price.append(close.rename(x))
...
In your original code, d_price.append(close).set_index(x) raised a AttributeError: 'NoneType' object has no attribute 'set_index' exception because append on a list returns None but the exception was raised after the append and was silently swallowed by the catchall except: pass.
What to remember from that: never use the very dangerous :
try:
...
except:
pass
which hides any error.
Try this small code
import pandas as pd
import requests
curr = ['1WO', 'ABX','ADH', 'ALX']
def daily_price_historical(symbol, comparison_symbol, all_data=False, limit=1, aggregate=1, exchange=''):
url = 'https://min-api.cryptocompare.com/data/histoday?fsym={}&tsym={}&limit={}&aggregate={}'\
.format(symbol.upper(), comparison_symbol.upper(), limit, aggregate)
if exchange:
url += '&e={}'.format(exchange)
if all_data:
url += '&allData=true'
page = requests.get(url)
data = page.json()['Data']
df = pd.DataFrame(data)
df.drop(df.index[-1], inplace=True)
return df
d_price = []
lables_ind = []
for idx, x in enumerate(curr):
try:
close = daily_price_historical(x, 'JPY', exchange='CCCAGG').close
d_price.append(close[0])
lables_ind.append(x)
except:
pass
d_price = pd.DataFrame(d_price,columns=["0"])
d_price.index = lables_ind
print(d_price)
Output
0
1WO 2.6100
ADH 0.3360
ALX 0.4843

Parsing a JSON using specific key words using Python

I'm trying to parse a JSON of a sites stock.
The JSON: https://www.ssense.com/en-us/men/sneakers.json
So I want to take some keywords from the user. Then I want to parse the JSON using these keywords to find the name of the item and (in this specific case) return the ID, SKU and the URL.
So for example:
If I inputted "Black Fennec" I want to parse the JSON and find the ID,SKU, and URL of Black Fennec Sneakers (that have an ID of 3297299, a SKU of 191422M237006, and a url of /men/product/ps-paul-smith/black-fennec-sneakers/3297299 )
I have never attempted doing anything like this. Based on some guides that show how to parse a JSON I started out with this:
r = requests.Session()
stock = r.get("https://www.ssense.com/en-us/men/sneakers.json",headers = headers)
obj json_data = json.loads(stock.text)
However I am now confused. How do I find the product based off the keywords and how do I get the ID,Url and the SKU or it?
Theres a number of ways to handle the output. not sure what you want to do with it. But this should get you going.
EDIT 1:
import requests
r = requests.Session()
obj_json_data = r.get("https://www.ssense.com/en-us/men/sneakers.json").json()
products = obj_json_data['products']
keyword = input('Enter a keyword: ')
for product in products:
if keyword.upper() in product['name'].upper():
name = product['name']
id_var = product['id']
sku = product['sku']
url = product['url']
print ('Product: %s\nID: %s\nSKU: %s\nURL: %s' %(name, id_var, sku, url))
# if you only want to return the first match, uncomment next line
#break
I also have it setup to store it into a dataframe, and or a list too. Just to give some options of where to go with it.
import requests
import pandas as pd
r = requests.Session()
obj_json_data = r.get("https://www.ssense.com/en-us/men/sneakers.json").json()
products = obj_json_data['products']
keyword = input('Enter a keyword: ')
products_found = []
results = pd.DataFrame()
for product in products:
if keyword.upper() in product['name'].upper():
name = product['name']
id_var = product['id']
sku = product['sku']
url = product['url']
temp_df = pd.DataFrame([[name, id_var, sku, url]], columns=['name','id','sku','url'])
results = results.append(temp_df)
products_found = products_found.append(name)
print ('Product: %s\nID: %s\nSKU: %s\nURL: %s' %(name, id_var, sku, url))
if products_found == []:
print ('Nothing found')
EDIT 2: Here is another way to do it by converting the json to a dataframe, then filtering by those rows that have the keyword in the name (this is actually a better solution in my opinion)
import requests
import pandas as pd
from pandas.io.json import json_normalize
r = requests.Session()
obj_json_data = r.get("https://www.ssense.com/en-us/men/sneakers.json").json()
products = obj_json_data['products']
products_df = json_normalize(products)
keyword = input('Enter a keyword: ')
products_found = []
results = pd.DataFrame()
results = products_df[products_df['name'].str.contains(keyword, case = False)]
#print (results[['name', 'id', 'sku', 'url']])
products_found = list(results['name'])
if products_found == []:
print ('Nothing found')
else:
print ('Found: '+ str(products_found))

Categories