I have brainstormed a lot of possibilities and created a bulk API call so I can import some products into my store. It works fine, products are imported correctly, however, I have trouble saving bad responses in a csv file.
Maybe I am doing something wrong or my indentation is not correct. Please, point me in a right direction or provide any advice for the future for not making similar mistakes.
This is the code:
df = pd.read_csv('edited_csv.csv')
bad_responses_list = []
for i in range(len(df)):
endpoint = f"{base_url}/products"
data = {
"product_id": int(df['product_id'][i]),
"title": df['title'][i],
"discount_type": "percentage",
"discount": 5,
}
}
response = requests.post(endpoint, json.dumps(data), headers=headers)
status_code = response.status_code
if status_code != 200 or status_code != 201:
bad_responses_list.append([df['product_id'][i], response.status_code])
df_bad_responses = pd.DataFrame(bad_responses_list, columns=['product_id', 'status_code'])
df_bad_responses.to_csv('products_with_bad_responses.csv')
Now, when I run this it creates csv with good and bad responses, something like this:
product_id, status_code
7262783, 201
9458389, 201
0493788, 422
7273628, 422
7263728, 201
Thank you in advance!
the or in this line:
if status_code != 200 or status_code != 201:
needs to be an and
Related
I'm brand new to python and api as well.
I'm trying to use a endpoint we have at work.
We have an API we are using a lot, we also have an UI. But using the UI we can only extract 10.000 records at the time.
There is no limit on the api.
I have found a small piece of code - but i need to add a nextpagetoken.
My code looks like this:
login_url = 'https://api.ubsend.io/v1/auth/login'
username = 'xxxxx'
password = 'xxxxx'
omitClaims = "true"
session = requests.Session()
session.headers['Accept'] = "application/json; charset=UTF-8"
response = session.post(
login_url,
json={'username': username, 'password': password},
headers={'VERSION': '3'},
)
response.raise_for_status()
response_data = response.json()
print(response_data)
This gives me the AccessToken.
Then I call:
getevents = 'https://api.ubsend.io/v1/reporting/shipments?'
data ={'client_id': 13490, 'created_after': '2020-05-01T00:00', 'created_before': '2021-05-02T00:00'} req.prepare_url(getevents, data)
events = requests.get(req.url, headers={'Authorization' : 'Bearer ' + response_data['accessToken'], Content-Type': 'application/json'})
events.json()
Which returns:
'nextPageToken': 'NjA4ZDc3YzNkMjBjODgyYjBhMWVkMTVkLDE2MTk4ODM5NzA3MDE='}
So I want to loop my script - until nextPageToken is blank ....
Any thoughts?
Edit thanks for the update. I think this might be the solution we're looking for. You might have to do some poking around to figure out exactly what the name of the page_token URL parameter should be.
has_next = True
getevents = 'https://api.ubsend.io/v1/reporting/shipments?'
token = None
while has_next:
data ={'client_id': 13490, 'created_after': '2020-05-01T00:00', 'created_before': '2021-05-02T00:00'}
if token:
# I don't know the proper name for this URL parameter.
data['page_token'] = token
req.prepare_url(getevents, data)
events = requests.get(req.url, headers={'Authorization' : 'Bearer ' + response_data['accessToken'], Content-Type: 'application/json'})
token = events.json().get('nextPageToken')
if not token:
has_next = False
I made a slight typo. It should be events.json().get('nextPageToken') I believe.
Let me know if this works.
I need to get a filtered sample of twitter stream
I'm using tweepy
I checked the functions for the class Stream to get sample stream and to filter
but I dint' catch how should I set the class
should it be
stream.filter(track=['']).sample()
stream.sample().filter(track=[''])
or each one in a line or what
And if you have another idea how to get a sample stream based on keyword filters please help
Thanks in advance
Twitter v2 APIs include an endpoint for random sampling and endpoint for filtered tweets.
The latter allows for specifying a random sample percentage in a query (for example, sample:10 will return a random 10% sample).
Note that v2 APIs are still new and at the moment have a cap of 500k tweets per month.
As an example for the latter, the following code (modified version of this, see this doc) will collect streaming data with cat or dog tags and store it in a json file for every 100 tweets. (Note: this does not include the random sampling query.)
import requests
import os
import json
import pandas as pd
# To set your enviornment variables in your terminal run the following line:
# export 'BEARER_TOKEN'='<your_bearer_token>'
data = []
counter = 0
def create_headers(bearer_token):
headers = {"Authorization": "Bearer {}".format(bearer_token)}
return headers
def get_rules(headers, bearer_token):
response = requests.get(
"https://api.twitter.com/2/tweets/search/stream/rules", headers=headers
)
if response.status_code != 200:
raise Exception(
"Cannot get rules (HTTP {}): {}".format(response.status_code, response.text)
)
print(json.dumps(response.json()))
return response.json()
def delete_all_rules(headers, bearer_token, rules):
if rules is None or "data" not in rules:
return None
ids = list(map(lambda rule: rule["id"], rules["data"]))
payload = {"delete": {"ids": ids}}
response = requests.post(
"https://api.twitter.com/2/tweets/search/stream/rules",
headers=headers,
json=payload
)
if response.status_code != 200:
raise Exception(
"Cannot delete rules (HTTP {}): {}".format(
response.status_code, response.text
)
)
print(json.dumps(response.json()))
def set_rules(headers, delete, bearer_token):
# You can adjust the rules if needed
sample_rules = [
{"value": "dog has:images", "tag": "dog pictures"},
{"value": "cat has:images -grumpy", "tag": "cat pictures"},
]
payload = {"add": sample_rules}
response = requests.post(
"https://api.twitter.com/2/tweets/search/stream/rules",
headers=headers,
json=payload,
)
if response.status_code != 201:
raise Exception(
"Cannot add rules (HTTP {}): {}".format(response.status_code, response.text)
)
print(json.dumps(response.json()))
def get_stream(headers, set, bearer_token):
global data, counter
response = requests.get(
"https://api.twitter.com/2/tweets/search/stream", headers=headers, stream=True,
)
print(response.status_code)
if response.status_code != 200:
raise Exception(
"Cannot get stream (HTTP {}): {}".format(
response.status_code, response.text
)
)
for response_line in response.iter_lines():
if response_line:
json_response = json.loads(response_line)
print(json.dumps(json_response, indent=4, sort_keys=True))
data.append(json_response['data'])
if len(data) % 100 == 0:
print('storing data')
pd.read_json(json.dumps(data), orient='records').to_json(f'tw_example_{counter}.json', orient='records')
data = []
counter +=1
def main():
bearer_token = os.environ.get("BEARER_TOKEN")
headers = create_headers(bearer_token)
rules = get_rules(headers, bearer_token)
delete = delete_all_rules(headers, bearer_token, rules)
set = set_rules(headers, delete, bearer_token)
get_stream(headers, set, bearer_token)
if __name__ == "__main__":
main()
Then, load data in pandas dataframe as
df = pd.read_json('tw_example.json', orient='records').
I'd suggest reading the api documentation for tweepy. Here you can see how to filter the stream like you want to.
From reading other code snippets, i belive it should be done like this:
stream.filter(track=['Keyword'])
print(stream.sample())
As I understand, tweepy uses twitter v1.1 APIs, which has separate APIs for sampling and filtering tweets in real time.
Twitter API references.
v1 sample-realtime
v1 filter-realtime
Approach 1: one can get filtered stream data using stream.filter(track=['Keyword1', 'keyord2']) etc. and then sample records from the collected data.
class StreamListener(tweepy.StreamListener):
def on_status(self, status):
# do data processing and storing here
see examples like https://www.storybench.org/how-to-collect-tweets-from-the-twitter-streaming-api-using-python/ Ignoring Retweets When Streaming Twitter Tweets
Approach 2: one can write program that starts and stops streaming in random time intervals (for example, random sampling of 3 min interval in every 15 minutes).
Approach 3: one can instead use the sampling API to collect data and then filter with keyword to store relevant data.
With help on Stackoverflow, I was able to come up with the scraper. The code returns a list of part numbers and its corresponding prices.
part1 price1
part2 price2
...
...
partn pricen
However the website seems to only allow 200 requests - when i raise the limit to 200+ i would get the error: "raise JSONDecodeError("Expecting value", s, err.value) from None JSONDecodeError: Expecting value".
I just want to know if there's a way to avoid this error? If not I can raise start:0 by 200 each time, but since I would have 100k+ items easily it won't be very efficient..is there a way I can loop the limit and the start function?
Please see the codes below, any help appreciated!
import requests
# import pprint # to format data on screen `pprint.pprint()
import pandas as pd
# --- fucntions ---
def get_data(query):
"""Get data from server"""
payload = {
# "facets":[{
# "name":"OEM",
# "value":"GE%20Healthcare"
# }],
"facets":[],
"facilityId": 38451,
"id_ins": "a2a3d332-73a7-4194-ad87-fe7412388916",
"limit": 200,
"query": query,
"referer": "/catalog/Service",
"start": 0,
# "urlParams":[{
# "name": "OEM",
# "value": "GE Healthcare"
# }],
"urlParams":[]
}
r = requests.post('https://prodasf-vip.partsfinder.com/Orion/CatalogService/api/v1/search', json=payload)
data = r.json()
return data
all_queries = ['GE Healthcare']
for query in all_queries:
#print('\n--- QUERY:', query, '---\n')
data = get_data(query)
Part_Num = []
Vendor_Item_Num = []
price = []
for item in data['products']:
if not item['options']:
Part_Num.append([])
Vendor_Item_Num.append([])
price.append([])
else:
all_prices = [option['price'] for option in item['options']]
all_vendor = [option['price'] for option in item['options']]
all_part_num = item['partNumber']
Part_Num.append(all_part_num)
Vendor_Item_Num.append(all_vendor)
price.append(all_prices)
list_of_dataframes = [pd.DataFrame(Part_Num),pd.DataFrame(price)]
pd.concat(list_of_dataframes, axis=1).to_csv(r'C:\Users\212677036\Documents\output7.csv')
You should always check the status_code that your request was successful. The API is giving HTTP 500 when limit is > 200. status codes. You need to study the documentation of the API. Many APIs limit requests per second and maximum request size so they can maintain a reliable service.
The json() method will fail if the HTTP request was not successful.
You can get data in batches. Sample code below I stop because I have no want to stay in the loop for 500+ iterations... You could consider using threading so it's not so sequential.
All of this is covered in SO prodasf-vip
import requests
query = 'GE Healthcare'
payload = {
"facets":[],
"facilityId": 38451,
"id_ins": "a2a3d332-73a7-4194-ad87-fe7412388916",
"limit": 200,
"query": query,
"referer": "/catalog/Service",
"start": 0,
"urlParams":[]
}
r = requests.post('https://prodasf-vip.partsfinder.com/Orion/CatalogService/api/v1/search', json=payload)
if r.status_code == 200:
js = r.json()
df = pd.json_normalize(js["products"])
while len(df) < js["totalResults"] and len(df)<2000:
payload["start"] += 200
r = requests.post('https://prodasf-vip.partsfinder.com/Orion/CatalogService/api/v1/search', json=payload)
if r.status_code == 200:
df = pd.concat([df, pd.json_normalize(r.json()["products"])])
else:
break
print(f"want: {js['totalResults']} got: {len(df)}")
df
My main function :
def get_data():
try:
response = send_request_to_get_data()
// will get one dict output looks like :
{
"data": ['some datas.....'],
"next": "api/data?top=100&skip=200",
}
if response.status_code == 200:
if response.json().get("next"):
first_paginated_response = get_paginated_data(response.json().get("next"))
if response.status_code == 200:
if first_paginated_response.json().get("next"):
second_paginated_response = get_paginated_data(response.json().get("next"))
if response.status_code == 200:
if second_paginated_response.json().get("next"):
print('again...again....again....again...again)
def send_request_to_get_data():
return rq.get('https://example.com')
def get_paginated_data(paginated):
url = "https://example.com/{next}".format(next=paginated)
return rq.get(url)
If "next" key is in response, i need to send another request for pagination api, but my if statement looks weird.
What is the good approach for this?
You could use while loop and save the data like this:
response = send_request_to_get_data()
data = response['data']
while response.status_code == 200 and response.json().get("next"):
response = get_paginated_data(response.json().get("next"))
data.extend(response['data'])
Python Requests API client has a function that needs to re execute if run unsuccessfully.
Kitten(BaseClient):
def create(self, **params):
uri = self.BASE_URL
data = dict(**(params or {}))
r = self.client.post(uri, data=json.dumps(data))
return r
If ran with
api = Kitten()
data = {"email": "bill#dow.com", "currency": "USD", "country": "US" }
r = api.create(**data)
The issue is whenever you run it, the first time it always returns back the request as GET, even when it it POST. The first time the post is sent, it returns back GET list of entries.
The later requests, second and later, api.create(**data) return back new entries created like they should be.
There is a status_code for get and post
# GET
r.status_code == 200
# POST
r.status_code == 201
What would be better Python way to re execute when status_code is 200, till a valid 201 is returned.
If you know for sure that the 2nd post will always return your expected value, you can use a ternary operator to perform the check a second time:
Kitten(BaseClient):
def create(self, **params):
uri = self.BASE_URL
data = dict(**(params or {}))
r = self._get_response(uri, data)
return r if r.status_code == 201 else self._get_response(uri, data)
def _get_response(uri, data):
return self.client.post(uri, data=json.dumps(data)
Otherwise you can put it in a while loop where the condition is that the status code == 201.