Limiting the lenght of response with python requests - python

I'm setting up a Views that save data from an API into my database on a button click and I'm having a hard time trying to figure out how to limit the size of the requests response for product descriptions in the following way:
If the lenght of a description is higher than 2000, delete some of the letters at the end of it until it reaches the limit of 2000, but don't remove it completely from the request.
As of now, what I've been able to achieve is to completely remove the product infos if the lenght is higher than 2000 as you can see below.
My django views function:
def api_data(request):
if request.GET.get('mybtn'): # to improve, == 'something':
resp_1 = requests.get(
"https://www.test-headout.com/api/public/v1/product/listing/list-by/city?language=fr&cityCode=PARIS&limit=5000&currencyCode=CAD",
headers={
"Headout-Auth": HEADOUT_TEST_API_KEY
})
resp_1_data = resp_1.json()
base_url_2 = "https://www.test-headout.com/api/public/v1/product/get/"
for item in resp_1_data['items']:
# concat ID to the URL string
url = '{}{}'.format(base_url_2, item['id'] + '?language=fr')
# make the HTTP request
resp_2 = requests.get(
url,
headers={
"Headout-Auth": HEADOUT_TEST_API_KEY
})
resp_2_data = resp_2.json()
if len(resp_2_data['contentListHtml'][0]['html']) < 2000: #represent the description of a product
Product.objects.get_or_create(
title=item['name'],
destination=item['city']['name'],
description=resp_2_data['contentListHtml'][0]['html'],
link=item['canonicalUrl'],
image=item['image']['url']
)
return render(request, "form.html")
But I remove way to much rows by doing this so I would like to know how can I fix this?
Please help.

Instead of using a conditional statement, you can use the slice operator to specify the character limit for the description. Using this approach you can refactor the relevant part of your code:
resp_2_data = resp_2.json()
Product.objects.get_or_create(title=item['name'],
destination=item['city']['name'],
description=resp_2_data['contentListHtml'][0]['html'][0:2000],
....
)

Related

Reading JSON data in Python using Pagination, max records 100

I am trying to extract data from a REST API using python and put it into one neat JSON file, and having difficulty. The date is rather lengthy, with a total of nearly 4,000 records, but the max record allowed by the API is 100.
I've tried using some other examples to get through the code, and so far this is what I'm using (censoring the API URL and auth key, for the sake of confidentiality):
import requests
import json
from requests.structures import CaseInsensitiveDict
url = "https://api.airtable.com/v0/CENSORED/Vendors?maxRecords=100"
headers = CaseInsensitiveDict()
headers["Authorization"] = "Bearer CENSORED"
resp = requests.get(url, headers=headers)
resp.content.decode("utf-8")
vendors = []
new_results = True
page = 1
while new_results:
centiblock = requests.get(url + f"&page={page}", headers=headers).json()
new_results = centiblock.get("results", [])
vendors.extend(centiblock)
page += 1
full_directory = json.dumps(vendors, indent=4)
print(full_directory)
For the life of me, I cannot figure out why it isn't working. The output keeps coming out as just:
[
"records"
]
If I play around with the print statement at the end, I can get it to print centiblock (so named for being a block of 100 records at a time) just fine - it gives me 100 records in un-formated text. However, if I try printing vendors at the end, the output is:
['records']
...which leads me to guess that somehow, the vendors array is not getting filled with the data. I suspect that I need to modify the get request where I define new_results, but I'm not sure how.
For reference, this is a censored look at how the json data begins, when I format and print out one centiblock:
{
"records": [
{
"id": "XXX",
"createdTime": "2018-10-15T19:23:59.000Z",
"fields": {
"Vendor Name": "XXX",
"Main Phone": "XXX",
"Street": "XXX",
Can anyone see where I'm going wrong?
Thanks in advance!
When you are extending vendors with centiblock, your are giving a dict to the extend function. extend is expecting an Iterable, so that works, but when you iterate over a python dict, you only iterate over the keys of the dict. In this case, ['records'].
Note as well, that your loop condition becomes False after the first iteration, because centiblock.get("results", []) returns [], since "results" is not a key of the output of the API. and [] has a truthiness value of False.
Hence to correct those errors you need to get the correct field from the API into new_results, and extend vendors with new_results, which is itself an array. Note that on the last iteration, new_results will be the empty list, which means vendors won't be extended with any null value, and will contain exactly what you need:
This should look like:
import requests
import json
from requests.structures import CaseInsensitiveDict
url = "https://api.airtable.com/v0/CENSORED/Vendors?maxRecords=100"
headers = CaseInsensitiveDict()
headers["Authorization"] = "Bearer CENSORED"
resp = requests.get(url, headers=headers)
resp.content.decode("utf-8")
vendors = []
new_results = True
page = 1
while len(new_results) > 0:
centiblock = requests.get(url + f"&page={page}", headers=headers).json()
new_results = centiblock.get("records", [])
vendors.extend(new_results)
page += 1
full_directory = json.dumps(vendors, indent=4)
print(full_directory)
Note that I replaced the while new_results with a while len(new_results)>0 which is equivalent in this case, but more readable, and better practice in general.

Scraping text from a header and Class tag

The code below errors out when trying to execute this line
"RptTime = TimeTable[0].xpath('//text()')"
Not sure why I see TimeTable has a value in my variable window, but the HtmlElement "TimeTable[0]" has no value and the "content.cssselect" at time of assignment returns value. Why then would I get an error "list index out of range". This tells me that the element is empty. I am trying to get the Year Month value in that field.
import pandas as pd
from datetime import datetime
from lxml import html
import requests
def http_request_get(url, session=None, payload=None, parse=True):
""" Sends a GET HTTP request to a website and returns its HTML content and full url address. """
if payload is None:
payload = {}
if session:
content = session.get(url, params=payload, verify=False, headers={"content-type":"text"})
else:
content = requests.get(url, params=payload, verify=False, headers={"content-type":"text"})
content.raise_for_status() # Raise HTTPError for bad requests (4xx or 5xx)
if parse:
return html.fromstring(content.text), content.url
else:
return content.text, content.url
def get_html(link):
"""
Returns a html.
"""
page_parsed, _ = http_request_get(url=link, payload={'t': ''}, parse=True)
return page_parsed
cmslinks=[
'https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/MCRAdvPartDEnrolData/Monthly-Contract-and-Enrollment-Summary-Report?items_per_page=100&items_per_page_options%5B5%5D=5%20per%20page&items_per_page_options%5B10%5D=10%20per%20page&items_per_page_options%5B25%5D=25%20per%20page&items_per_page_options%5B50%5D=50%20per%20page&items_per_page_options%5B100%5D=100%20per%20page&combine=&page=0',
'https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/MCRAdvPartDEnrolData/Monthly-Contract-and-Enrollment-Summary-Report?items_per_page=100&items_per_page_options%5B5%5D=5%20per%20page&items_per_page_options%5B10%5D=10%20per%20page&items_per_page_options%5B25%5D=25%20per%20page&items_per_page_options%5B50%5D=50%20per%20page&items_per_page_options%5B100%5D=100%20per%20page&combine=&page=1']
df=pd.DataFrame()
df2=pd.DataFrame()
for cmslink in cmslinks:
print(cmslink)
content, _ = http_request_get(url=cmslink,payload={'t':''},parse=True)
linkTable = content.cssselect('td[headers="view-dlf-1-title-table-column"]')[0]
TimeTable = content.cssselect('td[headers="view-dlf-2-report-period-table-column"]')[0]
headers = linkTable[0].xpath("//a[contains(text(),'Contract Summary') or contains(text(),'Monthly Enrollment by CPSC')]/#href")
RptTime = TimeTable.xpath('//text()')
dfl = pd.DataFrame(headers,columns= ['links'])
dft = pd.DataFrame(RptTime,columns= ['ReportTime'])
df=df.append(dfl)
df2=df.append(dft)
Error
src\lxml\etree.pyx in lxml.etree._Element.__getitem__()
IndexError: list index out of range
Look carefully at your last line. df = df.append(df1). Your explanation and code is quite unclear due to indenting and error traceback however this is obviously not what you intended.
df.append(df1) is a procedure rather than strictly a function, it does not return anything. You simply write the line and it does its magic similar to print("hi") rather than this_is_wrong = print("hi").
What would end up happening is you overwrite df will null which should be causing some major errors if you ever use that variable again. However, this is not the cause of your problem I thought it my duty to tell you anyway.
Could you please tell us exactly what is returned by the css... function. Although you said it returned something you only store the [0] index of the return value. Meaning that if it is ["","something"] hypothetically, the value stored would be null.
It is quite likely that the problem you are having is that you indexed [0] twice, when you probably only meant to do it once.

IndexError: list index out of range (on Reddit data crawler)

is expected the below is supposed to run without issues.
Solution to Reddit data:
import requests
import re
import praw
from datetime import date
import csv
import pandas as pd
import time
import sys
class Crawler(object):
'''
basic_url is the reddit site.
headers is for requests.get method
REX is to find submission ids.
'''
def __init__(self, subreddit="apple"):
'''
Initialize a Crawler object.
subreddit is the topic you want to parse. default is r"apple"
basic_url is the reddit site.
headers is for requests.get method
REX is to find submission ids.
submission_ids save all the ids of submission you will parse.
reddit is an object created using praw API. Please check it before you use.
'''
self.basic_url = "https://www.reddit.com"
self.headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
self.REX = re.compile(r"<div class=\" thing id-t3_[\w]+")
self.subreddit = subreddit
self.submission_ids = []
self.reddit = praw.Reddit(client_id="your_id", client_secret="your_secret", user_agent="subreddit_comments_crawler")
def get_submission_ids(self, pages=2):
'''
Collect all ids of submissions..
One page has 25 submissions.
page url: https://www.reddit.com/r/subreddit/?count25&after=t3_id
id(after) is the last submission from last page.
'''
# This is page url.
url = self.basic_url + "/r/" + self.subreddit
if pages <= 0:
return []
text = requests.get(url, headers=self.headers).text
ids = self.REX.findall(text)
ids = list(map(lambda x: x[-6:], ids))
if pages == 1:
self.submission_ids = ids
return ids
count = 0
after = ids[-1]
for i in range(1, pages):
count += 25
temp_url = self.basic_url + "/r/" + self.subreddit + "?count=" + str(count) + "&after=t3_" + ids[-1]
text = requests.get(temp_url, headers=self.headers).text
temp_list = self.REX.findall(text)
temp_list = list(map(lambda x: x[-6:], temp_list))
ids += temp_list
if count % 100 == 0:
time.sleep(60)
self.submission_ids = ids
return ids
def get_comments(self, submission):
'''
Submission is an object created using praw API.
'''
# Remove all "more comments".
submission.comments.replace_more(limit=None)
comments = []
for each in submission.comments.list():
try:
comments.append((each.id, each.link_id[3:], each.author.name, date.fromtimestamp(each.created_utc).isoformat(), each.score, each.body) )
except AttributeError as e: # Some comments are deleted, we cannot access them.
# print(each.link_id, e)
continue
return comments
def save_comments_submissions(self, pages):
'''
1. Save all the ids of submissions.
2. For each submission, save information of this submission. (submission_id, #comments, score, subreddit, date, title, body_text)
3. Save comments in this submission. (comment_id, submission_id, author, date, score, body_text)
4. Separately, save them to two csv file.
Note: You can link them with submission_id.
Warning: According to the rule of Reddit API, the get action should not be too frequent. Safely, use the defalut time span in this crawler.
'''
print("Start to collect all submission ids...")
self.get_submission_ids(pages)
print("Start to collect comments...This may cost a long time depending on # of pages.")
submission_url = self.basic_url + "/r/" + self.subreddit + "/comments/"
comments = []
submissions = []
count = 0
for idx in self.submission_ids:
temp_url = submission_url + idx
submission = self.reddit.submission(url=temp_url)
submissions.append((submission.name[3:], submission.num_comments, submission.score, submission.subreddit_name_prefixed, date.fromtimestamp(submission.created_utc).isoformat(), submission.title, submission.selftext))
temp_comments = self.get_comments(submission)
comments += temp_comments
count += 1
print(str(count) + " submissions have got...")
if count % 50 == 0:
time.sleep(60)
comments_fieldnames = ["comment_id", "submission_id", "author_name", "post_time", "comment_score", "text"]
df_comments = pd.DataFrame(comments, columns=comments_fieldnames)
df_comments.to_csv("comments.csv")
submissions_fieldnames = ["submission_id", "num_of_comments", "submission_score", "submission_subreddit", "post_date", "submission_title", "text"]
df_submission = pd.DataFrame(submissions, columns=submissions_fieldnames)
df_submission.to_csv("submissions.csv")
return df_comments
if __name__ == "__main__":
args = sys.argv[1:]
if len(args) != 2:
print("Wrong number of args...")
exit()
subreddit, pages = args
c = Crawler(subreddit)
c.save_comments_submissions(int(pages))
but I got:
(base) UserAir:scrape_reddit user$ python reddit_crawler.py apple 2
Start to collect all submission ids...
Traceback (most recent call last):
File "reddit_crawler.py", line 127, in
c.save_comments_submissions(int(pages))
File "reddit_crawler.py", line 94, in save_comments_submissions
self.get_submission_ids(pages)
File "reddit_crawler.py", line 54, in get_submission_ids
after = ids[-1]
IndexError: list index out of range
Erik's answer diagnoses the specific cause of this error, but more broadly I think this is caused by you not using PRAW to its fullest potential. Your script imports requests and performs a lot of manual requests that PRAW has methods for already. The whole point of PRAW is to prevent you from having to write these requests that do things such as paginate a listing, so I recommend you take advantage of that.
As an example, your get_submission_ids function (which scrapes the web version of Reddit and handles paginating) could be replaced by just
def get_submission_ids(self, pages=2):
return [
submission.id
for submission in self.reddit.subreddit(self.subreddit).hot(
limit=25 * pages
)
]
because the .hot() function does everything you tried to do by hand.
I'm going to go one step further here and have the function just return a list of Submission objects, because the rest of your code ends up doing things that would better by done by interacting with the PRAW Submission object. Here's that code (I renamed the function to reflect its updated purpose):
def get_submissions(self, pages=2):
return list(self.reddit.subreddit(self.subreddit).hot(limit=25 * pages))
(I've updated this function to just return its result, as your version both returns the value and sets it as self.submission_ids, unless pages is 0. That felt quite inconsistent, so I made it just return the value.)
Your get_comments function looks good.
The save_comments_submissions function, like get_submission_ids, does a lot of manual work that PRAW can handle. You construct a temp_url that has the full URL of a post, and then use that to make a PRAW Submission object, but we can replace that with directly using the one returned by get_submissions. You also have some calls to time.sleep() which I removed because PRAW will automatically sleep the appropriate amount for you. Lastly, I removed the return value of this function because the point of the function is to save data to disk, not to return it to anywhere else, and the rest of your script doesn't use the return value. Here's the updated version of that function:
def save_comments_submissions(self, pages):
"""
1. Save all the ids of submissions.
2. For each submission, save information of this submission. (submission_id, #comments, score, subreddit, date, title, body_text)
3. Save comments in this submission. (comment_id, submission_id, author, date, score, body_text)
4. Separately, save them to two csv file.
Note: You can link them with submission_id.
Warning: According to the rule of Reddit API, the get action should not be too frequent. Safely, use the defalut time span in this crawler.
"""
print("Start to collect all submission ids...")
submissions = self.get_submissions(pages)
print(
"Start to collect comments...This may cost a long time depending on # of pages."
)
comments = []
pandas_submissions = []
for count, submission in enumerate(submissions):
pandas_submissions.append(
(
submission.name[3:],
submission.num_comments,
submission.score,
submission.subreddit_name_prefixed,
date.fromtimestamp(submission.created_utc).isoformat(),
submission.title,
submission.selftext,
)
)
temp_comments = self.get_comments(submission)
comments += temp_comments
print(str(count) + " submissions have got...")
comments_fieldnames = [
"comment_id",
"submission_id",
"author_name",
"post_time",
"comment_score",
"text",
]
df_comments = pd.DataFrame(comments, columns=comments_fieldnames)
df_comments.to_csv("comments.csv")
submissions_fieldnames = [
"submission_id",
"num_of_comments",
"submission_score",
"submission_subreddit",
"post_date",
"submission_title",
"text",
]
df_submission = pd.DataFrame(pandas_submissions, columns=submissions_fieldnames)
df_submission.to_csv("submissions.csv")
Here's an updated version of the whole script that uses PRAW fully:
from datetime import date
import sys
import pandas as pd
import praw
class Crawler:
"""
basic_url is the reddit site.
headers is for requests.get method
REX is to find submission ids.
"""
def __init__(self, subreddit="apple"):
"""
Initialize a Crawler object.
subreddit is the topic you want to parse. default is r"apple"
basic_url is the reddit site.
headers is for requests.get method
REX is to find submission ids.
submission_ids save all the ids of submission you will parse.
reddit is an object created using praw API. Please check it before you use.
"""
self.subreddit = subreddit
self.submission_ids = []
self.reddit = praw.Reddit(
client_id="your_id",
client_secret="your_secret",
user_agent="subreddit_comments_crawler",
)
def get_submissions(self, pages=2):
"""
Collect all submissions..
One page has 25 submissions.
page url: https://www.reddit.com/r/subreddit/?count25&after=t3_id
id(after) is the last submission from last page.
"""
return list(self.reddit.subreddit(self.subreddit).hot(limit=25 * pages))
def get_comments(self, submission):
"""
Submission is an object created using praw API.
"""
# Remove all "more comments".
submission.comments.replace_more(limit=None)
comments = []
for each in submission.comments.list():
try:
comments.append(
(
each.id,
each.link_id[3:],
each.author.name,
date.fromtimestamp(each.created_utc).isoformat(),
each.score,
each.body,
)
)
except AttributeError as e: # Some comments are deleted, we cannot access them.
# print(each.link_id, e)
continue
return comments
def save_comments_submissions(self, pages):
"""
1. Save all the ids of submissions.
2. For each submission, save information of this submission. (submission_id, #comments, score, subreddit, date, title, body_text)
3. Save comments in this submission. (comment_id, submission_id, author, date, score, body_text)
4. Separately, save them to two csv file.
Note: You can link them with submission_id.
Warning: According to the rule of Reddit API, the get action should not be too frequent. Safely, use the defalut time span in this crawler.
"""
print("Start to collect all submission ids...")
submissions = self.get_submissions(pages)
print(
"Start to collect comments...This may cost a long time depending on # of pages."
)
comments = []
pandas_submissions = []
for count, submission in enumerate(submissions):
pandas_submissions.append(
(
submission.name[3:],
submission.num_comments,
submission.score,
submission.subreddit_name_prefixed,
date.fromtimestamp(submission.created_utc).isoformat(),
submission.title,
submission.selftext,
)
)
temp_comments = self.get_comments(submission)
comments += temp_comments
print(str(count) + " submissions have got...")
comments_fieldnames = [
"comment_id",
"submission_id",
"author_name",
"post_time",
"comment_score",
"text",
]
df_comments = pd.DataFrame(comments, columns=comments_fieldnames)
df_comments.to_csv("comments.csv")
submissions_fieldnames = [
"submission_id",
"num_of_comments",
"submission_score",
"submission_subreddit",
"post_date",
"submission_title",
"text",
]
df_submission = pd.DataFrame(pandas_submissions, columns=submissions_fieldnames)
df_submission.to_csv("submissions.csv")
if __name__ == "__main__":
args = sys.argv[1:]
if len(args) != 2:
print("Wrong number of args...")
exit()
subreddit, pages = args
c = Crawler(subreddit)
c.save_comments_submissions(int(pages))
I realize that my answer here gets into Code Review territory, but I hope that this answer is helpful for understanding some of the things PRAW can do. Your "list index out of range" error would have been avoided by using the pre-existing library code, so I do consider this to be a solution to your problem.
When my_list[-1] throws an IndexError, it means that my_list is empty:
>>> ids = []
>>> ids[-1]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IndexError: list index out of range
>>> ids = ['1']
>>> ids[-1]
'1'

Determine the rate limit for requests

I have a question about rate limits.
I take a data from the CSV and enter it into the query and the output is stored in a list.
I get an error because I make too many requests at once.
(I can only make 20 requests per second). How can I determine the rate limit?
import requests
import pandas as pd
df = pd.read_csv("Data_1000.csv")
list = []
def requestSummonerData(summonerName, APIKey):
URL = "https://euw1.api.riotgames.com/lol/summoner/v3/summoners/by-name/" + summonerName + "?api_key=" + APIKey
response = requests.get(URL)
return response.json()
def main():
APIKey = (str)(input('Copy and paste your API Key here: '))
for index, row in df.iterrows():
summonerName = row['Player_Name']
responseJSON = requestSummonerData(summonerName, APIKey)
ID = responseJSON ['accountId']
ID = int(ID)
list.insert(index,ID)
df["accountId"]= list
If you already know you can only make 20 requests per second, you just need to work out how long to wait between each request:
Divide 1 second by 20, which should give you 0.05. So you just need to sleep for 0.05 of a second between each request and you shouldn't hit the limit (maybe increase it a bit if you want to be safe).
import time at the top of your file and then time.sleep(0.05) inside of your for loop (you could also just do time.sleep(1/20))

Best method to obtain a list of the most recent N photos across all interesting Facebook albums?

I've put together this code to fetch data for the last MAX_PHOTOS from a given Facebook account (fb_id), excluding those in EXCLUDED_ALBUM_TYPES. I think I'm done with this Graph API for now... Unless anyone can come up with a better way?
# itertools.chain, with parse_datetime from Django
def get_facebook_photos(self, fb_id):
EXCLUDED_ALBUM_TYPES = ['profile', 'cover', 'wall', ]
MAX_PHOTOS = 20
albums = fb_client.get_object('{}/albums?fields=id,type'.format(fb_id)).get('data', [])
photos = []
album_ids = [a['id'] for a in albums if a['type'] not in EXCLUDED_ALBUM_TYPES]
# No prefix for this path.
photos_path = '/photos?ids={}&fields=source,link,created_time.order(reverse_chronological)'.format(
','.join(album_ids))
photos_data = chain.from_iterable(photos['data'] for photos in fb_client.get_object(photos_path).values())
for i, photo in enumerate(photos_data):
if i == MAX_PHOTOS:
break
photos.append({
'src': photo['source'],
'link': photo['link'],
'time': parse_datetime(photo['created_time'])
})
return photos
I suppose I could try limiting the number of albums retrieved. limit in the photos_path querystring sets a limit on the number of photo items per album, which isn't what I want. For now I'm relying on the default response pagination for performance, with a Django template fragment cache for the gathered images (they're displayed in a carousel). Admittedly, saying all that leads me to wonder why it hasn't simply been done on the frontend with JS...

Categories