I'm using this code. But I get 50 results only. It does not show the next 50 results. I mean, it does not get the nextPageToken. Am I doing something wrong? Or search().list_next() doesn't work?
def youtube_search(options):
youtube = build(YOUTUBE_API_SERVICE_NAME, YOUTUBE_API_VERSION,
developerKey=DEVELOPER_KEY)
req = youtube.search().list(
q=options.q,
part="id,snippet",
maxResults=50,
channelId="my_channel_id",
order="date",
type="video"
)
while req:
res = req.execute()
for item in res["items"]:
if item["id"]["kind"] == "youtube#video":
video_id=item["id"]["videoId"]
video_title=item["snippet"]["title"]
video_date=item["snippet"]["publishedAt"]
print("%s # %s # %s" % (video_id, video_title, video_date))
req = youtube.search().list_next(req, res)
I'm seaching in my channel using channelId, I know I have more than 50 results.
The maxResults, is the results per page. (value by default is 50)
q=options.q is the query, a word to base the search results ('dog' for example)
Related
My task is to Write a function to get the number of jobs for the given technology.
Note: The API gives a maximum of 50 jobs per page.
If you get 50 jobs per page, it means there could be some more job listings available.
So if you get 50 jobs per page you should make another API call for next page to check for more jobs.
If you get less than 50 jobs per page, you can take it as the final count.
Following is my code
baseurl = "https://jobs.github.com/positions.json"
def get_number_of_jobs(technology):
number_of_jobs = 0
tech = technology
page= 0
PARAMS = {'technology':tech , 'page': page}
jobs=requests.get(url=baseurl,params = PARAMS )
if jobs.ok:
listings = jobs.json()
number_of_jobs=len(listings)
if number_of_jobs==50:
page= page+1
PARAMS = {'technology':tech , 'page': page}
jobs=requests.get(url=baseurl,params = PARAMS )
if jobs.ok:
listings2 = jobs.json()
number_of_jobs= number_of_jobs + len(listings2)
return technology,number_of_jobs
Now I can not figure out how to do the pagination in this function? Meaning how to check if there are more than 50 job posting for a specific technology or not and if it is then run the code again and get those postings as well?
I print the output as
print(get_number_of_jobs('python'))
('python', 100)
Can someone please help?
Many thanks in advance!
Please let me know if should work
import requests
baseurl = 'https://jobs.github.com/positions.json'
total_job = 0
def get_number_of_jobs(technology, page):
global total_job
PARAMS = {'technology':technology , 'page': page}
jobs=requests.get(url=baseurl,params = PARAMS )
total_job += len(jobs.json()) if jobs.ok else 0
return len(jobs.json()) if jobs.ok else 0
def get_jobs(technology):
page = 0
while get_number_of_jobs(technology, page) >= 50:page+=1
return total_job
print(get_jobs('python'))
baseurl = 'https://jobs.github.com/positions.json'
def get_number_of_jobs(technology):
number_of_jobs = 0
page = 0
while True:
payload = {"description":technology,"page":page}
r = requests.get(baseurl,params=payload)
if r.ok:
data = r.json()
number_of_jobs = len(data)
if number_of_jobs >= 50:
page += 1
continue
else:
break
return technology,number_of_jobs
I'm using the code shown below in order to retrieve papers from arXiv. I want to retrieve papers that have words "machine" and "learning" in the title. The number of papers is large, therefore I want to implement a slicing by year (published).
How can I request records of 2020 and 2019 in search_query? Please notice that I'm not interested in post-filtering.
import urllib.request
import time
import feedparser
# Base api query url
base_url = 'http://export.arxiv.org/api/query?';
# Search parameters
search_query = urllib.parse.quote("ti:machine learning")
start = 0
total_results = 5000
results_per_iteration = 1000
wait_time = 3
papers = []
print('Searching arXiv for %s' % search_query)
for i in range(start,total_results,results_per_iteration):
print("Results %i - %i" % (i,i+results_per_iteration))
query = 'search_query=%s&start=%i&max_results=%i' % (search_query,
i,
results_per_iteration)
# perform a GET request using the base_url and query
response = urllib.request.urlopen(base_url+query).read()
# parse the response using feedparser
feed = feedparser.parse(response)
# Run through each entry, and print out information
for entry in feed.entries:
#print('arxiv-id: %s' % entry.id.split('/abs/')[-1])
#print('Title: %s' % entry.title)
#feedparser v4.1 only grabs the first author
#print('First Author: %s' % entry.author)
paper = {}
paper["date"] = entry.published
paper["title"] = entry.title
paper["first_author"] = entry.author
paper["summary"] = entry.summary
papers.append(paper)
# Sleep a bit before calling the API again
print('Bulk: %i' % 1)
time.sleep(wait_time)
According to the arXiv documentation, there is no published or date field available.
What you can do is to sort the results by date (by adding &sortBy=submittedDate&sortOrder=descending to your query parameters) and stop making requests when you reach 2018.
Basically your code should be modified like this:
import urllib.request
import time
import feedparser
# Base api query url
base_url = 'http://export.arxiv.org/api/query?';
# Search parameters
search_query = urllib.parse.quote("ti:machine learning")
i = 0
results_per_iteration = 1000
wait_time = 3
papers = []
year = ""
print('Searching arXiv for %s' % search_query)
while (year != "2018"): #stop requesting when papers date reach 2018
print("Results %i - %i" % (i,i+results_per_iteration))
query = 'search_query=%s&start=%i&max_results=%i&sortBy=submittedDate&sortOrder=descending' % (search_query,
i,
results_per_iteration)
# perform a GET request using the base_url and query
response = urllib.request.urlopen(base_url+query).read()
# parse the response using feedparser
feed = feedparser.parse(response)
# Run through each entry, and print out information
for entry in feed.entries:
#print('arxiv-id: %s' % entry.id.split('/abs/')[-1])
#print('Title: %s' % entry.title)
#feedparser v4.1 only grabs the first author
#print('First Author: %s' % entry.author)
paper = {}
paper["date"] = entry.published
year = paper["date"][0:4]
paper["title"] = entry.title
paper["first_author"] = entry.author
paper["summary"] = entry.summary
papers.append(paper)
# Sleep a bit before calling the API again
print('Bulk: %i' % 1)
i += results_per_iteration
time.sleep(wait_time)
for the "post-filtering" approach, once enough results are collected, I'd do something like this:
papers2019 = [item for item in papers if item["date"][0:4] == "2019"]
I need to get a filtered sample of twitter stream
I'm using tweepy
I checked the functions for the class Stream to get sample stream and to filter
but I dint' catch how should I set the class
should it be
stream.filter(track=['']).sample()
stream.sample().filter(track=[''])
or each one in a line or what
And if you have another idea how to get a sample stream based on keyword filters please help
Thanks in advance
Twitter v2 APIs include an endpoint for random sampling and endpoint for filtered tweets.
The latter allows for specifying a random sample percentage in a query (for example, sample:10 will return a random 10% sample).
Note that v2 APIs are still new and at the moment have a cap of 500k tweets per month.
As an example for the latter, the following code (modified version of this, see this doc) will collect streaming data with cat or dog tags and store it in a json file for every 100 tweets. (Note: this does not include the random sampling query.)
import requests
import os
import json
import pandas as pd
# To set your enviornment variables in your terminal run the following line:
# export 'BEARER_TOKEN'='<your_bearer_token>'
data = []
counter = 0
def create_headers(bearer_token):
headers = {"Authorization": "Bearer {}".format(bearer_token)}
return headers
def get_rules(headers, bearer_token):
response = requests.get(
"https://api.twitter.com/2/tweets/search/stream/rules", headers=headers
)
if response.status_code != 200:
raise Exception(
"Cannot get rules (HTTP {}): {}".format(response.status_code, response.text)
)
print(json.dumps(response.json()))
return response.json()
def delete_all_rules(headers, bearer_token, rules):
if rules is None or "data" not in rules:
return None
ids = list(map(lambda rule: rule["id"], rules["data"]))
payload = {"delete": {"ids": ids}}
response = requests.post(
"https://api.twitter.com/2/tweets/search/stream/rules",
headers=headers,
json=payload
)
if response.status_code != 200:
raise Exception(
"Cannot delete rules (HTTP {}): {}".format(
response.status_code, response.text
)
)
print(json.dumps(response.json()))
def set_rules(headers, delete, bearer_token):
# You can adjust the rules if needed
sample_rules = [
{"value": "dog has:images", "tag": "dog pictures"},
{"value": "cat has:images -grumpy", "tag": "cat pictures"},
]
payload = {"add": sample_rules}
response = requests.post(
"https://api.twitter.com/2/tweets/search/stream/rules",
headers=headers,
json=payload,
)
if response.status_code != 201:
raise Exception(
"Cannot add rules (HTTP {}): {}".format(response.status_code, response.text)
)
print(json.dumps(response.json()))
def get_stream(headers, set, bearer_token):
global data, counter
response = requests.get(
"https://api.twitter.com/2/tweets/search/stream", headers=headers, stream=True,
)
print(response.status_code)
if response.status_code != 200:
raise Exception(
"Cannot get stream (HTTP {}): {}".format(
response.status_code, response.text
)
)
for response_line in response.iter_lines():
if response_line:
json_response = json.loads(response_line)
print(json.dumps(json_response, indent=4, sort_keys=True))
data.append(json_response['data'])
if len(data) % 100 == 0:
print('storing data')
pd.read_json(json.dumps(data), orient='records').to_json(f'tw_example_{counter}.json', orient='records')
data = []
counter +=1
def main():
bearer_token = os.environ.get("BEARER_TOKEN")
headers = create_headers(bearer_token)
rules = get_rules(headers, bearer_token)
delete = delete_all_rules(headers, bearer_token, rules)
set = set_rules(headers, delete, bearer_token)
get_stream(headers, set, bearer_token)
if __name__ == "__main__":
main()
Then, load data in pandas dataframe as
df = pd.read_json('tw_example.json', orient='records').
I'd suggest reading the api documentation for tweepy. Here you can see how to filter the stream like you want to.
From reading other code snippets, i belive it should be done like this:
stream.filter(track=['Keyword'])
print(stream.sample())
As I understand, tweepy uses twitter v1.1 APIs, which has separate APIs for sampling and filtering tweets in real time.
Twitter API references.
v1 sample-realtime
v1 filter-realtime
Approach 1: one can get filtered stream data using stream.filter(track=['Keyword1', 'keyord2']) etc. and then sample records from the collected data.
class StreamListener(tweepy.StreamListener):
def on_status(self, status):
# do data processing and storing here
see examples like https://www.storybench.org/how-to-collect-tweets-from-the-twitter-streaming-api-using-python/ Ignoring Retweets When Streaming Twitter Tweets
Approach 2: one can write program that starts and stops streaming in random time intervals (for example, random sampling of 3 min interval in every 15 minutes).
Approach 3: one can instead use the sampling API to collect data and then filter with keyword to store relevant data.
I need to fetch information about likes, comments and etc. from only one post object and here's the request code I send.
Example of my requests:
class StatsSN:
def init(self, fb_post_id, fb_token):
self.fb_post_id = fb_post_id
self.fb_token = fb_token
def req_stats(self, url_method):
req = requests.get(url_method)
if req.status_code != 200:
# return req.json().get('error')
# return 'error'
log.info('FB_Statistics: %s' % req.json())
return -1
return req.json().get('summary').get('total_count')
def fb_likes(self):
url_method = fb_api_url + '%s/likes?summary=true&access_token=%s' % (self.fb_post_id, self.fb_token)
return self.req_stats(url_method)
def fb_reactions(self):
url_method = fb_api_url + '%s/reactions?summary=total_count&access_token=%s' % (self.fb_post_id, self.fb_token)
return self.req_stats(url_method)
def fb_comments(self):
url_method = fb_api_url + '%s/comments?summary=true&access_token=%s' % (self.fb_post_id, self.fb_token)
return self.req_stats(url_method)
def fb_sharedposts(self):
url_method = fb_api_url + '%s/sharedposts?access_token=%s' % (self.fb_post_id, self.fb_token)
req = requests.get(url_method)
if req.status_code != 200:
log.info('FB_Statistics: %s' % req.json())
return -1
return len(req.json().get('data'))
def fb_stats(self):
fb_likes, fb_reactions, fb_comments, fb_sharedposts = self.fb_likes(), self.fb_reactions(), self.fb_comments(), \
self.fb_sharedposts()
return int(fb_likes), int(fb_reactions), int(fb_comments), int(fb_sharedposts)
Is there a method in the Graph API to get info about few posts in one request?
You can achieve it by sending a batch request; If you only need public data, a normal page token is good enough. However if you need private information, you will need a specific page token of the page post you want to get the metrics of.
As the metrics you are referring to are public, you should be able to send a GET request with following syntax:
https://graph.facebook.com/v2.12/?fields=id,comments.limit(0).summary(true),shares,reactions.limit(0).summary(true)&ids=STATUS_ID1,STATUS_ID2,STATUS_ID3,...,STATUS_ID50&access_token=PAGE_TOKEN
You can request up to 50 status id's in one call.
limit(0).summary(true)
This part you need to add with comments and reactions as it is the best practice to retrieve the total amount of comments/reactions.
def get_videos(search_keyword):
youtube = build(YOUTUBE_API_SERVICE_NAME,
YOUTUBE_API_VERSION,
developerKey=DEVELOPER_KEY)
try:
search_response = youtube.search().list(
q=search_keyword,
part="id,snippet",
channelId=os.environ.get("CHANNELID", None),
maxResults=10, #max = 50, default = 5, min = 0
).execute()
videos = []
channels = []
for search_result in search_response.get("items", []):
if search_result["id"]["kind"] == "youtube#video":
title = search_result["snippet"]["title"]
videoId = search_result["id"]["videoId"]
channelTitle = search_result["snippet"]["channelTitle"]
cam_thumbnails = search_result["snippet"]["thumbnails"]["medium"]["url"]
publishedAt = search_result["snippet"]["publishedAt"]
channelId = search_result["snippet"]["channelId"]
data = {'title' : title,
'videoId' : videoId,
'channelTitle' : channelTitle,
'cam_thumbnails' : cam_thumbnails,
'publishedAt' : publishedAt}
videos.append(data)
elif search_result["id"]["kind"] == "youtube#channel":
channels.append("%s (%s)" % (search_result["snippet"]["title"],
search_result["id"]["channelId"]))
except Exception as e:
print e
Now, I'am using python youtube data api, I get youtube video data that is searched by keyword in specified channel, But I want to get All data that is not searched by keyword in specified channel
How get I youtube video data in specified channel? data that i want to get must be all data in specified channel
I'm not 100% sure I know what you're asking, but I think you're asking how you can get all videos in a channel and not just those related to your keyword? If that's correct, you should just be able to remove:
q=search_keyword,
from your request, and the API should then return all videos in the channel. If you're asking something else, please clarify in your question.