I need to get a filtered sample of twitter stream
I'm using tweepy
I checked the functions for the class Stream to get sample stream and to filter
but I dint' catch how should I set the class
should it be
stream.filter(track=['']).sample()
stream.sample().filter(track=[''])
or each one in a line or what
And if you have another idea how to get a sample stream based on keyword filters please help
Thanks in advance
Twitter v2 APIs include an endpoint for random sampling and endpoint for filtered tweets.
The latter allows for specifying a random sample percentage in a query (for example, sample:10 will return a random 10% sample).
Note that v2 APIs are still new and at the moment have a cap of 500k tweets per month.
As an example for the latter, the following code (modified version of this, see this doc) will collect streaming data with cat or dog tags and store it in a json file for every 100 tweets. (Note: this does not include the random sampling query.)
import requests
import os
import json
import pandas as pd
# To set your enviornment variables in your terminal run the following line:
# export 'BEARER_TOKEN'='<your_bearer_token>'
data = []
counter = 0
def create_headers(bearer_token):
headers = {"Authorization": "Bearer {}".format(bearer_token)}
return headers
def get_rules(headers, bearer_token):
response = requests.get(
"https://api.twitter.com/2/tweets/search/stream/rules", headers=headers
)
if response.status_code != 200:
raise Exception(
"Cannot get rules (HTTP {}): {}".format(response.status_code, response.text)
)
print(json.dumps(response.json()))
return response.json()
def delete_all_rules(headers, bearer_token, rules):
if rules is None or "data" not in rules:
return None
ids = list(map(lambda rule: rule["id"], rules["data"]))
payload = {"delete": {"ids": ids}}
response = requests.post(
"https://api.twitter.com/2/tweets/search/stream/rules",
headers=headers,
json=payload
)
if response.status_code != 200:
raise Exception(
"Cannot delete rules (HTTP {}): {}".format(
response.status_code, response.text
)
)
print(json.dumps(response.json()))
def set_rules(headers, delete, bearer_token):
# You can adjust the rules if needed
sample_rules = [
{"value": "dog has:images", "tag": "dog pictures"},
{"value": "cat has:images -grumpy", "tag": "cat pictures"},
]
payload = {"add": sample_rules}
response = requests.post(
"https://api.twitter.com/2/tweets/search/stream/rules",
headers=headers,
json=payload,
)
if response.status_code != 201:
raise Exception(
"Cannot add rules (HTTP {}): {}".format(response.status_code, response.text)
)
print(json.dumps(response.json()))
def get_stream(headers, set, bearer_token):
global data, counter
response = requests.get(
"https://api.twitter.com/2/tweets/search/stream", headers=headers, stream=True,
)
print(response.status_code)
if response.status_code != 200:
raise Exception(
"Cannot get stream (HTTP {}): {}".format(
response.status_code, response.text
)
)
for response_line in response.iter_lines():
if response_line:
json_response = json.loads(response_line)
print(json.dumps(json_response, indent=4, sort_keys=True))
data.append(json_response['data'])
if len(data) % 100 == 0:
print('storing data')
pd.read_json(json.dumps(data), orient='records').to_json(f'tw_example_{counter}.json', orient='records')
data = []
counter +=1
def main():
bearer_token = os.environ.get("BEARER_TOKEN")
headers = create_headers(bearer_token)
rules = get_rules(headers, bearer_token)
delete = delete_all_rules(headers, bearer_token, rules)
set = set_rules(headers, delete, bearer_token)
get_stream(headers, set, bearer_token)
if __name__ == "__main__":
main()
Then, load data in pandas dataframe as
df = pd.read_json('tw_example.json', orient='records').
I'd suggest reading the api documentation for tweepy. Here you can see how to filter the stream like you want to.
From reading other code snippets, i belive it should be done like this:
stream.filter(track=['Keyword'])
print(stream.sample())
As I understand, tweepy uses twitter v1.1 APIs, which has separate APIs for sampling and filtering tweets in real time.
Twitter API references.
v1 sample-realtime
v1 filter-realtime
Approach 1: one can get filtered stream data using stream.filter(track=['Keyword1', 'keyord2']) etc. and then sample records from the collected data.
class StreamListener(tweepy.StreamListener):
def on_status(self, status):
# do data processing and storing here
see examples like https://www.storybench.org/how-to-collect-tweets-from-the-twitter-streaming-api-using-python/ Ignoring Retweets When Streaming Twitter Tweets
Approach 2: one can write program that starts and stops streaming in random time intervals (for example, random sampling of 3 min interval in every 15 minutes).
Approach 3: one can instead use the sampling API to collect data and then filter with keyword to store relevant data.
Related
I am simply trying to extract the followers of a Twitter profile using the following code. However, for some unknown reason, the id query parameter value is not valid. I have tried to input the user id of several Twitter accounts to see if the problem lied with the id rather than my code. The problem is in the code...
def create_url_followers(max_results):
# params based on the get_users_followers endpoint
query_params_followers = {'max_results': max_results, # The maximum number of results to be returned per page
'pagination_token': {}} # Used to request the next page of results
return query_params_followers
def connect_to_endpoint_followers(url, headers, params):
response = requests.request("GET", url, headers=headers, params=params)
print("Endpoint Response Code: " + str(response.status_code))
if response.status_code != 200:
raise Exception(response.status_code, response.text)
return response.json()
# Inputs for the request
bearer_token = auth() # retrieves the token from the environment
headers = create_headers(bearer_token)
max_results = 100 # number results i.e. followers of pol_id
To run the code in a for loop I add the id in the position for id in the URL and loop through the list of ids while appending results to the json_response_followers.
pol_ids_list = house["Uid"].astype(str).values.tolist() # create list of politician ids column Uid in df house.
json_response_followers = [] # empty list to append Twitter data
for id in pol_ids_list: # loop over ids in list pol_ids_list
url = (("https://api.twitter.com/2/users/" + id + "/followers"), create_url_followers(max_results))
print(url)
json_response_followers.append(connect_to_endpoint_followers(url[0], headers, url[1])) # append data to list json_response_list
sleep(60.5) # sleep for 60.5 seconds since API limit is
pass
I think the problem here could be that you're specifying pol_id as a parameter, which would be appended to the call. In the case of this API, you want to insert the value of pol_id at the point where :id is in the URL. The max_results and pagination_token values should be appended.
Try checking the value of url before calling connect_to_endpoint_followers.
I think you are currently trying to call
https://api.twitter.com/2/users/:id/followers?id=2891210047&max_results=100&pagination_token=
This is not valid, as there's no value for :id in the URI where it belongs, and the id parameter itself is not valid / is essentially a duplicate of what should be :id.
It should be:
https://api.twitter.com/2/users/2891210047/followers?max_results=100&pagination_token=
Trying to find a way to iterate over the roleprivs and having issues getting to that level of the yaml from python.
testrole.yaml
info:
rolename: "testDeveloper"
desc: "Test Developer Role"
roletype: "user"
roleprivs:
admin-appliance:
name: "Administrate Appliance" # Informational Only Not used in code
description: "admin-appliance" # Informational Only Not used in code
code: "admin-appliance"
access: "full"
admin-backupSettings:
name: "Administrate Backup Settings" # Informational Only Not used in code
description: "admin-appliance" # Informational Only Not used in code
code: "admin-backupSettings"
access: "full"
I have a few different needs / use cases.
Part 1 of the script below - grab all the files in a directory and take the rolename, desc, and roletype and create a role.
Get the Role ID of the newly created role that was above.
HELP Needed - going back to the original yaml file and iterating over it and getting only the roleprivs..code and roleprivs..code --> role type would be something like admin-appliance. Keeping in mind that there are like 50 some odd features that need to be updated with the type of access.
The question:
How do i get the code and access in the yaml file into python variables?
def genericRoleCreate(baseURL, bearerToken):
print("initial")
files = glob.glob(ROLES_DIR)
logger.debug('Roles Dir '+ROLES_DIR)
for file in files:
yaml_file = file
logger.debug(yaml_file)
with open(yaml_file) as f:
try:
result=yaml.safe_load(f)
authority = result['info']['rolename']
desc = result['info']['desc']
roletype = result['info']['roletype']
url = baseURL+"/api/roles"
payload= json.dumps({"role":{"authority": authority, "description": desc, "roletype": roletype}})
headers = {'Content-Type': 'application/json','Authorization': 'Bearer ' +bearerToken}
roleResult = requests.request("POST", url, verify=False, headers=headers, data=payload)
logger.debug(roleResult.text)
except yaml.YAMLError as exc:
logger.error(exc)
# Getting Role ID
try:
with open(yaml_file) as f:
result = yaml.safe_load(f)
authority = result['info']['rolename']
url = baseURL+"/api/roles?phrase="+authority
headers = {'Content-Type': 'application/json','Authorization': 'Bearer ' +bearerToken}
roleResult = requests.request("GET", url, verify=False, headers=headers )
#print(roleResult.text)
roleID = json.loads(roleResult.text)
role = roleID['roles'][0]['id']
#logger.debug(role)
logger.info("Get Role ID")
print(role)
#return role
#logger.debug("Role ID: "+role)
except Exception as e:
logger.error('Exception occurred', exc_info=True)
logger.error('Error getting roleID')
# Start Updating
#role = getRoleId(baseURL, bearerToken)
try:
with open(yaml_file) as f:
result = yaml.safe_load(f)
except Exception as e:
logger.error("Broken")
strRoleID = str(role)
url = baseURL+"/api/roles/"+strRoleID+"/update-permission"
#logger.debug(result)
keys = list(result.keys())
for features in keys:
#logger.debug(keys)
code = result[features]['code']
access = result[features]['access']
payload = json.dumps({
"permissionCode": code,
"access": access
})
headers = {'Content-Type': 'application/json','Authorization': 'Bearer ' +bearerToken}
requests.request("PUT", url, verify=False, headers=headers, data=payload)
lets keep in mind i do know that i should be breaking that big nasty thing into multiple functions - i have it broken down in other areas - but compiling everything in a single function at the time.
I have been trying multiple iterations of how to get to the feature level. I have looked at many examples and can't seem to figure out how to drop to a level.
update 1
try:
with open(yaml_file, 'r') as f:
result = yaml.safe_load(f)
except Exception as e:
logger.error("Broken")
strRoleID = str(role)
url = baseURL+"/api/roles/"+strRoleID+"/update-permission"
#logger.debug(result)
keys = list(result['roleprivs'].keys())
#code2 = {roleprivs for roleprivs in result['roleprivs'].keys()}
#print(code2)
#return inventory, sites
for features in keys:
print(features)
The code above produces the output:
admin-appliance
admin-backupSettings
now the question is how do i go one step deeper in the chain and get code and access into a variable in python.
I ended up solving the problem after a few hours of testing and researching...The main problem i was encountering was how do i get to the next level after roleprivs. I could easily print the main elements under the roleprivs, but getting the code and access elements were a bit of a challenge.
I also kept running into an error where the indices needed to be an integer. This cleans up some of the keys that i was doing before and puts it into a one liner. Should help clean up a few for loops that i have been working with.
for k, v in result['roleprivs'].items():
print("Code "+result['roleprivs'][k]['code'])
print("Access: "+result['roleprivs'][k]['access'])
access = result['roleprivs'][k]['access']
code = result['roleprivs'][k]['code']
payload = json.dumps({
"permissionCode": code,
"access": access
})
headers = {'Content-Type': 'application/json','Authorization': 'Bearer ' +bearerToken}
requests.request("PUT", url, verify=False, headers=headers, data=payload)
From the original code i may have multiple roles in the ./config/roles/ directory. I needed to make sure i can read all and iterate in a for loop for each one. This solved it for me.
Final output:
2021-10-13 01:25:50,212:84:logger:role:genericRoleCreate:DEBUG:./config/roles/testDeveloper.yaml
2021-10-13 01:25:50,487:110:logger:role:genericRoleCreate:INFO:Get Role ID
8
Code admin-appliance
Access: full
Code admin-backupSettings
Access: full
i am trying to retrieve all the devices from the REST API below, but the API JSON response is limited to 10k rows. The code below only returns 40 rows, not sure why. How do i send multiple requests to retrieve all the devices to get past the 10k row limit?
https://developer.carbonblack.com/reference/carbon-black-cloud/platform/latest/devices-api/
n is the "num_found" from the json response which is retrieved from another python function and it's the total number of devices registered.
def getdevices(n):
param = {
"criteria": {"status": ["ACTIVE"], "target_priority": ["LOW", "MEDIUM", "HIGH", "MISSION_CRITICAL"]}
}
all_endpoints = []
# loop through all and return JSON object
for start in range(0, n, 10001):
r = requests.post(hostname+f'/appservices/v6/orgs/abcdefg/devices/_search?start={start}&rows=10000', data=json.dumps(param), headers=headers)
if r.status_code != 200:
logging.error('status code ' + str(r.status_code))
logging.error('unable to get devices! ' + r.text)
sys.exit(1)
else:
response = json.loads(r.text)
r = response['results']
for d in r:
all_endpoints.append({k: d[k] for k in ("id", "registered_time", "last_contact_time",
"last_external_ip_address", "last_internal_ip_address",
"last_location", "login_user_name", "name")})
return all_endpoints
Another example is the CURL command on the webpage below, how would I use a similar approach with python for the code above to retrieve all the devices?
https://community.carbonblack.com/t5/Knowledge-Base/Carbon-Black-Cloud-How-To-Use-API-Pagination/ta-p/40205
Thanks
With help on Stackoverflow, I was able to come up with the scraper. The code returns a list of part numbers and its corresponding prices.
part1 price1
part2 price2
...
...
partn pricen
However the website seems to only allow 200 requests - when i raise the limit to 200+ i would get the error: "raise JSONDecodeError("Expecting value", s, err.value) from None JSONDecodeError: Expecting value".
I just want to know if there's a way to avoid this error? If not I can raise start:0 by 200 each time, but since I would have 100k+ items easily it won't be very efficient..is there a way I can loop the limit and the start function?
Please see the codes below, any help appreciated!
import requests
# import pprint # to format data on screen `pprint.pprint()
import pandas as pd
# --- fucntions ---
def get_data(query):
"""Get data from server"""
payload = {
# "facets":[{
# "name":"OEM",
# "value":"GE%20Healthcare"
# }],
"facets":[],
"facilityId": 38451,
"id_ins": "a2a3d332-73a7-4194-ad87-fe7412388916",
"limit": 200,
"query": query,
"referer": "/catalog/Service",
"start": 0,
# "urlParams":[{
# "name": "OEM",
# "value": "GE Healthcare"
# }],
"urlParams":[]
}
r = requests.post('https://prodasf-vip.partsfinder.com/Orion/CatalogService/api/v1/search', json=payload)
data = r.json()
return data
all_queries = ['GE Healthcare']
for query in all_queries:
#print('\n--- QUERY:', query, '---\n')
data = get_data(query)
Part_Num = []
Vendor_Item_Num = []
price = []
for item in data['products']:
if not item['options']:
Part_Num.append([])
Vendor_Item_Num.append([])
price.append([])
else:
all_prices = [option['price'] for option in item['options']]
all_vendor = [option['price'] for option in item['options']]
all_part_num = item['partNumber']
Part_Num.append(all_part_num)
Vendor_Item_Num.append(all_vendor)
price.append(all_prices)
list_of_dataframes = [pd.DataFrame(Part_Num),pd.DataFrame(price)]
pd.concat(list_of_dataframes, axis=1).to_csv(r'C:\Users\212677036\Documents\output7.csv')
You should always check the status_code that your request was successful. The API is giving HTTP 500 when limit is > 200. status codes. You need to study the documentation of the API. Many APIs limit requests per second and maximum request size so they can maintain a reliable service.
The json() method will fail if the HTTP request was not successful.
You can get data in batches. Sample code below I stop because I have no want to stay in the loop for 500+ iterations... You could consider using threading so it's not so sequential.
All of this is covered in SO prodasf-vip
import requests
query = 'GE Healthcare'
payload = {
"facets":[],
"facilityId": 38451,
"id_ins": "a2a3d332-73a7-4194-ad87-fe7412388916",
"limit": 200,
"query": query,
"referer": "/catalog/Service",
"start": 0,
"urlParams":[]
}
r = requests.post('https://prodasf-vip.partsfinder.com/Orion/CatalogService/api/v1/search', json=payload)
if r.status_code == 200:
js = r.json()
df = pd.json_normalize(js["products"])
while len(df) < js["totalResults"] and len(df)<2000:
payload["start"] += 200
r = requests.post('https://prodasf-vip.partsfinder.com/Orion/CatalogService/api/v1/search', json=payload)
if r.status_code == 200:
df = pd.concat([df, pd.json_normalize(r.json()["products"])])
else:
break
print(f"want: {js['totalResults']} got: {len(df)}")
df
import requests
import json
import urllib2
data = '{"userId":"faraz#wittyparrot.com","password":"73-rRWk_"}'
response = requests.post(url, data=data, headers=
{"ContentType":"application/json"})
dataa = json.loads(response.content)
a = dataa['accessToken']
print a
tiketId = a['tokenType'] + a['tokenValue']
print tiketId
wit = '{ "name": "wit along with the attachment","parentId": "6d140705-c178-4410-bac3-b15507a5415e", "content": "faraz khan wit", "desc": "This is testing of Authorization wit","note": "Hello auto wit"}'
response = requests.post(URLcreatewit, data=wit , headers={"Content-Type":"application/json","Authorization":tiketId} )
createwit = json.loads(response.content)
print createwit
Id = createwit['id']
WitId = Id
print WitId
so here witId is 2d81dc7e-fc34-49d4-b4a7-39a8179eaa55 that comes as response
now i want to use that witId into below json as a input:
Sharewit = '{ "contentEntityIds":["'+WitId+'"],"userEmailIds": ["ediscovery111#gmail.com"],"permission":{"canComment": false,"canRead": true,"canEditFolderAndWits": false,"canFurtherShare": false,"canEditWits": false}, "inherit":true}'
response = requests.post(URLcreatewit, data= Sharewit , headers={"Content-Type":"application/json","Authorization":tiketId} )
print response.status_code
so in the last json, it seems it does not take the value of witId and gives 400 status error
I was trying to do the similar thing and Here is how I have done.
Assuming the rest api responds with a Json Object.
id = response.json()["id"]
However if the response is a Json Array
Looped in through the array and got the ids appended to an array
item = []
for item in response.json():
ids.append(item["id")
Also, I have used a Json Object - to be able to change values.
Instead of creating it as Json String.
sharewit["userEmailIds"] = ["ediscovery111#gmail.com"]
sharewit["contentEntityIds"] = [id]
response = requests.post(URLcreatewit, data=json.dumps(sharewit), headers={"Content-Type":"application/json","Authorization":tiketId} )