I am a beginner in Python and am trying to use the webhose.io API to collect data from the web. The problem is that this crawler retrieves 100 objects from one JSON at a time, i.e., to retrieve 500 data, it is necessary to make 5 requests. When I use the API, I am not able to collect all the data at once. I was able to collect the first 100 results, but when going to the next request, an error occurs, the first post is repeated. Follow the code:
import webhoseio
webhoseio.config(token="Xxxxx")
query_params = {
"q": "trump:english",
"ts": "1498538579353",
"sort": "crawled"
}
output = webhoseio.query("filterWebContent", query_params)
x = 0
for var in output['posts']:
print output['posts'][x]['text']
print output['posts'][x]['published']
if output['posts'] is None:
output = webhoseio.get_next()
x = 0
Thanks.
Use the following:
while output['posts']:
for var in output['posts']:
print output['posts'][0]['text']
print output['posts'][0]['published']
output = webhoseio.get_next()
Related
I am trying to extract data from a REST API using python and put it into one neat JSON file, and having difficulty. The date is rather lengthy, with a total of nearly 4,000 records, but the max record allowed by the API is 100.
I've tried using some other examples to get through the code, and so far this is what I'm using (censoring the API URL and auth key, for the sake of confidentiality):
import requests
import json
from requests.structures import CaseInsensitiveDict
url = "https://api.airtable.com/v0/CENSORED/Vendors?maxRecords=100"
headers = CaseInsensitiveDict()
headers["Authorization"] = "Bearer CENSORED"
resp = requests.get(url, headers=headers)
resp.content.decode("utf-8")
vendors = []
new_results = True
page = 1
while new_results:
centiblock = requests.get(url + f"&page={page}", headers=headers).json()
new_results = centiblock.get("results", [])
vendors.extend(centiblock)
page += 1
full_directory = json.dumps(vendors, indent=4)
print(full_directory)
For the life of me, I cannot figure out why it isn't working. The output keeps coming out as just:
[
"records"
]
If I play around with the print statement at the end, I can get it to print centiblock (so named for being a block of 100 records at a time) just fine - it gives me 100 records in un-formated text. However, if I try printing vendors at the end, the output is:
['records']
...which leads me to guess that somehow, the vendors array is not getting filled with the data. I suspect that I need to modify the get request where I define new_results, but I'm not sure how.
For reference, this is a censored look at how the json data begins, when I format and print out one centiblock:
{
"records": [
{
"id": "XXX",
"createdTime": "2018-10-15T19:23:59.000Z",
"fields": {
"Vendor Name": "XXX",
"Main Phone": "XXX",
"Street": "XXX",
Can anyone see where I'm going wrong?
Thanks in advance!
When you are extending vendors with centiblock, your are giving a dict to the extend function. extend is expecting an Iterable, so that works, but when you iterate over a python dict, you only iterate over the keys of the dict. In this case, ['records'].
Note as well, that your loop condition becomes False after the first iteration, because centiblock.get("results", []) returns [], since "results" is not a key of the output of the API. and [] has a truthiness value of False.
Hence to correct those errors you need to get the correct field from the API into new_results, and extend vendors with new_results, which is itself an array. Note that on the last iteration, new_results will be the empty list, which means vendors won't be extended with any null value, and will contain exactly what you need:
This should look like:
import requests
import json
from requests.structures import CaseInsensitiveDict
url = "https://api.airtable.com/v0/CENSORED/Vendors?maxRecords=100"
headers = CaseInsensitiveDict()
headers["Authorization"] = "Bearer CENSORED"
resp = requests.get(url, headers=headers)
resp.content.decode("utf-8")
vendors = []
new_results = True
page = 1
while len(new_results) > 0:
centiblock = requests.get(url + f"&page={page}", headers=headers).json()
new_results = centiblock.get("records", [])
vendors.extend(new_results)
page += 1
full_directory = json.dumps(vendors, indent=4)
print(full_directory)
Note that I replaced the while new_results with a while len(new_results)>0 which is equivalent in this case, but more readable, and better practice in general.
I am trying to enter large number of data (13 Million rows) into Firebase Firestore, but it is taking forever to finish.
Currently,I am Inserting the data row by row using python, I tried to use multi-treading, but it is still very slow and it is not efficient (I have to stay connected into the Internet)
so, is there another way to insert a file into Firebase (a more efficient way to (batch) insert the data)?
This is the data format
[
{
'010045006031': {
'FName':'Ahmed',
'LName':'Aline'
}
},
{
'010045006031': {
'FName':'Ali',
'LName':'Adel'
}
},
{
'010045006031': {
'FName':'Osama',
'LName':'Luay'
}
}
]
This is the code that I am using
import firebase_admin
from firebase_admin import credentials, firestore
def Insert2DB(I):
doc_ref = db.collection('DBCoolect').document(I['M'])
doc_ref.set({"FirstName": I['FName'], "LastName": I['LName']}
cred = credentials.Certificate("ServiceAccountKey.json")
firebase_admin.initialize_app(cred)
db = firestore.client()
List = []
#List are read from File
List.append({'M': random(),'FName':'Ahmed','LName':'Aline'})
List.append({'M': random(),'FName':'Ali','LName':'Adel'})
List.append({'M': random(),'FName':'Osama','LName':'Luay'})
for item in List:
Insert2DB(item)
Thanks a lot ...
Firestore does not offer any way to "bulk update" documents. They have to be added individually. There is a facility to batch write, but that's limited to 500 documents per batch, and that's not likely to speed up your process by a large amount.
If you want to optimize the rate at which documents can be added, I suggest reading the documentation on best practices for read and write operations and designing for scale. All things considered, however, there is really no "fast" way to get 13 million documents into Firestore. You're going to be writing code to add each one individually. Firestore is not optimized for fast writes. It's optimized for fast reads.
Yes there is a way to bulk data add into firebase using python
def process_streaming(self, response):
for response_line in response.iter_lines():
if response_line:
json_response = json.loads(response_line)
sentiment = self.sentiment_model(json_response["data"]["text"])
self.datalist.append(self.post_process_data(json_response, sentiment))
It extract data from response and save it into list.
if we want to wait to loop for minutes and upload first objects of list into daatabase then add belove code after if loop.
if (self.start_time + 300 < time.time()):
print(f"{len(self.datalist)} data send to database")
self.batch_upload_data(self.datalist)
self.datalist = []
self.start_time = time.time() + 300
Final function look like this.
def process_streaming(self, response):
for response_line in response.iter_lines():
if response_line:
json_response = json.loads(response_line)
sentiment = self.sentiment_model(json_response["data"]["text"])
self.datalist.append(self.post_process_data(json_response, sentiment))
if (self.start_time + 300 < time.time()):
print(f"{len(self.datalist)} data send to database")
self.batch_upload_data(self.datalist)
self.datalist = []
self.start_time = time.time() + 300
I'm new to Python. What I'm trying to do is to use the Webhose.io API to crawl web data into Json format. Each query will give me 5 posts/articles. I'm trying to get 1000 articles for the dataset.
Webhose are free to register and will give you 1000 request per month for free, so this should be enough for getting the dataset.
The code I have currently looke like this:
import webhoseio, json, io
webhoseio.config(token="YOUR API KEY")
query_params = {
"q": "organization:Amazon",
"sort": "crawled"
}
results = webhoseio.query("filterWebContent", query_params)
print(len(results))
with open('dataset.txt','w', encoding='utf8') as outfile:
output = json.dumps(results,
indent=4, sort_keys=True,
separators=(',', ': '), ensure_ascii=False)
outfile.write(output)
#output = json.load(outfile)
results = webhoseio.get_next()
output += json.dumps(results,
indent=4, sort_keys=True,
separators=(',', ': '), ensure_ascii=False)
outfile.write(output)
Each time I ran the code, looks like I'm only using 1 request. I'm not sure how many articles I'm getting. Is there a way to modify the code so that I can get 1000 articles (5 articles per request, need 200 requests) and also is there a way to change the crawled date in Webhose so that the 200 request won't give me the same articles.
Thank you
The content (articles, metadata) of the first 100 posts of your query is stored in results['posts']. So when you call len(results['posts']) you should get a 100 (Assuming your query produces at least 100 results.)
To get the next 100 results you should call webhoseio.get_next().
To get all batches you could do something like
output = []
while True:
temp = webhoseio.get_next()
output = output+temp['posts']
if temp['moreResultsAvailable'] <= 0:
break
I'm new to Django REST Framework and I faced a problem.
I'm building a backend for a social app. The task is to return a paginated JSON response to client. In the docs I only found how to do that for model instances, but what I have is a list:
[368625, 507694, 687854, 765213, 778491, 1004752, 1024781, 1303354, 1311339, 1407238, 1506842, 1530012, 1797981, 2113318, 2179297, 2312363, 2361973, 2610241, 3005224, 3252169, 3291575, 3333882, 3486264, 3860625, 3964299, 3968863, 4299124, 4907284, 4941503, 5120504, 5210060, 5292840, 5460981, 5622576, 5746708, 5757967, 5968243, 6025451, 6040799, 6267952, 6282564, 6603517, 7271663, 7288106, 7486229, 7600623, 7981711, 8106982, 8460028, 10471602]
Is there some nice way to do it? Do I have to serialize it in some special way?
What client is waiting for is:
{"users": [{"id": "368625"}, {"id": "507694"}, ...]}
The question is: How to paginate such response?
Any input is highly appreciated!
Best regards,
Alexey.
TLDR:
import json
data=[368625, 507694, 687854, 765213, 778491, 1004752, 1024781, 1303354, 1311339, 1407238, 1506842, 1530012, 1797981, 2113318, 2179297, 2312363, 2361973, 2610241, 3005224, 3252169, 3291575, 3333882, 3486264, 3860625, 3964299, 3968863, 4299124, 4907284, 4941503, 5120504, 5210060, 5292840, 5460981, 5622576, 5746708, 5757967, 5968243, 6025451, 6040799, 6267952, 6282564, 6603517, 7271663, 7288106, 7486229, 7600623, 7981711, 8106982, 8460028, 10471602]
print(json.dumps({"users":[{"id":value} for value in data]}))
import json imports the json package, which is a JSON serialization/deserialization library
json.dumps(obj) takes obj, a python object, and serializes it to a JSON string
[{"id":value} for value in data] is just a list comprehension which creates a list of python dictionaries with "id" mapped to each value in the data array
EDIT: Pagination
I'm not sure if there's some standard on pagination, but a simple model would be:
"data": {
"prevPage": "id",
"nextPage": "id",
"data": [
...
]
}
Honestly, implementing that in python wouldn't be that hard:
data=[ ... ]
currentPage={"pageID":0,"data":[]}
prevPage={"pageID":-1}
pageSize=5
for value in data:
currentPage["data"].append({"id":value})
if len(currentPage)==pageSize:
currentPage["prevPage"]=prevPage["pageID"]
prevPage["nextPage"]=currentPage["pageID"]
# add currentPage to some database of pages
prevPage=currentPage
currentPage={"pageID":"generate new page id","data":[]}
Obviously, this isn't very polished, but shows the basic concept.
EDIT: Pagination without storing pages
You could of course recreate the page every time it is requested:
def getPage(pageNum)
#calculate pageStart and pageEnd based on your own requiremnets
pageStart = (pageNum // 5) * 5
pageEnd = (pageNum // 5)*5+5
return [{"id":data[idx] for idx in range(pageStart, pageEnd)}]
The following link provides data in JSON regarding a BTC adress -> https://blockchain.info/address/1GA9RVZHuEE8zm4ooMTiqLicfnvymhzRVm?format=json.
The bitcoin adress can be viewed here --> https://blockchain.info/address/1GA9RVZHuEE8zm4ooMTiqLicfnvymhzRVm
As you can see in the first transaction on 2014-10-20 19:14:22, the TX had 10 inputs from 10 adresses. I want to retreive these adresses using the API, but been struggling to get this to work. The following code only retrieves the first adress instead of all 10, see code. I know it has to do with the JSON structure, but I cant figure it out.
import json
import urllib2
import sys
#Random BTC adress (user input)
btc_adress = ("1GA9RVZHuEE8zm4ooMTiqLicfnvymhzRVm")
#API call to blockchain
url = "https://blockchain.info/address/"+(btc_adress)+"?format=json"
json_obj = urllib2.urlopen(url)
data = json.load(json_obj)
#Put tx's into a list
txs_list = []
for txs in data["txs"]:
txs_list.append(txs)
#Cut the list down to 5 recent transactions
listcutter = len(txs_list)
if listcutter >= 5:
del txs_list[5:listcutter]
# Get number of inputs for tx
recent_tx_1 = txs_list[1]
total_inputs_tx_1 = len(recent_tx_1["inputs"])
The block below needs to put all 10 input adresses in the list 'Output_adress'. It only does so for the first one;
output_adress = []
output_adress.append(recent_tx_1["inputs"][0]["prev_out"]["addr"])
print output_adress
Your help is always appreciated, thanks in advance.
Because you only add one address to it. Change it to this:
output_adress = []
for i in xrange(len(recent_tx_1["inputs"])):
output_adress.append(recent_tx_1["inputs"][i]["prev_out"]["addr"])
print output_adress