How to write multiple items in dynamo db using s3? - python

So I'm trying to insert multiple items in a dynamodb table. I'm reading my data from a csv file and everything's going write(My logs in aws cloudwatch shows me that I'm correctly extracting my data from the csv file).
I've first try in a loop to write each element in the table like this:
for item in Items:
response = dynamodb.put_item(
TableName = 'some_table_name',
Item = item)
For this syntax I'm using the dynamodb client like this:
dynamodb = boto3.client('dynamodb', region_name=region)
After this attempt, I've tried to used to batch write in a loop like this:
for item in Items:
response = dynamodb.batch_write_item(
RequestItems={
'some_table': [
{
'PutRequest': {
'Item':item
}
}
]
}
)
print('Successfully uploaded to DynamoDB')
And I'm still using the dynamodb client.
After those two attempts, I've tried the same functions with the dynamodb resource (dynamodb = boto3.client('dynamodb', region_name=region)).
The problem is that my list Items has 42 items, and all those tentatives put different number of items (25,27,29) but never 42. So where am I doing wrong? Can u guys help me please?

Try implementing retry mechanism for your bachwrite
result, batchError := svc.DynamoDBBatchWrite(input)
if result != nil && len(result.UnprocessedItems) > 0 {
input = &dynamodb.BatchWriteItemInput{
RequestItems: result.UnprocessedItems,
}
}
and then retry till the result.UnprocessedItems == 0

Related

S3 Select Query JSON for nested value when keys are dynamic

I have a JSON object in S3 which follows this structure:
<code> : {
<client>: <value>
}
For example,
{
"code_abc": {
"client_1": 1,
"client_2": 10
},
"code_def": {
"client_2": 40,
"client_3": 50,
"client_5": 100
},
...
}
I am trying to retrieve the numerical value with an S3 Select query, where the "code" and the "client" are populated dynamically with each query.
So far I have tried:
sql_exp = f"SELECT * from s3object[*][*] s where s.{proc}.{client_name} IS NOT NULL"
sql_exp = f"SELECT * from s3object s where s.{proc}[*].{client_name}[*] IS NOT NULL"
as well as without the asterisk inside the square brackets, but nothing works, I get ClientError: An error occurred (ParseUnexpectedToken) when calling the SelectObjectContent operation: Unexpected token found LITERAL:UNKNOWN at line 1, column X (depending on the length of the query string)
Within the function defining the object, I have:
resp = s3.select_object_content(
Bucket=<bucket>,
Key=<filename>,
ExpressionType="SQL",
Expression=sql_exp,
InputSerialization={'JSON': {"Type": "Document"}},
OutputSerialization={"JSON": {}},
)
Is there something off in the way I define the object serialization? How can I fix the query so I can retrieve the desired numerical value on the fly when I provide ”code” and “client”?
I did some tinkering based on the documentation, and it works!
I need to access the single event in the EventStream (resp) as follows:
event_stream = resp['Payload']
# unpack successful query response
for event in event_stream:
if "Records" in event:
output_str = event["Records"]["Payload"].decode("utf-8") # bytes to string
output_dict = json.loads(output_str) # string to dict
Now the correct SQL expression is:
sql_exp= f"SELECT s['{code}']['{client}'] FROM S3Object s"
where I have gotten (dynamically) my values for code and client beforehand.
For example, based on the dummy JSON structure above, if code = "code_abc" and client = "client_2", I want this S3 Select query to return the value 10.
The f-string resolves to sql_exp = "SELECT s['code_abc']['client_2'] FROM S3Object s", and when we call resp, we retrieve output_dict = {'client_2': 10} (Not sure if there is a clear way to get the value by itself without the client key, this is how it looks like in the documentation as well).
So, the final step is to retrieve value = output_dict['client_2'], which in our case is equal to 10.

Python boto3, list contents of specific dir in bucket, limit depth

This is the same as this question, but I also want to limit the depth returned.
Currently, all answers return all the objects after the specified prefix. I want to see just what's in the current hierarchy level.
Current code that returns everything:
self._session = boto3.Session(
aws_access_key_id=aws_access_key_id,
aws_secret_access_key=aws_secret_access_key,
)
self._session.resource("s3")
bucket = self._s3.Bucket(bucket_name)
detections_contents = bucket.objects.filter(Prefix=prefix)
for object_summary in detections_contents:
print(object_summary.key)
How to see only the files and folders directly under prefix? How to go n levels deep?
I can parse everything locally, and this is clearly not what I am looking for here.
There is no definite way to do this using list objects without getting all the objects in the dir.
But there is a way using s3 select which uses sql query like format to get n levels deep to get the file content as well as to get object keys.
If you are fine with writing sql then use this.
reference doc
import boto3
import json
s3 = boto3.client('s3')
bucket_name = 'my-bucket'
prefix = 'my-directory/subdirectory/'
input_serialization = {
'CompressionType': 'NONE',
'JSON': {
'Type': 'LINES'
}
}
output_serialization = {
'JSON': {}
}
# Set the SQL expression to select the key field for all objects in the subdirectory
expression = 'SELECT s.key FROM S3Object s WHERE s.key LIKE \'' + prefix + '%\''
response = s3.select_object_content(
Bucket=bucket_name,
ExpressionType='SQL',
Expression=expression,
InputSerialization=input_serialization,
OutputSerialization=output_serialization
)
# The response will contain a Payload field with the selected data
payload = response['Payload']
for event in payload:
if 'Records' in event:
records = event['Records']['Payload']
data = json.loads(records.decode('utf-8'))
# The data will be a list of objects, each with a "key" field representing the file name
for item in data:
print(item['key'])
There is not built in way with the Boto3 or S3 APIs to do this. You'll need some version of processing each level and asking in turn for a list of objects at that level:
import boto3
s3 = boto3.client('s3')
max_depth = 2
paginator = s3.get_paginator('list_objects_v2')
# Track all prefixes to show with a list
common_prefixes = [(0, "")]
while len(common_prefixes) > 0:
# Pull out the next prefix to show
current_depth, current_prefix = common_prefixes.pop(0)
# Loop through all of the items using a paginator to handle common prefixes with more
# than a thousand items
for page in paginator.paginate(Bucket=bucket_name, Prefix=current_prefix, Delimiter='/'):
for cur in page.get("CommonPrefixes", []):
# Show each common prefix, here just use a format like AWS CLI does
print(" " * 27 + f"PRE {cur['Prefix']}")
if current_depth < max_depth:
# This is below the max depth we want to show, so
# add it to the list to be shown
common_prefixes.append((current_depth + 1, cur['Prefix']))
for cur in page.get("Contents", []):
# Show each item sharing this common prefix using a format like the AWS CLI
print(f"{cur['LastModified'].strftime('%Y-%m-%d %H:%M:%S')}{cur['Size']:11d} {cur['Key']}")

Is there a way to Enter large number of data into Firebase Firestore DataBase

I am trying to enter large number of data (13 Million rows) into Firebase Firestore, but it is taking forever to finish.
Currently,I am Inserting the data row by row using python, I tried to use multi-treading, but it is still very slow and it is not efficient (I have to stay connected into the Internet)
so, is there another way to insert a file into Firebase (a more efficient way to (batch) insert the data)?
This is the data format
[
{
'010045006031': {
'FName':'Ahmed',
'LName':'Aline'
}
},
{
'010045006031': {
'FName':'Ali',
'LName':'Adel'
}
},
{
'010045006031': {
'FName':'Osama',
'LName':'Luay'
}
}
]
This is the code that I am using
import firebase_admin
from firebase_admin import credentials, firestore
def Insert2DB(I):
doc_ref = db.collection('DBCoolect').document(I['M'])
doc_ref.set({"FirstName": I['FName'], "LastName": I['LName']}
cred = credentials.Certificate("ServiceAccountKey.json")
firebase_admin.initialize_app(cred)
db = firestore.client()
List = []
#List are read from File
List.append({'M': random(),'FName':'Ahmed','LName':'Aline'})
List.append({'M': random(),'FName':'Ali','LName':'Adel'})
List.append({'M': random(),'FName':'Osama','LName':'Luay'})
for item in List:
Insert2DB(item)
Thanks a lot ...
Firestore does not offer any way to "bulk update" documents. They have to be added individually. There is a facility to batch write, but that's limited to 500 documents per batch, and that's not likely to speed up your process by a large amount.
If you want to optimize the rate at which documents can be added, I suggest reading the documentation on best practices for read and write operations and designing for scale. All things considered, however, there is really no "fast" way to get 13 million documents into Firestore. You're going to be writing code to add each one individually. Firestore is not optimized for fast writes. It's optimized for fast reads.
Yes there is a way to bulk data add into firebase using python
def process_streaming(self, response):
for response_line in response.iter_lines():
if response_line:
json_response = json.loads(response_line)
sentiment = self.sentiment_model(json_response["data"]["text"])
self.datalist.append(self.post_process_data(json_response, sentiment))
It extract data from response and save it into list.
if we want to wait to loop for minutes and upload first objects of list into daatabase then add belove code after if loop.
if (self.start_time + 300 < time.time()):
print(f"{len(self.datalist)} data send to database")
self.batch_upload_data(self.datalist)
self.datalist = []
self.start_time = time.time() + 300
Final function look like this.
def process_streaming(self, response):
for response_line in response.iter_lines():
if response_line:
json_response = json.loads(response_line)
sentiment = self.sentiment_model(json_response["data"]["text"])
self.datalist.append(self.post_process_data(json_response, sentiment))
if (self.start_time + 300 < time.time()):
print(f"{len(self.datalist)} data send to database")
self.batch_upload_data(self.datalist)
self.datalist = []
self.start_time = time.time() + 300

Parsing json file to collect data and store in a list/array

I am trying to build an IOT setup. I am thinking of using a json file to store states of the sensors and lights of the setup.
I have created a function to test out my concept. Here is what I wrote so far for the data side of things.
{
"sensor_data": [
{
"sensor_id": "302CEM/lion/light1",
"sensor_state": "on"
},
{
"sensor_id": "302CEM/lion/light2",
"sensor_state": "off"
}
]
}
def read_from_db():
with open('datajson.json') as f:
data = json.load(f)
for sensors in data['sensor_data']:
name = sensors['sensor_id']
read_from_db()
What I want to do is to parse the sensor_id into an array so that I can access them by saying for example sensor_name[0]. I am not sure how to go about it. I tried array.array but it doesn't save any values, have also tried .append but not the result I expected. Any suggestions?
If I understood correctly, all you have to do is assign all those sensors to names using a for loop and then return the result:
import json
def read_from_db():
with open('sensor_data.json') as f:
data = json.load(f)
names = [sensors['sensor_id'] for sensors in data['sensor_data']]
return names
sensor_names = read_from_db()
for i in range(len(sensor_names)):
print(sensor_names[i])
This will print:
302CEM/lion/light1
302CEM/lion/light2

Error retrieving JSON data from Webhose API in Python

I am a beginner in Python and am trying to use the webhose.io API to collect data from the web. The problem is that this crawler retrieves 100 objects from one JSON at a time, i.e., to retrieve 500 data, it is necessary to make 5 requests. When I use the API, I am not able to collect all the data at once. I was able to collect the first 100 results, but when going to the next request, an error occurs, the first post is repeated. Follow the code:
import webhoseio
webhoseio.config(token="Xxxxx")
query_params = {
"q": "trump:english",
"ts": "1498538579353",
"sort": "crawled"
}
output = webhoseio.query("filterWebContent", query_params)
x = 0
for var in output['posts']:
print output['posts'][x]['text']
print output['posts'][x]['published']
if output['posts'] is None:
output = webhoseio.get_next()
x = 0
Thanks.
Use the following:
while output['posts']:
for var in output['posts']:
print output['posts'][0]['text']
print output['posts'][0]['published']
output = webhoseio.get_next()

Categories