I have the following json file
{"columnwithoutname":"structureet","nofinesset":810001792,"nofinessej":810001784}
{"columnwithoutname":"structureet","nofinesset":670797117,"nofinessej":670010339}
I want to insert it in DynamoDB using Lambda. This is what I did:
def lambda_handler(event, context):
bucket=event['b']
file_key=event['c']
table=event['t']
recList=[]
s3 = boto3.client('s3')
dynamodb = boto3.client('dynamodb')
obj= s3.get_object(Bucket=bucket, Key=file_key)
recList=obj['Body'].read().split('\n')
for row in recList:
response = dynamodb.put_item(TableName='test-abe', Item=row)
But I have this error:
"errorMessage": "Parameter validation failed:\nInvalid type for parameter Item
Apparently I also need to precise the type of each column so it could be accepted. Anyway to do it automatically? I want all columns to be strings. thank you
DynamoDB client expects Item parameter to be a dict (per https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/dynamodb.html#DynamoDB.Client.put_item)
When you do recList = obj['Body'].read().split('\n') what you get is a list of str, so passing str to a param expecting dict will obviously fail.
One more thing to consider is that DynamoDB client expects item in a very specific format, with explicitly specified attribute datatypes. If you want to read JSON and simply write it, I suggest using DynamoDB resource, something like this:
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table(table_name)
table.put_item(item)
Table.put_item() accepts simple dict with no need to specify data type for every attribute, so you can simply read from file, convert it to dict and send it away (https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/dynamodb.html#DynamoDB.Table.put_item).
You need to manipulate each row in order to have the 'S' format:
import json
for row in recList:
row_dict = json.loads(row)
ddb_row_dict = {k:{"S": v} for (k,v) in row_dict.items()}
response = dynamodb.put_item(TableName='test-abe', Item=ddb_row_dict)
Related
I have hosted files in AWS s3 bucket, I need only all S3 bucket object URL's in CSV file
Please suggest
You can get all S3 Object URLS by using the AWS SDK for S3. First, what you need to do is read all items in a bucket. You can use Python code similar to this Java code (you can port the logic):
ListObjectsRequest listObjects = ListObjectsRequest
.builder()
.bucket(bucketName)
.build();
ListObjectsResponse res = s3.listObjects(listObjects);
List<S3Object> objects = res.contents();
for (ListIterator iterVals = objects.listIterator(); iterVals.hasNext(); ) {
S3Object myValue = (S3Object) iterVals.next();
System.out.print("\n The name of the key is " + myValue.key());
}
Then iterate through the list and get the key as shown above. For each object, you can get the URL using Python code similar to this:
GetUrlRequest request = GetUrlRequest.builder()
.bucket(bucketName)
.key(keyName)
.build();
URL url = s3.utilities().getUrl(request);
System.out.println("The URL for "+keyName +" is "+url.toString());
Put each URL value into a collection and then write the collection out to a CSV. That is how you achieve your use case.
I'm querying a relational Database and I need the result as a CSV string. I can't save it on the disk as is running in a serverless environment (I don't have access to disk).
Any idea?
My solution was using PyGreSQL library and defining this function:
import pg
def get_csv_from_db(query, cols):
"""
Given the SQL #query and the expected #cols,
a string formatted CSV (containing headers) is returned
:param str query:
:param list of str cols:
:return str:
"""
connection = pg.DB(
dbname=my_db_name,
host=my_host,
port=my_port,
user=my_username,
passwd=my_password)
header = ','.join(cols) + '\n'
records_list = []
for row in connection.query(query).dictresult():
record = []
for c in cols:
record.append(str(row[c]))
records_list.append(",".join(record))
connection.close()
return header + "\n".join(records_list)
Unfortunately this solution expects the column names in input (which is not too bad IMHO) and iterate over the dictionary result with Python code.
Other solutions (especially out of the box) using other packages are more than welcome.
This is another solution based on PsycoPG and Pandas:
import psycopg2
import pandas as pd
def get_csv_from_db(query):
"""
Given the SQL #query a string formatted CSV (containing headers) is returned
:param str query:
:return str:
"""
conn = psycopg2.connect(
dbname=my_db_name,
host=my_host,
port=my_port,
user=my_username,
passwd=my_password)
cur = conn.cursor()
cursor.execute("query")
df = pd.DataFrame(cur.fetchall(), columns=[desc[0] for desc in cur.description])
cur.close()
conn.commit()
return df.to_csv()
I hadn't chance to test it yet though.
here is a different approach from other answers, Using Pandas.
i suppose you have a database connection already,
for example I'm using Oracle database, same can be done by using respective library for your relational db.
only these 2 lines do the trick,
df = pd.read_sql(query, con)
df.to_csv("file_name.csv")
Here is a full example using Oracle database:
dsn = cx_Oracle.makedsn(ip, port,service_name)
con = cx_Oracle.connect("user","password",dsn)
query = """"select * from YOUR_TABLE"""
df = pd.read_sql(query, con)
df.to_csv("file_name.csv")
PyGreSQL's Cursor has method copy_to. It accept as stream file-like object (which must have a write() method). io.StringIO does meet this condition and do not need access to disk, so it should be possible to do:
import io
csv_io = io.StringIO()
# here connect to your DB and get cursor
cursor.copy_to(csv_io, "SELECT * FROM table", format="csv", decode=True)
csv_io.seek(0)
csv_str = csv_io.read()
Explanation: many python modules accept file-like object, meaning you can use io.StringIO() or io.BytesIO() in place of true file-handles. These mimick file opened in text and bytes modes respectively. As with files there is position of reader, so I do seek to begin after usage. Last line does create csv_str which is just plain str. Remember to adjust SQL query to your needs.
Note: I do not tested above code, please try it yourself and write if it works as intended.
I'm trying to write a lambda function that is triggered whenever a json file is uploaded to an s3 bucket. The function is supposed to parse the file and store it immediately in DynamoDB. I created a table called 'data' with the primary key set as 'date'. Here's what I have for the function so far:
import boto3
import json
s3_client = boto3.client('s3')
dynamodb = boto3.resource('dynamodb')
def lambda_handler(event, context):
bucket = event['Records'][0]['s3']['bucket']['name']
json_file_name = event['Records'][0]['s3']['object']['key']
json_object = s3.Bucket(bucket).Object(json_file_name)
jsonFileReader = json_object['Body'].read()
jsonDict = json.loads(jsonFileReader)
table = dynamodb.Table('data')
table.put_item(Item = jsonDict)
Here is an example of a json file I'm trying to use:
{
"date": "2020-06-07 21:00:34.284421",
"ConfirmedCases": 7062067,
"ActiveCases": 3206573,
"RecoveredCases": 3450965,
"Deaths": 404529
}
Unfortunately, whenever I test the code, it throws this error:
[[ERROR] TypeError: string indices must be integers
Traceback (most recent call last):
File "/var/task/lambda_function.py", line 7, in lambda_handler
bucket = event'Records'][0]['s3']['bucket']['name']]
Does anyone know how to resolve this issue? I've wasted so much time trying to figure this out and I still am unable to :/
Your error is from either of these lines.
bucket = event['Records'][0]['s3']['bucket']['name']
json_file_name = event['Records'][0]['s3']['object']['key']
However, they are correct. This is the valid way of access bucket name and object key from the event generated by S3 Notifications.
It seems to me that something else is trigger your function. Either you use Test option in the console and provide incorrect event object, or there is some other event that triggers the lambda with non-S3 event.
As a quick fix, you can do the following. The code below will finish if event object does not contain Records:
def lambda_handler(event, context):
if 'Records' not in event:
print(event) # good to check what event you get
return
# and the rest of code
So, lets say I have a JSON object stored in the variable names jsonobject and want to read a specific property from it, lets say address.state. If I say jsonobject.address.state I am getting the expected output, but what if the property that I am trying to fins i.e. address.state is stored in a variable lets say key.
So key = "address.state" and when I try to get jsonobject.key I get an error saying jsonobject has no property names key.
How can I implement this .
def main():
#messagebody='{"name":"vivke", "age":"26", "comname":"Infracloud", "address":{ "street":44, "state":"NY" } }'
#i am HTTP POSTing above message format to the function
messagebody = request.get_data().decode("utf-8")
key = "address.state"
#convert messagebody to JSON
jsondata = jsonparser.json2obj(messagebody)
return jsondata.address.state # this works file
return jsondata.key #isnt working
here is the code of jsonparser
import json
from collections import namedtuple
def _json_object_hook(d): return namedtuple('X', d.keys())(*d.values())
def json2obj(data): return json.loads(data, object_hook=_json_object_hook)
The key in jsondata.key is the name of the attribute. Under the hood, something like jsondata.__getattr__("key"). It has nothing to do with your key = "address.state".
If you insist on it, you may need to override the __getattr__ function to split the "address.state" and call the superclass's __getattr__ recursively.
I'm fetching some data from an API on regular interval and wants to store the JSON data into database to access and use later.
From API, I get data in this sample each time:
'{"data": {"cursor": null, "files": {"nodes": [{u'code': u'BOPhmYQg5Vm', u'date': 1482244678,u'counts': 2, u'id': u'1409492981312099686'}, {u'code': u'g5VmBOPhmYQ', u'date': 1482244678,u'counts': 5, u'id': u'1209968614094929813'}]}}}'
I can json_data = json.loads(above_data) and then fetch nodes as nodes_data = json_data["data"]["files"]["nodes"] which gives a list of nodes.
I want to store this nodes data into DB column data = Column(db.Text) of Text type. Each time there are going to be 10-15 values in nodes list.
How do I store? There are multiple nodes and I need it in a way that in future I can append/add more nodes to already available data column in my db.
While I would like to do json.loads(db_data_col) so that I get valid json and can loop over all of nodes to get internal data and use later.
I'm confused on how to store in db and access later in valid json format.
Edit 1: Using Sqlite for testing. Can use PostgresSQL in future. Text type of column is main point.
If you are using Django 1.8 you can create your own model field that can store a json. This class will make sure that you have the right JSON format as well.
import json
from django.db import models
class JsonField(models.TextField):
"""
Stores json-able python objects as json.
"""
def get_db_prep_value(self, value, connection, prepared=False):
try:
return json.dumps(value)
except TypeError:
BAD_DATA.error(
"cannot serialize %s to store in a JsonField", str(value)
)
return ""
def from_db_value(self, value, expression, connection, context):
if value == "":
return None
try:
return json.loads(value)
except TypeError:
BAD_DATA.error("cannot load dictionary field -- type error")
return None
I found a way to store JSON data into DB. Since I'm accessing nodes from remote service which returns a list of nodes on every request, I need to build proper json to store/retrieve from db.
Say API returned json text as : '{"cursor": null, "nodes" = [{"name": "Test1", "value: 1}, {"name": "Test2", "value: 2}, ...]}'
So, first we need to access nodes list as:
data = json.loads(api_data)
nodes = data['nodes']
Now for 1st entry into DB column we need to do following:
str_data = json.dumps({"nodes": nodes})
So, str_data would return a valid string/buffer, which we can store into DB with a "nodes" key.
For 2nd or successive entries into DB column, we will do following:
# get data string from DB column and load into json
db_data = json.loads(db_col_data)
# get new/latest 'nodes' data from api as explained above
# append this data to 'db_data' json as
latest_data = db_data["nodes"] + new_api_nodes
# now add this data back to column after json.dumps()
db_col_data = json.dumps(latest_data)
# add to DB col and DB commit
It is a proper way to load/dump data from DB while adding/removing json and keeping proper format.
Thanks!