I'm in the process of creating a Table for AWS DynamoDB. All the documentation on how to use the required JSON format is demonstrated entirely manually... in my case, I want to create several tables, each with several columns - it seems inefficient to do this manually when I know my column headers and their data types...
The boto3 website has a guide with the following snippet:
# Get the service resource.
dynamodb = boto3.resource('dynamodb')
# Create the DynamoDB table.
table = dynamodb.create_table(
TableName='users',
KeySchema=[
{
'AttributeName': 'username',
'KeyType': 'HASH'
},
{
'AttributeName': 'last_name',
'KeyType': 'RANGE'
}
],
AttributeDefinitions=[
{
'AttributeName': 'username',
'AttributeType': 'S'
},
{
'AttributeName': 'last_name',
'AttributeType': 'S'
},
],
ProvisionedThroughput={
'ReadCapacityUnits': 5,
'WriteCapacityUnits': 5
}
)
Now I'm wondering, of course if you had hundreds of columns/AttributeTypes in your data, you wouldn't want to sit there typing it all in. How can I automate this process with a loop? I have a general idea but I'm coming from Java and I'm not quite proficient to think of the solution in this case.
Could anyone help? Thanks!
EDIT:
I worded this question horribly, and had gotten too bogged down in the documentation to understand what I was asking about. I wanted a solution for automating the addition of data to a DynamoDB table using loops. I explain in my answer below.
So unbeknownst to me at the time - the snippet shown in my question is only about defining the key schema - i.e. your primary/ composite keys. What I wanted to do was add actual data to the table - and the examples of this were all done manually on the boto3 documentation.
To answer my own question: first, you obviously need to create and define the key schema - and that's no time at all to do manually using the template shown in the question.
Note that Boto3 will not allow float values... my solution was to change them into str types. Boto3 recommends using Decimal, but Decimal(str(value)) would not work, as it was being passed as string Decimal(value) for some reason (can anyone explain?):
Passing a Decimal(str(value)) to a dictionary for raw value
This is how I used pandas to import data from excel, and then automated putting that data into my table:
#panda reads the excel document to extract df
df = pd.read_excel(fileRoute,dtype=object)
#the column headers are allocated to the keys variable
keys= df.columns.values
#create a dynamodb instance and link it to an existing table in dynamodb
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table(name)
#batch_writer() allows the addition of many items at once to the table
with table.batch_writer() as batch:
dfvalues = df.values
#loop through values to generate dictionaries for every row in table
for sublist in dfvalues:
dicts = {}
for ind, value in enumerate(sublist):
if type(value) is float:
if math.isnan(value): #you might want to avoid 'NaN' types
continue
value = str(value) #convert float values to str
dicts[headings[ind]] = value
batch.put_item(
Item=dicts #add item to dynamoDB table
)
Related
I'm trying to figure out how to work better with json in postgres.
I have a file that stores information about many tables (structure and values). File is periodically updated, this may mean changes in data as well as in table structures. It turns out some kind of dynamic tables.
As a result, I have json table structure (key is column, value is field type (string or number only)) and list of json records for each table.
Something like this (actualy structure does not matter):
{
'table_name': 'table1',
'columns': {
'id': 'int',
'data1': 'string',
'data2': 'string'
},
'values': [
[1, 'aaa', 'bbb'],
[2, 'ccc', 'ddd']
]
}
At first I wanted to make a real table for each table in file, do truncate when updating the data and drop table if table structure changes. Second option I'm testing now is a single table with json data:
CREATE TABLE IF NOT EXISTS public.data_tables
(
id integer NOT NULL,
table_name character varying(50),
row_data jsonb,
CONSTRAINT data_tables_pkey PRIMARY KEY (id)
)
And now there is the question of how to properly work with json:
directly query row_data like row_data->>'id' = 1 with hash index for 'id' key
use jsonb_populate_record with custom types for each table (yes, I need to recreate them each time table structure will change)
probably some other way to work with it?
First option is the easiest and fast because of indexes, but there is no data type control and you have to put it in every query.
Second option is more difficult to implement, but easier to use in queries. I can even create views for each table with jsonb_populate_record. But as far as I see - indexes won't work with json function?
Perhaps there is a better way? Or is recreating tables not such a bad option?
Firstly, your JSON string is not the correct format. I wrote the corrected sample JSON string:
{
"table_name": "table1",
"columns": {
"id": "integer",
"data1": "text",
"data2": "text"
},
"values": [
{
"id": 1,
"data1": "aaa",
"data2": "bbb"
},
{
"id": 2,
"data1": "ccc",
"data2": "ddd"
}
]
}
I wrote a sample function for you, but only for creating table from JSON. You can write SQL code for inserting process too, it's easy, not difficult.
Sample Function:
CREATE OR REPLACE FUNCTION dynamic_create_table()
RETURNS boolean
LANGUAGE plpgsql
AS $function$
declare
rec record;
begin
FOR rec IN
select
t1.table_name,
string_agg(t2.pkey || ' ' || t2.pval || ' NULL', ', ') as sql_columns
from data_tables t1
cross join jsonb_each_text(t1.row_data->'columns') t2(pkey, pval)
group by t1.table_name
loop
execute 'create table ' || rec.table_name || ' (' || rec.sql_columns || ')';
END loop;
return true;
END;
$function$;
I'm trying to query all the Food values in the "Categories" attribute and the "review_count" attribute values that are at least 100. My first time working with scanning tables in DynamoDB through python. I need to use the table.scan function as well. This is what I have tried so far.
resp = table.scan(FilterExpression='(categories = cat1) AND' + '(review_count >= 100)',
ExpressionAttributeValues={
':cat1': 'Food',
})
Any help would be greatly appreciated. Thanks
Assuming table name is test
FilterExpression can't contain constants, should only have either table attributes like categories, review_count and placeholders like :cat1, :rc . So, 100 can be replaced with a variable :rc.
All placeholders should start with : , so, cat1 should be :cat1
table = dynamodb.Table('test')
response = table.scan(
FilterExpression= 'categories=:cat1 AND review_count>=:rc',
ExpressionAttributeValues= {
':cat1': "Food" ,
':rc': 100,
}
)
data = response['Items']
Important point to note on scan , from documentation
A single Scan operation reads up to the maximum number of items set
(if using the Limit parameter) or a maximum of 1 MB of data and then
apply any filtering to the results using FilterExpression.
started using DynamoDB recently and I am having problems fetching data by multiple keys.
I am trying to get multiple items from a table.
My table schema is defined as follows:
{
"AttributeDefinitions": [
{
"AttributeName": "id",
"AttributeType": "S"
},
{
"AttributeName": "date",
"AttributeType": "S"
}
],
"KeySchema": [
{
"AttributeName": "id",
"KeyType": "HASH"
},
{
"AttributeName": "date",
"KeyType": "RANGE"
}
],
...
}
I have a filter list of ids and a date range for each id:
[
{ "id": "abc", "start_date": "24/03/2020", "end_date": "26/03/2020" },
{ "id": "def", "start_date": "10/04/2020", "end_date": "20/04/2020" },
{ "id": "ghi", "start_date": "11/04/2020", "end_date": "11/04/2020" }
]
I need to fetch all items that match the filter list.
The problem is that I cannot use Query as KeyConditionExpression only accepts a single partition key (and I need to match it to the entire filter list)
The condition must perform an equality test on a single partition key value.
I cannot use BatchGetItem as it requires the exact key (and I need a date range for my sort key Key('date').between(start_date, end_date))
Keys - An array of primary key attribute values that define specific items in the table. For each primary key, you must provide all of the key attributes. For example, with a simple primary key, you only need to provide the partition key value. For a composite key, you must provide both the partition key value and the sort key value.
I am kind of lost...
Is there a way to fetch by multiple keys with a range query (by a single request - not multiple requests from a loop)?
Would you suggest any table changes?
You need to make one query per unique id. Each of these queries should include a key condition expression that has equality on the id partition key and range of values on the date sort key, like this:
#id = :id AND #date BETWEEN :startdate AND :enddate
Don't use scan for this. As your table grows, performance will decline.
You can use table.scan to get multiple records. See documentation here.
Here's an example code:
import boto3
# Get the service resource.
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('tablename')
response = table.scan(
FilterExpression=Attr('first_name').begins_with('J') & Attr('account_type').eq('super_user')
)
items = response['Items']
print(items)
refering this post https://stackoverflow.com/a/70494101/7706503, you could try partiQL to get by multiple partition keys in one query
Right now i am able to connect to the url api and my database. I am trying to insert data from the url to the postgresql database using psycopg2. I dont fully understand how to do this, and this is all i could come up with to do this.
import urllib3
import json
import certifi
import psycopg2
from psycopg2.extras import Json
http = urllib3.PoolManager(
cert_reqs='CERT_REQUIRED',
ca_certs=certifi.where())
url = '<API-URL>'
headers = urllib3.util.make_headers(basic_auth='<user>:<passowrd>')
r = http.request('GET', url, headers=headers)
data = json.loads(r.data.decode('utf-8'))
def insert_into_table(data):
for item in data['issues']:
item['id'] = Json(item['id'])
with psycopg2.connect(database='test3', user='<username>', password='<password>', host='localhost') as conn:
with conn.cursor() as cursor:
query = """
INSERT into
Countries
(revenue)
VALUES
(%(id)s);
"""
cursor.executemany(query, data)
conn.commit()
insert_into_table(data)
So this code give me a TypeError: string indices must be integers on cursor.executemany(query, data)
So i know that json.loads brings back a type object and that json.dumps brings a type string . I wasn't sure which one i should be using. and i know i am completely missing something on how im targeting the 'id' value, and inserting it into the query.
Also a little about the API, it is very large and complex and eventually i'll have to go down multiple trees to grab certain values, here is an example of what i'm pulling from.
I am trying to grab "id" under "issues" and not "issue type"
{
"expand": "<>",
"startAt": 0,
"maxResults": 50,
"total": 13372,
"issues": [
{
"expand": "<>",
"id": "41508",
"self": "<>",
"key": "<>",
"fields": {
"issuetype": {
"self": "<>",
"id": "1",
"description": "<>",
"iconUrl": "<>",
"name": "<>",
"subtask": <>,
"avatarId": <>
},
First, extract ids into a list of tuples:
ids = list((item['id'],) for item in data['issues'])
# example ids: [('41508',), ('41509',)]
Next use the function extras.execute_values():
from psycopg2 import extras
query = """
INSERT into Countries (revenue)
VALUES %s;
"""
extras.execute_values(cursor, query, ids)
Why I was getting type errors?
The second argument of the function executemany(query, vars_list) should be a sequence while data is an object which elements cannot be accessed by integer indexes.
Why to use execute_values() instead of executemany()?
Because of performance, the first function executes a single query with multiple arguments, while the second one executes as many queries as arguments.
Note, that by default the third argument of execute_values() is a list of tuples, so we extracted ids just in this way.
If you have to insert values into more than one column, each tuple in the list should contain all the values for a single inserted row, example:
values = list((item['id'], item['key']) for item in data['issues'])
query = """
INSERT into Countries (id, revenue)
VALUES %s;
"""
extras.execute_values(cur, query, values)
If you're trying to get just the id and insert it into your table, you should try
ids = []
for i in data['issues']:
ids.append(i['id'])
Then you can pass your ids list to you cursor.executemany function.
The issue you have is not in the way you are parsing your JSON, it occurs when you try to insert it into your table using cursor.executemany().
data is a single object, Are you attempting to insert all of the data your fetch returns into your table all at once? Or are you trying to insert a specific part of the data (a list of issue IDs)?
You are passing data into your cursor.executemany call. data is an object. I believe you wish to pass data.issues which is the list of issues that you modified.
If you only wish to insert the ids into the table try this:
def insert_into_table(data):
with psycopg2.connect(database='test3', user='<username>', password='<password>', host='localhost') as conn:
with conn.cursor() as cursor:
query = """
INSERT into
Countries
(revenue)
VALUES
(%(id)s);
"""
for item in data['issues']:
item['id'] = Json(item['id'])
cursor.execute(query, item['id')
conn.commit()
insert_into_table(data)
If you wish keep the efficiency of using cursor.executemany() You need create an array of the IDs, as the current object structure doesn't arrange them the way the cursor.executemany() requires.
I am using Python2.7, Pymongo and MongoDB. I'm trying to get rid of the default _id values in MongoDB. Instead, I want certain fields of columns to go as _id.
For example:
{
"_id" : ObjectId("568f7df5ccf629de229cf27b"),
"LIFNR" : "10099",
"MANDT" : "100",
"BUKRS" : "2646",
"NODEL" : "",
"LOEVM" : ""
}
I would like to concatenate LIFNR+MANDT+BUKRS as 100991002646 and hash it to achieve uniqueness and store it as new _id.
But how far hashing helps for unique ids? And how do I achieve it?
I understood that using default hash function in Python gives different results for different machines (32 bit / 64 bit). If it is true, how would I go about generating _ids?
But I need LIFNR+MANDT+BUKRS to be used however. Thanks in advance.
First you can't update the _id field. Instead you should create a new field and set it value to the concatenated string. To return the concatenated value you need to use the .aggregate() method which provides access to the aggregation pipeline. The only stage in the pipeline is the $project stage where you use the $concat operator which concatenates strings and returns the concatenated string.
From there you then iterate the cursor and update each document using "bulk" operations.
bulk = collection.initialize_ordered_bulk_op()
count = 0
cursor = collection.aggregate([
{"$project": {"value": {"$concat": ["$LIFNR", "$MANDT", "$BUKRS"]}}}
])
for item in cursor:
bulk.find({'_id': item['_id']}).update_one({'$set': {'id': item['value']}})
count = count + 1
if count % 200 == 0:
bulk.execute()
if count > 0:
bulk.execute()
MongoDB 3.2 deprecates Bulk() and its associated methods so you will need to use the bulk_write() method.
from pymongo import UpdateOne
requests = []
for item in cursor:
requests.append(UpdateOne({'_id': item['_id']}, {'$set': {'id': item['value']}}))
collection.bulk_write(requests)
Your documents will then look like this:
{'BUKRS': '2646',
'LIFNR': '10099',
'LOEVM': '',
'MANDT': '100',
'NODEL': '',
'_id': ObjectId('568f7df5ccf629de229cf27b'),
'id': '100991002646'}