How to do a bulk insert without overwriting existing data - Pymongo? - python

I am trying to bulk insert data to MondoDB without overwriting existing data. I want to insert new data to the database if no match with unique id (sourceID). Looking at the documentation for Pymongo I have written some code but cannot make it work. Any ideas to what I am doing wrong?
db.bulk_write(UpdateMany({"sourceID"}, test, upsert=True))
db is the name of my database, SourceID is the unique ID of the documents that I don't want to overwrite in the existing data, test is the array that I am tying to insert.

Either I don't understand your requirement or you misunderstands the UpdateMany operation. As per documentation, this operation serves for modifying the existing data (those matching the query) and only if no documents match the query, and upsert=True, insert new documents. Are you sure you don't want to use insert_many method?
Also, in your example, the first parameter which should be a filter for update, is not a valid query which has to be in a form {"key": "value"}.

Related

Relational DB - separate joined tables

Is there any way to join tables from a relational database and then separate them again ?
I'm working on a project that involves modifying the data after having joined them. Unfortunately, modifying the data bejore the join is not an option. I would then want to separate the data according to the original schema.
I'm stuck at the separating part. I have metadata (python dictionary) with the information on the tables (primary keys, foreign keys, fields, etc.).
I'm working with python. So, if you have a solution in python, it would be greatly appretiated. If an SQL solution works, that also helps.
Edit : Perhaps the question was unclear. To summarize I would like to create a new database with an identical schema to the old one. I do not want to make any modifications to the original database. The data that makes up the new database must originally be in a single table (result of a join of the old tables) as the operations that need to be run must be run on a single table and I cannot run these operations on invidual tables as the outcome will not be as desired.
I would like to know if this is possible and, if so, how can I achieve this?
Thanks!

How to get only specified columns from Dynamodb using python?

I have the below function to pull the required columns from dynamodb, it is working fine.
The problem is, it pulling only few rows from the table.
For eg: Table has 26000+ rows but I'm able to get only 3000 rows here.
Did I missed anything?
def get_columns_dynamodb():
try:
response = table.query(
ProjectionExpression= " id, name, date",
KeyConditionExpression=
Key('opco_type').eq('cwc') and Key('opco_type').eq('cwp')
)
return (response['Items'])
except Exception as error:
logger.error(error)
In DynamoDB, there's no such thing as "select only these columns". Or, there sort of is, but that happens after data is fetched from storage. The entire item is always fetched, and the entire item will count towards the various limits in DynamoDB, such as 1mb max for each response, etc.
One way to solve this, is to write your data in a way that's more optimized for this query. Generally speaking, in DynamoDB, you optimize "queries" (in quotes, since they're more of a key/value read than a dynamic query with joins and selects etc) by writing optimized data.
So, when you write data to your table, you can either use a transaction to write companion items to the same or a separate table, or you can use DynamoDB streams to write the same data in a similar fashion, except async (i.e. eventually consistent).
Let's say you roll with two tables: have one table, my_things, which contains full items. Then another table, my_things_for_query_x that only has the exact data you need for that query, which will allow you to read more data in each chunk, since the data in storage only contains the data you actually need in your situation.

How to check if a list of primary keys already exist in DB in a single query?

In db I have a table called register that has mail-id as primary key. I used to submit in bulk using session.add_all(). But sometimes some records already exist; in that case I want to separate already existing records and non- existing.
http://docs.sqlalchemy.org/en/latest/orm/session_api.html#sqlalchemy.orm.session.Session.merge
If all the objects you are adding to the database are complete (e.g. the new object contains at least all the information that existed for the record in the database) you can use Session.merge(). Effectively merge() will either create or update the existing row (by finding the primary key if it exists in the session/database and copying the state across from the object you merge). The crucial thing to take note of is that the attribute values of the object passed to merge will overwrite that which already existed in the database.
I think this is not so great in terms of performance, so if that is important, SQLAlchemy has some bulk operations. You would need to check existence for the set of primary keys that will be added/updated, and do one bulk insert for the objects which didn't exist and one bulk update for the ones that did. The documentation has some info on the bulk operations if it needs to be a high-performance approach.
http://docs.sqlalchemy.org/en/latest/orm/persistence_techniques.html#bulk-operations
User SQL ALchemys inspector for this:
inspector = inspect(engine)
inspector.get_primary_keys(table, schema)
Inspector "reflects" all primary keys and you can check agaist the returned list.

How to delete queried results from Splunk database?

Query is on Splunk DB data delete:
My requirement:
I do a query to splunk, based on time stamp, "from date" & "to date".
After I got the list of all events results between the timestamp, I want to delete these list of events from the Splunk database.
Each queried results data will be stored in the destination database, hence I want to delete each queried results data from querying Splunk DB, so that my next query will not end up in giving repetitive results, also I want to free up the storage space in source Splunk DB.
Hence I want a effective solution on how to delete completely the Queried result data, from querying Splunk DB?
Thanks & Regards,
Dharmendra Setty
I'm not sure you can actually delete them to free up storage space.
As written here, what you can do is simply mask the results from ever showing up again in the next searches.
To do this, simply pipe the "delete" command to your search query.
BE CAREFUL: First make sure these really are the events you want to delete
Example:
index=<index-name> sourcetype=<sourcetype-name> source=<source-name>
earliest="%m/%d/%Y:%H:%M:%S" latest="%m/%d/%Y:%H:%M:%S" | delete
Where
index=<index-name> sourcetype=<sourcetype-name> source=<source-name>
earliest="%m/%d/%Y:%H:%M:%S" latest="%m/%d/%Y:%H:%M:%S"
is the search query

Bigquery: how to preserve nested data in derived tables?

I have a few large hourly upload tables with RECORD fieldtypes. I want to pull select records out of those tables and put them in daily per-customer tables. The trouble I'm running into is that using QUERY to do this seems to flatten the data out.
Is there some way to preserve the nested RECORDs, or do I need to rethink my approach?
If it helps, I'm using the Python API.
It is now possible to preserve nested field structure in query results.... more here
use flatten_results flag in bq util
--[no]flatten_results: Whether to flatten nested and repeated fields in the result schema. If
not set, the default behavior is to flatten.
API Documentation
https://developers.google.com/bigquery/docs/reference/v2/jobs#configuration.query.flattenResults
Unfortunately, there isn't a way to do this right now, since, as you realized, all results are flattened.

Categories