How to delete queried results from Splunk database? - python

Query is on Splunk DB data delete:
My requirement:
I do a query to splunk, based on time stamp, "from date" & "to date".
After I got the list of all events results between the timestamp, I want to delete these list of events from the Splunk database.
Each queried results data will be stored in the destination database, hence I want to delete each queried results data from querying Splunk DB, so that my next query will not end up in giving repetitive results, also I want to free up the storage space in source Splunk DB.
Hence I want a effective solution on how to delete completely the Queried result data, from querying Splunk DB?
Thanks & Regards,
Dharmendra Setty

I'm not sure you can actually delete them to free up storage space.
As written here, what you can do is simply mask the results from ever showing up again in the next searches.
To do this, simply pipe the "delete" command to your search query.
BE CAREFUL: First make sure these really are the events you want to delete
Example:
index=<index-name> sourcetype=<sourcetype-name> source=<source-name>
earliest="%m/%d/%Y:%H:%M:%S" latest="%m/%d/%Y:%H:%M:%S" | delete
Where
index=<index-name> sourcetype=<sourcetype-name> source=<source-name>
earliest="%m/%d/%Y:%H:%M:%S" latest="%m/%d/%Y:%H:%M:%S"
is the search query

Related

Super slow SQLAlchemy query to Google Cloud SQL on a very large table

I have a very large (and growing) table of URLs, and I want to query the table to check if an item exists and return that item so I can edit it, or choose to add a new item. The code below works but runs very very slowly and, given the volume of queries I need to perform (several thousand per hour) is creating some issues. I haven't been able to find a better solution than below. I have a good sense of what is happening - it is loading the entire table every time, but there must be a faster way here.
Session = sessionmaker(bind=engine)
formatted_url = "%{}%".format(url)
matching_url = None
with Session.begin() as session:
matching_url = session.query(Link.id).filter(Link.URL.like(formatted_url)).yield_per(200).first()
This works great if the URL exists and is recent, but especially if the URL isn't in the database at all, the process takes as long as one minute.
You are doing a select from table where Linkid like %formatted_url% limit 1;
This needs a full table scan in the database.
If you are lucky, the row is still in memory or cache.
If not, or if it does not exist the database will need that full table scan.
If you are using postgres on cloud SQL, this question will help you to remediate the problem PostgreSQL: Full Text Search - How to search partial words?

How to do a bulk insert without overwriting existing data - Pymongo?

I am trying to bulk insert data to MondoDB without overwriting existing data. I want to insert new data to the database if no match with unique id (sourceID). Looking at the documentation for Pymongo I have written some code but cannot make it work. Any ideas to what I am doing wrong?
db.bulk_write(UpdateMany({"sourceID"}, test, upsert=True))
db is the name of my database, SourceID is the unique ID of the documents that I don't want to overwrite in the existing data, test is the array that I am tying to insert.
Either I don't understand your requirement or you misunderstands the UpdateMany operation. As per documentation, this operation serves for modifying the existing data (those matching the query) and only if no documents match the query, and upsert=True, insert new documents. Are you sure you don't want to use insert_many method?
Also, in your example, the first parameter which should be a filter for update, is not a valid query which has to be in a form {"key": "value"}.

"Not found: Table" for new bigquery table

I use the python sdk to create a new bigquery table:
tableInfo = {
'tableReference':{
'datasetId':datasetId,
'projectId':projectId,
'tableId':targetTableId
},
'schema':schema
}
result = bigquery_service.tables().insert(projectId=projectId,
datasetId=datasetId,
body=tableInfo).execute()
The result variable contains the created table information with etag,id,kind,schema,selfLink,tableReference,type - therefore I assume the table is created correctly.
Afterwards I even get the table, when I call bigquery_service.tables().list(...)
The problem is:
When inserting right after that, I still (often) get an error: Not found: MY_TABLE_NAME
My insert function call looks like this:
response = bigquery_service.tabledata().insertAll(
projectId=projectId,
datasetId=datasetId,
tableId=targetTableId,
body=body).execute()
I even retried the insert multiple times with 3 seconds of sleep between retries. Any ideas?
My projectId is stylight-bi-testing
There were a lot failures between 10:00 and 12:00 (time given in UTC)
Per your answers to my question regarding using NOT_FOUND as an indicator to create the table, this is intended (though admittedly somewhat frustrating) behavior.
The streaming insertion path caches information about tables (and the authorization of a user to insert into the table). This is because of the intended high QPS nature of the API. We also cache certain negative responses in order to protect again buggy or abusive clients. One of those cached negative responses is the non-existence of a destination table. We've always done this on a per-machine basis, but recently added an additional centralized cache, such that all machines will see the negative cache result almost immediately after the first NOT_FOUND response is returned.
In general, we recommend that table creation not occur inline with insert requests, because in a system that is issuing thousands of QPS of inserts, a table miss could result in thousands of table creation operations which can be taxing on our system. Instead, if you know the possible set of tables beforehand, we recommend some periodic process that performs table creations in advance of their usage as a streaming destination. If your destination tables are more dynamic in nature, you may need to implement a delay after table creation has been performed.
Apologies for the difficulty. We do hope to address this issue, but we don't have any timeframe yet for doing so.

AWS DynamoDB retrieve entire table

Folks,
Retrieving all items from a DynamoDB table, I would like to replace the scan operation with a query.
Currently I am pulling in all the table's data via the following (python):
drivertable = Table(url['dbname'])
all_drivers = []
all_drivers_query = drivertable.scan()
for x in all_drivers_query:
all_drivers.append(x['number'])
How would i change this to use the query API?
Thanks!
There is no way to query and get the entire results of the table. As of right now, you have a few options if you want to get all of your data out of a DynamoDB, and all of them involve actually reading the data out of DynamoDB:
Scan the table. It can be done faster with the expense of using much more read capacity by using a parallel scan
Export your data using AWS Data Pipelines. You can configure the export job for where and how it should store your data.
Using one of the AWS event platforms for new data and denormalize it. For all new data you can get a time-ordered stream of all updates to the table from DynamoDB Update Streams or process events using AWS Lambda
You can't query an entire table. Query is used to retrieve a set of items by supplying a hash key (part of the complex primary key hash-range of the table).
One can not use query without knowing the hash keys.
EDIT as a bounty was added to this old question that asks:
How do I get a list of hashes from DynamoDB?
Well - In Dec 2014 you still can't ask via a single API for all hash keys of a table.
Even if you go and put a GSI you still can't get a DISTINCT hash count.
The way I would solve this is with de-normalization. Keep another table with no range key and put every hash there together with the main table. This adds house-keeping overhead to your application level (mainly when removing), but solves the problem you asked.

Is it possible to get the cursor for the beginning of a query so that a future run of the query can use it as an end_cursor?

I'm trying to render a page that starts with the first n entities of a given model and have the page regularly check for new items. Similar to how, say, twitter renders a page of tweets and then prompts you with "6 newer tweets" when they are available.
I currently do this by:
query = Item.all()
query.order("-created")
query.fetch(100)
cursor = query.cursor()
timestamp = time.time()
# render a page storing cursor and timestamp so it can request for older items
# & regularly check for newer ones.
When I receive a request for updates, with a timestamp, I run a query that filters out items before that timestamp.
I was wondering if there was a way to do it as follows (and whether this would be faster):
grab a cursor for the beginning, let's call this update_cursor
fetch data, etc. as before.
When page asks for updates (and provides an update_cursor) perform same query but with query.with_cursor(end_cursor=update_cursor) and query.run() instead of fetch() so that it grabs me all the items until the point I started from previously.
I've tried to fetch 0 entities and look at the cursor, but that just gives me an empty string.
Is this possible (and if it is, is measurably faster than the timestamp method)? Any suggestions or advice?
If instead of elements being at the beginning of the index they are at the end the cursor? This could help you be at the newest entity of your model (The only thing I am not sure if you would get a valid cursor after the last element of the query). next when you are pulling updates just start your query from that cursor and fetch the newest entities.
I have not tried this myself but it could work, give it a try and let me know how it goes.

Categories