Correct the sequence generator start number? - python

My PosgressSql database is allocating ids that already exist. From what I read can be a problem with sequence generator.
Seems that I get sequence corruption often, with the sequence starting number, being before, the last id in the database.
I know I can change the number in pgadmin, but how can I auto-correct this behavior in production.
I'm using python/django, it is possible to catch the error somehow, and reset the sequence ?

For sequences it goes something like
select setval('foo_id_seq',max(id),true) from foo;
for apropriate values 'foo_id_seq' of foo and id;

Related

How to know number of position document in MongoDB

What is the best way to look for a document’s position in a collection?
Using the following code. But its works so bad with my big collection of documents.
get_top_func(user_score):
return len(db.collection.find({'score': {'$gt': user_score}})) + 1
Considering your query and the fact that you mention that the query is slow, I am guessing that you don't have an index on the score field. Creating and index should make this query faster:
db.collection.createIndex( { score: -1 } )
After that, you should be able to run the query you have with better performance.
You can have a field like myDocId as the key and value to be like a variable that functions as a counter.
So while inserting each document in a collection, you'll be having its number also stored with the document data as well.
Each document in a collection has an identifier key that is stuffed my mongoDB itself which is _id but that wont tell you exactly which number of document is it(like nth document) as its made up using 4 parameters
a 4-byte value representing the seconds since the Unix epoch,
a 3-byte machine identifier,
a 2-byte process id, and
a 3-byte counter, starting with a random value.
Also refer https://docs.mongodb.com/manual/core/aggregation-pipeline-optimization/
and also MongoDB Query, sort then take nth document for group
So what you can do is using aggregate you can use your filters and then project as needed to get the nth element using the $arrayElemAt.
https://docs.mongodb.com/manual/reference/operator/aggregation/arrayElemAt/

DynamoDB Querying in Python (Count with GroupBy)

This may be trivial, but I loaded a local DynamoDB instance with 30GB worth of Twitter data that I aggregated.
The primary key is id (tweet_id from the Tweet JSON), and I also store the date/text/username/geocode.
I basically am interested in mentions of two topics (let's say "Bees" and "Booze"). I want to get a count of each of those by state by day.
So by the end, I should know for each state, how many times each was mentioned on a given day. And I guess it'd be nice to export that as a CSV or something for later analysis.
Some issues I had with doing this...
First, the geocode info is a tuple of [latitude, longitude] so for each entry, I need to map that to a state. That I can do.
Second, is the most efficient way to do this to go through each entry and manually check if it contains a mention of either keyword and then have a dictionary for each that maps the date/location/count?
EDIT:
Since it took me 20 hours to load all the data into my table, I don't want to delete and re-create it. Perhaps I should create a global secondary index (?) and use that to search other fields in a query? That way I don't have to scan everything. Is that the right track?
EDIT 2:
Well, since the table is on my computer locally I should be OK with just using expensive operations like a Scan right?
So if I did something like this:
query = table.scan(
FilterExpression=Attr('text').contains("Booze"),
ProjectionExpression='id, text, date, geo',
Limit=100)
And did one scan for each keyword, then I would be able to go through the resulting filtered list and get a count of mentions of each topic for each state on a given day, right?
EDIT3:
response = table.scan(
FilterExpression=Attr('text').contains("Booze"),
Limit=100)
//do something with this set
while 'LastEvaluatedKey' in response:
response = table.scan(
FilterExpression=Attr('text').contains("Booze"),
Limit=100,
ExclusiveStartKey=response['LastEvaluatedKey']
)
//do something with each batch of 100 entries
So something like that, for both keywords. That way I'll be able to go through the resulting filtered set and do what I want (in this case, figure out the location and day and create a final dataset with that info). Right?
EDIT 4
If I add:
ProjectionExpression='date, location, user, text'
into the scan request, I get an error saying "botocore.exceptions.ClientError: An error occurred (ValidationException) when calling the Scan operation: Invalid ProjectionExpression: Attribute name is a reserved keyword; reserved keyword: location". How do I fix that?
NVM I got it. Answer is to look into ExpressionAttributeNames (see: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/ExpressionPlaceholders.html)
Yes, scanning the table for "Booze" and counting the items in the result should give you the total count. Please note that you need to do recursive scan until LastEvaluatedKey is null.
Refer exclusive start key as well.
Scan
EDIT:-
Yes, the code looks good. One thing to note, the result set wouldn't always contain 100 items. Please refer the LIMIT definition below (not same as SQL database).
Limit — (Integer) The maximum number of items to evaluate (not
necessarily the number of matching items). If DynamoDB processes the
number of items up to the limit while processing the results, it stops
the operation and returns the matching values up to that point, and a
key in LastEvaluatedKey to apply in a subsequent operation, so that
you can pick up where you left off. Also, if the processed data set
size exceeds 1 MB before DynamoDB reaches this limit, it stops the
operation and returns the matching values up to the limit, and a key
in LastEvaluatedKey to apply in a subsequent operation to continue the
operation. For more information, see Query and Scan in the Amazon
DynamoDB Developer Guide.

SQL Server stored procedure to insert called from Python doesn't always store data, but still increments identity coutner

I've hit a strange inconsistency problem with SQL Server inserts using a stored procedure. I'm calling a stored procedure from Python via pyodbc by running a loop to call it multiple times for inserting multiple rows in a table.
It seems to work normally most of the time, but after a while it will just stop working in the middle of the loop. At that point even if I try to call it just once via the code it doesn't insert anything. I don't get any error messages in the Python console and I actually get back the incremented identities for the table as though the data were actually inserted, but when I go look at the data, it isn't there.
If I call the stored procedure from within SQL Server Management Studio and pass in data, it inserts it and shows the incremented identity number as though the other records had been inserted even though they are not in the database.
It seems I reach a certain limit on the number of times I can call the stored procedure from Python and it just stops working.
I'm making sure to disconnect after I finish looping through the inserts and other stored procedures written in the same way and sent via the same database connection still work as usual.
I've tried restarting the computer with SQL Server and sometimes it will let me call the stored procedure from Python a few more times, but that eventually stops working as well.
I'm wondering if it is something to do with calling the stored procedure in a loop too quickly, but that doesn't explain why after restarting the computer, it doesn't allow any more inserts from the stored procedure.
I've done lots of searching online, but haven't found anything quite like this.
Here is the stored procedure:
USE [Test_Results]
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE PROCEDURE [dbo].[insertStepData]
#TestCaseDataId int,
#StepNumber nchar(10),
#StepDateTime nvarchar(50)
AS
SET NOCOUNT ON;
BEGIN TRANSACTION
DECLARE #newStepId int
INSERT INTO TestStepData (
TestCaseDataId,
StepNumber,
StepDateTime
)
VALUES (
#TestCaseDataId,
#StepNumber,
#StepDateTime
)
SET #newStepId = SCOPE_IDENTITY();
SELECT #newStepId
FROM TestStepData
COMMIT TRANSACTION
Here is the method I use to call a stored procedure and get back the id number ('conn' is an active database connection via pyodbc):
def CallSqlServerStoredProc(self, conn, procName, *args):
sql = """DECLARE #ret int
EXEC #ret = %s %s
SELECT #ret""" % (procName, ','.join(['?'] * len(args)))
return int(conn.execute(sql, args).fetchone()[0])
Here is where I'm passing in the stored procedure to insert:
....
for testStep in testStepData:
testStepId = self.CallSqlServerStoredProc(conn, "insertStepData", testCaseId, testStep["testStepNumber"], testStep["testStepDateTime"])
conn.commit()
time.sleep(1)
....
SET #newStepId = SCOPE_IDENTITY();
SELECT #newStepId
FROM StepData
looks mighty suspicious to me:
SCOPE_IDENTITY() returns numeric(38,0) which is larger than int. A conversion error may occur after some time. Update: now that we know the IDENTITY column is int, this is not an issue (SCOPE_IDENTITY() returns the last value inserted into that column in the current scope).
SELECT into variable doesn't guarantee its value if more that one record is returned. Besides, I don't get the idea behind overwriting the identity value we already have. In addition to that, the number of values returned by the last statement is equal to the number of rows in that table which is increasing quickly - this is a likely cause of degradation. In brief, the last statement is not just useless, it's detrimental.
The 2nd statement also makes these statements misbehave:
EXEC #ret = %s %s
SELECT #ret
Since the function doesn't RETURN anything but SELECTs a single time, this chunk actually returns two data sets: 1) a single #newStepId value (from EXEC, yielded by the SELECT #newStepId <...>); 2) a single NULL (from SELECT #ret). fetchone() reads the 1st data set by default so you don't notice this but it doesn't work towards performance or correctness anyway.
Bottom line
Replace the 2nd statement with RETURN #newStepId.
Data not in the database problem
I believe it's caused by RETURN before COMMIT TRANSACTION. Make it the other way round.
In the original form, I believe it was caused by the long-working SELECT and/or possible side-effects from the SELECT not-to-a-variable being inside a transaction.

Commiting objects with long integers to MYSQL with SQLAlchemy

I am trying to add a large integer to a MySQL table with SQLAlchemy. As this answer explains, you cannot pass Integer a length argument like you can String. So following that answer I've defined my column with mysql.INTEGER like so:
from sqlalchemy.dialects import mysql
uniqueid = Column(mysql.INTEGER(20))
When I try to commit an object with a 14 digit uniqueid, however, I get the following error message: DataError: (DataError) (1264, "Out of range value for column 'uniqueid' at row 1"). When I try a shorter integer that is not a long, it has no problem committing the same object to the SQL database. I am running python 2.7, other discussions of the long type indicate that it should not behave any differently than int except for printing an L at the end of the number. One final piece of information is if I set the uniqueid to the same short number but make it a long, as in uniqueid = long(32423), I can still commit the object to the SQL database.
I did not solve the mystery of why the mysql.INTEGER class will not work with numbers that have to be long in python 2.7, but the practical solution is to use SQLalchemy's BigInteger class, which as the name suggests can handle big integers, including long.

Return the column names from an empty MySQL query result

I'm using Python 3.2.3, with the MySQL/Connector 1.0.7 module. Is there a way to return the column names, if the MySQL query returns an empty result?
For example. Say I have this query:
SELECT
`nickname` AS `team`,
`w` AS `won`,
`l` AS `lost`
WHERE `w`>'10'
Yet, if there's nobody over 10, it returns nothing, obviously. Now, I know I can check if the result is None, but, can I get MySQL to return the column name and a NULL value for it?
If you're curious, the reason I'm wondering if this is possible, is because I'm dynamically building dict's based on the column names. So, the above, would end up looking something like this if nobody was over 10...
[{'team':None,'won':None,'lost':None}]
And looks like this, if it found 3 teams over 10...
[{'team':'Tigers','won':14,'lost':6},
{'team':'Cardinals','won':12,'lost':8},
{'team':'Giants','won':15,'lost':4}]
If this kind of thing is possible, then I won't have to write a ton of exception checks all over the code in case of empty dict's all over the place.
You could use a DESC table_name first, you should get the column names in the first column
Also you already know the keys in the dict so you can construct yourself and then append things to it if the result has values.
[{'team':None,'won':None,'lost':None}]
But what I fail to see why you need this. If you have a list of dictionaries, I am guessing you will have for loop operations. For loop will not do anything to a empty list, so you would not have to bother about exception checks
If you have to do something like result[0]['team'] then you should definitely check if len(result)>0

Categories