I developed a script on python and sqlalchemy to get and update the last activity of my active users.
But the users are increasing a lot, now i´m getting the following error
psycopg2.ProgrammingError: Statement is too large. Statement Size: 16840277 bytes. Maximum Allowed: 16777216 bytes
I was thinking if I update the file postgres.conf it will work, so with the help of pgtune I updated the file, but it does not work, so I updated my kernel on /etc/syslog.conf, with the following parameters
kern.sysv.shmmax=4194304
kern.sysv.shmmin=1
kern.sysv.shmmni=32
kern.sysv.shmseg=8
kern.sysv.shmall=1024
and again it does not work.
After that I divide my query into slices to reduce the size but I got the same error.
How can know what parameter I need to update, to increase the size of my statement?
Workflow
query = "SELECT id FROM {}.{} WHERE status=TRUE".format(schema, customer_table)
ids = ["{}".format(i)for i in pd.read_sql(query, insert_uri).id.tolist()]
read_query = """
SELECT id,
MAX(CONVERT_TIMEZONE('America/Mexico_City', last_activity)) lastactivity
FROM activity WHERE
DATE_TRUNC('d', CONVERT_TIMEZONE('America/Mexico_City', last_activity)) =
DATE_TRUNC('d', CONVERT_TIMEZONE('America/Mexico_City', CURRENT_DATE))-{} and
id in ({})
GROUP BY id
""".format(day, ",".join(ids))
last_activity = pd.read_sql(read_query, read_engine, parse_dates=True)
If you are only fetching the IDs from the database and not filtering them by any other way, there is no need to fetch them at all, you can just insert the SQL statement as a subquery into the second:
SELECT id,
MAX(CONVERT_TIMEZONE('America/Mexico_City', last_activity)) lastactivity
FROM activity WHERE
DATE_TRUNC('d', CONVERT_TIMEZONE('America/Mexico_City', last_activity)) =
DATE_TRUNC('d', CONVERT_TIMEZONE('America/Mexico_City', CURRENT_DATE))-%s and
id in (
SELECT id FROM customerschema.customer WHERE status=TRUE
)
GROUP BY id
Also, as Antti Haapala said, don't use string formatting for SQL parameters, because it is insecure and if any parameter contains appropriate quotes, postgres will interpret them as commands instead of data.
Related
I am making a script, that should create a schema for each customer. I’m fetching all metadata from a database that defines how each customer’s schema should look like, and then create it. Everything is well defined, the types, names of tables, etc. A customer has many tables (fx, address, customers, contact, item, etc), and each table has the same metadata.
My procedure now:
get everything I need from the metadataDatabase.
In a for loop, create a table, and then Alter Table and add each metadata (This is done for each table).
Right now my script runs in about a minute for each customer, which I think is too slow. It has something to do with me having a loop, and in that loop, I’m altering each table.
I think that instead of me altering (which might be not so clever approach), I should do something like the following:
Note that this is just a stupid but valid example:
for table in tables:
con.execute("CREATE TABLE IF NOT EXISTS tester.%s (%s, %s);", (table, "last_seen date", "valid_from timestamp"))
But it gives me this error (it seems like it reads the table name as a string in a string..):
psycopg2.errors.SyntaxError: syntax error at or near "'billing'"
LINE 1: CREATE TABLE IF NOT EXISTS tester.'billing' ('last_seen da...
Consider creating tables with a serial type (i.e., autonumber) ID field and then use alter table for all other fields by using a combination of sql.Identifier for identifiers (schema names, table names, column names, function names, etc.) and regular format for data types which are not literals in SQL statement.
from psycopg2 import sql
# CREATE TABLE
query = """CREATE TABLE IF NOT EXISTS {shm}.{tbl} (ID serial)"""
cur.execute(sql.SQL(query).format(shm = sql.Identifier("tester"),
tbl = sql.Identifier("table")))
# ALTER TABLE
items = [("last_seen", "date"), ("valid_from", "timestamp")]
query = """ALTER TABLE {shm}.{tbl} ADD COLUMN {col} {typ}"""
for item in items:
# KEEP IDENTIFIER PLACEHOLDERS
final_query = query.format(shm="{shm}", tbl="{tbl}", col="{col}", typ=i[1])
cur.execute(sql.SQL(final_query).format(shm = sql.Identifier("tester"),
tbl = sql.Identifier("table"),
col = sql.Identifier(item[0]))
Alternatively, use str.join with list comprehension for one CREATE TABLE:
query = """CREATE TABLE IF NOT EXISTS {shm}.{tbl} (
"id" serial,
{vals}
)"""
items = [("last_seen", "date"), ("valid_from", "timestamp")]
val = ",\n ".join(["{{}} {typ}".format(typ=i[1]) for i in items])
# KEEP IDENTIFIER PLACEHOLDERS
pre_query = query.format(shm="{shm}", tbl="{tbl}", vals=val)
final_query = sql.SQL(pre_query).format(*[sql.Identifier(i[0]) for i in items],
shm = sql.Identifier("tester"),
tbl = sql.Identifier("table"))
cur.execute(final_query)
SQL (sent to database)
CREATE TABLE IF NOT EXISTS "tester"."table" (
"id" serial,
"last_seen" date,
"valid_from" timestamp
)
However, this becomes heavy as there are too many server roundtrips.
How many tables with how many columns are you creating that this is slow? Could you ssh to a machine closer to your server and run the python there?
I don't get that error. Rather, I get an SQL syntax error. A values list is for conveying data. But ALTER TABLE is not about data, it is about metadata. You can't use a values list there. You need the names of the columns and types in double quotes (or no quotes) rather than single quotes. And you can't have a comma between name and type. And you can't have parentheses around each pair. And each pair needs to be introduced with "ADD", you can't have it just once. You are using the wrong tool for the job. execute_batch is almost the right tool, except it will use single quotes rather than double quotes around the identifiers. Perhaps you could add a flag to it tell it to use quote_ident.
Not only is execute_values the wrong tool for the job, but I think python in general might be as well. Why not just load from a .sql file?
I am doing inserts via Streaming. In the UI, I can see the following row counts:
Is there a way to get that via the API? Current when I do:
from google.cloud import bigquery
client = bigquery.Client()
dataset = client.dataset("bqtesting")
table = client.get_table(dataset.table('table_streaming'))
table.num_rows
0
Obviously 0 is not the number that I'm looking to get. From the API documentation it says:
numRows unsigned long [Output-only] The number of rows of data in this table, excluding any data in the streaming buffer.
So then, my question is: how do we get the exact number of rows in a table? Currently I'm doing:
count=[item[0] for item in client.query('SELECT COUNT(*) FROM `bqtesting.table_streaming`').result()][0]
But this takes about 5s just to get the count (and I need to execute this query quite frequently to see if all streaming inserts have 'finished').
select count(1) and select count(*) etc have 0 scanned and billed bytes (you can see this in the job metadata after you run it or in a dry run) so you should be able to run those as often as you like
if i'm reading the documentation correctly, the numbers there are not guaranteed to give you rows in the buffer which have not yet been flushed to big-query storage
you can also use the API mentioned here https://cloud.google.com/bigquery/streaming-data-into-bigquery#dataavailability i.e. check the streamingBuffer.oldestEntryTime field from the tables.get result
You can use the __TABLES__ metadata table to get the information that you want. Querying __TABLES__ incurs no charges.
The query that you need is:
SELECT table_id, row_count, size_bytes
FROM `your-project-name.bqtesting.__TABLES__`
WHERE STARTS_WITH(table_id, "table_streaming")
ORDER BY table_id DESC
I am trying to construct an insert statement that is built from the results of a query. I run a query that retrieves results from one database and then creates an insert statement from the results and inserts that into a different database.
The server that is initially queried only returns those fields in the reply which are populated and this can differ from record to record. The destination database table has all of the possible fields available. This is why I need to construct the insert statement on the fly for each record that is retrieved and why I cannot use a default list of fields as I have no control over which ones will be populated in the response.
Here is a sample of the code, I send off a request for the T&C for an isin and the response is a name and value.
fields = []
data = []
getTCQ = ("MDH:T&C|"+isin+"|NAME|VALUE")
mdh.execute(getTCQ)
TC = mdh.fetchall()
for values in TC:
fields.append(values[0])
data.append(values[1])
insertQ = ("INSERT INTO sp_fields ("+fields+") VALUES ('"+data+"')")
The problem is with the fields part, mysql is expecting the following:
INSERT INTO sp_fields (ACCRUAL_COUNT,AMOUNT_OUTSTANDING_CALC_DATE) VALUES ('030/360','2014-11-10')
But I am getting the following for insertQ:
INSERT INTO sp_fields ('ACCRUAL_COUNT','AMOUNT_OUTSTANDING_CALC_DATE') VALUES ('030/360','2014-11-10')
and mysql does not like the ' ' around the fields names.
How do I get rid of these? so that it looks like the 1st insertQ statement that works.
many thanks in advance.
You could use ','.join(fields) to create the desired string (without quotes around each field).
Then use parametrized sql and pass the values as the second argument to cursor.execute:
insertQ = ("INSERT INTO sp_fields ({}) VALUES ({})".format(
','.join(fields), ','.join(['%s']*len(dates)))
cursor.execute(insertQ, dates)
Note that the correct placemarker to use, e.g. %s, depends on the DB adapter you are using. MySQLdb uses %s, but oursql uses ?, for instance.
I am trying something like
select customer_id, order_id from order_table where purchase_id = 10 OR
purchase_id = 25 OR
...
purchase_id = 25432;
Since the query is too big, I am running to variety of problems... if I run the entire query in a single line, I am running into the error:
SP2-0027: Input is too long (> 2499 characters) - line ignored
If split the query to multiple lines, the query gets corrupted, due to the interference with line numbers printed for each line of the entered query. If I disable line numbers, SQL> prompt at each line is troubling me.
Same error if run the query from a text file SQL> #query.sql
(I did not face such issues with mysql in the past but with sqlplus now).
I am not an expert in shell-script nor in python. It would be of great help if I can get pointers on how I can put all the purchase_ids in a text file, one purchase_id per line and supply it to sqlplus query at script-runtime.
I did sufficient research, but I still appreciate pointers as well.
1) Syntax change:
Try to use 'in (10,25,2542, ...)' instead of a series of 'OR'. It can reduce the size of the sql statement
2) Logic change:
Syntax may delay the inevitable, but the exception will still occur if there are a lot of id to exclude.
2a)
A straight-forward fix is to break the query down into batches. You can issue a select query per 50 purchase IDs until all IDs are covered.
2b)
Or you can look into a more generalised way to retrieve the same query result. Let's assume what you actually want to see is a list of 'unconfirmed order'. Then instead of a using a set of purchase IDs in the where clause, you can add a boolean field 'confirmed' to the order_table and select based on this criteria.
another idea:
Create a table "query_ids" (one column) and input all your order_id from the WHERE clause.
New query would be:
select customer_id, order_id from order_table where purchase_id = ( select * from query_ids);
I want to run various select query 100 million times and I have aprox. 1 million rows in a table. Therefore, I am looking for the fastest method to run all these select queries.
So far I have tried three different methods, and the results were similar.
The following three methods are, of course, not doing anything useful, but are purely for comparing performance.
first Method:
for i in range (100000000):
cur.execute("select id from testTable where name = 'aaa';")
second method:
cur.execute("""PREPARE selectPlan AS
SELECT id FROM testTable WHERE name = 'aaa' ;""")
for i in range (10000000):
cur.execute("""EXECUTE selectPlan ;""")
third method:
def _data(n):
cur = conn.cursor()
for i in range (n):
yield (i, 'test')
sql = """SELECT id FROM testTable WHERE name = 'aaa' ;"""
cur.executemany(sql, _data(10000000))
And the table is created like this:
cur.execute("""CREATE TABLE testTable ( id int, name varchar(1000) );""")
cur.execute("""CREATE INDEX indx_testTable ON testTable(name)""")
I thought that using the prepared statement functionality would really speed up the queries, but as it seems like this will not happen, I thought you could give me a hint on other ways of doing this.
This sort of benchmark is unlikely to produce any useful data, but the second method should be fastest, as once the statement is prepared it is stored in memory by the database server. Further calls to repeat the query do not require the text of the query to be transmitted, so saving a small about of time.
This is likely to be moot as the query is very small (likely the same quantity of packets over the wire as repeating sending the query text), and the query cache will serve the same data for every request.
What's the purpose of retrieving such amount of data at once? I don't know your situation, but I'd definitely page the results using limit and offset. Take a look at:
7.6. LIMIT and OFFSET
If you just want to benchmark SQL all on it's own and not mix Python into the equation try pgbench.
http://developer.postgresql.org/pgdocs/postgres/pgbench.html
Also what is your goal here?