syntax error when attempting to insert data into postgresql - python

I am attempting to insert parsed dta data into a postgresql database with each row being a separate variable table, and it was working until I added in the second row "recodeid_fk". The error I now get when attempting to run this code is: pg8000.errors.ProgrammingError: ('ERROR', '42601', 'syntax error at or near "imp"').
Eventually, I want to be able to parse multiple files at the same time and insert the data into the database, but if anyone could help me understand whats going on now it would be fantastic. I am using Python 2.7.5, the statareader is from pandas 0.12 development records, and I have very little experience in Python.
dr = statareader.read_stata('file.dta')
a = 2
t = 1
for t in range(1,10):
z = str(t)
for date, row in dr.iterrows():
cur.execute("INSERT INTO tblv00{} (data, recodeid_fk) VALUES({}, {})".format(z, str(row[a]),29))
a += 1
t += 1
conn.commit()
cur.close()
conn.close()

To your specific error...
The syntax error probably comes from strings {} that need quotes around them. execute() can take care of this for you automtically. Replace
execute("INSERT INTO tblv00{} (data, recodeid_fk) VALUES({}, {})".format(z, str(row[a]),29))
execute("INSERT INTO tblv00{} (data, recodeid_fk) VALUES(%s, %s)".format(z), (row[a],29))
The table name is completed the same way as before, but the the values will be filled in by execute, which inserts quotes if they are needed. Maybe execute could fill in the table name too, and we could drop format entirely, but that would be an unusual usage, and I'm guessing execute might (wrongly) put quotes in the middle of the name.
But there's a nicer approach...
Pandas includes a function for writing DataFrames to SQL tables. Postgresql is not yet supported, but in simple cases you should be able to pretend that you are connected to sqlite or MySQL database and have no trouble.
What do you intend with z here? As it is, you loop z from '1' to '9' before proceeding to the next for loop. Should the loops be nested? That is, did you mean to insert the contents dr into nine different tables called tblv001 through tblv009?
If you mean that loop to put different parts of dr into different tables, please check the indentation of your code and clarify it.
In either case, the link above should take care of the SQL insertion.
Response to Edit
It seems like t, z, and a are doing redundant things. How about:
import pandas as pd
import string
...
# Loop through columns of dr, and count them as we go.
for i, col in enumerate(dr):
table_name = 'tblv' + string.zfill(i, 3) # e.g., tblv001 or tblv010
df1 = DataFrame(dr[col]).reset_index()
df1.columns = ['data', 'recodeid_fk']
pd.io.sql.write_frame(df1, table_name, conn)
I used reset_index to make the index into a column. The new (sequential) index will not be saved by write_frame.

Related

What SQL statement do I need to insert data into multiple rows that are currently 'null'?

I have data in this format, that I want to enter into a column in my mysql database. The column i created is empty with null values.
[(1958000,),
(261250,),
(205480,),
(140580,),
(804320,),
(652653,),
(484990,),]
I am using python to insert data using the following code:
sql = "INSERT INTO city(Population) VALUES(%s)"
new_column_data = The data I listed above.
mycursor.executemany(sql, new_column_data)
When I run this, It inserted data into rows starting below all the 'null' values in Population.
Would something like this work?
sql = "INSERT INTO city(Population) VALUES(%s) WHERE Population = 'null'"
Any help would be greatly appreciated!
Quickest solution for you is to remove the NULL columns before inserting the new data.
You would have to do this by running something like:
DELETE c FROM city AS c WHERE c.Population IS NULL
Then run your INSERT.
However, if you are trying to UPDATE existing records in city whose Population field is NULL, then you need to write a completely different statement.
Your data set will need to include a key, kind of like updating a Python dictionary.
dataset =
[(1958000,<key_value>),
(261250,<key_value>),
(205480,<key_value>),
(140580,<key_value>),
(804320,<key_value>),
(652653,<key_value>),
(484990,<key_value>),]
Using these tuples, you will need to create an update statement. How this works depends on the Python library you are using. But something along the lines of this:
WITH ds = dataset:
statement = UPDATE c SET c.Population = ds.<population_value> WHERE c.<key_column> = ds.<key_value>
mysql.update(statment,ds)

Inserting data from a CSV file to postgres using SQL

Struggling with this python issue as I'm new to it and I don't have significant experience in the language. I currently have a CSV file with containing around 20 headers and the same amount of rows so listing each out like some examples here is what I'm trying to avoid:
https://www.dataquest.io/blog/loading-data-into-postgres/
My code consists of the following so far:
with open('dummy-data.csv', 'r') as f:
reader = csv.reader(f)
next(reader)
for row in reader:
cur.execute('INSERT INTO messages VALUES', (row))
I'm getting a syntax error at the end of the input, so I assumed it is linked to the way my execute method has been written but I still don't know what I would do in order to address the issue. Any help?
P.S. I understand the person usings %s for that, but if that was the case, can it be avoided since I don't want to have it duplicated in a line 20 times.
Basically, you DO have to specify at least the required placeholders - and preferably the fields names too - in your query.
If it's a one-shot affair and you know which fields are in the CSV and in which order, then you simply hardcode them in the query ie
SQL = "insert into tablename(field1, field2, field21) values(%s, %s, %s)"
Ok, for 20 or so fields it gets quite boring, so you can also use a list of field names to generate the fieldnames part and the placeholders:
fields = ["field1", "field2", "field21"]
placeholders = ["%s"] * len(fields) # list multiplication, yes
SQL = "insert into tablename({}) values({})".format(", ".join(fields), ", ".join(placeholders))
If by chance the CSV header row contains the exact field names, you can also just use this row as value for fields - but you have to trust the csv then.
NB: specifying the fields list in the query is not strictly required but it can protect you from possible issues with a malformed csv. Actually, unless you really trust the source (your csv), you should actively validate the incoming data before sending them to the database.
NB2:
%s is for strings I know but would it work the same for timestamps?
In this case, "%s" is not used as a Python string format specifier but as a plain database query placeholder. The choice of the string format specifier here is really unfortunate as it creates a lot of confusion. Note that this is DB vendor specific though, some vendors use "?" instead which is much clearer IMHO (and you want to check your own db-api connector's doc for the correct plaeholder to use BTW).
And since it's not a string format specifier, it will work for any type and doesn't need to be quoted for strings, it's the db-api module's job to do proper formatting (including quoting etc) according to the db column's type.
While we're at it, by all means, NEVER directly use Python string formatting operations when passing values to your queries - unless you want your database to be open-bar for script-kiddies of course.
The problem lies on the insert itself:
cur.execute('INSERT INTO messages VALUES', (row))
The problem is that, since you are not defining parameters on the query, it is interpreting that you literally want to execute INSERT INTO messages VALUES, with no parameters, which will cause a syntax error; using a single parameter won't work either, since it will understand that you want a single parameter, instead of multiple parameters.
If you want to create parameters in a more dynamic way, you could try to construct the query string dynamically.
Please, take a look the documentation: http://initd.org/psycopg/docs/cursor.html#cursor.execute
You can use strings multiply.
import csv
import psycopg2
conn = psycopg2.connect('postgresql://db_user:db_user_password#server_name:port/db_name')
cur = conn.cursor()
multiple_placehorders = ','.join(['%s']*20)
with open('dummy-data.csv', 'r') as f:
reader = csv.reader(f)
next(reader)
for row in reader:
cur.execute('INSERT INTO public.messages VALUES (' + multiple_placehorders + ')', row)
conn.commit()
If you want to have a single placeholder that covers an whole list of values, you can use a different method, located in "extras", which covers that usage:
psycopg2.extras.execute_values(cur, 'INSERT INTO messages VALUES %s', (row,))
This method can take many rows at a time (which is good for performance), which is why you need to wrap your single row in (...,).
Last time when I was struggling to insert a CSV data into the postgres I've used pgAdmin and it has worked. I don't know whether this answer is a solution but an easy idea to get along with it.
You can use the cursor and executemany so that you can skip the iteration , But its slower than string joining parameterized approach.
import pandas
df = pd.read_csv('dummy-data.csv')
df.columns = [<define the headers here>] # You can skip this line if headers match column names
try:
cursor.prepare("insert into public.messages(<Column Names>) values(:1, :2, :3 ,:4, :5)")
cursor.executemany(None, df.values.tolist())
conn.commit()
except:
conn.rollback()

In Python, display output of SQL query as a table, just like it does in SQL

this seems like a basic function, but I'm new to Python so maybe I'm not googling this correctly.
In Microsoft SQL Server, when you have a statement like
SELECT top 100 * FROM dbo.Patient_eligibility
you get a result like
Patient_ID | Patient_Name | Patient_Eligibility
67456 | Smith, John | Eligible
...
etc.
Etc.
I am running a connection to SQL through Python as such, and would like the output to look exactly the same as in SQL. Specifically - with column names and all the data rows specified in the SQL query. It doesn't have to appear in the console or the log, I just need a way to access it to see what's in it.
Here is my current code attempts:
import pyodbc
conn = pyodbc.connect(connstr)
cursor = conn.cursor()
sql = "SELECT top 100 * FROM [dbo].[PATIENT_ELIGIBILITY]"
cursor.execute(sql)
data = cursor.fetchall()
#Query1
for row in data :
print (row[1])
#Query2
print (data)
#Query3
data
My understanding is that somehow the results of PATIENT_ELIGIBILITY are stored in the variable data. Query 1, 2, and 3 represent methods of accessing that data that I've googled for - again seems like basic stuff.
The results of #Query1 give me the list of the first column, without a column name in the console. In the variable explorer, 'data' appears as type List. When I open it up, it just says 'Row object of pyodbc module' 100 times, one for each row. Not what I'm looking for. Again, I'm looking for the same kind of view output I would get if I ran it in Microsoft SQL Server.
Running #Query2 gets me a little closer to this end. The results appear like a .csv file - unreadable, but it's all there, in the console.
Running #Query3, just the 'data' variable, gets me the closest result but with no column names. How can I bring in the column names?
More directly, how do i get 'data' to appear as a clean table with column names somewhere? Since this appears a basic SQL function, could you direct me to a SQL-friendly library to use instead?
Also note that neither of the Queries required me to know the column names or widths. My entire method here is attempting to eyeball the results of the Query and quickly check the data - I can't see that the Patient_IDs are loading properly if I don't know which column is patient_ids.
Thanks for your help!
It's more than 1 question, I'll try help you and give advice.
I am running a connection to SQL through Python as such, and would like the output to look exactly the same as in SQL.
You are mixing SQL as language and formatted output of some interactive SQL tool.
SQL itself does not have anything about "look" of data.
Also note that neither of the Queries required me to know the column names or widths. My entire method here is attempting to eyeball the results of the Query and quickly check the data - I can't see that the Patient_IDs are loading properly if I don't know which column is patient_ids.
Correct. cursor.fetchall returns only data.
Field informations can be read from cursor.description.
Read more in PEP-O249
how do i get 'data' to appear as a clean table with column names somewhere?
It depends how do you define "appear".
You want: text output, html page or maybe GUI?
For text output: you can read column names from cursor.description and print them before data.
If you want html/excel/pdf/other - find some library/framework suiting your taste.
If you want an interactive experience similar to SQL tools - I recommend to look on jupyter-notebook + pandas.
Something like:
pandas.read_sql_query(sql)
will give you "clean table" nothing worse than SQLDeveloper/SSMS/DBeaver/other gives.
We don't need any external libraries.
Refer to this for more details.
Print results in MySQL format with Python
However, the latest version of MySQL gives an error to this code. So, I modified it.
Below is the query for the dataset
stri = "select * from table_name"
cursor.execute(stri)
data = cursor.fetchall()
mycon.commit()
Below it will print the dataset in tabular form
def columnnm(name):
v = "SELECT LENGTH("+name+") FROM table_name WHERE LENGTH("+name+") = (SELECT MAX(LENGTH("+name+")) FROM table_name) LIMIT 1;"
cursor.execute(v)
data = cursor.fetchall()
mycon.commit()
return data[0][0]
widths = []
columns = []
tavnit = '|'
separator = '+'
for cd in cursor.description:
widths.append(max(columnnm(cd[0]), len(cd[0])))
columns.append(cd[0])
for w in widths:
tavnit += " %-"+"%ss |" % (w,)
separator += '-'*w + '--+'
print(separator)
print(tavnit % tuple(columns))
print(separator)
for row in data:
print(tavnit % row)
print(separator)

Inserting rows while looping over result set

I am working on a program to clone rows in my database from one user to another. It works my selecting the rows, editing a few values and then inserting them back.
I also need to store the newly inserted rowIDs with their existing counterparts so I can clone some other link tables later on.
My code looks like the following:
import mysql.connector
from collections import namedtuple
con = mysql.connector.connect(host='127.0.0.1')
selector = con.cursor(prepared=True)
insertor = con.cursor(prepared=True)
user_map = {}
selector.execute('SELECT * FROM users WHERE companyID = ?', (56, ))
Row = namedtuple('users', selector.column_names)
for row in selector:
curr_row = Row._make(row)
new_row = curr_row._replace(userID=None, companyID=95)
insertor.execute('INSERT INTO users VALUES(?,?,?,?)', tuple(new_row))
user_map[curr_row.userID] = insertor.lastrowid
selector.close()
insertor.close()
When running this code, I get the following error:
mysql.connector.errors.InternalError: Unread result found
I'm assuming this is because I am trying to run an INSERT while I am still looping over the SELECT, but I thought using two cursors would fix that. Why do I still get this error with multiple cursors?
I found a solution using fetchall(), but I was afraid that would use too much memory as there could be thousands of results returned from the SELECT.
import mysql.connector
from collections import namedtuple
con = mysql.connector.connect(host='127.0.0.1')
cursor = con.cursor(prepared=True)
user_map = {}
cursor.execute('SELECT * FROM users WHERE companyID = ?', (56, ))
Row = namedtuple('users', cursor.column_names)
for curr_row in map(Row._make, cursor.fetchall()):
new_row = curr_row._replace(userID=None, companyID=95)
cursor.execute('INSERT INTO users VALUES(?,?,?,?)', tuple(new_row))
user_map[curr_row.userID] = cursor.lastrowid
cursor.close()
This works, but it's not very fast. I was thinking that not using fetchall() would be quicker, but it seems if I do not fetch the full result set then MySQL yells at me.
Is there a way to insert rows while looping over a result set without fetching the entire result set?
Is there a way to insert rows while looping over a result set without fetching the entire result set?
Yes. Use two MySQL connections: one for reading and the other for writing.
The performance impact isn't too bad, as long as you don't have thousands of instances of the program trying to connect to the same MySQL server.
One connection is reading a result set, and the other is inserting rows to the end of the same table, so you shouldn't have a deadlock. It would be helpful if the WHERE condition you use to read the table could explicitly exclude the rows you're inserting, if there's a way to tell the new rows apart from the old rows.
At some level, the performance impact of two connections doesn't matter because you don't have much choice. The only other way to do what you want to do is slurp the whole result set into RAM in your program, close your reading cursor, and then write.

How to insert several thousand columns into sqlite3?

Similar to my last question, but I ran into problem lets say I have a simple dictionary like below but its Big, when I try inserting a big dictionary using the methods below I get operational error for the c.execute(schema) for too many columns so what should be my alternate method to populate an sql databases columns? Using the alter table command and add each one individually?
import sqlite3
con = sqlite3.connect('simple.db')
c = con.cursor()
dic = {
'x1':{'y1':1.0,'y2':0.0},
'x2':{'y1':0.0,'y2':2.0,'joe bla':1.5},
'x3':{'y2':2.0,'y3 45 etc':1.5}
}
# 1. Find the unique column names.
columns = set()
for _, cols in dic.items():
for key, _ in cols.items():
columns.add(key)
# 2. Create the schema.
col_defs = [
# Start with the column for our key name
'"row_name" VARCHAR(2) NOT NULL PRIMARY KEY'
]
for column in columns:
col_defs.append('"%s" REAL NULL' % column)
schema = "CREATE TABLE simple (%s);" % ",".join(col_defs)
c.execute(schema)
# 3. Loop through each row
for row_name, cols in dic.items():
# Compile the data we have for this row.
col_names = cols.keys()
col_values = [str(val) for val in cols.values()]
# Insert it.
sql = 'INSERT INTO simple ("row_name", "%s") VALUES ("%s", "%s");' % (
'","'.join(col_names),
row_name,
'","'.join(col_values)
)
If I understand you right, you're not trying to insert thousands of rows, but thousands of columns. SQLite has a limit on the number of columns per table (by default 2000), though this can be adjusted if you recompile SQLite. Never having done this, I do not know if you then need to tweak the Python interface, but I'd suspect not.
You probably want to rethink your design. Any non-data warehouse / OLAP application is highly unlikely to need or be terribly efficient with thousands of columns (rows, yes) and SQLite is not a good solution for a data warehouse / OLAP type situation. You may get a bit further with something like an entity-attribute-value setup (not a normal recommendation for genuine relational databases, but a valid application data model and much more likely to accommodate your needs without pushing the limits of SQLite too far).
If you really are adding a massive number of rows and are running into problems, maybe your single transaction is getting too large.
Do a COMMIT (commit()) after a given number of lines (or even after each insert as a test) if that is acceptable.
Thousands of rows should be easily doable with sqlite. Getting to millions and above, at some point there might be need for more. Depends on a lot of things, of course.

Categories