Inserting data from a CSV file to postgres using SQL - python

Struggling with this python issue as I'm new to it and I don't have significant experience in the language. I currently have a CSV file with containing around 20 headers and the same amount of rows so listing each out like some examples here is what I'm trying to avoid:
https://www.dataquest.io/blog/loading-data-into-postgres/
My code consists of the following so far:
with open('dummy-data.csv', 'r') as f:
reader = csv.reader(f)
next(reader)
for row in reader:
cur.execute('INSERT INTO messages VALUES', (row))
I'm getting a syntax error at the end of the input, so I assumed it is linked to the way my execute method has been written but I still don't know what I would do in order to address the issue. Any help?
P.S. I understand the person usings %s for that, but if that was the case, can it be avoided since I don't want to have it duplicated in a line 20 times.

Basically, you DO have to specify at least the required placeholders - and preferably the fields names too - in your query.
If it's a one-shot affair and you know which fields are in the CSV and in which order, then you simply hardcode them in the query ie
SQL = "insert into tablename(field1, field2, field21) values(%s, %s, %s)"
Ok, for 20 or so fields it gets quite boring, so you can also use a list of field names to generate the fieldnames part and the placeholders:
fields = ["field1", "field2", "field21"]
placeholders = ["%s"] * len(fields) # list multiplication, yes
SQL = "insert into tablename({}) values({})".format(", ".join(fields), ", ".join(placeholders))
If by chance the CSV header row contains the exact field names, you can also just use this row as value for fields - but you have to trust the csv then.
NB: specifying the fields list in the query is not strictly required but it can protect you from possible issues with a malformed csv. Actually, unless you really trust the source (your csv), you should actively validate the incoming data before sending them to the database.
NB2:
%s is for strings I know but would it work the same for timestamps?
In this case, "%s" is not used as a Python string format specifier but as a plain database query placeholder. The choice of the string format specifier here is really unfortunate as it creates a lot of confusion. Note that this is DB vendor specific though, some vendors use "?" instead which is much clearer IMHO (and you want to check your own db-api connector's doc for the correct plaeholder to use BTW).
And since it's not a string format specifier, it will work for any type and doesn't need to be quoted for strings, it's the db-api module's job to do proper formatting (including quoting etc) according to the db column's type.
While we're at it, by all means, NEVER directly use Python string formatting operations when passing values to your queries - unless you want your database to be open-bar for script-kiddies of course.

The problem lies on the insert itself:
cur.execute('INSERT INTO messages VALUES', (row))
The problem is that, since you are not defining parameters on the query, it is interpreting that you literally want to execute INSERT INTO messages VALUES, with no parameters, which will cause a syntax error; using a single parameter won't work either, since it will understand that you want a single parameter, instead of multiple parameters.
If you want to create parameters in a more dynamic way, you could try to construct the query string dynamically.
Please, take a look the documentation: http://initd.org/psycopg/docs/cursor.html#cursor.execute

You can use strings multiply.
import csv
import psycopg2
conn = psycopg2.connect('postgresql://db_user:db_user_password#server_name:port/db_name')
cur = conn.cursor()
multiple_placehorders = ','.join(['%s']*20)
with open('dummy-data.csv', 'r') as f:
reader = csv.reader(f)
next(reader)
for row in reader:
cur.execute('INSERT INTO public.messages VALUES (' + multiple_placehorders + ')', row)
conn.commit()

If you want to have a single placeholder that covers an whole list of values, you can use a different method, located in "extras", which covers that usage:
psycopg2.extras.execute_values(cur, 'INSERT INTO messages VALUES %s', (row,))
This method can take many rows at a time (which is good for performance), which is why you need to wrap your single row in (...,).

Last time when I was struggling to insert a CSV data into the postgres I've used pgAdmin and it has worked. I don't know whether this answer is a solution but an easy idea to get along with it.

You can use the cursor and executemany so that you can skip the iteration , But its slower than string joining parameterized approach.
import pandas
df = pd.read_csv('dummy-data.csv')
df.columns = [<define the headers here>] # You can skip this line if headers match column names
try:
cursor.prepare("insert into public.messages(<Column Names>) values(:1, :2, :3 ,:4, :5)")
cursor.executemany(None, df.values.tolist())
conn.commit()
except:
conn.rollback()

Related

Too many server roundtrips w/ psycopg2

I am making a script, that should create a schema for each customer. I’m fetching all metadata from a database that defines how each customer’s schema should look like, and then create it. Everything is well defined, the types, names of tables, etc. A customer has many tables (fx, address, customers, contact, item, etc), and each table has the same metadata.
My procedure now:
get everything I need from the metadataDatabase.
In a for loop, create a table, and then Alter Table and add each metadata (This is done for each table).
Right now my script runs in about a minute for each customer, which I think is too slow. It has something to do with me having a loop, and in that loop, I’m altering each table.
I think that instead of me altering (which might be not so clever approach), I should do something like the following:
Note that this is just a stupid but valid example:
for table in tables:
con.execute("CREATE TABLE IF NOT EXISTS tester.%s (%s, %s);", (table, "last_seen date", "valid_from timestamp"))
But it gives me this error (it seems like it reads the table name as a string in a string..):
psycopg2.errors.SyntaxError: syntax error at or near "'billing'"
LINE 1: CREATE TABLE IF NOT EXISTS tester.'billing' ('last_seen da...
Consider creating tables with a serial type (i.e., autonumber) ID field and then use alter table for all other fields by using a combination of sql.Identifier for identifiers (schema names, table names, column names, function names, etc.) and regular format for data types which are not literals in SQL statement.
from psycopg2 import sql
# CREATE TABLE
query = """CREATE TABLE IF NOT EXISTS {shm}.{tbl} (ID serial)"""
cur.execute(sql.SQL(query).format(shm = sql.Identifier("tester"),
tbl = sql.Identifier("table")))
# ALTER TABLE
items = [("last_seen", "date"), ("valid_from", "timestamp")]
query = """ALTER TABLE {shm}.{tbl} ADD COLUMN {col} {typ}"""
for item in items:
# KEEP IDENTIFIER PLACEHOLDERS
final_query = query.format(shm="{shm}", tbl="{tbl}", col="{col}", typ=i[1])
cur.execute(sql.SQL(final_query).format(shm = sql.Identifier("tester"),
tbl = sql.Identifier("table"),
col = sql.Identifier(item[0]))
Alternatively, use str.join with list comprehension for one CREATE TABLE:
query = """CREATE TABLE IF NOT EXISTS {shm}.{tbl} (
"id" serial,
{vals}
)"""
items = [("last_seen", "date"), ("valid_from", "timestamp")]
val = ",\n ".join(["{{}} {typ}".format(typ=i[1]) for i in items])
# KEEP IDENTIFIER PLACEHOLDERS
pre_query = query.format(shm="{shm}", tbl="{tbl}", vals=val)
final_query = sql.SQL(pre_query).format(*[sql.Identifier(i[0]) for i in items],
shm = sql.Identifier("tester"),
tbl = sql.Identifier("table"))
cur.execute(final_query)
SQL (sent to database)
CREATE TABLE IF NOT EXISTS "tester"."table" (
"id" serial,
"last_seen" date,
"valid_from" timestamp
)
However, this becomes heavy as there are too many server roundtrips.
How many tables with how many columns are you creating that this is slow? Could you ssh to a machine closer to your server and run the python there?
I don't get that error. Rather, I get an SQL syntax error. A values list is for conveying data. But ALTER TABLE is not about data, it is about metadata. You can't use a values list there. You need the names of the columns and types in double quotes (or no quotes) rather than single quotes. And you can't have a comma between name and type. And you can't have parentheses around each pair. And each pair needs to be introduced with "ADD", you can't have it just once. You are using the wrong tool for the job. execute_batch is almost the right tool, except it will use single quotes rather than double quotes around the identifiers. Perhaps you could add a flag to it tell it to use quote_ident.
Not only is execute_values the wrong tool for the job, but I think python in general might be as well. Why not just load from a .sql file?

How can I search a record in MySQL using Python

def search(title="",author="",year="",isbn=""):
con = mysql.connector.connect(host="localhost", user="root", passwd="junai2104", database="book")
cur = con.cursor()
sql_statement = "SELECT * FROM book WHERE title={} or author={} or year={} or isbn={} ".format(title,author,year,isbn)
cur.execute(sql_statement)
rows=cur.fetchall()
con.close()
return rows
print(search(title='test2'))
How can I search a value in MySQL using Python argument?
how to get a values from the argument?
You have a couple of issues with your code:
In your SQL SELECT statement you are looking for values in text columns (TEXT, VARCHAR etc.). To do so you must add single quotes to your search qriteria, since you want to indicate a text literal. So WHERE title={} should be WHERE title='{}' (same goes for the other parameters).
When one or more of your arguments are empty, you will search for rows where the respective value is an empty text. So in your example search(title='test2') will trigger a search for an entry where the title column has the value 'test2' or any of the other three columns (author, year and isbn) has an empty text. If you inted to look for a title 'test2', this will only work if none of the other columns will ever contain an empty text. And even then, because of the three OR operators in your query, performance will be poor. What you should do instead is to evaluate each parameter individually and construct the query only with the parameters that are not empty.
By constructing your query with formatting a string, you will create a massive security issue in case the values of your search parameters come from user input. Your code is wide open for SQL injection, which is one of the simplest and most effective attacks on your system. You should always parametrize your queries to prevent this attack. By general principle, never create SQL queries by formating or concatenating strings with their parameters. Note that with parametrized queries you do not need to add single quotes to your query as wriitten in point 1.

Given table and column name, how to test if INSERT needs quotes ('') around the values to be inserted?

I have a dictionary of column name / values, to insert into a table. I have a function that generates the INSERT statement. I'm stuck because the function always puts quotes around the values, and some are integers.
e.g. If column 1 is type integer then the statement should be INSERT INTO myTable (col1) VALUES 5; vs
INSERT INTO myTable (col1) VALUES '5'; second one causes an error saying column 5 does not exist.
EDIT: I found the problem (I think). the value was in double quotes not single, so it was "5".
In Python, given a table and column name, how can I test if the INSERT statement needs to have '' around the VALUES ?
This question was tagged with "psycopg2" -- you can prepare the statement using a format string and have psycopg2 infer types for you in many cases.
cur.execute('INSERT INTO myTable (col1, col2) VALUES (%s, %s);', (5, 'abc'))
psycopg2 will deal with it for you, because Python knows that 5 is an integer and 'abc' is a string.
http://initd.org/psycopg/docs/usage.html#passing-parameters-to-sql-queries
You certainly want to use a library function to decide whether or not to quote values you insert. If you are inserting anything input by a user, writing your own quoting function can lead to SQL Injection attacks.
It appears from your tags that you're using psycopg2 - I've found another response that may be able to answer your question, since I'm not familiar with that library. The main gist seems to be that you should use
cursor.execute("query with params %s %s", ("param1", "pa'ram2"))
Which will automatically handle any quoting needed for param1 and param2.
Although I personally don't like the idea, you can use single quotes around integers when you insert in Postgres.
Perhaps your problem is the lack of parentheses:
INSERT INTO myTable(col1)
VALUES('5');
Here is a SQL Fiddle illustrating this code.
As you note in the comments, double quotes do not work in Postgres.
You can put always the single quote (be careful, if the value contents a quote you must double it: insert into example (value_t) values ('O''Hara');
You can decide checking the value that you want to insert regardles of the type of de destination
You can decide checking the type of the target field
As you can see in http://sqlfiddle.com/#!15/8bfbd/3 theres no mater with inserting integers into a text field or string that represents an integer in a numeric field.
To check the field type you can use the information_schema:
select data_type from information_schema.columns
where table_schema='public'
and table_name='example'
and column_name='value_i';
http://sqlfiddle.com/#!15/8bfbd/7

syntax error when attempting to insert data into postgresql

I am attempting to insert parsed dta data into a postgresql database with each row being a separate variable table, and it was working until I added in the second row "recodeid_fk". The error I now get when attempting to run this code is: pg8000.errors.ProgrammingError: ('ERROR', '42601', 'syntax error at or near "imp"').
Eventually, I want to be able to parse multiple files at the same time and insert the data into the database, but if anyone could help me understand whats going on now it would be fantastic. I am using Python 2.7.5, the statareader is from pandas 0.12 development records, and I have very little experience in Python.
dr = statareader.read_stata('file.dta')
a = 2
t = 1
for t in range(1,10):
z = str(t)
for date, row in dr.iterrows():
cur.execute("INSERT INTO tblv00{} (data, recodeid_fk) VALUES({}, {})".format(z, str(row[a]),29))
a += 1
t += 1
conn.commit()
cur.close()
conn.close()
To your specific error...
The syntax error probably comes from strings {} that need quotes around them. execute() can take care of this for you automtically. Replace
execute("INSERT INTO tblv00{} (data, recodeid_fk) VALUES({}, {})".format(z, str(row[a]),29))
execute("INSERT INTO tblv00{} (data, recodeid_fk) VALUES(%s, %s)".format(z), (row[a],29))
The table name is completed the same way as before, but the the values will be filled in by execute, which inserts quotes if they are needed. Maybe execute could fill in the table name too, and we could drop format entirely, but that would be an unusual usage, and I'm guessing execute might (wrongly) put quotes in the middle of the name.
But there's a nicer approach...
Pandas includes a function for writing DataFrames to SQL tables. Postgresql is not yet supported, but in simple cases you should be able to pretend that you are connected to sqlite or MySQL database and have no trouble.
What do you intend with z here? As it is, you loop z from '1' to '9' before proceeding to the next for loop. Should the loops be nested? That is, did you mean to insert the contents dr into nine different tables called tblv001 through tblv009?
If you mean that loop to put different parts of dr into different tables, please check the indentation of your code and clarify it.
In either case, the link above should take care of the SQL insertion.
Response to Edit
It seems like t, z, and a are doing redundant things. How about:
import pandas as pd
import string
...
# Loop through columns of dr, and count them as we go.
for i, col in enumerate(dr):
table_name = 'tblv' + string.zfill(i, 3) # e.g., tblv001 or tblv010
df1 = DataFrame(dr[col]).reset_index()
df1.columns = ['data', 'recodeid_fk']
pd.io.sql.write_frame(df1, table_name, conn)
I used reset_index to make the index into a column. The new (sequential) index will not be saved by write_frame.

How can I reference columns by their names in python calling SQLite?

I have some code which I've been using to query MySQL, and I'm hoping to use it with SQLite. My real hope is that this will not involve making too many changes to the code. Unfortunately, the following code doesn't work with SQLite:
cursor.execute(query)
rows = cursor.fetchall()
data = []
for row in rows
data.append(row["column_name"])
This gives the following error:
TypeError: tuple indices must be integers
Whereas if I change the reference to use a column number, it works fine:
data.append(row[1])
Can I execute the query in such a way that I can reference columns by their names?
In the five years since the question was asked and then answered, a very simple solution has arisen. Any new code can simply wrap the connection object with a row factory. Code example:
import sqlite3
conn = sqlite3.connect('./someFile')
conn.row_factory = sqlite3.Row // Here's the magic!
cursor = conn.execute("SELECT name, age FROM someTable")
for row in cursor:
print(row['name'])
Here are some fine docs. Enjoy!
To access columns by name, use the row_factory attribute of the Connection instance. It lets you set a function that takes the arguments cursor and row, and return whatever you'd like. There's a few builtin to pysqlite, namely sqlite3.Row, which does what you've asked.
This can be done by adding a single line after the "connect" statment:
conn.row_factory = sqlite3.Row
Check the documentation here:
http://docs.python.org/library/sqlite3.html#accessing-columns-by-name-instead-of-by-index
I'm not sure if this is the best approach, but here's what I typically do to retrieve a record set using a DB-API 2 compliant module:
cursor.execute("""SELECT foo, bar, baz, quux FROM table WHERE id = %s;""",
(interesting_record_id,))
for foo, bar, baz, quux in cursor.fetchall():
frobnicate(foo + bar, baz * quux)
The query formatting method is one of the DB-API standards, but happens to be the preferred method for Psycopg2; other DB-API adapters might suggest a different convention which will be fine.
Writing queries like this, where implicit tuple unpacking is used to work with the result set, has typically been more effective for me than trying to worry about matching Python variable names to SQL column names (which I usually only use to drop prefixes, and then only if I'm working with a subset of the column names such that the prefixes don't help to clarify things anymore), and is much better than remembering numerical column IDs.
This style also helps you avoid SELECT * FROM table..., which is just a maintenance disaster for anything but the simplest tables and queries.
So, not exactly the answer you were asking for, but possibly enlightening nonetheless.
The SQLite API supports cursor.description so you can easily do it like this
headers = {}
for record in cursor.fetchall():
if not headers:
headers = dict((desc[0], idx) for idx,desc in cursor.description))
data.append(record[headers['column_name']])
A little long winded but gets the job done. I noticed they even have it in the factory.py file under dict_factory.
kushal's answer to this forum works fine:
Use a DictCursor:
import MySQLdb.cursors
.
.
.
cursor = db.cursor (MySQLdb.cursors.DictCursor)
cursor.execute (query)
rows = cursor.fetchall ()
for row in rows:
print row['employee_id']
Please take note that the column name is case sensitive.
Use the cursor description, like so:
rows = c.fetchall()
for row in rows:
for col_i, col in enumerate(row):
print("Attribute: {0:30} Value: {1}".format(c.description[col_i][0],col))

Categories