am writing a Python script that will copy files from one dir to another and copy that filename into a doc archive PostgreSQL table. The error I receive is below:
Camt' call excute() on named cursors more than once
Below is my code:
cursor = conn.cursor('cur', cursor_factory=psycopg2.extras.DictCursor)
cursor.execute('SELECT * FROM doc_archive.table LIMIT 4821')
row_count = 0
for row in cursor:
row_count += 1
print "row: %s %s\r" % (row_count, row),
pathForListFiles = srcDir
files = os.listdir(pathForListFiles)
for file in files:
print file
try:
# Perform an insert with the docid
cursor.execute("INSERT INTO doc_archive.field_photo_vw VALUES)
Is this the actual code? You've got unmatched quotes in the second execute.
when iterating through results, I normally use
for var in range(int(cursor.rowcount)):
row = cursor.fetchone()
Without trouble.
for var in cursor:
Seems wrong to me.
results = cur.fetchall()
for var in enumerate(results):
Is basically the same thing there. But would allow you to close your cursor in case you have to do another execute while iterating the first set of results. Generally I just declare another cursor in those instances.
In either case, your current code doesn't seem to be fetching the results of the execute. Which is important if you need to process that data.
Related
import csv
from cs50 import SQL
db = SQL("sqlite:///roster.db")
with open ("students.csv" , "r") as file :
reader = csv.DictReader(file)
record = {}
same = []
for row in reader :
n = db.execute("INSERT INTO houses(house_id , house) VALUES (?, ?)", row['id'] , row['house'])
a = db.execute("SELECT * from houses")
print(a)
the program above keeps telling me some error messages that I do not really understand
I do not know how to fix that. I did try to put the variable row['id'] directly to the value parenthesis, but I got a empty table with nothing in it.
That is the part when I ran ".schema" to get the table.
The table "name" is created in the command line argument with sqlite3 instead of running python code, is that why the error above mentioned about the "name" table?
enter image description here
Assuming the second image is the schema (better to post as text not image!), there is a typo in the REFERENCES clauses of the house and head CREATE statements. Read them carefully and critically. It will not fail on the CREATE, only when trying to insert into either of the tables.
I'm new with psycopg2 and I do have a question (which I cannot really find a respond in the Internet): Do we have any difference (for exemaple in the aspect of performance) between using copy_xxx() method and combo execute() + fetchxxx() method when we try to write the result of query into a CSV file?
...
query_str = "SELECT * FROM mytable"
cursor.execute(query_str)
with open("my_file.csv", "w+") as file:
writer = csv.writer(file)
while True:
rows = cursor.fetchmany()
if not rows:
break
writer.writerows(rows)
vs
...
query_str = "SELECT * FROM mytable"
output_query = f"COPY {query_str} TO STDOUT WITH CSV HEADER"
with open("my_file.csv", "w+") as file:
cursor.copy_expert(output_query, file)
And if I try to do a very complex query (my assumption is that we cannot simplify this query anymore for ex) with psycopg2, which method should I use? Or do you guys have any advice, please?
Many thanks!!!
COPY is faster, but if query execution time is dominant or the file is small, it won't matter much.
You don't show us how the cursor was declared. If it is an anonymous cursor, then execute/fetch will read all query data into memory upfront, leading to out of memory conditions for very large queries. If it is a named cursor, then you will individually request every row from the server, leading to horrible performance (which can be overcome by specifying a count argument to fetchmany, as the default is bizarrely set to 1)
I have a table that has 10 million plus records(rows) in it. I am trying to do a one-time load into s3 by select *'ing the table and then writing it to a gzip file in my local file system. Currently, I can run my script to collect 800,000 records into the gzip file but then I receive an error, and the remainder records are obviously not inserted.
Since there is no continuation in sql (for example- if you run 10 limit 800,000 queries, it wont be in order).
So, is there a way to writer a python/airflow function that can load the 10 million+ table in batches? Perhaps theres a way in python where I can do a select * statement and continue the statement after x amount of records into separate gzip files?
Here is my python/airflow script so far that when ran, it only writers 800,000 records to the path variable:
def gzip_postgres_table(table_name, **kwargs):
path = '/usr/local/airflow/{}.gz'.format(table_name)
server_post = create_tunnel_postgres()
server_post.start()
etl_conn = conn_postgres_internal(server_postgres)
record = get_etl_record(kwargs['master_table'],
kwargs['table_name'])
cur = etl_conn.cursor()
unload_sql = '''SELECT *
FROM schema1.database1.{0} '''.format(record['table_name'])
cur.execute(unload_sql)
result = cur.fetchall()
column_names = [i[0] for i in cur.description]
fp = gzip.open(path, 'wt')
myFile = csv.writer(fp, delimiter=',')
myFile.writerow(column_names)
myFile.writerows(result)
fp.close()
etl_conn.close()
server_postgres.stop()
The best, I mean THE BEST approach to insert so many records into PostgreSQL, or to get them form PostgreSQL, is to use postgresql COPY. This means you would have to change your approach drastically, but there's no better way that I know in PostgreSQL. COPY manual
COPY creates a file with the query you are executing or it can insert into a table from a file.
COPY moves data between PostgreSQL tables and standard file-system
files.
The reason why is the best solution is because your using PostgreSQL default method to handle external data, without intermediaries; so it's fast and secure.
COPY works like a charm with CSV files. You should change your approach to a file handling method and the use of COPY.
Since COPY runs with SQL, you can divide your data using LIMIT and OFFSET in the query. For example:
COPY (SELECT * FROM country LIMIT 10 OFFSET 10) TO '/usr1/proj/bray/sql/a_list_of_10_countries.copy';
-- This creates 10 countries starting in the row 10
COPY only works with files that are accessible with PostgreSQL user in the server.
PL Function (edited):
If you want COPY to be dynamic, you can use the COPY into a PL function. For example:
CREATE OR REPLACE FUNCTION copy_table(
table_name text,
file_name text,
vlimit text,
voffset text
)RETURNS VOID AS $$
DECLARE
query text;
BEGIN
query := 'COPY (SELECT * FROM country LIMIT '||vlimit||' OFFSET '||voffset||') TO '''||file_name||''' DELIMITER '','' CSV';
-- NOTE that file_name has to have its dir too.
EXECUTE query;
END;$$ LANGUAGE plpgsql;
SECURITY DEFINER
LANGUAGE plpgsql;
To execute the function you just have to do:
SELECT copy_table('test','/usr/sql/test.csv','10','10')
Notes:
If the PL will be public, you have to check for SQL injection attacks.
You can program the PL to suit your needs, this is just an example.
The function returns VOID, so it just do the COPY, if you need some feedback you should return something else.
The function has to be owned with user postgres from the server, because it needs file access; that is why it needs SECURITY DEFINER, so that any database user can run the PL.
I'm attempting to import an sq file that already has tables into python. However, it doesn't seem to import what I had hoped. The only things I've seen so far are how to creata a new sq file with a table, but I'm looking to just have an already completed sq file imported into python. So far, I've written this.
# Python code to demonstrate SQL to fetch data.
# importing the module
import sqlite3
# connect withe the myTable database
connection = sqlite3.connect("CEM3_Slice_20180622.sql")
# cursor object
crsr = connection.cursor()
# execute the command to fetch all the data from the table emp
crsr.execute("SELECT * FROM 'Trade Details'")
# store all the fetched data in the ans variable
ans= crsr.fetchall()
# loop to print all the data
for i in ans:
print(i)
However, it keeps claiming that the Trade Details table, which is a table inside the file I've connected it to, does not exist. Nowhere I've looked shows me how to do this with an already created file and table, so please don't just redirect me to an answer about that
As suggested by Rakesh above, you create a connection to the DB, not to the .sql file. The .sql file contains SQL scripts to rebuild the DB from which it was generated.
After creating the connection, you can implement the following:
cursor = connection.cursor() #cursor object
with open('CEM3_Slice_20180622.sql', 'r') as f: #Not sure if the 'r' is necessary, but recommended.
cursor.executescript(f.read())
Documentation on executescript found here
To read the file into pandas DataFrame:
import pandas as pd
df = pd.read_sql('SELECT * FROM table LIMIT 10', connection)
There are two possibilities:
Your file is not in the correct format and therefore cannot be opened.
The SQLite file can exist anywhere on the disk e.g. /Users/Username/Desktop/my_db.sqlite , this means that you have to tell python exactly where your file is otherwise it will look inside the scripts directory, see that there is no file with the same name and therefore create a new file with the provided filename.
sqlite3.connect expects the full path to your database file or '::memory::' to create a database that exists in RAM. You don't pass it a SQL file. Eg.
connection = sqlite3.connect('example.db')
You can then read the contents of CEM3_Slice_20180622.sql as you would a normal file and execute the SQL commands against the database.
I have data stored in CSV files in multiple folders that I want to load into multiple SQL tables using MySQL on an Ubuntu system. Each table and file follows this schema (the files don't have the id field):
+ ------ + -------- + -------- + --------- + ---------- +
| SPO_Id | SPO_Name | SPO_Date | SPO_Price | SPO_Amount |
+ ------ + -------- + -------- + --------- + ---------- +
Each file contains pricing and sales data for a single day. Unfortunately, the files are not named after their date; they are stored in folders that are named after the date. Here's an example diagram of what the directory looks like
------> 20170102 ------> prices.csv
/
/
Exmpl ------> 20170213 ------> prices.csv
\
\
------> 20170308 ------> prices.csv
Here is a query I've written that pulls data from a file and stores it into a table:
use pricing ; # the database I want the tables in
drop table if exists SP_2017_01_02 ;
create table SP_2017_01_02 (
SPO_Id int not null primary key auto_increment,
SPO_Name varchar(32),
SPO_Date date,
SPO_Price float,
SPO_Amount int
);
load data local infile '/Exmpl/20170102/prices.csv'
into table SP_2017_01_02
fields terminated by ','
lines terminated by '\n'
ignore 1 lines # First line contains field name information
(SPO_Name, SPO_Date, SPO_Price, SPO_Amount) ;
select * from SP_2017_01_02 ;
show tables ;
This query works fine for loading one table in at a time; however, because I have hundreds of tables, I need to automate this process. I've looked around on SO and here are a few things I've found:
Here is a question similar to mine, only this question references SQL Server. The answer gives a suggestion of what to do without any real substance.
This question is also very similar to mine, only this is specifically using SSIS, to which I don't have access (and the question is left unanswered)
This post suggests using control file reference, but this is for sql-loader and oracle.
Using python may be the way to go, but I've never used it before and my question seems like too complicated a problem with which to start.
This one and this one also use python, but they're just updating one table with data from one file.
I've worked a lot in SQL Server, but I'm fairly new to MySQL. Any help is greatly appreciated!
Update
I have attempted to do this using Dynamic SQL in MySQL. Unfortunately, MySQL requires the use of stored procedures to do Dynamic SQL, but it doesn't allow the function load data in a stored procedure. As #RandomSeed pointed out, this cannot be done with only MySQL. I'm going to take his advice and attempt to write a shell/python script to handle this.
I'll leave this question open until I (or someone else) can come up with a solid answer.
So once you have a sql query/function/script that reads a single table, which it looks like you do (or can build an equivalent one in python somewhat simply), using python to loop through the directory structure and get filenames is fairly simple. If you can somehow pass the infile '/Exmpl/20170102/prices.csv' a new csv parameter each time and call your sql script from within python, you should be good.
I don't have much time right now, but I wanted to show you how you could get those filename strings using python.
import os
prices_csvs = []
for root, dirs, files in os.walk(os.path.join('insert_path_here', 'Exmpl'):
for f in files:
if f == 'prices.csv':
prices_csvs.append(os.path.join(root, f))
break # optional, use if there only is one prices.csv in each subfolder
for csv_file in prices_csvs:
# csv_file is a string of the path for each prices.csv
# if you can insert it as the `infile` parameter and run the sql, you are done
# admittedly, i don't know how to do this at the moment
os.walk goes down through each subdirectory, giving the name root to the path to that folder, listing all the directories as dirs and files as files stored there. From there it's a simple check to see if the filename matches what you're looking for, and storing it in a list if it does. Looping over the list yields strings containing the path to each prices.csv in Exmpl.
Hope that shed a very little light on how python could help
I've marked Charlie's answer as the correct answer because, although he does not fully answer the question, he gave me a great start. Below is the code for anyone who might want to see how to load csv files into MySQL. The basic idea is to dynamically construct a string in Python and then execute that string in MySQL.
#!/usr/bin/python
import os
import MySQLdb # Use this module in order to interact with SQL
# Find all the file names located in this directory
prices_csvs = []
for root, dirs, files in os.walk(os.path.join('insert_path_here', 'Exmpl'):
for f in files:
if f == 'prices.csv':
prices_csvs.append(os.path.join(root, f))
break
# Connect to the MySQL database
db = MySQLdb.connect(host ="<Enter Host Here>", user = "<Enter User here>", passwd = "<Enter Password Here>", db = "<Enter Database name here>" )
# must create cursor object
cur = db.cursor()
for csv_file in prices_csvs:
directory = "'" + csv_file + "'"
table = csv_file[56:64] # This extracts the name of the table from the directory
sql_string1 = "drop table if exists SD" + table + " ;\n"
sql_string2 = "create table SD" + table + " as \n\
<Enter your fields here> \n\
); \n"
sql_string3 = "load data local infile " + directory + " \n\
into table TempPrices \n\
fields terminated by ',' \n\
lines terminated by " + repr('\n') + " \n\
ignore 1 lines ;\n"
# Print out the strings for debugging
print sql_string1
print sql_string2
print sql_string3
print sql_string4
print sql_string5
# Execute your SQL statements
cur.execute(sql_string1)
cur.execute(sql_string2)
cur.execute(sql_string3)
cur.execute(sql_string4)
cur.execute(sql_string5)
db.commit()
db.close()
While debugging, I found it very helpful to copy the printed SQL statement and paste it into MySQL to confirm that the strings were being constructed successfully.