I'm new with psycopg2 and I do have a question (which I cannot really find a respond in the Internet): Do we have any difference (for exemaple in the aspect of performance) between using copy_xxx() method and combo execute() + fetchxxx() method when we try to write the result of query into a CSV file?
...
query_str = "SELECT * FROM mytable"
cursor.execute(query_str)
with open("my_file.csv", "w+") as file:
writer = csv.writer(file)
while True:
rows = cursor.fetchmany()
if not rows:
break
writer.writerows(rows)
vs
...
query_str = "SELECT * FROM mytable"
output_query = f"COPY {query_str} TO STDOUT WITH CSV HEADER"
with open("my_file.csv", "w+") as file:
cursor.copy_expert(output_query, file)
And if I try to do a very complex query (my assumption is that we cannot simplify this query anymore for ex) with psycopg2, which method should I use? Or do you guys have any advice, please?
Many thanks!!!
COPY is faster, but if query execution time is dominant or the file is small, it won't matter much.
You don't show us how the cursor was declared. If it is an anonymous cursor, then execute/fetch will read all query data into memory upfront, leading to out of memory conditions for very large queries. If it is a named cursor, then you will individually request every row from the server, leading to horrible performance (which can be overcome by specifying a count argument to fetchmany, as the default is bizarrely set to 1)
Related
I have a table that has 10 million plus records(rows) in it. I am trying to do a one-time load into s3 by select *'ing the table and then writing it to a gzip file in my local file system. Currently, I can run my script to collect 800,000 records into the gzip file but then I receive an error, and the remainder records are obviously not inserted.
Since there is no continuation in sql (for example- if you run 10 limit 800,000 queries, it wont be in order).
So, is there a way to writer a python/airflow function that can load the 10 million+ table in batches? Perhaps theres a way in python where I can do a select * statement and continue the statement after x amount of records into separate gzip files?
Here is my python/airflow script so far that when ran, it only writers 800,000 records to the path variable:
def gzip_postgres_table(table_name, **kwargs):
path = '/usr/local/airflow/{}.gz'.format(table_name)
server_post = create_tunnel_postgres()
server_post.start()
etl_conn = conn_postgres_internal(server_postgres)
record = get_etl_record(kwargs['master_table'],
kwargs['table_name'])
cur = etl_conn.cursor()
unload_sql = '''SELECT *
FROM schema1.database1.{0} '''.format(record['table_name'])
cur.execute(unload_sql)
result = cur.fetchall()
column_names = [i[0] for i in cur.description]
fp = gzip.open(path, 'wt')
myFile = csv.writer(fp, delimiter=',')
myFile.writerow(column_names)
myFile.writerows(result)
fp.close()
etl_conn.close()
server_postgres.stop()
The best, I mean THE BEST approach to insert so many records into PostgreSQL, or to get them form PostgreSQL, is to use postgresql COPY. This means you would have to change your approach drastically, but there's no better way that I know in PostgreSQL. COPY manual
COPY creates a file with the query you are executing or it can insert into a table from a file.
COPY moves data between PostgreSQL tables and standard file-system
files.
The reason why is the best solution is because your using PostgreSQL default method to handle external data, without intermediaries; so it's fast and secure.
COPY works like a charm with CSV files. You should change your approach to a file handling method and the use of COPY.
Since COPY runs with SQL, you can divide your data using LIMIT and OFFSET in the query. For example:
COPY (SELECT * FROM country LIMIT 10 OFFSET 10) TO '/usr1/proj/bray/sql/a_list_of_10_countries.copy';
-- This creates 10 countries starting in the row 10
COPY only works with files that are accessible with PostgreSQL user in the server.
PL Function (edited):
If you want COPY to be dynamic, you can use the COPY into a PL function. For example:
CREATE OR REPLACE FUNCTION copy_table(
table_name text,
file_name text,
vlimit text,
voffset text
)RETURNS VOID AS $$
DECLARE
query text;
BEGIN
query := 'COPY (SELECT * FROM country LIMIT '||vlimit||' OFFSET '||voffset||') TO '''||file_name||''' DELIMITER '','' CSV';
-- NOTE that file_name has to have its dir too.
EXECUTE query;
END;$$ LANGUAGE plpgsql;
SECURITY DEFINER
LANGUAGE plpgsql;
To execute the function you just have to do:
SELECT copy_table('test','/usr/sql/test.csv','10','10')
Notes:
If the PL will be public, you have to check for SQL injection attacks.
You can program the PL to suit your needs, this is just an example.
The function returns VOID, so it just do the COPY, if you need some feedback you should return something else.
The function has to be owned with user postgres from the server, because it needs file access; that is why it needs SECURITY DEFINER, so that any database user can run the PL.
I try to load data from Oracle to Vertica via CSV file
Used python,
wrote this script for create CSV from Oracle
csv_file = open("C:\DataBases\csv\%s_%s.csv" % (FILE_NAME, TABLE_NAME), "a", encoding = 'utf-8')
for row in cursor:
count_rows += 1
result_inside = {}
row_content = []
for col, val in zip(col_names, row):
result_inside[col] = val
row_content.append(result_inside[col])
result_select_from_oracle.append(result_inside)
file.write(json.dumps(result_inside,
default = myconverter))
writer = csv.writer(csv_file, delimiter = ';', quoting = csv.QUOTE_ALL)
writer.writerow(row_content)
wrote this script for COPY CSV to Vertica
connection = vertica_python.connect( * * conn_info)
cursor = connection.cursor()
with open("C:\DataBases\csv\%s_%s.csv" % (FILE_NAME, TABLE_NAME), "rb") as fs:
record_terminator = '\n')
" %(SCHEMA_NAME, TABLE_NAME), my_file)
cursor.copy("COPY %s.%s from stdin PARSER fcsvparser(type='traditional', delimiter=';', record_terminator='\n')" % (SCHEMA_NAME, TABLE_NAME), my_file)
connection.commit()
connection.close()
After fineshed operation I had problem
from oracle
Unloaded 40 000 rows
BUT in Vertica 39700 rows.
Where there can be a problem and how to solve it?
COPY statement has two main stages: parsing and loading (there are other stages, but we’ll stick to these two). COPY rejects data only if it encounters problems during its parser phase. That’s when you end up with rejected data.
Potential causes for parsing errors include:
Unsupported parser options
Incorrect data types for the table into which data is being loaded
Malformed context for the parser in use
Missing delimiters
You may want the whole load to fail if even one row is rejected, for that, use the optional parameter ABORT ON ERROR
You may want to limit the number of rejected rows you’ll permit. Use REJECTMAX to set a threshold after which you want COPY to roll back the load process.
Vertica gives you these options to save rejected data:
Do nothing. Vertica automatically saves a rejected data file and an
accompanying explanation of each rejected row (the exception), to
files in a catalog subdirectory called CopyErrorLogs.
Specify file locations of your choice using the REJECTED DATA and
EXCEPTIONS parameters (files will be saved on the machine which you run the script on) .
Save rejected data to a table. Using a table lets you query what
data was rejected, and why. You can then fix any incorrect data, and
reload it.
Vertica recommends saving rejected data to a table which will contain both the rejected data and the exception in one location. Saving rejected data to a table is simple, using the REJECTED DATA AS TABLE reject_table clause in the COPY statement
I am doing a bulk import of dbf files to sqlite. I wrote a simple script in python using the dbf module at http://dbfpy.sourceforge.net/. It works fine and as expected except for a small few cases. In a very discreet numbr of cases the module seems to have added a few erroneous records to the table it was reading.
I know this sounds crazy right but it really seems to be the case. I have exported the dbase file in question to csv using open office and imported it directly to sqlite using .import and the 3 extra records are not there.
But if I iterate through the file using python and the dbfpy module the 3 extra records are added.
I am wondering is it possible that these three records were flagged as deleted in the dbf file and while invisible to open office are being picked up by the dbf module. I could be way off in this possibility but I am really scratching my head on this one.
Any help is appreciated.
What follows is a sample of my method for reading the dbf file. I have removed the loop and used one single case instead.
conn = lite.connect('../data/my_dbf.db3')
#used to get rid of the 8 byte string error from sqlite3
conn.text_factory = str
cur = conn.cursor()
rows_list = []
db = dbf.Dbf("../data/test.dbf")
for rec in db:
***if not rec.deleted:***
row_tuple = (rec["name"], rec["address"], rec["age"])
rows_list.append(row_tuple)
print file_name + " processed"
db.close()
cur.executemany("INSERT INTO exported_data VALUES(?, ?, ?)", rows_list)
#pprint.pprint(rows_list)
conn.commit()
Solution
Ok after about another half hour of testing before lunch I discovered that my possible hypothesis was in fact correct some files had not been packed and as such had records which had been flagged for deleted still remaining in them. They should not have been in an unpacked state after export so this caused more confusion.
I manually packed one file and tested it and it immediately returned the proper results.
A big thanks for the help on this. I had added in the solution given below to ignore the deleted records. I had searched and searched for this method(deleted) in this module but could not find an api doc for it, I even looked in the code but in the fog of it all it must have slipped by. Thanks a million for the solution and help guys.
If you wont to discard records marked as deleted, you can write:
for rec in db:
if not rec.deleted:
row_tuple = (rec["name"], rec["address"], rec["age"])
rows_list.append(row_tuple)
I exported some data from my database in the form of JSON, which is essentially just one [list] with a bunch (900K) of {objects} inside it.
Trying to import it on my production server now, but I've got some cheap web server. They don't like it when I eat all their resources for 10 minutes.
How can I split this file into smaller chunks so that I can import it piece by piece?
Edit: Actually, it's a PostgreSQL database. I'm open to other suggestions on how I can export all the data in chunks. I've got phpPgAdmin installed on my server, which supposedly can accept CSV, Tabbed and XML formats.
I had to fix phihag's script:
import json
with open('fixtures/PostalCodes.json','r') as infile:
o = json.load(infile)
chunkSize = 50000
for i in xrange(0, len(o), chunkSize):
with open('fixtures/postalcodes_' + ('%02d' % (i//chunkSize)) + '.json','w') as outfile:
json.dump(o[i:i+chunkSize], outfile)
dump:
pg_dump -U username -t table database > filename
restore:
psql -U username < filename
(I don't know what the heck pg_restore does, but it gives me errors)
The tutorials on this conveniently leave this information out, esp. the -U option which is probably necessary in most circumstances. Yes, the man pages explain this, but it's always a pain to sift through 50 options you don't care about.
I ended up going with Kenny's suggestion... although it was still a major pain. I had to dump the table to a file, compress it, upload it, extract it, then I tried to import it, but the data was slightly different on production and there were some missing foreign keys (postalcodes are attached to cities). Of course, I couldn't just import the new cities, because then it throws a duplicate key error instead of silently ignoring it, which would have been nice. So I had to empty that table, repeat the process for cities, only to realize something else was tied to cities, so I had to empty that table too. Got the cities back in, then finally I could import my postal codes. By now I've obliterated half my database because everything is tied to everything and I've had to recreate all the entries. Lovely. Good thing I haven't launched the site yet. Also "emptying" or truncating a table doesn't seem to reset the sequences/autoincrements, which I'd like, because there are a couple magic entries I want to have ID 1. So..I'd have to delete or reset those too (I don't know how), so I manually edited the PKs for those back to 1.
I would have ran into similar problems with phihag's solution, plus I would have had to import 17 files one at a time, unless I wrote another import script to match the export script. Although he did answer my question literally, so thanks.
In Python:
import json
with open('file.json') as infile:
o = json.load(infile)
chunkSize = 1000
for i in xrange(0, len(o), chunkSize):
with open('file_' + str(i//chunkSize) + '.json', 'w') as outfile:
json.dump(o[i:i+chunkSize], outfile)
I turned phihag's and mark's work into a tiny script (gist)
also copied below:
#!/usr/bin/env python
# based on http://stackoverflow.com/questions/7052947/split-95mb-json-array-into-smaller-chunks
# usage: python json-split filename.json
# produces multiple filename_0.json of 1.49 MB size
import json
import sys
with open(sys.argv[1],'r') as infile:
o = json.load(infile)
chunkSize = 4550
for i in xrange(0, len(o), chunkSize):
with open(sys.argv[1] + '_' + str(i//chunkSize) + '.json', 'w') as outfile:
json.dump(o[i:i+chunkSize], outfile)
Assuming you have the option to go back and export the data again...:
pg_dump - extract a PostgreSQL database into a script file or other archive file.
pg_restore - restore a PostgreSQL database from an archive file created by pg_dump.
If that's no use, it might be useful to know what you're going to be doing with the output so that another suggestion can hit the mark.
I know this is question is from a while back, but I think this new solution is hassle-free.
You can use pandas 0.21.0 which supports a chunksize parameter as part of read_json. You can load one chunk at a time and save the json:
import pandas as pd
chunks = pd.read_json('file.json', lines=True, chunksize = 20)
for i, c in enumerate(chunks):
c.to_json('chunk_{}.json'.format(i))
i've accessed a database and have the result in a cursor object. but i couldn't save it :(
cur_deviceauth.execute('select * from device_auth')
for row in cur_deviceauth:
print row
writer = csv.writer(open("out.csv", "w"))
writer.writerows(cur_deviceauth)
i don't get an error msg and i couldn't write it. how do i make it? any advice would be of much help and what is the best place to learn this stuff?
When you're printing rows before writing to a file, you're exhausting cursor object that works as a generator. Just write to file without any intermediate steps.