Copy from Oracle to Vertica via CSV: lost rows

Copy from Oracle to Vertica via CSV: lost rows - python

I try to load data from Oracle to Vertica via CSV file
Used python,
wrote this script for create CSV from Oracle
csv_file = open("C:\DataBases\csv\%s_%s.csv" % (FILE_NAME, TABLE_NAME), "a", encoding = 'utf-8')
for row in cursor:
count_rows += 1
result_inside = {}
row_content = []
for col, val in zip(col_names, row):
result_inside[col] = val
row_content.append(result_inside[col])
result_select_from_oracle.append(result_inside)
file.write(json.dumps(result_inside,
default = myconverter))
writer = csv.writer(csv_file, delimiter = ';', quoting = csv.QUOTE_ALL)
writer.writerow(row_content)
wrote this script for COPY CSV to Vertica
connection = vertica_python.connect( * * conn_info)
cursor = connection.cursor()
with open("C:\DataBases\csv\%s_%s.csv" % (FILE_NAME, TABLE_NAME), "rb") as fs:
record_terminator = '\n')
" %(SCHEMA_NAME, TABLE_NAME), my_file)
cursor.copy("COPY %s.%s from stdin PARSER fcsvparser(type='traditional', delimiter=';', record_terminator='\n')" % (SCHEMA_NAME, TABLE_NAME), my_file)
connection.commit()
connection.close()
After fineshed operation I had problem
from oracle
Unloaded 40 000 rows
BUT in Vertica 39700 rows.
Where there can be a problem and how to solve it?

COPY statement has two main stages: parsing and loading (there are other stages, but we’ll stick to these two). COPY rejects data only if it encounters problems during its parser phase. That’s when you end up with rejected data.
Potential causes for parsing errors include:
Unsupported parser options
Incorrect data types for the table into which data is being loaded
Malformed context for the parser in use
Missing delimiters
You may want the whole load to fail if even one row is rejected, for that, use the optional parameter ABORT ON ERROR
You may want to limit the number of rejected rows you’ll permit. Use REJECTMAX to set a threshold after which you want COPY to roll back the load process.
Vertica gives you these options to save rejected data:
Do nothing. Vertica automatically saves a rejected data file and an
accompanying explanation of each rejected row (the exception), to
files in a catalog subdirectory called CopyErrorLogs.
Specify file locations of your choice using the REJECTED DATA and
EXCEPTIONS parameters (files will be saved on the machine which you run the script on) .
Save rejected data to a table. Using a table lets you query what
data was rejected, and why. You can then fix any incorrect data, and
reload it.
Vertica recommends saving rejected data to a table which will contain both the rejected data and the exception in one location. Saving rejected data to a table is simple, using the REJECTED DATA AS TABLE reject_table clause in the COPY statement

Related

psycopg2: Copy vs Execute-Fetch

I'm new with psycopg2 and I do have a question (which I cannot really find a respond in the Internet): Do we have any difference (for exemaple in the aspect of performance) between using copy_xxx() method and combo execute() + fetchxxx() method when we try to write the result of query into a CSV file?
...
query_str = "SELECT * FROM mytable"
cursor.execute(query_str)
with open("my_file.csv", "w+") as file:
writer = csv.writer(file)
while True:
rows = cursor.fetchmany()
if not rows:
break
writer.writerows(rows)
vs
...
query_str = "SELECT * FROM mytable"
output_query = f"COPY {query_str} TO STDOUT WITH CSV HEADER"
with open("my_file.csv", "w+") as file:
cursor.copy_expert(output_query, file)
And if I try to do a very complex query (my assumption is that we cannot simplify this query anymore for ex) with psycopg2, which method should I use? Or do you guys have any advice, please?
Many thanks!!!

COPY is faster, but if query execution time is dominant or the file is small, it won't matter much.

You don't show us how the cursor was declared. If it is an anonymous cursor, then execute/fetch will read all query data into memory upfront, leading to out of memory conditions for very large queries. If it is a named cursor, then you will individually request every row from the server, leading to horrible performance (which can be overcome by specifying a count argument to fetchmany, as the default is bizarrely set to 1)

Other way than splitting with coma to store in a database?

I started created a database with postgresql and I am currently facing a problem when I want to copy the data from my csv file to my database
Here is my code:
connexion = psycopg2.connect(dbname= "db_test" , user = "postgres", password ="passepasse" )
connexion.autocommit = True
cursor = connexion.cursor()
cursor.execute("""CREATE TABLE vocabulary(
fname integer PRIMARY KEY,
label text,
mids text
)""")
with open (r'C:\mypathtocsvfile.csv', 'r') as f:
next(f) # skip the header row
cursor.copy_from(f, 'vocabulary', sep=',')
connexion.commit()
I asked to allocate 4 column to store my csv data, the problem is that datas in my csv are stored like this:
fname,labels,mids,split
64760,"Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music","/m/02sgy,/m/0342h,/m/0fx80y,/m/04szw,/m/04rlf",train
16399,"Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music","/m/02sgy,/m/0342h,/m/0fx80y,/m/04szw,/m/04rlf",train
...
There is comas inside my columns label and mids, thats why i get the following error:
BadCopyFileFormat: ERROR: additional data after the last expected column
Which alternativ should I use to copy data from this csv file?
ty

if the file is small, then the easiest way is to open the file in LibreOffice and save the file with a new separetor.  I usually use ^.  If the file is large, write a script to replace ," and "," on ^" and "^", respectively.

COPY supports csv as a format, which already does what you want. But to access it via psycopg2, I think you will need to use copy_expert rather than copy_from.
cursor.copy_expert('copy vocabulary from stdin with csv', f)

Is postgres COPY tablename FROM STDIN with csv at risk of SQL injection?

I am using python and pyscopg2.
If I run code below, the user provided csv file will be open and read. Then the content contains in csv file will be transferred to database.
I want to know if the code is at risk of SQL injection when some unexpected words or symbols contain in the csv file.
conn_config = dict(port="5432", dbname="test", password="test")
with psycopg2.connection(**conn_config) as conn:
with conn.cursor() as cur:
with open("test.csv") as f:
cur.copy_expert(sql="COPY test FROM STDIN", file=f)
I read some documents of psycopg2 and postgres, but I did not found the result.
Please know that English is not my native language, and I may make some confusing mistakes

The command simply copies the data into the table. No part of the copied data may be interpreted as an SQL command, so SQL injection is out of the question. Additional security is the rigid CSV format. If the data contains extra (redundant) writes, the command will simply fail. The only risk of the command operation may be strange contents in the table.

How to batch write a table that has 10 million plus records into separate gzip files-Python/Postgres/Airflow

I have a table that has 10 million plus records(rows) in it. I am trying to do a one-time load into s3 by select *'ing the table and then writing it to a gzip file in my local file system. Currently, I can run my script to collect 800,000 records into the gzip file but then I receive an error, and the remainder records are obviously not inserted.
Since there is no continuation in sql (for example- if you run 10 limit 800,000 queries, it wont be in order).
So, is there a way to writer a python/airflow function that can load the 10 million+ table in batches? Perhaps theres a way in python where I can do a select * statement and continue the statement after x amount of records into separate gzip files?
Here is my python/airflow script so far that when ran, it only writers 800,000 records to the path variable:
def gzip_postgres_table(table_name, **kwargs):
path = '/usr/local/airflow/{}.gz'.format(table_name)
server_post = create_tunnel_postgres()
server_post.start()
etl_conn = conn_postgres_internal(server_postgres)
record = get_etl_record(kwargs['master_table'],
kwargs['table_name'])
cur = etl_conn.cursor()
unload_sql = '''SELECT *
FROM schema1.database1.{0} '''.format(record['table_name'])
cur.execute(unload_sql)
result = cur.fetchall()
column_names = [i[0] for i in cur.description]
fp = gzip.open(path, 'wt')
myFile = csv.writer(fp, delimiter=',')
myFile.writerow(column_names)
myFile.writerows(result)
fp.close()
etl_conn.close()
server_postgres.stop()

The best, I mean THE BEST approach to insert so many records into PostgreSQL, or to get them form PostgreSQL, is to use postgresql COPY. This means you would have to change your approach drastically, but there's no better way that I know in PostgreSQL. COPY manual
COPY creates a file with the query you are executing or it can insert into a table from a file.
COPY moves data between PostgreSQL tables and standard file-system
files.
The reason why is the best solution is because your using PostgreSQL default method to handle external data, without intermediaries; so it's fast and secure.
COPY works like a charm with CSV files. You should change your approach to a file handling method and the use of COPY.
Since COPY runs with SQL, you can divide your data using LIMIT and OFFSET in the query. For example:
COPY (SELECT * FROM country LIMIT 10 OFFSET 10) TO '/usr1/proj/bray/sql/a_list_of_10_countries.copy';
-- This creates 10 countries starting in the row 10
COPY only works with files that are accessible with PostgreSQL user in the server.
PL Function (edited):
If you want COPY to be dynamic, you can use the COPY into a PL function. For example:
CREATE OR REPLACE FUNCTION copy_table(
table_name text,
file_name text,
vlimit text,
voffset text
)RETURNS VOID AS $$
DECLARE
query text;
BEGIN
query := 'COPY (SELECT * FROM country LIMIT '||vlimit||' OFFSET '||voffset||') TO '''||file_name||''' DELIMITER '','' CSV';
-- NOTE that file_name has to have its dir too.
EXECUTE query;
END;$$ LANGUAGE plpgsql;
SECURITY DEFINER
LANGUAGE plpgsql;
To execute the function you just have to do:
SELECT copy_table('test','/usr/sql/test.csv','10','10')
Notes:
If the PL will be public, you have to check for SQL injection attacks.
You can program the PL to suit your needs, this is just an example.
The function returns VOID, so it just do the COPY, if you need some feedback you should return something else.
The function has to be owned with user postgres from the server, because it needs file access; that is why it needs SECURITY DEFINER, so that any database user can run the PL.

Postgres/GCS/Python: Insert data stored in variable that read from GCS to postgres table

I am trying to read a file from GCS and store it in variable and create a postgresql table from it. I can connect to GCs from my code and be able to store data in variabe with this:
result = blob.download_as_string()
result = result.decode('utf8').strip()
which the printed result are in the correct format. Then, I tried to insert data from this variable into the table which I did:
sql = "COPY tablename FROM STDIN WITH DELIMITER ',' NULL AS '\\N' CSV HEADER;"
cursor.copy_expert(sql, result)
and I got this error:
file must be a readable file-like object for COPY FROM; a writable file-like object for COPY TO
I also tried with another function:
cursor.copy_from(result, 'table' ,sep=',')
but I got this result:
argument 1 must have a .read() method
So, according to my question how can I put the data in the variable to the table I created. Or do I have to download it to my local and use this:
sql = "COPY tablename from STDIN WITH DELIMITER ',' CSV HEADER"
with open('/path/to/csv' , 'r+') as file:
cursor.copy_expert(sql, file)
This one is worked but I don't want to download it to my local. I just want to read it and insert it into my table
Thank you

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Copy from Oracle to Vertica via CSV: lost rows - python

Related

psycopg2: Copy vs Execute-Fetch

Other way than splitting with coma to store in a database?

Is postgres COPY tablename FROM STDIN with csv at risk of SQL injection?

How to batch write a table that has 10 million plus records into separate gzip files-Python/Postgres/Airflow

Postgres/GCS/Python: Insert data stored in variable that read from GCS to postgres table

Categories

Resources