I'm having an EOF issue when trying to bcp a .csv file I generated with Python's csv.writer. I've done lots of googling with no luck, so I turn to you helpful folks on SO
Here's the error message (which is triggered on the subprocess.call() line):
Starting copy...
Unexpected EOF encountered in BCP data-file.
bcp copy in failed
Here's the code:
sel_str = 'select blahblahblah...'
result = engine.execute(sel_str) #engine is a SQLAlchemy engine instance
# write to disk temporarily to be able to bcp the results to the db temp table
with open('tempscratch.csv','wb') as temp_bcp_file:
csvw = csv.writer(temp_bcp_file)
for r in result:
csvw.writerow(r)
temp_bcp_file.flush()
# upload the temp scratch file
bcp_string = 'bcp tempdb..collection in #INFILE -c -U username -P password -S DSN'
bcp_string = string.replace(bcp_string,'#INFILE','tempscratch.csv')
result_code = subprocess.call(bcp_string, shell=True)
I looked at the tempscratch.csv file in a text editor and didn't see any weird EOF or other control characters. Moreover, I looked at other .csv files for comparison, and there doesn't seem to be a standardized EOF that bcp is looking for.
Also, yes this is hacky, pulling down a result set, writing it to disk and then reuploading it to the db with bcp. I have to do this because SQLAlchemy does not support multi-line statements (aka DDL and DML) in the same execute() command. Further, this connection is with a Sybase db, which does not support SQLAlchemy's wonderful ORM :( (which is why I'm using execute() in the first place)
From what I can tell, the bcp default field delimiter is the tab character '\t' while Python's csv writer defaults to the comma. Try this...
# write to disk temporarily to be able to bcp the results to the db temp table
with open('tempscratch.csv','wb') as temp_bcp_file:
csvw = csv.writer(temp_bcp_file, delimiter = '\t')
for r in result:
csvw.writerow(r)
temp_bcp_file.flush()
Related
I'm trying to perform an update using flask-sqlalchemy but when it gets to the update script it does not return anything. it seems the script is hanging or it is not doing anything.
I tried to wrap a try catch on the code that does not complete but there are no errors.
I gave it 10 minutes to complete the update statement which only updates 1 record and still, it will not do anything for some reason.
When I cancel the script, it provides an error Communication link failure (0) (SQLEndTran) but I don't think this is the root cause of the error because on the same script, I have other sql scripts that works ok so the connection to db is good
what my script does is get some list of filenames that I need to process (I have no issues with this). then using the retrieved list of filenames, I will look into the directory to check if the file exists. if it does not exists, I will update the database to tag the file as it is not found. this is where I get the issue, it does not perform the update nor provide an error message of some sort.
I even tried to create a new engine just for the update script, but still I get the same behavior.
I also tried to print out the sql script first in python before executing. I ran the printed sql command on my sql browser and it worked ok.
The code is very simple, I'm not really sure why it's having the issue.
#!/usr/bin/env python3
from flask_sqlalchemy import sqlalchemy
import glob
files_directory = "/files_dir/"
sql_string = """
select *
from table
where status is null
"""
# ommited conn_string
engine1 = sqlalchemy.create_engine(conn_string)
result = engine1.execute(sql_string)
for r in result:
engine2 = sqlalchemy.create_engine(conn_string)
filename = r[11]
match = glob.glob(f"{files_directory}/**/{filename}.wav")
if not match:
print('no match')
script = "update table set status = 'not_found' where filename = '" + filename + "' "
engine2.execute(script)
engine2.dispose()
continue
engine1.dispose()
it appears that if I try to loop through 26k records, the script doesn't work. but when I try to do by batches of 2k records per run, then the script will work. so my sql string will become (added top 2000 on the query)
sql_string = """
select top 2000 *
from table
where status is null
"""
it's manual yeah, but it works for me since I just need to run this script once. (I mean 13 times)
I'm using bcp tool to import CSV into a sql server table. I'm using python subprocess to execute the bcp command. My sample bcp command is like below:
bcp someDatabase.dbo.sometable IN myData.csv -n -t , -r \n -S mysqlserver.com -U myusername -P 'mypassword'
the command executes and says
0 rows copied.
Even if i remove the -t or -n option, the message is still the same. I read from sql server docs that there is something called length prefix(if bcp tool is used with -n (native) mode).
How can i specify that length prefix with bcp command?
My goal is to import CSV into a sql server table using bcp tool. I first create my table according to my date in the CSV file and i dont create a format file for bcp. I want all my data to be inserted correctly(according to the data type i have sepecified in my table).
If it is a csv file then do not use -n, -t or -r options. Use -e errorFileName to catch the error(s) you may be encountering. You can then take the appropriate steps.
It is a very common practice with ETL tasks to first load text files into a "load" table that has all varchar/char data types. This avoids any possible implied data conversion errors that are more difficult/time-consuming to troubleshoot via BCP. Just pass the character data in the text file into character datatype columns in SQL Server. Then you can move data from the "load" table into your final destination table. This will allow you to use the MUCH more functional T-SQL commands to handle transformation of data types. Do not force BCP/SQL Server to transform your data-types for you by going from text file directly into your final table via BCP.
Also, I would also suggest visually inspecting your incoming data file to confirm it is formatted as specified. I often see mixups betweeen \n and \r\n for line terminator.
Last, when loading the data, you should also use the -e option as Neeraj has stated. This will capture "data" errors (it does not report command/syntax errors; just data/formatting errors). Since you incoming file is an ascii text file, you DO want to use the -c option for loading into the all-varchar "load" table.
This answer suggests using AWS Data Pipeline but I'm wondering if there's a clever way to do it with my own machine and Python.
I've been using psycopg2, boto3 and pandas libraries. Tables have 5 to 50 columns and few million rows. My current method doesn't work that well with large data.
Guess I can show one of my own versions here aswell which is based on copy_expert in psycopg2
import io
import psycopg2
import boto3
resource = boto3.resource('s3')
conn = psycopg2.connect(dbname=db, user=user, password=pw, host=host)
cur = conn.cursor()
def copyFun(bucket, select_query, filename):
query = f"""COPY {select_query} TO STDIN \
WITH (FORMAT csv, DELIMITER ',', QUOTE '"', HEADER TRUE)"""
file = io.StringIO()
cur.copy_expert(query, file)
resource.Object(bucket, f'{filename}.csv').put(Body=file.getvalue())
We do following in our case, performance wise, its pretty fast, and scheduled method rather then continuous streaming. I'm not 100% sure if its wise method, but definitely good from speed prospective in case of scheduled data exports in CSV format that we eventually use for loading to d/w.
Using shell script, we fire psql command to copy data to local file in EC2 App intance.
psql [your connection options go here] -F, -A -c 'select * from my_schema.example' >example.csv
Then, using shell script, we fire s3cmd command to Put example.csv to designated S3:bucket Location.
s3cmd put example.csv s3://your-bucket/path/to/file/
This is an old question, but it comes up when searching for "aws_s3.export_query_to_s3", even though there is no mention of it here, so I thought I'd throw another answer out there.
This can be done natively with a Postgres extension if you're using AWS Aurora Postgres 11.6 or above via: aws_s3.export_query_to_s3
Exporting data from an Aurora PostgreSQL DB cluster to Amazon S3
https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/postgresql-s3-export.html
See here for the function reference:
https://docs.amazonaws.cn/en_us/AmazonRDS/latest/AuroraUserGuide/postgresql-s3-export.html#postgresql-s3-export-functions
This has been present since Aurora for Postgres since 3.1.0 which was released on February 11, 2020 (I don't know why this URL says 2018): https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraPostgreSQL.Updates.20180305.html#AuroraPostgreSQL.Updates.20180305.310
I would not recommend using 3.1.0/11.6, however, there is a bug that causes data corruption issues after 10MB of data is exported to S3: https://forums.aws.amazon.com/thread.jspa?messageID=962494
I just tested with 3.3.1, from September 17, 2020, and the issue isn't present, so, anyone who wants a way to dump data from Postgres to S3... and is on AWS, give this a try!
Here's an example query to create JSONL for you.
JSONL is JSON, with a single JSON object per line: https://jsonlines.org/
So you can dump a whole table to a JSONL file, for example... You could also do json_agg in postgres and dump as a single JSON file with objects in an array, it's up to you, really. Just change the query, and the file extension, and leave it as text format.
select * from aws_s3.query_export_to_s3(
'select row_to_json(data) from (<YOUR QUERY HERE>) data',
aws_commons.create_s3_uri(
'example-bucket/some/path',
'whatever.jsonl',
'us-east-1'),
options :='format text');
For CSV, something like this should do the trick:
select * from aws_s3.query_export_to_s3(
'<YOUR QUERY HERE>',
aws_commons.create_s3_uri(
'example-bucket/some/path',
'whatever.csv',
'us-east-1'),
options :='format csv');
I am writing a python method to dump the entire contents of a MySQL table. However, this table contains personally-identifiable information (PII). I have a requirement that this data must be GPG-encrypted. Additionally the requirement is that none of this data is allowed to be written to disk in unencrypted form (even if that is just a temporary file that is later removed)
I have temporarily solved this problem as follows by using the subprocess.Popen() and piping the output of the mysql executable directly to the gpg executable and then piping that output to stdout:
p1 = subprocess.Popen(
'mysql -h127.0.0.1 -Dmydbinstance -umyuser -pmyPassword -e "select * from my_table"',
stdin=subprocess.PIPE,
stdout=subprocess.PIPE,
)
p2 = subprocess.Popen(
"gpg --encrypt -r myemail#gmail.com",
stdin=p1.stdout,
stdout=subprocess.PIPE
)
p1.stdout.close()
print p2.communicate()[0]
It works, but this seems to me like a terrible hack. It feels very wrong to fork shell processes to do this.
So I want to do this natively in python (without popen()). I have a MySQLdb connection to the database. And the python-gnupg module can do the encryption on a file stream. But how can I convert the output of MySQLdb's fetchall() to a file stream? So far, all I have is this:
import MySQLdb
import gpg
DBConn = MySQLdb.Connect(host='127.0.0.1', user='myuser', passwd='myPassword', db='mydbinstance', port=3306, charset='utf8')
DBConn.autocommit(True)
cur = DBConn.cursor(MySQLdb.cursors.DictCursor)
cur.execute("select * from my_table")
if cur.rowcount >= 1:
rows = cur.fetchall()
else
rows = []
for i in rows:
print i
# WHAT DO I NEED TO DO HERE TO TURN THE DB OUTPUT INTO A FILE STREAM?
encrypted_ascii_data = gpg.encrypt_file(stream, recipient_fingerprint)
How can I turn the output of fetchall() to a file stream so that I can send it to gpg.encrypt_file() without writing a temporary file to disk that is unencrypted? There could be millions of rows of data. So reading it all into memory at once is not a viable solution.
You can use a file-like object similiar to io.StringIO or io.BytesIO from the io module.
Looking at the latest source code there is no encrypt_file anymore, instead encrypt wraps the data in a binary stream using StringIO or BytesIO depending on the Python Version
So nothing actually stops you from using encrypt directly, if you want more control on how the data is encrypted you can implement a dummy file-object or just write your data to an io.BytesIO object.
I need to load data from some source data sources to a Postgres database.
To do this task, I first write the data to a temporary CSV file and then load data from the CSV file to Postgres database using COPY FROM query. I do all of this on Python.
The code looks like this:
table_name = 'products'
temp_file = "'C:\\Users\\username\\tempfile.csv'"
db_conn = psycopg2.connect(host, port, user, password, database)
cursor = db_conn.cursor()
query = """COPY """ + table_name + """ FROM """ + temp_file + " WITH NULL AS ''; """
cursor.execute(query)
I want to avoid the step of writing to the intermediate file. Instead, I would like to write to a Python object and then load data to postgres database using COPY FROM file method.
I am aware of this technique of using psycopg2's copy_from method which copies data from a StringIO object to the postgres database. However, I cannot use psycopg2 for a reason and hence, I don't want my COPY FROM task to be dependent on a library. I want it to be Postgres query which can be run by any other postgres driver as well.
Please advise a better way of doing this without writing to an intermediate file.
You could call the psql command-line tool from your script (i.e. using subprocess.call) and leverage its \copy command, piping the output of one instance to the input of another, avoiding a temp file. i.e.
psql -X -h from_host -U user -c "\copy from_table to stdout" | psql -X -h to_host -U user -c "\copy to_table from stdin"
This assumes the table exists in the destination database. If not, a separate command would first need to create it.
Also, note that one caveat of this method is that errors from the first psql call can get swallowed by the piping process.
psycopg2 has integrated support for the COPY wire-protocol, allowing you to use COPY ... FROM STDIN / COPY ... TO STDOUT.
See Using COPY TO and COPY FROM in the psycopg2 docs.
Since you say you can't use psycopg2, you're out of luck. Drivers must understand COPY TO STDOUT / COPY FROM STDIN in order to use them, or must provide a way to write raw data to the socket so you can hijack the driver's network socket and implement the COPY protocol yourself. Driver specific code is absolutely required for this, it is not possible to simply use the DB-API.
So khampson's suggestion, while usually a really bad idea, seems to be your only alternative.
(I'm posting this mostly to make sure that other people who find this answer who don't have restrictions against using psycopg2 do the sane thing.)
If you must use psql, please:
Use the subprocess module with the Popen constructor
Pass -qAtX and -v ON_ERROR_STOP=1 to psql to get sane behaviour for batching.
Use the array form command, e.g. ['psql', '-v', 'ON_ERROR_STOP=1', '-qAtX', '-c', '\copy mytable from stdin'], rather than using a shell.
Write to psql's stdin, then close it, and wait for psql to finish.
Remember to trap exceptions thrown on command failure. Let subprocess capture stderr and wrap it in the exception object.
It's safer, cleaner, and easier to get right than the old-style os.popen2 etc.