Export & Map CSV output to MySQL table using Python

Export & Map CSV output to MySQL table using Python - python

I have a multiple clients to single server bidirectional iperf set-up for network monitoring. The iperf server runs well and displays output in CSV format based on the cron jobs written on the client end.
I wish to write a python script to automate the process of mapping these CSV outputs to a MySQL database; which in turn would be updated and saved at regular intervals without need of human intervention.
I am using a Ubuntu 13.10 machine as the iperf server. Following is a sample CSV output that I get. This is not being stored to a file, just being displayed on screen.
s1:~$ iperf -s -y C
20140422105054,172.16.10.76,41065,172.16.10.65,5001,6,0.0-20.0,73138176,29215083
20140422105054,172.16.10.76,5001,172.16.10.65,56254,4,0.0-20.0,46350336,18502933
20140422105100,172.16.10.76,54550,172.16.10.50,5001,8,0.0-20.0,67895296,27129408
20140422105100,172.16.10.76,5001,172.16.10.50,58447,5,0.0-20.1,50937856,20292796
20140422105553,172.16.10.76,5001,172.16.10.65,47382,7,0.0-20.1,51118080,20358083
20140422105553,172.16.10.76,41067,172.16.10.65,5001,5,0.0-20.1,76677120,30524007
20140422105600,172.16.10.76,5001,172.16.10.50,40734,4,0.0-20.0,57606144,23001066
20140422105600,172.16.10.76,54552,172.16.10.50,5001,8,0.0-20.0,70123520,28019115
20140422110053,172.16.10.76,41070,172.16.10.65,5001,5,0.0-20.1,63438848,25284066
20140422110053,172.16.10.76,5001,172.16.10.65,46462,6,0.0-20.1,11321344,4497094
The fields I want to map them to are: timestamp, server_ip, server_port, client_ip, client_port, tag_id, interval, transferred, bandwidth
I want to map this CSV output periodically to a MySQL database, for which I do understand that I would have to write a Python script (inside a cron job) querying and storing in MySQL database. I am a beginner at Python scripting and database queries.
I went through another discussion on Server Fault at [https://serverfault.com/questions/566737/iperf-csv-output-format]; and would like to build my query based on this.

Generate SQL script, then run it
If you do not want to use complex solutions like sqlalchemy, following approach is possible.
having your csv data, convert them into SQL script
use mysql command line tool to run this script
Before you do it the first time, be sure you create needed database structure in the database (this I leave to you).
My following sample uses (just for my convenience) package docopt, so you need installing it:
$ pip install docopt
CSV to SQL script conversion utility
csv2sql.py:
"""
Usage:
csv2sql.py [--table <tablename>] <csvfile>
Options:
--table <tablename> Name of table in database to import into [default: mytable]
Convert csv file with iperf data into sql script for importing
those data into MySQL database.
"""
from csv import DictReader
from docopt import docopt
if __name__ == "__main__":
args = docopt(__doc__)
fname = args["<csvfile>"]
tablename = args["--table"]
headers = ["timestamp",
"server_ip",
"server_port",
"client_ip",
"client_port",
"tag_id",
"interval",
"transferred",
"bandwidth"
]
sql = """insert into {tablename}
values ({timestamp},"{server_ip}",{server_port},"{client_ip}",{client_port},{tag_id},"{interval}",{transferred},{bandwidth});"""
with open(fname) as f:
reader = DictReader(f, headers, delimiter=",")
for rec in reader:
print(sql.format(tablename=tablename, **rec)) # python <= 2.6 will fail here
Convert CSV to SQL script
First let the conversion utility introduce:
$ python csv2sql.py -h
Usage:
csv2sql.py [--table <tablename>] <csvfile>
Options:
--table <tablename> Name of table in database to import into [default: mytable]
Convert csv file with iperf data into sql script for importing
those data into MySQL database.
Having your data in file data.csv:
$ python csv2sql.py data.csv
insert into mytable
values (20140422105054,"172.16.10.76",41065,"172.16.10.65",5001,6,"0.0-20.0",73138176,29215083);
insert into mytable
values (20140422105054,"172.16.10.76",5001,"172.16.10.65",56254,4,"0.0-20.0",46350336,18502933);
insert into mytable
values (20140422105100,"172.16.10.76",54550,"172.16.10.50",5001,8,"0.0-20.0",67895296,27129408);
insert into mytable
values (20140422105100,"172.16.10.76",5001,"172.16.10.50",58447,5,"0.0-20.1",50937856,20292796);
insert into mytable
values (20140422105553,"172.16.10.76",5001,"172.16.10.65",47382,7,"0.0-20.1",51118080,20358083);
insert into mytable
values (20140422105553,"172.16.10.76",41067,"172.16.10.65",5001,5,"0.0-20.1",76677120,30524007);
insert into mytable
values (20140422105600,"172.16.10.76",5001,"172.16.10.50",40734,4,"0.0-20.0",57606144,23001066);
insert into mytable
values (20140422105600,"172.16.10.76",54552,"172.16.10.50",5001,8,"0.0-20.0",70123520,28019115);
insert into mytable
values (20140422110053,"172.16.10.76",41070,"172.16.10.65",5001,5,"0.0-20.1",63438848,25284066);
insert into mytable
values (20140422110053,"172.16.10.76",5001,"172.16.10.65",46462,6,"0.0-20.1",11321344,4497094);
Put it all into file data.sql:
$ python csv2sql.py data.csv > data.sql
Apply data.sql to your MySQL database
And finally use mysql command (provided by MySQL) to do the import into database:
$ myslq --user username --password password db_name < data.sql

If you plan using Python, then I would recommend using sqlalchemy
General approach is:
define class, which has all the attributes, you want to store
map all the properties of the class to database columns and types
read your data from csv (using e.g. csv module), for each row create corresponding object being the class prepared before, and let it to be stored.
sqlalchemy shall provide you more details and instructions, your requirement seems rather easy.
Other option is to find out an existing csv import tool, some are already available with MySQL, there are plenty of others too.

This probably is not the kind of answer you are looking for, but if you learn a little sqlite3 (a native Python module - "import sqlite3") by doing a basic tutorial online, you will realize that your problem is not at all difficult to solve. Then just use a standard timer, such as time.sleep() to repeat the procedure.

Related

Is there a function under pyodbc that can replace cursor.copy_expert

I use a code that opens a csv file to store it in a database. I use SQL SERVER.
when the file is opened in the RAM, after a processing that is done before, we want to store it in the database.
under Postgresql we use the following code but I want an equivalent under SQL SERVER
# upload to db
SQL_STATEMENT = """
COPY %s FROM STDIN WITH
CSV
HEADER
DELIMITER AS ','
"""
cursor.copy_expert(sql=SQL_STATEMENT % tbl_name, file=my_file)
I have no idea how to change the code block without changing the code

Whereas psycopg2 is a Postgres specific DB-API to maintain extended methods like copy_expert, copy_from, copy_to that are only supported in Postgres, pyodbc is a generalized DB-API that interfaces with any ODBC driver including SQL Server, Teradata, MS Access, even PostgreSQL ODBC drivers! Therefore, it is not likely an SQL Server specific convenience command exists to replace copy_expert.
However, consider submitting an SQL Server specific SQL command such as BULK INSERT that can read from flat files and then run cursor.execute. Below uses F-strings (introduced in Python 3.6) for string formatting:
# upload to db
SQL_STATEMENT = (
f"BULK INSERT {tbl_name} "
f"FROM '{my_file}' "
"WITH (FORMAT='CSV');"
)
cur.execute(SQL_STATEMENT)
conn.commit()

How to specify length prefix in bcp command?

I'm using bcp tool to import CSV into a sql server table. I'm using python subprocess to execute the bcp command. My sample bcp command is like below:
bcp someDatabase.dbo.sometable IN myData.csv -n -t , -r \n -S mysqlserver.com -U myusername -P 'mypassword'
the command executes and says
0 rows copied.
Even if i remove the -t or -n option, the message is still the same. I read from sql server docs that there is something called length prefix(if bcp tool is used with -n (native) mode).
How can i specify that length prefix with bcp command?
My goal is to import CSV into a sql server table using bcp tool. I first create my table according to my date in the CSV file and i dont create a format file for bcp. I want all my data to be inserted correctly(according to the data type i have sepecified in my table).

If it is a csv file then do not use -n, -t or -r options. Use -e errorFileName to catch the error(s) you may be encountering. You can then take the appropriate steps.

It is a very common practice with ETL tasks to first load text files into a "load" table that has all varchar/char data types. This avoids any possible implied data conversion errors that are more difficult/time-consuming to troubleshoot via BCP. Just pass the character data in the text file into character datatype columns in SQL Server. Then you can move data from the "load" table into your final destination table. This will allow you to use the MUCH more functional T-SQL commands to handle transformation of data types. Do not force BCP/SQL Server to transform your data-types for you by going from text file directly into your final table via BCP.
Also, I would also suggest visually inspecting your incoming data file to confirm it is formatted as specified. I often see mixups betweeen \n and \r\n for line terminator.
Last, when loading the data, you should also use the -e option as Neeraj has stated. This will capture "data" errors (it does not report command/syntax errors; just data/formatting errors). Since you incoming file is an ascii text file, you DO want to use the -c option for loading into the all-varchar "load" table.

A way to export psql table (or query) directly to AWS S3 as file (csv, json)

This answer suggests using AWS Data Pipeline but I'm wondering if there's a clever way to do it with my own machine and Python.
I've been using psycopg2, boto3 and pandas libraries. Tables have 5 to 50 columns and few million rows. My current method doesn't work that well with large data.

Guess I can show one of my own versions here aswell which is based on copy_expert in psycopg2
import io
import psycopg2
import boto3
resource = boto3.resource('s3')
conn = psycopg2.connect(dbname=db, user=user, password=pw, host=host)
cur = conn.cursor()
def copyFun(bucket, select_query, filename):
query = f"""COPY {select_query} TO STDIN \
WITH (FORMAT csv, DELIMITER ',', QUOTE '"', HEADER TRUE)"""
file = io.StringIO()
cur.copy_expert(query, file)
resource.Object(bucket, f'{filename}.csv').put(Body=file.getvalue())

We do following in our case, performance wise, its pretty fast, and scheduled method rather then continuous streaming. I'm not 100% sure if its wise method, but definitely good from speed prospective in case of scheduled data exports in CSV format that we eventually use for loading to d/w.
Using shell script, we fire psql command to copy data to local file in EC2 App intance.
psql [your connection options go here] -F, -A -c 'select * from my_schema.example' >example.csv
Then, using shell script, we fire s3cmd command to Put example.csv to designated S3:bucket Location.
s3cmd put example.csv s3://your-bucket/path/to/file/

This is an old question, but it comes up when searching for "aws_s3.export_query_to_s3", even though there is no mention of it here, so I thought I'd throw another answer out there.
This can be done natively with a Postgres extension if you're using AWS Aurora Postgres 11.6 or above via: aws_s3.export_query_to_s3
Exporting data from an Aurora PostgreSQL DB cluster to Amazon S3
https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/postgresql-s3-export.html
See here for the function reference:
https://docs.amazonaws.cn/en_us/AmazonRDS/latest/AuroraUserGuide/postgresql-s3-export.html#postgresql-s3-export-functions
This has been present since Aurora for Postgres since 3.1.0 which was released on February 11, 2020 (I don't know why this URL says 2018): https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraPostgreSQL.Updates.20180305.html#AuroraPostgreSQL.Updates.20180305.310
I would not recommend using 3.1.0/11.6, however, there is a bug that causes data corruption issues after 10MB of data is exported to S3: https://forums.aws.amazon.com/thread.jspa?messageID=962494
I just tested with 3.3.1, from September 17, 2020, and the issue isn't present, so, anyone who wants a way to dump data from Postgres to S3... and is on AWS, give this a try!
Here's an example query to create JSONL for you.
JSONL is JSON, with a single JSON object per line: https://jsonlines.org/
So you can dump a whole table to a JSONL file, for example... You could also do json_agg in postgres and dump as a single JSON file with objects in an array, it's up to you, really. Just change the query, and the file extension, and leave it as text format.
select * from aws_s3.query_export_to_s3(
'select row_to_json(data) from (<YOUR QUERY HERE>) data',
aws_commons.create_s3_uri(
'example-bucket/some/path',
'whatever.jsonl',
'us-east-1'),
options :='format text');
For CSV, something like this should do the trick:
select * from aws_s3.query_export_to_s3(
'<YOUR QUERY HERE>',
aws_commons.create_s3_uri(
'example-bucket/some/path',
'whatever.csv',
'us-east-1'),
options :='format csv');

Python - Dynamic variable in Redshift SQL Query

I'm developing in a python environment and I want to call an sql query using psycopg2
Lets say I have the following UNLOAD command in an .sql file:
UNLOAD
(
'
Some SQL Query
'
)
TO 's3://%PATH%'
...
In the sql file the %PATH% should be explicit like: 'folder1/folder3/file_name'.
But I want the python program to set this %PATH% in runtime. which means that the .sql file containts something like %PATH% and will be set only in runtime.
Any idea of how to do it?

Implementing it this way will give you a tough time.
The best way to do is to dump the file at a static location:
UNLOAD
(
'
Some SQL Query
'
)
TO 's3://path/to/static/s3_bucket'
...
and then use (via a shellscript / or opt for a suitable command for any other script)
aws s3 mv $source $destination
Here, you may pass any value for $destination which can be easily populated during run-time.
In short, you've dumped the file in s3 at a fixed location (using
UNLOAD) and moved it to the location of your choice or a location
populated at run time (using aws s3 mv...)

You simply specify a replacement field in your SQL file, and the use a format command.
Create your file like this
UNLOAD ('Some SQL Query')
TO 's3://{bucket}/{key}'
And use this file in python like
template = open('file1.sql', 'r').read()
query = template.format(bucket='mybucket', key='folder/file.csv')

You would not be able to set up the UNLOAD path dynamically at runtime, however you could put your SQL statement in a something like a shell/python script where you can create variables with the path you'd like and then pass them into the query.
This UNLOAD utility by AWS will get you started if you decide to go with a python script.

Using Postgres's COPY FROM file query in Python without writing to a temporary file

I need to load data from some source data sources to a Postgres database.
To do this task, I first write the data to a temporary CSV file and then load data from the CSV file to Postgres database using COPY FROM query. I do all of this on Python.
The code looks like this:
table_name = 'products'
temp_file = "'C:\\Users\\username\\tempfile.csv'"
db_conn = psycopg2.connect(host, port, user, password, database)
cursor = db_conn.cursor()
query = """COPY """ + table_name + """ FROM """ + temp_file + " WITH NULL AS ''; """
cursor.execute(query)
I want to avoid the step of writing to the intermediate file. Instead, I would like to write to a Python object and then load data to postgres database using COPY FROM file method.
I am aware of this technique of using psycopg2's copy_from method which copies data from a StringIO object to the postgres database. However, I cannot use psycopg2 for a reason and hence, I don't want my COPY FROM task to be dependent on a library. I want it to be Postgres query which can be run by any other postgres driver as well.
Please advise a better way of doing this without writing to an intermediate file.

You could call the psql command-line tool from your script (i.e. using subprocess.call) and leverage its \copy command, piping the output of one instance to the input of another, avoiding a temp file. i.e.
psql -X -h from_host -U user -c "\copy from_table to stdout" | psql -X -h to_host -U user -c "\copy to_table from stdin"
This assumes the table exists in the destination database. If not, a separate command would first need to create it.
Also, note that one caveat of this method is that errors from the first psql call can get swallowed by the piping process.

psycopg2 has integrated support for the COPY wire-protocol, allowing you to use COPY ... FROM STDIN / COPY ... TO STDOUT.
See Using COPY TO and COPY FROM in the psycopg2 docs.
Since you say you can't use psycopg2, you're out of luck. Drivers must understand COPY TO STDOUT / COPY FROM STDIN in order to use them, or must provide a way to write raw data to the socket so you can hijack the driver's network socket and implement the COPY protocol yourself. Driver specific code is absolutely required for this, it is not possible to simply use the DB-API.
So khampson's suggestion, while usually a really bad idea, seems to be your only alternative.
(I'm posting this mostly to make sure that other people who find this answer who don't have restrictions against using psycopg2 do the sane thing.)
If you must use psql, please:
Use the subprocess module with the Popen constructor
Pass -qAtX and -v ON_ERROR_STOP=1 to psql to get sane behaviour for batching.
Use the array form command, e.g. ['psql', '-v', 'ON_ERROR_STOP=1', '-qAtX', '-c', '\copy mytable from stdin'], rather than using a shell.
Write to psql's stdin, then close it, and wait for psql to finish.
Remember to trap exceptions thrown on command failure. Let subprocess capture stderr and wrap it in the exception object.
It's safer, cleaner, and easier to get right than the old-style os.popen2 etc.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Export & Map CSV output to MySQL table using Python - python

Related

Is there a function under pyodbc that can replace cursor.copy_expert

How to specify length prefix in bcp command?

A way to export psql table (or query) directly to AWS S3 as file (csv, json)

Python - Dynamic variable in Redshift SQL Query

Using Postgres's COPY FROM file query in Python without writing to a temporary file

Categories

Resources