Parse large XML file w/ script or use BioPython API? - python

Hey guys this is my first question on here. I'm trying to make a local copy of the UniprotKB in SQL.
The UniprotKB is 2.1GB, and it comes in XML and a special text format used by SwissProt
Here are my options:
1) Use a SAX parser (XML) - I chose Ruby, and Nokogiri. I started writing the parser, but my initial reaction: how would I map the XML schema to the SAX parser?
2) BioPython - I already have BioSQL/Biopython installed, which literally created my SQL schema for me, and I was able to successfully insert one SwissProt/Uniprot txt file into the database.
I'm running it right now (crosses fingers) on the entire 2.1gb. Here is the code I'm running:
from Bio import SeqIO
from BioSQL import BioSeqDatabase
from Bio import SwissProt
server = BioSeqDatabase.open_database(driver = "MySQLdb", user = "root", passwd = "", host="localhost", db = "bioseqdb")
db = server["uniprot"]
iterator = SeqIO.parse(open("/path/to/uniprot_sprot.dat", "r"), "swiss")
db.load(iterator)
server.commit()
Edit: it's now crashing because the transactions are getting locked (since the tables are Innodb)
Error Number: 1205
Lock wait timeout exceeded; try restarting transaction. I'm using
MySQL version: 5.1.43
Should I switch my database to Postgrelsql ?

Switched to PostgrelSQL for convenience purposes. Some of the issues were resolved by downloading the NCBI taxonomy information (which I did not know was necessary, should have been more clear in the documentation), so I ended up using the Swiss parser from BioPython because it fits so nicely with BioSQL.

Related

A way to export psql table (or query) directly to AWS S3 as file (csv, json)

This answer suggests using AWS Data Pipeline but I'm wondering if there's a clever way to do it with my own machine and Python.
I've been using psycopg2, boto3 and pandas libraries. Tables have 5 to 50 columns and few million rows. My current method doesn't work that well with large data.
Guess I can show one of my own versions here aswell which is based on copy_expert in psycopg2
import io
import psycopg2
import boto3
resource = boto3.resource('s3')
conn = psycopg2.connect(dbname=db, user=user, password=pw, host=host)
cur = conn.cursor()
def copyFun(bucket, select_query, filename):
query = f"""COPY {select_query} TO STDIN \
WITH (FORMAT csv, DELIMITER ',', QUOTE '"', HEADER TRUE)"""
file = io.StringIO()
cur.copy_expert(query, file)
resource.Object(bucket, f'{filename}.csv').put(Body=file.getvalue())
We do following in our case, performance wise, its pretty fast, and scheduled method rather then continuous streaming. I'm not 100% sure if its wise method, but definitely good from speed prospective in case of scheduled data exports in CSV format that we eventually use for loading to d/w.
Using shell script, we fire psql command to copy data to local file in EC2 App intance.
psql [your connection options go here] -F, -A -c 'select * from my_schema.example' >example.csv
Then, using shell script, we fire s3cmd command to Put example.csv to designated S3:bucket Location.
s3cmd put example.csv s3://your-bucket/path/to/file/
This is an old question, but it comes up when searching for "aws_s3.export_query_to_s3", even though there is no mention of it here, so I thought I'd throw another answer out there.
This can be done natively with a Postgres extension if you're using AWS Aurora Postgres 11.6 or above via: aws_s3.export_query_to_s3
Exporting data from an Aurora PostgreSQL DB cluster to Amazon S3
https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/postgresql-s3-export.html
See here for the function reference:
https://docs.amazonaws.cn/en_us/AmazonRDS/latest/AuroraUserGuide/postgresql-s3-export.html#postgresql-s3-export-functions
This has been present since Aurora for Postgres since 3.1.0 which was released on February 11, 2020 (I don't know why this URL says 2018): https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraPostgreSQL.Updates.20180305.html#AuroraPostgreSQL.Updates.20180305.310
I would not recommend using 3.1.0/11.6, however, there is a bug that causes data corruption issues after 10MB of data is exported to S3: https://forums.aws.amazon.com/thread.jspa?messageID=962494
I just tested with 3.3.1, from September 17, 2020, and the issue isn't present, so, anyone who wants a way to dump data from Postgres to S3... and is on AWS, give this a try!
Here's an example query to create JSONL for you.
JSONL is JSON, with a single JSON object per line: https://jsonlines.org/
So you can dump a whole table to a JSONL file, for example... You could also do json_agg in postgres and dump as a single JSON file with objects in an array, it's up to you, really. Just change the query, and the file extension, and leave it as text format.
select * from aws_s3.query_export_to_s3(
'select row_to_json(data) from (<YOUR QUERY HERE>) data',
aws_commons.create_s3_uri(
'example-bucket/some/path',
'whatever.jsonl',
'us-east-1'),
options :='format text');
For CSV, something like this should do the trick:
select * from aws_s3.query_export_to_s3(
'<YOUR QUERY HERE>',
aws_commons.create_s3_uri(
'example-bucket/some/path',
'whatever.csv',
'us-east-1'),
options :='format csv');

Cant connect to superset using athena database

I am new to superset.
Going to Sources > Databases for a new connection to my athena.
I have downloaded JDBC driver and writing following connection line:
awsathena+jdbc://AKIAJ2PKWTZYAPBYKRMQ:xxxxxxxxxxxxxxx#athena.us-east-1.amazonaws.com:443/default?s3_staging_dir='s3://aws-athena-query-results-831083831535-us-east-1/' as SQLAlchemy URI. First parameter being access key and 2nd being secret key(Modified a bit for privacy)
I am getting the error:
ERROR: {"error": "Connection failed!\n\nThe error message returned was:\nCan't load plugin: sqlalchemy.dialects:awsathena.jdbc"}
I really wish to explore the open source visualisation using superset on my databases.
As per Superset documentation, you need to escape/encode at least the s3_staging_dir, i.e.,
s3://... -> s3%3A//...
Have you followed that step?
If you are sure you have done pip install "PyAthenaJDBC>1.0.9" in the same python environment as you start your superset. Try restarting Superset in the same environment.
In my case problem was with the special characters in the aws_secret_key and s3_staging_dir. I solved it by putting the output of quote_plus methods into the URI. No quotes were required.
from urllib.parse import quote_plus
secretkey = quote_plus(aws_secret_access_key)
loc = quote_plus(s3_staging_dir)
Further, make sure the schema_name (i.e. database name) already exists in the s3 path. Hope it helps!

passing information from one script to another

I have two python scripts, scriptA and scriptB, which run on Unix systems. scriptA takes 20s to run and generates a number X. scriptB needs X when it is run and takes around 500ms. I need to run scriptB everyday but scriptA only once every month. So I don't want to run scriptA from scriptB. I also don't want to manually edit scriptB each time I run scriptA. I thought of updating a file through scriptA but I'm not sure where such a file can be placed ideally so that scriptB can read it later; independent of the location of these two scripts. What is the best way of storing this value X in an Unix system so that it can be used later by scriptB?
Many programs in Linux/Unix keep config in /etc/ and use subfolder in /var/ for other files.
But probably you could need root privilages.
If you run script in your home folder than you could create file ~/.scripB.rc or folder ~/.scriptB/ or ~/.config/scriptB/
See also on wikipedia Filesystem Hierarchy Standard
It sounds like you want to serialize ScriptA's results, save it in a file or database somewhere, then have ScriptB read those results (possibly also modifying the file or updating the database entry to indicate that those results have now been processed).
To make that work you need for ScriptA and ScriptB to agree on the location and format of the data ... and you might want to implement some sort of locking to ensure that ScriptB doesn't end up with corrupted inputs if it happens to be run at the same time that ScriptA is writing or updating the data (and, conversely, that ScriptA doesn't corrupt the data store by writing thereto while ScriptB is accessing it).
Of course ScriptA and ScriptB could each have a filename or other data location hard-coded into their sources. However, that would violation the DRY Principle. So you might want them to share a configuration file. (Of course the configuration filename is also repeated in these sources ... or at least the import of the common bit of configuration code ... but the latter still ensures that an installation/configuration detail (location and, possibly, format, of the data store) is decoupled from the source code. Thus it can be changed (in the shared config) without affecting the rest of the code for either script.
As for precisely which type of file and serialization to use ... that's a different question.
These days, as strange as it may sound, I'd would suggest using SQLite3. It may seem like over-kill to use an SQL "database" for simply storing a single value. However, SQLite3 is included in the Python standard libraries, and it only needs a filename for configuration.
You could also use a pickle or JSON or even YAML (which would require a third party module) ... or even just text or some binary representation using something like struct. However, any of those will require that you parse your results and deal with any parsing or formatting errors. JSON would be the simplest option among these alternatives. Additionally you'd have to do your own file locking and handling if you wanted ScriptA and ScriptB (and, potentially, any other scripts you ever write for manipulating this particular data) to be robust against any chance of concurrent operations.
The advantage of SQLite3 is that it handles the parsing and decoding and the locking and concurrency for you. You create the table once (perhaps embedded in ScriptA as a rarely used "--initdb" option for occasions when you need to recreate the data store). Your code to read it might look as simple as:
#!/usr/bin/python
import sqlite3
db = sqlite3.connect('./foo.db')
cur = db.cursor()
results = cur.execute(SELECT value, MAX(date) FROM results').fetchone()[0]
... and writing a new value would look a bit like:
#!/usr/bin/python
# (Same import, db= and cur= from above)
with db:
cur.execute('INSERT INTO results (value) VALUES (?)', (myvalue,))
All of this assuming you had, at some time, initialized the data store (foo.db in this example) with something like:
#!/usr/bin/python
# (Same import, db= and cur= from above)
with db:
cur.execute('CREATE TABLE IF NOT EXISTS results (value INTEGER NOT NULL, date TIMESTAMP DEFAULT current_timestamp)')
(Actually you could just execute that command every time if you wanted your scripts to recovery silently from cleaning out the old data).
This might seem like more code than a JSON file-based approach. However, SQLite3 is providing ACID(transactional) semantics as well as abstracting away the serialization and deserialization.
Also note that I'm glossing over a few details. My example above are actually creating a whole table of results, with timestamps for when they were written to your datastore. These would accumulate over time and, if you were using this approach, you'd periodically want to clean up your "results" table with a command like:
#!/usr/bin/python
# (Same import, db= and cur= from above)
with db:
cur.execute('DELETE FROM results where date < ?', cur.execute('SELECT MAX(date) from results').fetchone())
Alternatively if you really never want to have access to your prior results that change from INSERT into UPDATE like so:
#!/usr/bin/python
# (Same import, db= and cur= from above)
with db:
cur.execute(cur.execute('UPDATE results SET value=(?)', (mynewvalue,))
(Also note that the (mynewvalue,) is a single element tuple. The DBAPI requires that our parameters be wrapped in tuples which is easy to forget when you first start using it with single parameters such as this).
Obviously if you took this UPDATE only approach you could drop the 'date' column from the 'results' table and all those references to MAX(data) from the queries.
I chose use the slightly more complex schema in my early examples because they allow your scripts to be a bit more robust with very little additional complexity. You could then do other error checking, detecting missing values where ScriptB finds that ScriptA hasn't been run as intended, for example).
Edit/run crontab -e:
# this will run every month on the 25th at 2am
0 2 25 * * python /path/to/scriptA.py > /dev/null
# this will run every day at 2:10 am
10 2 * * * python /path/to/scriptB.py > /dev/null
Create an external file for both scripts:
In scriptA:
>>> with open('/path/to/test_doc','w+') as f:
... f.write('1')
...
In scriptB:
>>> with open('/path/to/test_doc','r') as f:
... v = f.read()
...
>>> v
'1'
You can take a look at PyPubSub
It's a python package which provides a publish - subscribe Python API that facilitates event-based programming.
It'll give you an OS independent solution to your problem and only requires few additional lines of code in both A and B.
Also you don't need to handle messy files!
Assuming you are not running the two scripts at the same time, you can (pickle and) save the go between object anywhere so long as when you load and save the file you point to the same system path. For example:
import pickle # or import cPickle as pickle
# Create a python object like a dictionary, list, etc.
favorite_color = { "lion": "yellow", "kitty": "red" }
# Write to file ScriptA
f_myfile = open('C:\\My Documents\\My Favorite Folder\\myfile.pickle', 'wb')
pickle.dump(favorite_color, f_myfile)
f_myfile.close()
# Read from file ScriptB
f_myfile = open('C:\\My Documents\\My Favorite Folder\\myfile.pickle', 'rb')
favorite_color = pickle.load(f_myfile) # variables come out in the order you put them in
f_myfile.close()

Export & Map CSV output to MySQL table using Python

I have a multiple clients to single server bidirectional iperf set-up for network monitoring. The iperf server runs well and displays output in CSV format based on the cron jobs written on the client end.
I wish to write a python script to automate the process of mapping these CSV outputs to a MySQL database; which in turn would be updated and saved at regular intervals without need of human intervention.
I am using a Ubuntu 13.10 machine as the iperf server. Following is a sample CSV output that I get. This is not being stored to a file, just being displayed on screen.
s1:~$ iperf -s -y C
20140422105054,172.16.10.76,41065,172.16.10.65,5001,6,0.0-20.0,73138176,29215083
20140422105054,172.16.10.76,5001,172.16.10.65,56254,4,0.0-20.0,46350336,18502933
20140422105100,172.16.10.76,54550,172.16.10.50,5001,8,0.0-20.0,67895296,27129408
20140422105100,172.16.10.76,5001,172.16.10.50,58447,5,0.0-20.1,50937856,20292796
20140422105553,172.16.10.76,5001,172.16.10.65,47382,7,0.0-20.1,51118080,20358083
20140422105553,172.16.10.76,41067,172.16.10.65,5001,5,0.0-20.1,76677120,30524007
20140422105600,172.16.10.76,5001,172.16.10.50,40734,4,0.0-20.0,57606144,23001066
20140422105600,172.16.10.76,54552,172.16.10.50,5001,8,0.0-20.0,70123520,28019115
20140422110053,172.16.10.76,41070,172.16.10.65,5001,5,0.0-20.1,63438848,25284066
20140422110053,172.16.10.76,5001,172.16.10.65,46462,6,0.0-20.1,11321344,4497094
The fields I want to map them to are: timestamp, server_ip, server_port, client_ip, client_port, tag_id, interval, transferred, bandwidth
I want to map this CSV output periodically to a MySQL database, for which I do understand that I would have to write a Python script (inside a cron job) querying and storing in MySQL database. I am a beginner at Python scripting and database queries.
I went through another discussion on Server Fault at [https://serverfault.com/questions/566737/iperf-csv-output-format]; and would like to build my query based on this.
Generate SQL script, then run it
If you do not want to use complex solutions like sqlalchemy, following approach is possible.
having your csv data, convert them into SQL script
use mysql command line tool to run this script
Before you do it the first time, be sure you create needed database structure in the database (this I leave to you).
My following sample uses (just for my convenience) package docopt, so you need installing it:
$ pip install docopt
CSV to SQL script conversion utility
csv2sql.py:
"""
Usage:
csv2sql.py [--table <tablename>] <csvfile>
Options:
--table <tablename> Name of table in database to import into [default: mytable]
Convert csv file with iperf data into sql script for importing
those data into MySQL database.
"""
from csv import DictReader
from docopt import docopt
if __name__ == "__main__":
args = docopt(__doc__)
fname = args["<csvfile>"]
tablename = args["--table"]
headers = ["timestamp",
"server_ip",
"server_port",
"client_ip",
"client_port",
"tag_id",
"interval",
"transferred",
"bandwidth"
]
sql = """insert into {tablename}
values ({timestamp},"{server_ip}",{server_port},"{client_ip}",{client_port},{tag_id},"{interval}",{transferred},{bandwidth});"""
with open(fname) as f:
reader = DictReader(f, headers, delimiter=",")
for rec in reader:
print(sql.format(tablename=tablename, **rec)) # python <= 2.6 will fail here
Convert CSV to SQL script
First let the conversion utility introduce:
$ python csv2sql.py -h
Usage:
csv2sql.py [--table <tablename>] <csvfile>
Options:
--table <tablename> Name of table in database to import into [default: mytable]
Convert csv file with iperf data into sql script for importing
those data into MySQL database.
Having your data in file data.csv:
$ python csv2sql.py data.csv
insert into mytable
values (20140422105054,"172.16.10.76",41065,"172.16.10.65",5001,6,"0.0-20.0",73138176,29215083);
insert into mytable
values (20140422105054,"172.16.10.76",5001,"172.16.10.65",56254,4,"0.0-20.0",46350336,18502933);
insert into mytable
values (20140422105100,"172.16.10.76",54550,"172.16.10.50",5001,8,"0.0-20.0",67895296,27129408);
insert into mytable
values (20140422105100,"172.16.10.76",5001,"172.16.10.50",58447,5,"0.0-20.1",50937856,20292796);
insert into mytable
values (20140422105553,"172.16.10.76",5001,"172.16.10.65",47382,7,"0.0-20.1",51118080,20358083);
insert into mytable
values (20140422105553,"172.16.10.76",41067,"172.16.10.65",5001,5,"0.0-20.1",76677120,30524007);
insert into mytable
values (20140422105600,"172.16.10.76",5001,"172.16.10.50",40734,4,"0.0-20.0",57606144,23001066);
insert into mytable
values (20140422105600,"172.16.10.76",54552,"172.16.10.50",5001,8,"0.0-20.0",70123520,28019115);
insert into mytable
values (20140422110053,"172.16.10.76",41070,"172.16.10.65",5001,5,"0.0-20.1",63438848,25284066);
insert into mytable
values (20140422110053,"172.16.10.76",5001,"172.16.10.65",46462,6,"0.0-20.1",11321344,4497094);
Put it all into file data.sql:
$ python csv2sql.py data.csv > data.sql
Apply data.sql to your MySQL database
And finally use mysql command (provided by MySQL) to do the import into database:
$ myslq --user username --password password db_name < data.sql
If you plan using Python, then I would recommend using sqlalchemy
General approach is:
define class, which has all the attributes, you want to store
map all the properties of the class to database columns and types
read your data from csv (using e.g. csv module), for each row create corresponding object being the class prepared before, and let it to be stored.
sqlalchemy shall provide you more details and instructions, your requirement seems rather easy.
Other option is to find out an existing csv import tool, some are already available with MySQL, there are plenty of others too.
This probably is not the kind of answer you are looking for, but if you learn a little sqlite3 (a native Python module - "import sqlite3") by doing a basic tutorial online, you will realize that your problem is not at all difficult to solve. Then just use a standard timer, such as time.sleep() to repeat the procedure.

Python subprocess calling bcp on .csv: 'unexpected eof'

I'm having an EOF issue when trying to bcp a .csv file I generated with Python's csv.writer. I've done lots of googling with no luck, so I turn to you helpful folks on SO
Here's the error message (which is triggered on the subprocess.call() line):
Starting copy...
Unexpected EOF encountered in BCP data-file.
bcp copy in failed
Here's the code:
sel_str = 'select blahblahblah...'
result = engine.execute(sel_str) #engine is a SQLAlchemy engine instance
# write to disk temporarily to be able to bcp the results to the db temp table
with open('tempscratch.csv','wb') as temp_bcp_file:
csvw = csv.writer(temp_bcp_file)
for r in result:
csvw.writerow(r)
temp_bcp_file.flush()
# upload the temp scratch file
bcp_string = 'bcp tempdb..collection in #INFILE -c -U username -P password -S DSN'
bcp_string = string.replace(bcp_string,'#INFILE','tempscratch.csv')
result_code = subprocess.call(bcp_string, shell=True)
I looked at the tempscratch.csv file in a text editor and didn't see any weird EOF or other control characters. Moreover, I looked at other .csv files for comparison, and there doesn't seem to be a standardized EOF that bcp is looking for.
Also, yes this is hacky, pulling down a result set, writing it to disk and then reuploading it to the db with bcp. I have to do this because SQLAlchemy does not support multi-line statements (aka DDL and DML) in the same execute() command. Further, this connection is with a Sybase db, which does not support SQLAlchemy's wonderful ORM :( (which is why I'm using execute() in the first place)
From what I can tell, the bcp default field delimiter is the tab character '\t' while Python's csv writer defaults to the comma. Try this...
# write to disk temporarily to be able to bcp the results to the db temp table
with open('tempscratch.csv','wb') as temp_bcp_file:
csvw = csv.writer(temp_bcp_file, delimiter = '\t')
for r in result:
csvw.writerow(r)
temp_bcp_file.flush()

Categories