How to copy data in airflow - python

I am using appache airflow in my project. In this user can connect their data base with our project and copy their table to our data base .
So I am able to establish a connection using the following lines
import json
from airflow.models.connection import Connection
c = Connection(
conn_id='some_conn',
conn_type='mysql',
description='connection description',
host='myhost.com',
login='myname',
schema = 'myschema'
password='mypassword',
extra=json.dumps(dict(this_param='some val', that_param='other val*')),
)
print(f"AIRFLOW_CONN_{c.conn_id.upper()}='{c.get_uri()}'")
hook = MySqlHook(c.conn_id)
result = hook.get_records(f'SELECT table_name FROM information_schema.tables WHERE table_schema = {c.schema};')
Now I am able to get the table names associated with the connected data base ....
How to copy data from this connected data base to our data base .... Please help me with some hints on this

This depends on what databases you want to copy data between.
A straightforward approach could be outlined as the following steps.
Grab the records from Database A.
Insert the records to Database B.
You would create a custom operator that would perform those steps in order. There might even be operators already created that fulfill these functions. I would advice you to take a look in the Airflow Github first.
Please note that this approach is not suitable for large datasets because the data is stored in memory during the task execution. You can also write to disk but that route then depends on the machine that the Airflow worker runs on.
If the database lives in the same cluster/server then a simple SQL script would work. A HiveOperator, for example, would be sufficient to move data with some INSERT INTO sql commands.

Related

How to persist an inmemory monetdbe db to local disk

I am using an embedded monetdb database in python using Monetdbe.
I can see how to create a new connection with the :memory: setting
But i cant see a way to persist the created database and tables for use later.
Once an in memory session ends, all data is lost.
So i have two questions:
Is there a way to persist an in memory db to local disk
and
Once an in memory db has been saved to local disk, is it possible to load the db to memory at a later point to allow fast data analytics. At the moment it looks like if i create a connection from a file location, then my queries are reading from local disk rather memory.
It is a little bit hidden away admittedly, but you can check out the following code snipet from the movies.py example in the monetdbe-examples repository:
import monetdbe
database = '/tmp/movies.mdbe'
with monetdbe.connect(database) as conn:
conn.set_autocommit(True)
conn.execute(
"""CREATE TABLE Movies
(id SERIAL, title TEXT NOT NULL, "year" INTEGER NOT NULL)""")
So in this example the single argument to connect is just the desired path to your database directory. This is how you can (re)start a database that stores its data in a persistent way on a file system.
Notice that I have intentionally removed the python lines from the example in the actual repo that start with the comment # Removes the database if it already exists. Just to make the example in the answer persistent.
I haven't run the code but I expect that if you run this code twice consecutively the second run wil return a database error on the execute statement as the movies table should already be there.
And just to be sure, don't use the /tmp directory if you want your data to persist between restarts of your computer.

pg_dump and pg_restore between different servers with a selection criteria on the data to be dumped

Currently trying to use pg_dump and pg_restore to be able to dump select rows from a production server to a testing server. The goal is to have a testing server and database that contains the subset of data selected, moreover through a python script, I want the ability to restore the database that original subset after testing and potentially modifying the contents of the database.
From my understanding of pg_dump and pg_restore, the databases that they interact with must be of the same dbname. Moreover, a selection criteria should be made with a the COPY command. Hence, my idea is to have 2 databases in my production server, one with the large set of data and one with the selected set. Then, name the smaller set db 'test' and restore it to the 'test' db in the test server.
Is there a better way to do this considering I don't want to keep the secondary db in my production server and will need to potentially make changes to the selected subset in the future.
From my understanding of pg_dump and pg_restore, the databases that they interact with must be of the same dbname.
The databases being worked with only have to have the same name if you are using --create. Otherwise each programs operates in whatever database was specified when it was invoked, which can be different.
The rest of your question is too vague to be addressable. Maybe pg_dump/pg_restore are the wrong tools for this, and just using COPY...TO and COPY...FROM would be more suitable.

HBase and Integration Testing

I have a Spark project which uses HBase as it's key/value store. We've started as a whole implementing better CI/CD practices, and I am writing a python client to run integration tests against a self contains AWS environment.
While I am able to easily submit our spark jobs and run them as EMR steps. I haven't found a good way to interact with HBase from python. My goal is to be able to run our code against sample HDFS data and then verify in HBase that I am getting the results I expected. Can anyone suggest a good way to do this?
Additionally, my test sets are very small. I'd also be happy if I could simply read the entire HBase table into memory and check it that way. Would appreciate the communities input.
Here's a simple way to read HBase data from Python using Happybase API and Thrift Server.
To start thrift server on the Hbase server:
/YOUR_HBASE_BIN_DIR/hbase-daemon.sh start thrift
Then from Python:
import happybase
HOST = 'Hbase server host name here'
TABLE_NAME = 'MyTable'
ROW_PREFIX = 'MyPrefix'
COL_TXT = 'CI:BO'.encode('utf-8') # column family CI, column name BO (Text)
COL_LONG = 'CI:BT'.encode('utf-8') # column family CI, column name C (Long)
conn = happybase.Connection(HOST) # uses default port 9095, but provide second arg if non-default port
myTable = conn.table(TABLE_NAME)
for rowID, row in myTable.scan(row_prefix=ROW_PREFIX.encode('utf-8')): # or leave empty if want full table scan
colValTxt = row[COL_TXT].decode('utf-8')
colValLong = int.from_bytes(row[COL_LONG], byteorder='big')
print('Row ID: {}\tColumn Value: {}'.format(rowID, colValTxt))
print('All Done')
As discussed in the comment, this won't work if you try to pass things into Spark workers, as the above HBase connection is not serializable. So you can only run this type of code from the master program. If you figure out a way -- share!

Why does writing from Spark to Vertica DB take longer than writing from Spark to MySQL?

Ultimately, I want to grab data from a Vertica DB into Spark, train a machine learning model, make predictions, and store these predictions into another Vertica DB.
Current issue is identifying the bottleneck in the last part of the flow: storing values in Vertica DB from Spark. It takes about 38 minutes to store 63k rows of data in a Vertica DB. In comparison, when I transfer that same data from Spark to MySQL database, it takes 10 seconds.
I don't know why the difference is so huge.
I have classes called VerticaContext and MySQLContext for Vertica and MySQL connections respectively. Both classes use the SQLContext to read entries using the jdbc format.
df = self._sqlContext.read.format('jdbc').options(url=self._jdbcURL, dbtable=subquery).load()
And write using jdbc.
df.write.jdbc(self._jdbcURL, table_name, save_mode)
There's no difference between the two classes aside from writing to a different target database. I'm confused as to why there's a huge difference in the time it takes to save tables. Is it because of the inherent difference in hardware between the two different databases?
I figured out an alternative solution. Before I dive in, I'll explain what I found and why I think saves to Vertica DB are slow.
Vertica log (search for the file "vertica.log" on your Vertica machine) contains all the recent logs related to reads/writes from and to Vertica DBs. After running the write command, I found out that this is essentially creating INSERT statements into the Vertica DB.
INSERT statements (without the "DIRECT" directive) are slow because they are written into the WOS (RAM) instead of ROS (disk). I don't know the exact details as to why that's the case. The writes were issuing individual INSERT statements
Slow inserts are a known issue. I had trouble finding this information, but I finally found a few links that support that. I'm placing them here for posterity: http://www.vertica-forums.com/viewtopic.php?t=267, http://vertica-forums.com/viewtopic.php?t=124
My solution:
There is documentation that says the COPY command (with a "DIRECT" keyword) is the most efficient way of loading large amounts of data to the database. Since I was looking for a python solution, I used Uber's vertica-python package which allowed me to establish a connection with the Vertica DB and send Vertica commands to execute.
I want to exploit the efficiency of the COPY command, but the data lives somewhere outside of the Vertica cluster. I need to send the data from my Spark cluster to Vertica DB. Fortunately, there's a way to do that from HDFS (see here). I settled on converting the dataframe to a csv file and saving that on HDFS. Then I sent the COPY command to the Vertica DB to grab the file from HDFS.
My code is below (assuming I have a variable that stores the pyspark dataframe already. Let's call that 'df'):
import vertica_python as VertPy
df.toPandas().to_csv(hdfs_table_absolute_filepath, header=False, index=False)
conn_info = {
'host': ['your-host-here']
'port': [port #],
'user': ['username'],
'password': ['password'],
'database': ['database']
}
conn = VertPy.connect(**conn_info)
cur = conn.cursor()
copy_command = create_copy_command(table_name, hdfs_table_relative_filepath)
cursor.execute(copy_command)
def create_copy_command(table_name, table_filepath):
copy_command = "COPY " + table_name + " SOURCE Hdfs(url='http://hadoop:50070/webhdfs/v1" + table_filepath + "', username='root') DELIMITER ',' DIRECT ABORT ON ERROR"
return copy_command

Overwrite a database

I have an online database and connect to it by using MySQLdb.
db = MySQLdb.connect(......)
cur = db.cursor()
cur.execute("SELECT * FROM YOUR_TABLE_NAME")
data = cur.fetchall()
Now, I want to write the whole database to my localhost (overwrite). Is there any way to do this?
Thanks
If I'm reading you correctly, you have two database servers, A and B (where A is a remote server and B is running on your local machine) and you want to copy a database from server A to server B?
In all honesty, if this is a one-off, consider using the mysqldump command-line tool, either directly or calling it from python.
If not, the last answer on http://bytes.com/topic/python/answers/24635-dump-table-data-mysqldb details the SQL needed to define a procedure to output tables and data, though this may well miss subtleties mysqldump does not

Categories