I have a Spark project which uses HBase as it's key/value store. We've started as a whole implementing better CI/CD practices, and I am writing a python client to run integration tests against a self contains AWS environment.
While I am able to easily submit our spark jobs and run them as EMR steps. I haven't found a good way to interact with HBase from python. My goal is to be able to run our code against sample HDFS data and then verify in HBase that I am getting the results I expected. Can anyone suggest a good way to do this?
Additionally, my test sets are very small. I'd also be happy if I could simply read the entire HBase table into memory and check it that way. Would appreciate the communities input.
Here's a simple way to read HBase data from Python using Happybase API and Thrift Server.
To start thrift server on the Hbase server:
/YOUR_HBASE_BIN_DIR/hbase-daemon.sh start thrift
Then from Python:
import happybase
HOST = 'Hbase server host name here'
TABLE_NAME = 'MyTable'
ROW_PREFIX = 'MyPrefix'
COL_TXT = 'CI:BO'.encode('utf-8') # column family CI, column name BO (Text)
COL_LONG = 'CI:BT'.encode('utf-8') # column family CI, column name C (Long)
conn = happybase.Connection(HOST) # uses default port 9095, but provide second arg if non-default port
myTable = conn.table(TABLE_NAME)
for rowID, row in myTable.scan(row_prefix=ROW_PREFIX.encode('utf-8')): # or leave empty if want full table scan
colValTxt = row[COL_TXT].decode('utf-8')
colValLong = int.from_bytes(row[COL_LONG], byteorder='big')
print('Row ID: {}\tColumn Value: {}'.format(rowID, colValTxt))
print('All Done')
As discussed in the comment, this won't work if you try to pass things into Spark workers, as the above HBase connection is not serializable. So you can only run this type of code from the master program. If you figure out a way -- share!
Related
I am using appache airflow in my project. In this user can connect their data base with our project and copy their table to our data base .
So I am able to establish a connection using the following lines
import json
from airflow.models.connection import Connection
c = Connection(
conn_id='some_conn',
conn_type='mysql',
description='connection description',
host='myhost.com',
login='myname',
schema = 'myschema'
password='mypassword',
extra=json.dumps(dict(this_param='some val', that_param='other val*')),
)
print(f"AIRFLOW_CONN_{c.conn_id.upper()}='{c.get_uri()}'")
hook = MySqlHook(c.conn_id)
result = hook.get_records(f'SELECT table_name FROM information_schema.tables WHERE table_schema = {c.schema};')
Now I am able to get the table names associated with the connected data base ....
How to copy data from this connected data base to our data base .... Please help me with some hints on this
This depends on what databases you want to copy data between.
A straightforward approach could be outlined as the following steps.
Grab the records from Database A.
Insert the records to Database B.
You would create a custom operator that would perform those steps in order. There might even be operators already created that fulfill these functions. I would advice you to take a look in the Airflow Github first.
Please note that this approach is not suitable for large datasets because the data is stored in memory during the task execution. You can also write to disk but that route then depends on the machine that the Airflow worker runs on.
If the database lives in the same cluster/server then a simple SQL script would work. A HiveOperator, for example, would be sufficient to move data with some INSERT INTO sql commands.
I would like to parallelise queries to a MongoDB database, using pymongo. I am using an HPC system, which uses Slurm as the workload manager. I have a setup which works fine on a single node, but fails when the tasks are spread across more than one node.
I know that the problem is that mongodb is bound to node I start it on, and therefore the additional nodes can't connect to it.
I specifically would like to know how to start and then connect to the mongodb server when using multiple HPC nodes. Thanks!
Some extra details:
Before starting my python script, I start the mongodb like this:
numactl --interleave=all mongod --dbpath=database &
And I get the warning message:
** WARNING: This server is bound to localhost.
** Remote systems will be unable to connect to this server.
** Start the server with --bind_ip <address> to specify which IP
** addresses it should serve responses from, or with --bind_ip_all to
** bind to all interfaces. If this behavior is desired, start the
** server with --bind_ip 127.0.0.1 to disable this warning.
In my python script, I have a worker function which is run by each processor. It is basically structured like this:
def worker(args):
cl = pymongo.MongoClient()
db = cl.mydb
collection = db['mycol']
query = {}
result = collection.find_one(query)
# now do some work...
The warning message mentions --bind_ip <address>. To know the IP address of a compute node, the simplest solution is to use the hostname -i command. So in your submission script, try
numactl --interleave=all mongod --dbpath=database --bind_ip $(hostname -i) &
But then, your Python script must also know the IP address of the node on which MongoDB is running:
def worker(args):
cl = pymongo.MongoClient(host=<IP of MongoDB Server>)
db = cl.mydb
collection = db['mycol']
query = {}
result = collection.find_one(query)
# now do some work...
You will need to adapt the <IP of MongoDB Server> part depending on how you want to pass the information to the Python script. It can be through a command-line parameter, trough the environment, through a file, etc.
Do not forget to use srun to run the python script on all nodes of the allocation, or you will need to implement that functionality in your python script itself.
Do not hesitate also to change the default port of MongoDB from job to job to avoid possible interference if you have several of them running.
Hello Guys I have cassandra table for arima time series forecasting can you share some basic step how to implement it?
Let's start with Cassandra database.
Common practice in python is to use database clients from PIP packages. In this case, cassandra-driver is the package you need.
https://datastax.github.io/python-driver/installation.html
Then the following page provides a working python example with Cassandra database queries:
https://techfossguru.com/apache-cassandra-python-step-step-guide-ubuntu-example/
# simple example without security
from cassandra.cluster import Cluster
server_address = "localhost" # or where Cassandra is hosted
keyspace = "my_keyspace" # or whatever you select
cluster = Cluster(server_address)
session = cluster.connect(keyspace)
session.execute('select * from mytable limit 100 where somecolumn > 10;') # get 100 matching lines from mytable
cluster.shutdown()
For ARIMA, it's a pretty complex algorithm. See https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average
I suggest first figuring out what the algo does and then browsing for reference implementations.
I want to connect the db available inside DynamoDbLocal using the boto sdk.I followed the documentation as per the below link.
http://boto.readthedocs.org/en/latest/dynamodb2_tut.html#dynamodb-local
This is the official documentation provided by the amazon.But when I am executing the snippet available in the document, I am unable to connect the db and I can't get the tables available inside the db. The dbname is "dummy_us-east-1.db". And my snippet is:
from boto.dynamodb2.layer1 import DynamoDBConnection
con = DynamoDBConnection(host='localhost', port=8000,
aws_access_key_id='dummy',
aws_secret_access_key='dummy',
is_secure=False,
)
print con.list_tables()
I have a 8 tables available inside the db. But I am getting empty list, after executing the list_tables() command.
output:
{u'TableNames':[]}
Instead of accessing the required database, it creating and accessing the new database.
Old database : dummy_us-east-1.db
New database : dummy_localhost.db
How to resolve this.
Please give me some suggestions regarding to the DynamoDbLocal access. Thanks in advance.
It sounds like you are using different approaches to connect to DynamoDB Local.
If so, you can also start DynamoDB Local with the sharedDb flag to force it to use a single db file:
-sharedDb When specified, DynamoDB Local will use a
single database instead of separate databases
for each credential and region. As a result,
all clients will interact with the same set of
tables, regardless of their region and
credential configuration.
E.g.
java -Djava.library.path=./DynamoDBLocal_lib -jar DynamoDBLocal.jar --sharedDb
Here is the solution. this is because you didn't start the dynamodb with it location of jar file.
java -Djava.library.path=./DynamoDBLocal_lib -jar DynamoDBLocal.jar -sharedDb
I have an online database and connect to it by using MySQLdb.
db = MySQLdb.connect(......)
cur = db.cursor()
cur.execute("SELECT * FROM YOUR_TABLE_NAME")
data = cur.fetchall()
Now, I want to write the whole database to my localhost (overwrite). Is there any way to do this?
Thanks
If I'm reading you correctly, you have two database servers, A and B (where A is a remote server and B is running on your local machine) and you want to copy a database from server A to server B?
In all honesty, if this is a one-off, consider using the mysqldump command-line tool, either directly or calling it from python.
If not, the last answer on http://bytes.com/topic/python/answers/24635-dump-table-data-mysqldb details the SQL needed to define a procedure to output tables and data, though this may well miss subtleties mysqldump does not