How to access and query mongodb on HPC - python

I would like to parallelise queries to a MongoDB database, using pymongo. I am using an HPC system, which uses Slurm as the workload manager. I have a setup which works fine on a single node, but fails when the tasks are spread across more than one node.
I know that the problem is that mongodb is bound to node I start it on, and therefore the additional nodes can't connect to it.
I specifically would like to know how to start and then connect to the mongodb server when using multiple HPC nodes. Thanks!
Some extra details:
Before starting my python script, I start the mongodb like this:
numactl --interleave=all mongod --dbpath=database &
And I get the warning message:
** WARNING: This server is bound to localhost.
** Remote systems will be unable to connect to this server.
** Start the server with --bind_ip <address> to specify which IP
** addresses it should serve responses from, or with --bind_ip_all to
** bind to all interfaces. If this behavior is desired, start the
** server with --bind_ip 127.0.0.1 to disable this warning.
In my python script, I have a worker function which is run by each processor. It is basically structured like this:
def worker(args):
cl = pymongo.MongoClient()
db = cl.mydb
collection = db['mycol']
query = {}
result = collection.find_one(query)
# now do some work...

The warning message mentions --bind_ip <address>. To know the IP address of a compute node, the simplest solution is to use the hostname -i command. So in your submission script, try
numactl --interleave=all mongod --dbpath=database --bind_ip $(hostname -i) &
But then, your Python script must also know the IP address of the node on which MongoDB is running:
def worker(args):
cl = pymongo.MongoClient(host=<IP of MongoDB Server>)
db = cl.mydb
collection = db['mycol']
query = {}
result = collection.find_one(query)
# now do some work...
You will need to adapt the <IP of MongoDB Server> part depending on how you want to pass the information to the Python script. It can be through a command-line parameter, trough the environment, through a file, etc.
Do not forget to use srun to run the python script on all nodes of the allocation, or you will need to implement that functionality in your python script itself.
Do not hesitate also to change the default port of MongoDB from job to job to avoid possible interference if you have several of them running.

Related

Fabric2 CLI: gracefully switch SSH user

I am using Invoke/Fabric with boto3 to create an AWS instance and hand it over to an Ansible script. In order to do that, a few things have to be prepared on the remote machine before Ansible can take over, notably installing Python, create a user, and copy public SSH keys.
The AWS image comes with a particular user. I would like to use this user only to create my own user, copy public keys, and remove password login afterwards. While using the Fabric CLI the connection object is not created and cannot be modified within tasks.
What would be a good way to switch users (aka recreate a connection object between tasks) and run the following tasks with the user that I just created?
I might not go about it the right way (I am migrating from Fabric 1 where a switch of the env values has been sufficient). Here are a few strategies I am aware of, most of them remove some flexibility we have been relying on.
Create a custom AMI on which all preparations has been done already.
Create a local Connection object within a task for the user setup before falling back to the connection object provided by the Fabric CLI.
Deeper integrate AWS with Ansible (the problem is that we have users that might use Ansible after the instance is alive but don't have AWS privileges).
I guess this list includes also a best practice question.
The AWS image comes with a particular user. I would like to use this user
only to create my own user, copy public keys, and remove password login
afterwards. While using the Fabric CLI the connection object is not created
and cannot be modified within tasks.
I'm not sure this is accurate. I have switched users during the execution of a task just fine. You just have to make sure that all subsequent calls that need the updated env use the execute operation.
e.g.
def create_users():
run('some command')
def some_other_stuff():
run('whoami')
#task
def new_instance():
# provision instance using boto3
env.host = [ ip_address ]
env.user = 'ec2-user'
env.password = 'sesame'
execute(create_users)
env.user = 'some-other-user'
execute(some_other_stuff)

HBase and Integration Testing

I have a Spark project which uses HBase as it's key/value store. We've started as a whole implementing better CI/CD practices, and I am writing a python client to run integration tests against a self contains AWS environment.
While I am able to easily submit our spark jobs and run them as EMR steps. I haven't found a good way to interact with HBase from python. My goal is to be able to run our code against sample HDFS data and then verify in HBase that I am getting the results I expected. Can anyone suggest a good way to do this?
Additionally, my test sets are very small. I'd also be happy if I could simply read the entire HBase table into memory and check it that way. Would appreciate the communities input.
Here's a simple way to read HBase data from Python using Happybase API and Thrift Server.
To start thrift server on the Hbase server:
/YOUR_HBASE_BIN_DIR/hbase-daemon.sh start thrift
Then from Python:
import happybase
HOST = 'Hbase server host name here'
TABLE_NAME = 'MyTable'
ROW_PREFIX = 'MyPrefix'
COL_TXT = 'CI:BO'.encode('utf-8') # column family CI, column name BO (Text)
COL_LONG = 'CI:BT'.encode('utf-8') # column family CI, column name C (Long)
conn = happybase.Connection(HOST) # uses default port 9095, but provide second arg if non-default port
myTable = conn.table(TABLE_NAME)
for rowID, row in myTable.scan(row_prefix=ROW_PREFIX.encode('utf-8')): # or leave empty if want full table scan
colValTxt = row[COL_TXT].decode('utf-8')
colValLong = int.from_bytes(row[COL_LONG], byteorder='big')
print('Row ID: {}\tColumn Value: {}'.format(rowID, colValTxt))
print('All Done')
As discussed in the comment, this won't work if you try to pass things into Spark workers, as the above HBase connection is not serializable. So you can only run this type of code from the master program. If you figure out a way -- share!

Looking for a way to shard a MongoDB collection from within Python code

I'm searching for a way to remotely perform sharding on an existing collection, from within a python (2.7) program.
I wasn't able to find an API that performs that (pymongo), or maybe just wasn't looking good enough.
Is such thing possible?
Thanks in advance
Follow the instructions for setting up a sharded cluster, up to the point where you connect the "mongo" shell to the mongos server and say:
sh.enableSharding("<database>")
Instead, view the code for enableSharding by just typing the command without parentheses:
sh.enableSharding
You can see that it executes { enableSharding : dbname } on the "admin" DB, so do that with pymongo:
client = pymongo.MongoClient()
client.admin.command('enableSharding', 'dbname')
Replace 'dbname' with your database name, obviously. Repeat to shard a collection. Get the code from the shell:
sh.shardCollection
And execute the same command in Python:
client.admin.command('shardCollection', 'dbname.collectionname', key={'shardkey': 1})

Nested Fabric Connections

The scenario is our production servers are sitting in a private subnet with a NAT instance in front of them to allow maintenance via SSH. Currently we connect to the NAT instance via SSH then via SSH from there to the respective server.
What I would like to do is run deployment tasks from my machine using the NAT as a proxy without uploading the codebase to the NAT instance. Is this possible with Fabric or am I just going to end up in a world of pain?
EDIT
Just to follow up on this, as #Morgan suggested, the gateway option will indeed fix this issue.
For a bit of completeness, in my fabfile.py:
def setup_connections():
"""
This should be called as the first task in all calls in order to setup the correct connections
e.g. fab setup_connections task1 task2...
"""
env.roledefs = {}
env.gateway = 'ec2-user#X.X.X.X' # where all the magic happens
tag_mgr = EC2TagManager(...)
for role in ['web', 'worker']:
env.roledefs[role] = ['ubuntu#%s' % ins for ins in
tag_mgr.get_instances(instance_attr='private_ip_address', role=role)]
env.key_filename = '/path/to/server.pem'
#roles('web')
def test_uname_web():
run('uname -a')
I can now run fab setup_connections test_uname_web and I can get the uname of my web server
So if you have a newer version of Fabric (1.5+) you can try using the gateway options. I've never used it myself, but seems like what you'd want.
Documentation here:
env var
cli flag
execution model notes
original ticket
Also if you run into any issues, all of us tend to idle in irc.

Testing Memcached connection

I want to run a basic service which shows the status of various other services on the system (i.e. Mongo, Redis, Memcached, etc).
For Memcached, I thought that I might do something like this:
from django.core.cache import cache
host, name = cache._cache._get_server('test')
This seems to return a host and an arbitrary string. Does the host object confirm that I'm connecting to Memcached successfully?
I know that the returned host object has a connect() method. I'm slightly scared to open a new connection in production environment and I don't have an easy Dev setup to test that method. I assume that it's in one of the Python Memcached libraries, but I'm not sure which one is relevant here.
Can I just use the _get_server method to test Memecached connection success, or should I use the connect method?
There are various things you could monitor, like memcache process up, memcache logs moving etc. that don't require network connectivity. Next level of test would be to see that you can open a socket at the memcache port. But the real test is of course to set and get a value from memcache. For that kind of testing I would probably just use the python-memcached package directly to make the connection and set and get values.

Categories