Unable scale up cassandra using latest datastax python driver

Unable scale up cassandra using latest datastax python driver - python

We are using multiple cassandra datastax cluster instances(6) to connect to cassandra using python. We are pooling these multiple connections to do some operations. Each operation is independent of other.
It works fine on a small number of operations, but once I try to scale up I get the following errors :
NoHostAvailable: ('Unable to complete the operation against any hosts', {<Host: 127.ption('Pool is shutdown',)})
and sometimes the following warning:
WARNING Heartbeat failed for connection (140414695068880) to 127.0.0.1
I tried changing some cluster object parameters but it did not help.
Following is the configuration of key space in cassandra I am using :
'class': 'SimpleStrategy',
'replication_factor': '1'
I am using lastest versions of cassandra and datastax driver for python. There is only one node is cassandra.
EDIT: More details:
The multiple cluster instances are in different processes (processes are created using the Python multiprocessing module) - one cluster instance per process. Lets call the proccesses Cassandra-Processes (CP). There are a bunch of other process that do some computation and need to look up a Cassandra DB, and write to it, occassionally. The current design is that each of these processes is mapped to one CP, and all DB reads/writes to be done by the process is done via this mapped CP. 'what' exactly is to be read/written is passed into a queue (again from the multiprocessing library) which the mapped CP reads.
We observe that this setup runs for quite sometime - and then suddenly Cassandra begins erroring out.

It's unclear why you're using six cluster instances against a single Cassandra node. Generally, you should use one Cluster instance per application (per remote cluster). You can read about general design considerations for Cassandra drivers here
If you're looking to "scale" with regards to throughput, you might consider using multiprocessing. I discuss this in a blog post here.
Follow-on:
Two things can be inferred from the information we have so far:
The application is pushing more concurrent requests than your connection pool is configured to handle. I say this because the "Pool is shutdown" only occurs when a request is waiting for a connection/stream to become available. You can tune connection pooling to make more available initially using cluster settings. However, if your "cluster" (server node) is overwhelmed, you won't gain much there.
Your connection is being shutdown. This exception only happens when the node is suddenly marked down. In a single node setup this is most likely because of a connection error. Look for clues in the server log, or driver debug log if you're capturing that.
We probably need to know more about your execution model to help more. Is it possible you're running unfettered async requests without occasionally waiting for them to complete?

Remote diagnosis is hard to do without knowing anything on your specific topology, setup and system configuration. This however looks much like a configuration problem or even the python driver. If you google your error message you will find multiple topics on Datastax's Jira describing this or similar problems, I would check that the Python Driver is up to date.
What would help in the first place is to see in detail what you try to do, how your cluster is configured aso.

Related

how to create a multiple connection with advantage database server at same time?

I am trying to integrate alongside an existing application that uses ADS as its database.
When i connect my integration app using the code below it connects fine until i try and run the original application at the same time. It seems to only allow one connection, my application seems to hold the connection and block all others. Yet i can have multiple instances of the original application running conncurrently with no issue. Which leads me to believe that its the way in which i am trying to correct from my c# app. The error im getting when the original app is open and i then try to connect with my integration app is "The Advantage Data Dictionary cannot be opened. axServerConnect" .
Error 7077: The Advantage Data Dictionary cannot be opened. axServerConnect
Anyone any suggestions? How to create a multiple connection at same time?
Python code:
conn = adsdb.connect(DataSource=str(dbpath[0]), ServerType='local',
UserID = config.ADS_USERNAME, password=config.ADS_PASS)

According to this page in ADS documentations, you can use connection pooling by providing pooling=True to your client connection arguments.
I think using this approach, you will be able to open multiple connections at the same time.
Edit
After checking adsdb python script, I think it does not support connection pooling. You probably be able to set that connection pooling in your C# application.

Sending Data using Python

I Have a Python application which analyses data from multiple sources in real time. Once the data is analyzed the result of the analysis is stored in a database along with a time-stamp of when it was analyzed.
I would like to access the most recent result of this program remotely from another computer.
I was thinking about using python sockets and having a server script running on the main computer which runs the application and then that way I can access the data using a client script on another computer.
Is there a better way of doing this? Or are there any other solutions out there that can address this need?

Your question is very broad.
Most DB servers will provide a method/API to access the data remotely. You can use Python as a client if there is a DBAPI module for your DB that supports remote access over the network. For example if you are using Postgres you could use the psycopg2 module.
If you are using a simple DB such as SQLite then you might be able to use an ODBC driver. Some alternatives are here.
Edit
mongodb provides an API, pymongo.

In the end Redis was the best solution. Considering the original question The goal was to be able to send data in real time from one computer to another. Solutions such as Redis or RabbitMQ successfully accomplish this.
With Redis a server can be setup and it can publish messages to the network, clients can then subscribe to data channels and receive the messages in a queue
This Python library was used as a python Redis client :
https://pypi.python.org/pypi/redis

SQLite over a network

I am creating a Python application that uses embedded SQLite databases. The programme creates the db files and they are on a shared network drive. At this point there will be no more than 5 computers on the network running the programme.
My initial thought was to ask the user on startup if they are the server or client. If they are the server then they create the database. If they are the client they must find a server instance on the network. The one way I suppose is to send all db commands from client to server and server implements in the database. Will that solve the shared db issue?
Alternatively, is there some way to create a SQLite "server". I presume this would be the quicker option if available?
Note: I can't use a server engine such as MySQL or PostgreSQL at this point but I am implementing using ORM and so when this becomes viable, it should be easy to change over.

Here's a "SQLite Server", http://sqliteserver.xhost.ro/, but it looks like not in maintain for years.
SQLite supports concurrency itself, multiple processes can read data at one time and only one can write data into it. Also, When some process is writing, it'll lock the whole database file for a few seconds and others have to wait in the mean time according official document.
I guess this is sufficient for 5 processes as yor scenario. Just you need to write codes to handle the waiting.

Managing a pool of connections to a hosted Elastic Search provider

I need a way to manage connections to a hosted Elastic Search provider, to speed up search on my website. We are running Django on Heroku, using the Found ElasticSearch add-on, and pyes, which is an ElasticSearch Python library.
The standard way of setting up a connection to ElasticSearch with pyes is by passing the provider URL into an ES object, like so:
(1) connection = ES(my_elasticsearch_url)
Pyes uses the ES object behind the scenes to establish an open HTTP connection to my ElasticSearch provider, so I can run searches like this:
(2) results = connection.search(some_query, index_name)
Previously, I was doing both those steps in my Django view for search -- every time a user did a search, it opened a new HTTP connection then ran the search. Consequentially the search call was slow.
I sped up search by moving (1) into my app's __init__.py file -- now, I am setting up the connection only once, and importing it into the search view. But I'm worried it will choke that HTTP connection if lots of people are trying search at once.
I'm looking for ideas on how to set up a pool of connections, initiate them once on app start up, and then dole them out to my search view as needed. Ideally I'd like to be able to scale the size of the pool up and down easily with minimal changes to my code.
I can think of a few ways to approach it, but it seems like a common computing related problem, so I'm sure that a lot of you have ideas on good design and best practices for such a system. I'd love to hear them.
Thanks a lot!
Clay

If your running in a multi-threaded environment, it's merely a matter of extending Queue.Queue to create an instance that can fetch and instantiate connections on demand, from multiple threads in which your views are handling the request-response flow. You'll probably want to have a certain cap on how many connections your retaining by limiting the maximum size of the queue, although you can instantiate more connections beyond that and simply discard them if you can put them back into the queue.
The downside of using Queue.Queue is that it can create cross-cutting concerns if your views are responsible for retrieving connections from and returning them back into the queue. You can get a healthier design if you only queue the actual object from pyes.ES that holds the connection and create a wrapper for ES that, when performing a query, creates a new ES instance, fetches a connection from the queue, sets it on the instance, performs the query, returns the connection back into the queue, discards the ES instance and returns the query results.

Google AppEngine and Threaded Workers

I am currently trying to develop something using Google AppEngine, I am using Python as my runtime and require some advise on setting up the following.
I am running a webserver that provides JSON data to clients, The data comes from an external service in which I have to pull the data from.
What I need to be able to do is run a background system that will check the memcache to see if there are any required ID's, if there is an ID I need to fetch some data for that ID from the external source and place the data in the memecache.
If there are multiple id's, > 30 I need to be able to pull all 30 request as quickly and efficiently as possible.
I am new to Python Development and AppEngine so any advise you guys could give would be great.
Thanks.

You can use "backends" or "task queues" to run processes in the background. Tasks have a 10-minute run time limit, and backends have no run time limit. There's also a cronjob mechanism which can trigger requests at regular intervals.
You can fetch the data from external servers with the "URLFetch" service.

Note that using memcache as the communication mechanism between front-end and back-end is unreliable -- the contents of memcache may be partially or fully erased at any time (and it does happen from time to time).
Also note that you can't query memcache of you don't know the exact keys ahead of time. It's probably better to use the task queue to queue up requests instead of using memcache, or using the datastore as a storage mechanism.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.