concerning the connection of my MySQL server to jupyter notebook [duplicate] - python

I am running a python2.7 application that performs "inserts" into a single mysql/mariadb instance on a multi-core 64 bit CentOS(or ubuntu) machine. as soon as the parallel processes/cores exceed 4 or maybe 6, I see this error. (at different points in the execution)
2003: Can't connect to MySQL server on '127.0.0.1:3306' (99 Cannot assign requested address)
I am running the application on CentOS6.5, mariadb 10.1
I have also tried with Ubuntu 14.04 (64 bit), mysql resulting in the same problem.
I tried making the following changes:
In my.cnf file:
[mysqld]
interactive_timeout=1
wait-timeout = 1
thread_cache_size = 800
max_connections = 5000
#max_user_connections = 5000
max_connect_errors = 150
In sysctl.conf file:
fs.file-max = 65536
In limits.confg file:
* soft nproc 65535
* hard nproc 65535
* soft nofile 65535
* hard nofile 65535
I am inclined to think that this is a configuration issue, because the code runs just fine on 2 core Mac. Can someone suggest some configuration tweaks or any easy way to reuse connections?

You're probably connecting/disconnecting mysqld on a pretty high rate?
And when hitting error 99 you're probably seeing a lot of connections in TIME_WAIT state in netstat -nt output?
Problem most likely is that you are running out of client ports pretty quick due to frequent reconnects and the TIME_WAIT delay. This would also explain why you are more likely to run into this the higher your number of parallel clients is.
The TL;DR solution may be to set net.ipv4.tcp_tw_reuse to 1, e.g. using
echo 1 > /proc/sys/net/ipv4/tcp_tw_reuse
or, as you have clients and mysql server on the same machine anyway, you could use UNIX domain socket connections instead of TCP. This may be as simple as connecting to verbatim "localhost" host name instead of 127.0.0.1, but I don't know about the various Python connectors and how these handle this ...
For more detailed tips and explanations see
http://www.fromdual.com/huge-amount-of-time-wait-connections
and
http://vincent.bernat.im/en/blog/2014-tcp-time-wait-state-linux.html

In a more recent environment, I had the same error message when trying to access MariaDB in a Docker container behind a Traefik 2.0-beta reverse proxy.
The error message has no sense compared to what occurred for real that's why I add my 2 cents.
And the solution I found was to use the name of the Docker service as host instead of the local domain name I gave to my MariaDB server with Traefik.
Some new inputs may be available here https://github.com/jclaveau/docker-standalone-webdev-ci-stack/tree/master/examples/docker-example-lamp

you can use sleep mechanism to solve this issue.
import time
time.sleep(2) ## sleep for two seconds.
this will reduce the number of processes in wait state.

Related

Why is there a discrepancy between python sockets and tcp ping for the same IP:port destination?

My setup:
I am using an IP and port provided by portmap.io to allow me to perform port forwarding.
I have OpenVPN installed (as required by portmap.io), and I run a ready-made config file when I want to operate my project.
My main effort involves sending messages between a client and a server using sockets in Python.
I have installed a software called tcping, which basically allows me to ping an IP:port over a tcp connection.
This figure basically sums it up:
Results I'm getting:
When I try to "ping" said IP, the average RTT ends up being around 30ms consistently.
I try to use the same IP to program sockets in Python, where I have a server script on my machine running, and a client script on any other machine but binding to this IP. I try sending a small message like "Hello" over the socket, and I am finding that the message is taking a significantly greater amount of time to travel across, and an inconsistent one for that matter. Sometimes it ends up taking 1 second, sometimes 400ms...
What is the reason for this discrepancy?
What is the reason for this discrepancy?
tcpping just measures the time needed to establish the TCP connection. The connection establishment is usually completely done in the OS kernel, so there is not even a switch to user space involved.
Even some small data exchange at the application is significantly more expensive. First, the initial TCP handshake must be done. Usually only once the TCP handshake is done the client starts sending the payload, which then needs to be delivered to the other side, put into the sockets read buffer, schedule the user space application to run, read the data from the buffer in the application and process, create and deliver the response to the peers OS kernel, let the kernel deliver the response to the local system and lots of stuff here too until the local app finally gets the response and ends the timing of how long this takes.
Given that the time for the last one is that much off from the pure RTT I would assume though that the server system has either low performance or high load or that the application is written badly.

Occasional 'temporary failure in name resolution' while connecting to AWS Aurora cluster

I am running an Amazon Web Services RDS Aurora 5.6 database cluster. There are a couple of lambda's talking to these database instances, all written in python. Now everything was running well, but then suddenly, since a couple of days ago, the python code sometimes starts throwing the following error:
[ERROR] InterfaceError: 2003: Can't connect to MySQL server on 'CLUSTER-DOMAIN:3306' (-3 Temporary failure in name resolution)
This happens in 1 every 1000 or so new connections. What is interesting that I haven't touched this whole service in the last couple of days (since it started happening). All lambdas are using the official MySQL-connector client and connect on every initialization with the following snippet:
import mysql.connector as mysql
import os
connection = mysql.connect(user=os.environ['DATABASE_USER'],
password=os.environ['DATABASE_PASSWORD'],
database=os.environ['DATABASE_NAME'],
host=os.environ['DATABASE_HOST'],
autocommit=True)
To rule out that this is a problem in the Python MySQL client I added the following to resolve the host:
import os
import socket
host = socket.gethostbyname(os.environ['DATABASE_HOST'])
Also here I sometimes get the following error:
[ERROR] gaierror: [Errno -2] Name or service not known
Now I suspect this has something to do with DNS, but since I'm just using the cluster endpoint there is not much I can do about that. What is interesting is that I also recently encountered exactly the same problem in a different region, with the same setup (Aurora 5.6 cluster, lambda's in python connecting to it) and the same happens there.
I've tried restarting all the machines in the cluster, but the problem still seems to occur. Is this really a DNS issue? What can do I to stop this from happening?
AWS Support have told me that this error is likely to be caused by a traffic quota in AWS's VPCs.
According to their documentation on DNS Quotas:
Each Amazon EC2 instance limits the number of packets that can be sent
to the Amazon-provided DNS server to a maximum of 1024 packets per
second per network interface. This quota cannot be increased. The
number of DNS queries per second supported by the Amazon-provided DNS
server varies by the type of query, the size of response, and the
protocol in use. For more information and recommendations for a
scalable DNS architecture, see the Hybrid Cloud DNS Solutions for
Amazon VPC whitepaper.
It's important to note that the metric we're looking at here is packets per second, per ENI. What's important about this? Well, it may not be immediately obvious that although the actual number of packets per query varies, there are typically multiple packets per DNS query.
While these packets cannot be seen in VPC flow logs, upon reviewing my own packet captures, I can see some resolutions consisting of about 4 packets.
Unfortunately, I can't say much about the whitepaper; at this stage, I'm not really considering the implementation of a hybrid DNS service as a "good" solution.
Solutions
I'm looking into ways to alleviate the risk of this error occurring, and to limit it's impacts when it does occur. As I see it, there are number of options to achieve this:
Force Lambda Functions to resolve the Aurora Cluster's DNS before doing anything else and use the private IP address for the connection and handle failures with an exponential back-off. To minimise the cost of waiting for reties, I've set a total timeout of 5 seconds for DNS resolution. This number includes all back-off wait time.
Making many, short-lived connections comes with a potentially costly overhead, even if you're closing the connection. Consider using connection pooling on the client side, as it is a common misconception that Aurora's connection pooling is sufficient to handle the overhead of many short-lived connections.
Try not to rely on DNS where possible. Aurora automatically handles failover and promotion/demotion of instances, so it's important to know that you're always connected to the "right" (or write, in some cases :P) instance. As updates to the Aurora cluster's DNS name can take time to propagate, even with it's 5 second TTLs, it might be better to make use of the INFORMATION_SCHEMA.REPLICA_HOST_STATUS table, in which MySQL exposes " in near-real-time" metadata about DB instances. Note that the table "contains cluster-wide metadata". If you cbf, have a look at option 4.
Use a smart driver, which:
is a database driver or connector with the ability to read DB
cluster topology from the metadata table. It can route new
connections to individual instance endpoints without relying on
high-level cluster endpoints. A smart driver is also typically
capable of load balancing read-only connections across the available
Aurora Replicas in a round-robin fashion.
Not solutions
Initially, I thought it might be a good idea to create a CNAME which points to the cluster, but now I'm not so sure that caching Aurora DNS query results is wise. There are a few reasons for this, which are discussed in varying levels of details in The Aurora Connection Management Handbook:
Unless you use a smart database driver, you depend on DNS record
updates and DNS propagation for failovers, instance scaling, and load
balancing across Aurora Replicas. Currently, Aurora DNS zones use a
short Time-To-Live (TTL) of 5 seconds. Ensure that your network and
client configurations don’t further increase the DNS cache TTL
Aurora's cluster and reader endpoints abstract the role changes
(primary instance promotion/demotion) and topology changes (addition
and removal of instances) occurring in the DB cluster
I hope this helps!
I had the same error with an instance (and ruled out the DNS lookup limit). After some time I stumbled on an AWS support thread indicating that it could be a hardware problem.
The physical underlying host of your instance (i-3d124c6d) looks to have intermittently been having a issues, some of which would have definitely caused service interruption.
Could you try stopping and starting this instance? Doing so will cause it to be brought up on new underlying hardware and then we could utilize your pingdom service to verify if further issues arise.
from: https://forums.aws.amazon.com/thread.jspa?threadID=171805.
Stopping and restarting the instance resolved the issue for me.

Spyne rpc server benchmark (Failed to establish a new connection)

I have python rpc server written above http://spyne.io (and twisted). I've done some Multi-Mechanize benchmarks on it and as you can see in the image bellow - after a minute of testing it starts to have problems establishing connections.
111274, 254.989, 1516806285, user_group-1, 0.017, HTTPConnectionPool(host='0.0.0.0' port=4321): Max retries exceeded with url: / (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7f2c78bf2810>: Failed to establish a new connection: [Errno 99] Cannot assign requested address')), {'increment': 0.0179598331451416}
Since this happens as a clock (after 60s) I'm suspecting, I've run into some implicit/default rate limits from twisted (but my search for them wasn't successful).
Is this possible - if so can someone point me to those limits?
Or is this just overload of the server?
Thanks
multi mechanize benchmark image
EDIT:
Thanks to Jean-Paul Calderone answer I've looked at the number of TCP connections netstat -at | wc -l.
Every time it gets above 28K I get the Cannot assign requested address.
I'm happy this isn't a server issue. :)
The error
Cannot assign requested address
is probably telling you that you've run out of IP/port combinations. A particular IP address can be combined with about 65535 different port numbers to form an address. A TCP connection involves two addresses, one for each end of the connection.
Thus, a particular IP address can not establish more than 65535 TCP connections. And if both ends of the TCP connection fall on the same IP address, that drops the limit to half of this.
Furthermore, TCP connection cleanup involves the TIME_WAIT state - a time-bounded interval during which the connection is still exists though it has already been closed. During this interval, the address of its endpoint is not available for re-use. Therefore, in addition to a hard maximum on the number of TCP connections that you can have open at a given time on a given IP address, there is also a hard maximum to the number of TCP connections that you can open within a given window of time.
If you are benchmarking on your local system, it seems likely that you've run in to these limits. You can expand your benchmark beyond them by using more IP addresses.
Of course, if your server can deal with enough connections that your benchmark actually encounters these limits, perhaps you've answered the question of whether the server is fast enough, already. There's not much to be gained by going from (for example) 1000 connections/second to 10000 connections/second unless your application logic can process requests faster by order of magnitude or so.
Consider an application which requires 1ms processing time. Paired with an RPC server which can service 1000 connections/sec, you'll be able to service 500 requests/second (1ms app + 1ms server -> 2ms -> 1/500th sec -> 500/sec). Now replace the server with one with one tenth the overhead: 1ms app + .1ms server -> 1.1ms -> 909/sec. So your server is 10x faster but you haven't quite achieved 2x the throughput. Now, almost 2x is not too bad. But it's rapidly diminishing returns from here - a server ten times faster again only gets you to 990 requests/second. And you'll never beat 1000 requests/second because that's your application logic's limit.

Error 2006: "MySQL server has gone away" using Python, Bottle Microframework and Apache

After accessing my web app using:
- Python 2.7
- the Bottle micro framework v. 0.10.6
- Apache 2.2.22
- mod_wsgi
- on Ubuntu Server 12.04 64bit; I'm receiving this error after several hours:
OperationalError: (2006, 'MySQL server has gone away')
I'm using MySQL - the native one included in Python. It usually happens when I don't access the server. I've tried closing all the connections, which I do, using this:
cursor.close()
db.close()
where db is the standard MySQLdb.Connection() call.
The my.cnf file looks something like this:
key_buffer = 16M
max_allowed_packet = 128M
thread_stack = 192K
thread_cache_size = 8
# This replaces the startup script and checks MyISAM tables if needed
# the first time they are touched
myisam-recover = BACKUP
#max_connections = 100
#table_cache = 64
#thread_concurrency = 10
It is the default configuration file except max_allowed_packet is 128M instead of 16M.
The queries to the database are quite simple, at most they retrieve approximately 100 records.
Can anyone help me fix this? One idea I did have was use try/except but I'm not sure if that would actually work.
Thanks in advance,
Jamie
Update: try/except calls didn't work.
This is MySQL error, not Python's.
The list of possible causes and possible solutions is here: MySQL 5.5 Reference Manual: C.5.2.9. MySQL server has gone away.
Possible causes include:
You tried to run a query after closing the connection to the server. This indicates a logic error in the application that should be corrected.
A client application running on a different host does not have the necessary privileges to connect to the MySQL server from that host.
You have encountered a timeout on the server side and the automatic reconnection in the client is disabled (the reconnect flag in the MYSQL structure is equal to 0).
You can also get these errors if you send a query to the server that is incorrect or too large. If mysqld receives a packet that is too large or out of order, it assumes that something has gone wrong with the client and closes the connection. If you need big queries (for example, if you are working with big BLOB columns), you can increase the query limit by setting the server's max_allowed_packet variable, which has a default value of 1MB. You may also need to increase the maximum packet size on the client end. More information on setting the packet size is given in Section C.5.2.10, “Packet too large”.
You also get a lost connection if you are sending a packet 16MB or larger if your client is older than 4.0.8 and your server is 4.0.8 and above, or the other way around.
and so on...
In other words, there are plenty of possible causes. Go through that list and check every possible cause.
Make sure you are not trying to commit to a closed MySqldb object
An answer to a (very closely related) question has been posted here: https://stackoverflow.com/a/982873/209532
It relates directly to the MySQLdb driver (MySQL-python (unmaintained) and mysqlclient (maintained fork)), but the approach is the the same for other driver the does not support automatic reconnect.
For me this was fixed using
MySQLdb.connect("127.0.0.1","root","","db" )
instead of
MySQLdb.connect("localhost","root","","db" )
and then
df.to_sql('df',sql_cnxn,flavor='mysql',if_exists='replace', chunksize=100)

pymongo is not closing connections

we are trying to make an api server for our project.
we are using mongodb with pymongo on debian boxes. everyhting is up to
date.
but we are having a really weird connection problem. there is
generally more than 15k-32k connections to mongodb port when i check
with
root#webserver1:/# netstat -na | grep mongo_db_ip | wc -l
i got 15363
connections are at TIME_WAIT state...
but when i check mongo, i only see 5-6 connections at the moment...
we wrote a mongodb class, that creates an instance and makes
connection. we tried call conn.disconnect() or conn.end_request()
everytime query end but it is not stoped that high connection
number...
is there anybody can tell what should be my mistake, or is there any
written python class for mongodb to examine how others make such as
stuff...
thanks for help, and information...
TIME_WAIT is not an open connection. It's an Operating System state for a socket so that it can make sure all data has come through. AFAIK, the default length for this on Linux is a minute. Have a look at http://antmeetspenguin.blogspot.com/2008/10/timewait-in-netstat.html, it has a good explanation. You can tell the kernel to reuse the TIME_WAIT sockets though:
echo 30 > /proc/sys/net/ipv4/tcp_fin_timeout
reduces it to 30 seconds.
However, you should be checking why you are making so many connections. You're saying you're using the Debian packages for mongod and pymongo, and they tend to be out of date. You really want to be running mongod 2.0.2 and pymongo 2.1.1.

Categories