We have a HDInsight cluster with some tables in HIVE. I want to query these tables from Python 3.6 from a client machine (outside Azure).
I have tried using PyHive, pyhs2 and also impyla but I am running into various problems with all of them.
Does anybody have a working example of accessing a HDInsight HIVE from Python?
I have very little experience with this, and don't know how to configure PyHive (which seems the most promising), especially regarding authorization.
With impyla:
from impala.dbapi import connect
conn = connect(host='redacted.azurehdinsight.net',port=443)
cursor = conn.cursor()
cursor.execute('SELECT * FROM cs_test LIMIT 100')
print(cursor.description) # prints the result set's schema
results = cursor.fetchall()
This gives:
Traceback (most recent call last):
File "C:/git/ml-notebooks/impyla.py", line 3, in <module>
cursor = conn.cursor()
File "C:\Users\chris\Anaconda3\lib\site-packages\impala\hiveserver2.py", line 125, in cursor
session = self.service.open_session(user, configuration)
File "C:\Users\chris\Anaconda3\lib\site-packages\impala\hiveserver2.py", line 995, in open_session
resp = self._rpc('OpenSession', req)
File "C:\Users\chris\Anaconda3\lib\site-packages\impala\hiveserver2.py", line 923, in _rpc
response = self._execute(func_name, request)
File "C:\Users\chris\Anaconda3\lib\site-packages\impala\hiveserver2.py", line 954, in _execute
.format(self.retries))
impala.error.HiveServer2Error: Failed after retrying 3 times
With Pyhive:
from pyhive import hive
conn = hive.connect(host="redacted.azurehdinsight.net",port=443,auth="NOSASL")
#also tried other auth-types, but as i said, i have no clue here
This gives:
Traceback (most recent call last):
File "C:/git/ml-notebooks/PythonToHive.py", line 3, in <module>
conn = hive.connect(host="redacted.azurehdinsight.net",port=443,auth="NOSASL")
File "C:\Users\chris\Anaconda3\lib\site-packages\pyhive\hive.py", line 64, in connect
return Connection(*args, **kwargs)
File "C:\Users\chris\Anaconda3\lib\site-packages\pyhive\hive.py", line 164, in __init__
response = self._client.OpenSession(open_session_req)
File "C:\Users\chris\Anaconda3\lib\site-packages\TCLIService\TCLIService.py", line 187, in OpenSession
return self.recv_OpenSession()
File "C:\Users\chris\Anaconda3\lib\site-packages\TCLIService\TCLIService.py", line 199, in recv_OpenSession
(fname, mtype, rseqid) = iprot.readMessageBegin()
File "C:\Users\chris\Anaconda3\lib\site-packages\thrift\protocol\TBinaryProtocol.py", line 134, in readMessageBegin
sz = self.readI32()
File "C:\Users\chris\Anaconda3\lib\site-packages\thrift\protocol\TBinaryProtocol.py", line 217, in readI32
buff = self.trans.readAll(4)
File "C:\Users\chris\Anaconda3\lib\site-packages\thrift\transport\TTransport.py", line 60, in readAll
chunk = self.read(sz - have)
File "C:\Users\chris\Anaconda3\lib\site-packages\thrift\transport\TTransport.py", line 161, in read
self.__rbuf = BufferIO(self.__trans.read(max(sz, self.__rbuf_size)))
File "C:\Users\chris\Anaconda3\lib\site-packages\thrift\transport\TSocket.py", line 117, in read
buff = self.handle.recv(sz)
ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host
According to the offical document Understand and resolve errors received from WebHCat on HDInsight, it said as below.
What is WebHCat
WebHCat is a REST API for HCatalog, a table, and storage management layer for Hadoop. WebHCat is enabled by default on HDInsight clusters, and is used by various tools to submit jobs, get job status, etc. without logging in to the cluster.
So a workaround way is to use WebHCat to run the Hive QL in Python, please refer to the Hive document to learn & use it. As reference, there is a similar MSDN thread discussed about it.
Hope it helps.
Technically you should be able to use the Thrift connector and pyhive but I haven't had any success with this. However I have successfully used the JDBC connector using JayDeBeAPI.
First you need to download the JDBC driver.
http://central.maven.org/maven2/org/apache/hive/hive-jdbc/1.2.1/hive-jdbc-1.2.1-standalone.jar
http://repo1.maven.org/maven2/org/apache/httpcomponents/httpclient/4.4/httpclient-4.4.jar
http://central.maven.org/maven2/org/apache/httpcomponents/httpcore/4.4.4/httpcore-4.4.4.jar
I put mine in /jdbc and used JayDeBeAPI with the following connection string.
edit: You need to add /jdbc/* to your CLASSPATH environment variable.
import jaydebeapi
conn = jaydebeapi.connect("org.apache.hive.jdbc.HiveDriver",
"jdbc:hive2://my_ip_or_url:443/;ssl=true;transportMode=http;httpPath=/hive2",
[username, password],
"/jdbc/hive-jdbc-1.2.1.jar")
Related
I am trying to run a Google Cloud Function that collects data off a website then inserts it into a Cloud SQL (MySQL) database, but having problems with SQLAlchemy in Cloud which don't appear on my local machine. Any suggestions!?
When I run the function locally, against Py3.7 (on a Mac, not using virtualenv), using the Cloud SQL Proxy and SQLAlchemy, I successfully connect to the database.
When running the Cloud Function, I use connection string in this format mysql+pymysql://<username>:<password>/<dbname>?unix_socket=/cloudsql/<PROJECT-NAME>:<INSTANCE-REGION>:<INSTANCE-NAME>.
The Cloud Function keeps throwing the following exception for SQLAlchemy.create_engine. It does not appear to be related to being able to connect, but due to instantiation.
Everything is in the same project.
I have also tried using the public IP and connection string in the format mysql+pymysql://<username>:<password>#<public ip address>:3306/<dbname>, which made no difference.
Traceback (most recent call last):
File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker_v2.py", line 449, in run_background_function
_function_handler.invoke_user_function(event_object)
File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker_v2.py", line 268, in invoke_user_function
return call_user_function(request_or_event)
File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker_v2.py", line 265, in call_user_function
event_context.Context(**request_or_event.context))
File "/user_code/main.py", line 14, in retrieve_and_log
engine = create_engine(connection_string,echo=True)
File "/env/local/lib/python3.7/site-packages/sqlalchemy/engine/__init__.py", line 500, in create_engine
return strategy.create(*args, **kwargs)
File "/env/local/lib/python3.7/site-packages/sqlalchemy/engine/strategies.py", line 56, in create
plugins = u._instantiate_plugins(kwargs)
AttributeError: 'Context' object has no attribute '_instantiate_plugins'
Here is a snippet of my code:
import requests
from bs4 import BeautifulSoup
from sqlalchemy import create_engine, MetaData, Table, Column, Integer, String
def retrieve_and_log(store_string, connection_string = 'mysql+pymysql://<username>:<password>/<dbname>?unix_socket=/cloudsql/<PROJECT-NAME>:<INSTANCE-REGION>:<INSTANCE-NAME>'):
engine = create_engine(connection_string,echo=True)
conn = engine.connect()
# ....
If retrieve_and_log is the function you are trying to deploy as a background Cloud Function, it needs a signature like:
def retrieve_and_log(data, context):
...
It can't take arbitrary parameters.
See https://cloud.google.com/functions/docs/writing/background for more details.
fixed: just using mysql.connector package now.
i am a few programming with python now and i wanted to create a use login/logout system with a database linked to a self created web platform for managment, logging, etc...
now i wanted to perform a query to get all users from my database but for some reason im not able to get any results i tried:
# as requested, connector method.
def initiate_connection(self):
return MySQLdb.connect("localhost", "root", "", "tester")
# This works !
def get_database_version(self):
db = self.initiate_connection() # Instantiate db connection
curs = db.cursor() # Server sided cursors - ref more info: https://mysqlclient.readthedocs.io/user_guide.html#cursor-objects
curs.execute("SELECT VERSION();")# Query command
data = curs.fetchone() # Fetch result.
db.close() # Close conn
return data
# This doesnt? :(
def get_users(self):
db = self.initiate_connection() # Instantiate db connection
curs = db.cursor() # Server sided cursors - ref more info: https://mysqlclient.readthedocs.io/user_guide.html#cursor-objects
curs.execute("SELECT name FROM users")# Query command
data = curs.fetchone() # Fetch result.
db.close() # Close conn
return data
But i get an uknown column error, so i tried selecting everything to see what i get from that result: Nonetype, Also ! i am able to retrieve version from database so i assume im connected properly.
Im pretty clueless in what im doing wrong here any ideas?
Also db structure is:
db->tester
table->users
- id
- name
- password
- salt
- email
Edit:
Actual error:
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "D:\Programmas\PyCharm Community Edition 2019.2.5\helpers\pydev\_pydev_bundle\pydev_umd.py", line 197, in runfile
pydev_imports.execfile(filename, global_vars, local_vars) # execute the script
File "D:\Programmas\PyCharm Community Edition 2019.2.5\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "C:/Users/oj/PycharmProjects/RegisteryHandeler/DatabaseHandler.py", line 25, in <module>
print(database_handler().get_users())
File "C:/Users/oj/PycharmProjects/RegisteryHandeler/DatabaseHandler.py", line 18, in get_users
curs.execute("SELECT name FROM users")# Query command
File "C:\Users\oj\PycharmProjects\RegisteryHandeler\venv\lib\site-packages\MySQLdb\cursors.py", line 209, in execute
res = self._query(query)
File "C:\Users\oj\PycharmProjects\RegisteryHandeler\venv\lib\site-packages\MySQLdb\cursors.py", line 315, in _query
db.query(q)
File "C:\Users\oj\PycharmProjects\RegisteryHandeler\venv\lib\site-packages\MySQLdb\connections.py", line 239, in query
_mysql.connection.query(self, query)
MySQLdb._exceptions.OperationalError: (1054, "Unknown column 'name' in 'field list'")
I am trying to connect python to mysql by this code--
(I am using the mysql.connect library)
import mysql.connector
cnx = mysql.connector.connect(user='root',
password='password',
host='127.0.0.1',
database='db')
print(cnx)
cnx.close()
But it continues to throw the error--
Traceback (most recent call last):
File "D:/ted/main.py", line 5, in <module>
database='db')
File "C:\Program Files (x86)\Python36-32\lib\mysql\connector\__init__.py",
line 179, in connect
return MySQLConnection(*args, **kwargs)
File "C:\Program Files (x86)\Python36-32\lib\mysql\connector\connection.py",
line 94, in __init__
self.connect(**kwargs)
File "C:\Program Files (x86)\Python36-32\lib\mysql\connector\abstracts.py",
line 722, in connect
self._open_connection()
File "C:\Program Files (x86)\Python36-32\lib\mysql\connector\connection.py",
line 211, in _open_connection
self._ssl)
File "C:\Program Files (x86)\Python36-32\lib\mysql\connector\connection.py",
line 141, in _do_auth
auth_plugin=self._auth_plugin)
File "C:\Program Files (x86)\Python36-32\lib\mysql\connector\protocol.py",
line 102, in make_auth
auth_data, ssl_enabled)
File "C:\Program Files (x86)\Python36-32\lib\mysql\connector\protocol.py",
line 58, in _auth_response
auth = get_auth_plugin(auth_plugin)(
File "C:\Program Files (x86)\Python36-
32\lib\mysql\connector\authentication.py", line 191, in get_auth_plugin
"Authentication plugin '{0}' is not supported".format(plugin_name))
mysql.connector.errors.NotSupportedError: Authentication plugin
'caching_sha2_password' is not supported
Process finished with exit code 1
I have started the mysql sever in workbench and the also checked the status.
Also database named "db" is also there
Also checked if the host is proper.
-->Removed ssl also.
It seems that your MySQL Server version is 8.x and in that case the default MySQL connector is caching_sha2_password. In the other hand your error is maybe because your python client connector does not support this Authentication Plugin and you should explicitly change the Authentication Plugin to the old one (mysql_native_password).
cnx = mysql.connector.connect(user='root', password='password',
host='127.0.0.1', database='db',
auth_plugin='mysql_native_password')
I've recently installed the official MySQL extension for Python. However, when connecting to the server it asks me to select the database, but I have not made a database yet.
I don't really know what to do here. So I tried to connect using my information without the database but received errors with the following code:
import mysql.connector
cnx = mysql.connector.connect(user='ubuntulogin', password='ubuntupassword',
host='localhost')
cursor = cnx.cursor()
query = ("CREATE DATABASE database")
cursor.execute(query)
cursor.close()
cnx.close()
Please let me know any issues with my code or how to get MySQL information when I don't know mydatabase name.
Thanks
EDIT: My error message when running the code was:
File "/home/liam/sqltest.py", line 3, in <module>
cnx = mysql.connector.connect(user='ubuntulogin', password='ubuntupassword', host='localhost')
File "/usr/lib/python2.7/dist-packages/mysql/connector/__init__.py", line 162, in connect
return MySQLConnection(*args, **kwargs)
File "/usr/lib/python2.7/dist-packages/mysql/connector/connection.py", line 129, in __init__
self.connect(**kwargs)
File "/usr/lib/python2.7/dist-packages/mysql/connector/connection.py", line 454, in connect
self._open_connection()
File "/usr/lib/python2.7/dist-packages/mysql/connector/connection.py", line 417, in _open_connection
self._socket.open_connection()
File "/usr/lib/python2.7/dist-packages/mysql/connector/network.py", line 475, in open_connection
errno=2003, values=(self.get_address(), _strioerror(err)))
mysql.connector.errors.InterfaceError: 2003: Can't connect to MySQL server on 'localhost:3306' (111 Connection refused)
Okay, so it turns out that I am a complete idiot. I was under the impression that MySQL was pre-installed and running since I am used to PHP. However, I just installed it and use my code above and everything seems to be working.
Thanks to #DYZ for pointing out that it looked like I hadn't started the MySQL server and my credentials were incorrect (which turned out to be both true.
You forgot something)
cursor.execute("CREATE DATABASE database;")
I am trying to connect to a MongoDB replicaset with PyMongo and to manually balance the reading load with ReadPreference.
The problem is that whatever I try with MongoClient, it always reads from PRIMARY.
I am using Python 2.7.6 and PyMongo 2.6.3 with mongodb-10gen 2.4.14 (all for legacy reasons) on Linux Mint 17.2.
My connection sequence looks like this (without the print, and with massive request on some collection of some database from the connection):
>>> from pymongo import MongoClient, ReadPreference
>>> HOST = "10.0.0.51"
>>> PORT = 49029
>>> print MongoClient(host=HOST, port=PORT, replicaset="rs02", readPreference=ReadPreference.SECONDARY)
MongoClient([u'XXXX-MNGO03664:49029', u'XXXX-MNGO03663:49029'])
This one would be the right way to go with PyMongo > 3, from what I have read. Unfortunately I am stuck with PyMongo 2.6.3, and I can only assume that this is why it doesn't read from secondary.
After a bit of digging, I found about ReplicaSetConnection (deprecated since PyMongo 2.4) and MongoReplicaSetClient (see e.g. pymongo replication secondary readreference not work and pymongo: Advantage of using MongoReplicaSetClient?), but it also doesn't seem to work for me, for different reasons though.
>>> print MongoReplicaSetClient(host=HOST, port=PORT, replicaset="rs02", readPreference=ReadPreference.SECONDARY)
MongoReplicaSetClient([])
The client doesn't seem to be able to see the members of the replicaset…
And of course when I start reading from this connection, it doesn't work.
>>> myConn = MongoReplicaSetClient(host=HOST, port=PORT, replicaset="rs02", readPreference=ReadPreference.SECONDARY)
>>> print myConn.someDB.someCollection.count()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/pymongo/collection.py", line 759, in count
return self.find().count()
File "/usr/local/lib/python2.7/dist-packages/pymongo/cursor.py", line 640, in count
**command)
File "/usr/local/lib/python2.7/dist-packages/pymongo/database.py", line 391, in command
result = self["$cmd"].find_one(command, **extra_opts)
File "/usr/local/lib/python2.7/dist-packages/pymongo/collection.py", line 604, in find_one
for result in self.find(spec_or_id, *args, **kwargs).limit(-1):
File "/usr/local/lib/python2.7/dist-packages/pymongo/cursor.py", line 904, in next
if len(self.__data) or self._refresh():
File "/usr/local/lib/python2.7/dist-packages/pymongo/cursor.py", line 848, in _refresh
self.__uuid_subtype))
File "/usr/local/lib/python2.7/dist-packages/pymongo/cursor.py", line 782, in __send_message
res = client._send_message_with_response(message, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/pymongo/mongo_replica_set_client.py", line 1631, in _send_message_with_response
raise AutoReconnect(msg, errors)
pymongo.errors.AutoReconnect: No replica set secondary available for query with ReadPreference SECONDARY
Note that the same error appears when I try to read with ReadPreference.PRIMARY.
The weird thing here is that if I change the name of the replicaset to connect to, the client spots that it doesn't exist :
>>> print MongoReplicaSetClient(host=HOST, port=PORT, replicaset="rs42", readPreference=ReadPreference.SECONDARY)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/pymongo/mongo_replica_set_client.py", line 742, in __init__
self.refresh()
File "/usr/local/lib/python2.7/dist-packages/pymongo/mongo_replica_set_client.py", line 1135, in refresh
% (host, port, self.__name))
pymongo.errors.ConfigurationError: 10.0.0.51:49029 is not a member of replica set rs42
So I assume that in normal cases, it has a way to see that there is a replicaset here, and who are its members.