HIVE not connecting with Python?

HIVE not connecting with Python? - python

I have installed Hadoop and HIVE on windows 10 by following tutorials,
https://exitcondition.com/install-hadoop-windows/ & https://www.youtube.com/watch?v=npyRXkMhrgk respectively.
Both Hadoop and HIVE are running on my machine, I have been able to put files in HDFS and run queries in HIVE, but when I try to connect HIVE with python it gives different errors. Such as
from pyhive import hive
hive.Connection(host='localhost',port=10000,auth='NOSASL')
it gives following error:
TTransportException: TSocket read 0 bytes
I have tried impala as well but it did not work.
How can I connect python with hive, is it possible on windows 10 or should I shift to linux?

Pyhive had issues with auth = NOSASL in past.. not sure whether it got fixed .
Try hdfs3 python lib
conda install hdfs3
from hdfs3 import HDFileSystem
hdfs=HDFileSystem(host='localhost',port=9000)
More info available here..
https://medium.com/#arush.xtremelife/connecting-hadoop-hdfs-with-python-267234bb68a2

Related

unable to connect to Hive through Python

I have installed Hive on a Debian system using the steps provided in the link:
https://phoenixnap.com/kb/install-hive-on-ubuntu
I have followed all the steps and able to create a database and table in Hive. However the property hive.metastore.uris is not set in hive-site.xml. When I try to connect to Hive using pyhive module in python I get this error:
thrift.transport.TTransport.TTransportException: Could not connect to any of [('172.16.0.125', 10000)]

How to locate Oracle Instant Client in Python in Docker CI/CD pipeline

I am trying to connect to oracle DB to execute some SQL queries and fetch data through a python script . I have imported cx_Oracle and tried connecting.I got the error as - Exception - DPI-1047: Cannot locate a 64-bit Oracle Client library: "libclntsh.so: cannot open shared object file: No such file or directory". See https://cx-oracle.readthedocs.io/en/latest/user_guide/installation.html for help was raised.
I downloaded instaclient and used that in my script and it worked using the below commands :
LOCATION = r"C:\instantclient_19_5"
os.environ["PATH"] = LOCATION + ";" + os.environ["PATH"]
But now I need to use this in CI CD pipeline. I have created a docker image for instaclient and python I am trying use this into my script. But I am not sure how to use add instaclient location in script (like the above code snippet) Could you please help me with this.

If you were deploying to Windows (or macOS), you could have used the new cx_Oracle 8 init_oracle_client() function, which is preferred to fiddling with PATH.
However it seems you are deploying to Linux, meaning the system library search path needs to contain all library directories before the process start. So you need to use ldconfig or set LD_LIBRARY_PATH as traditionally used on Linux. The cx_Oracle doc Installing cx_Oracle on Linux and Locating the Oracle Client Libraries covers this.
Also see sample docker images for Linux developers and for basic Oracle Instant Client. If you are not on an RPM based system, check out the sample Docker files in Docker for Oracle Database Applications in Node.js and Python.
In summary, download Oracle Instant Client Basic or Basic Light packages from here. If you got the ZIP files then run something like:
RUN echo /opt/oracle/instantclient_19_8 > /etc/ld.so.conf.d/oic.conf && \
ldconfig
The details vary with what base Docker image you are using, and whether you want to install Instant Client from ZIP files or RPMs.

error trying to access IBM DB2 using python

I tried connecting my python server with IBM DB2 and got this error. I searched online, tried many things and nothing could fix it. I couldn't find the db2dsdriver.config anywhere in the IBM folder. For context, I'm trying to access an online IBM db using a python flask server that I'm running locally
File "server.py", line 8, in <module>
conn = ibm_db.connect("BLUDB","MyDB2loginhere","psswd")
Exception: [IBM][CLI Driver] SQL1531N The connection failed because the name specified with the DSN connection string keyword could not be found in either the db2dsdriver.cfg configuration file or the db2cli.ini configuration file. Data s SQLCODE=-1531cified in the connection string: "BLUDB"

Answered by comments.
By default when you install python module ibm_db it delivers a small ODBC driver from IBM called clidriver for connecting to local or remote Db2 databases. Other ODBC drivers are available, including from other suppliers different from IBM. With an IBM supplied driver (including clidriver), an additional license may be required only for accessing Db2-for-Z/OS, or Db2 for i series. Specifically for i-series, it is cheaper to use the ODBC driver option that is available via IBM product called IBM i access (cheaper than buying a license for Db2-connect to use with clidriver).
To connect to a database, you either use an externally configured DSN (data source name), or use a connection string in your code.
Externally configured DSNs mean that your code does not hard-code any connectivity details, so it is easier to move between different environments without changing code, because the configuration is all external data. You can also arrange the connection-strings to be external.
To configure an external DSN, either use local tools or edit the db2dsdriver.cfg file, which is a small XML file you can create and edit manually or via db2cli command lines. Local tools can be odbcad32 on MS-Windows or unixODBC on Linux/Unix/MacOS. The IBM published documentation in the Db2 Knowledge Center gives all the details of db2dsdriver.cfg.
To use a connection-string, you need the target database hostname (server name, or cloud service name), the port-number, the database-name, and a userid+password (or personal-certificate for Db2-for-Z/OS). The format of the connection string is a semi-colon separated/terminated collection of X=Y pairs, like this:
conn_string="Server=hostname_or_ip_address_of_the_Db2_server_Or_Cloud_Service;Port=50000;Database=BLUDB;UID=***;PWD=***;"
Many other additional and optional settings can be set in that connection string to control other aspects of the connection, or to specify encryption. These are connection attributes or connection properties and are all documented in the Db2 knowledge centre.
The python code to use the connection string looks like:
conn = ibm_db.connect( conn_string , "", "" )
or
conn = ibm_db.connect( conn_string, user, password)
if you exclude the UID and PWD from the connection string and supply then as the second and third parameters to the ibm_db.connect() .

How I managed to do in 2021. What you will need:
Python 3.7
PipEnv
Ibm-db
Ibm-db version is not important but this lib only works with Python 3.7 (current python version is 3.9).
Install Python 3.7.6 in your machine (this is the version that worked).
In your IDE create a new python file.
Let's create a Virtual Enviroment to make sure we will use Python 3.7
pip install pipenv
After installing
pipenv install --python 3.7
Activate the Virtual Environment
pipenv shell
You can use pip list to verify if you are in the new Virtual Enviroment - if list only shows 3 or 4 libs, it's because you are
Now you can download Ibm_db
pip install ibm-db
You may add this to your code to confirm what is the version you are using
from platform import python_version
print(python_version())
Now accessing the DB2
import ibm_db_dbi as db
# Connect to DB2B1 (keep Protocol as TCPIP)
conn = db.connect("DATABASE=DBNAME;HOSTNAME=hostname;PORT=port;PROTOCOL=TCPIP;UID=Your User;PWD=Your Password;", "", "")
Checking all tables available
for t in conn.tables():
print(t)
Your SQL code:
sql_for_df = """SELECT *
FROM TABLE
WHERE ..."""
Visualizing as DataFrame
First install pandas as it will not be present in your Virtual Environment
pip install pandas
After that import to your code and play around
import pandas as pd
df = pd.read_sql(sql_for_df, conn)
df.head()
To exit the VIrtual Enviroment just write exit in your terminal. If you want to remove the Virtual Enviroment write in the terminal pipenv --rm
That's pretty much all I could learn so far. I hope it helps you all.

How to query a remote snappydata server from Python

I am trying to query snappydata from Python and some of the answers say in StackOverflow that Python cant connect to remote spark clusters. Could anyone help me how can I connect to snappydata cluster and get a simple query working?
Code I am trying -
from pyspark.sql.snappy import SnappySession
snappy = SnappySession.builder.appName("test") \
.master("local[*]") \
.config("spark.snappydata.connection", "<remote server>:1527") \
.getOrCreate()
I am getting FileNotFoundError: [WinError 2] The system cannot find the file specified. In running above code. Unfortunately, there is not much information in setting up the environment. However, I have configured my environment to run PySpark locally and it works.

SnappyData's Python API is not distributed as a Python module that you can use from any spark cluster. However, you can use PySpark that is bundled as part of the SnappyData distribution.

How to run Hive commands on hadoop using python

all, I am trying to use Python to run Hive commands on the Hadoop Edge node. I went through all related questions on this website. But I still cannot solve it.
Currently, I can run hive after I ssh the server using terminal:
ssh user-admin#abc-hadoop-edge01.endor.lan
Then type hive, I am able to run hive commands.
However, I cannot run hive using python. I use pyhs2. My codes are as follows:
import pyhs2
conn = pyhs2.connect('abc-hadoop-edge01.endor.lan', port = 10000, user = 'user-admin', password = '125438a', database='default')
cursor = conn.cursor()
conn.close()
The error is: TTransportException: TTranspo...:10000
Is there anyone who knows how to solve this ?
BTW: I use Mac.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.