Pandas DataFrame to Impala Table SSL Error

Pandas DataFrame to Impala Table SSL Error - python

I am trying to connect to Impala Shell so I can create tables from Pandas DataFrames from Cloudera Datascience Workbench
based on this blog post:
https://netlify--tdhopper.netlify.app/blog/creating-impala-tables-from-pandas-dataframes/
I get SSL error. Can anyone help me to know what is missing?
import os
import ibis
hdfs_host = 'xxxxx.xxxxx.com'
hdfs_port = xxxxx
impala_host = 'xxxxxx.xxxxxx.com'
impala_port = xxxxxxx
hdfs = ibis.impala.hdfs_connect(host=hdfs_host, port=hdfs_port)
client = ibis.impala.connect(host=impala_host, port=impala_port, hdfs_client=hdfs, auth_mechanism='GSSAPI', use_ssl=True)
Error output
failed to initialize SSL
Traceback (most recent call last):
File "/home/cdsw/.local/lib/python3.9/site-packages/ibis/backends/impala/client.py", line 113, in _get_cursor
cursor = self.connection_pool.popleft()
IndexError: pop from an empty deque
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/thrift/transport/TSSLSocket.py", line 281, in _do_open
return self._wrap_socket(plain_sock)
File "/usr/local/lib/python3.9/site-packages/thrift/transport/TSSLSocket.py", line 181, in _wrap_socket
self.ssl_context.load_verify_locations(self.ca_certs)
FileNotFoundError: [Errno 2] No such file or directory
....
TTransportException: failed to initialize SSL

Related

How to integrate and mocking Redshift and S3 locally using redshift-fake-driver

I would like to build redshift and s3 locally, and then use them for tasks that may run from airflow, tools ... to reduce CI/CD code when have to deploy them to dev, also want to avoid conflict about resources, files, ...
Currently can use LocalStack's S3, but for Redshift, jusr looking for solutions but only get combination using redshift-fake-driver along with package JayDeBeApi in python, but it seems not working properly
import jpype # JPype1==1.4.1
import jaydebeapi # JayDeBeApi==1.2.3
jars = "/Users/trancongminh/Downloads/jars/*"
jpype.startJVM(classpath=jars)
driverName = "jp.ne.opt.redshiftfake.postgres.FakePostgresqlDriver"
print(jpype.JClass(driverName))
# as I spin up a docker container for postgresQL
connectionString = "jdbc:postgresqlredshift://localhost:5432/docker"
uid = "docker"
pwd = "docker"
driverFileName = "/Users/trancongminh/Downloads/jars/redshift-fake-driver_2.12-1.0.15.jar"
conn = jaydebeapi.connect(
jclassname=driverName,
url=connectionString,
driver_args={'user': uid, 'password': pwd},
jars=driverFileName
)
curs = conn.cursor()
curs.execute("SELECT * FROM pg_catalog.pg_tables limit 10;")
curs.fetchall()
curs.execute("copy db_table_name_v2 from 'http://localhost:4566/events-streaming/traveller/v2/ym_202210/d_04/hm_131901.parquet' CREDENTIALS 'aws_access_key_id=test;aws_secret_access_key=test' ")
But get errors like No such file or directory, or smth like this
Traceback (most recent call last):
File "FakeConnection.scala", line 31, in jp.ne.opt.redshiftfake.FakeConnection.prepareStatement
Exception: Java Exception
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/trancongminh/Pelago/pelago-ds-env/lib/python3.9/site-packages/jaydebeapi/__init__.py", line 531, in execute
self._prep = self._connection.jconn.prepareStatement(operation)
java.lang.NoSuchMethodError: java.lang.NoSuchMethodError: 'void scala.util.parsing.combinator.Parsers.$init$(scala.util.parsing.combinator.Parsers)'
or may be like this:
Traceback (most recent call last):
File "FakePreparedStatement.scala", line 138, in jp.ne.opt.redshiftfake.FakePreparedStatement$FakeAsIsPreparedStatement.execute
Exception: Java Exception
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/trancongminh/Pelago/pelago-ds-env/lib/python3.9/site-packages/jaydebeapi/__init__.py", line 534, in execute
is_rs = self._prep.execute()
org.postgresql.util.PSQLException: org.postgresql.util.PSQLException: ERROR: could not open file "s3://events-streaming/traveller/v2/ym_202210/d_04/hm_131901.parquet" for reading: No such file or directory
Hint: COPY FROM instructs the PostgreSQL server process to read a file. You may want a client-side facility such as psql's \copy.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/trancongminh/Pelago/pelago-ds-env/lib/python3.9/site-packages/jaydebeapi/__init__.py", line 536, in execute
_handle_sql_exception()
File "/Users/trancongminh/Pelago/pelago-ds-env/lib/python3.9/site-packages/jaydebeapi/__init__.py", line 165, in _handle_sql_exception_jpype
reraise(exc_type, exc_info[1], exc_info[2])
File "/Users/trancongminh/Pelago/pelago-ds-env/lib/python3.9/site-packages/jaydebeapi/__init__.py", line 57, in reraise
raise value.with_traceback(tb)
File "/Users/trancongminh/Pelago/pelago-ds-env/lib/python3.9/site-packages/jaydebeapi/__init__.py", line 534, in execute
is_rs = self._prep.execute()
jaydebeapi.DatabaseError: org.postgresql.util.PSQLException: ERROR: could not open file "s3://events-streaming/traveller/v2/ym_202210/d_04/hm_131901.parquet" for reading: No such file or directory
Hint: COPY FROM instructs the PostgreSQL server process to read a file. You may want a client-side facility such as psql's \copy
Anyy body has exp with this pattern just help, thanks
Solutions or keywords that helpful for further investigation

How to fix the error 'TypeError: can't pickle time objects'?

I am using the OpenOPC library to read data from an OPC Server, I am using 'Matrikon OPC Simulation Server', when I try to read the data it sends me the following error:
TypeError: can't pickle time objects
The code I use is the following, I run it from the python console.
CODE:
import OpenOPC
opc = OpenOPC.client()
opc.connect('Matrikon.OPC.Simulation')
opc.read('Random.Int4')
When I run the line opc.read ('Random.Int4'), that's when the error appears.
This is how the variable appears in my MatrikonOPC Explorer:
This is the complete error:
Traceback (most recent call last):
File "C:\Python27\Lib\multiprocessing\queues.py", line 264, in _feed
send(obj)
TypeError: can't pickle time objects
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "C:\Users\User\PycharmProjects\OPC2\venv\lib\site-packages\OpenOPC.py", line 625, in read
return list(results)
File "C:\Users\User\PycharmProjects\OPC2\venv\lib\site-packages\OpenOPC.py", line 543, in iread
raise TimeoutError('Callback: Timeout waiting for data')
TimeoutError: Callback: Timeout waiting for data

I solved this issue by adding sync=True when calling opc.read()
CODE:
import OpenOPC
opc = OpenOPC.client()
opc.connect('Matrikon.OPC.Simulation')
opc.read('Random.Int4', sync=True)
Reference: mkwiatkowski/openopc

pySpark:ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host

I am farely new to pyspark.
While running the below piece of code in pycharm I am getting the expected output I want.
But I am getting below error
Traceback (most recent call last):
File "C:\Study\Spark\spark\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py", line 1067, in start
ConnectionRefusedError: [WinError 10061] No connection could be made because the target machine actively refused it
ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:49748)
Traceback (most recent call last):
File "C:\Study\Spark\spark\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py", line 929, in _get_connection
IndexError: pop from an empty deque
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Study\Spark\spark\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py", line 1067, in start
ConnectionRefusedError: [WinError 10061] No connection could be made because the target machine actively refused it
**Process finished with exit code 0**
As you can see in the last line Process finished with exit code 0 and I also get my expected output
Here is my code sample
Python-3.7
Spark-2.4.5
def func(row):
temp=row.asDict()
temp["concat_val"]="|".join([str(x) for x in row])
put=Row(**temp)
return put
if __name__ == "__main__":
spark = SparkSession\
.builder.\
master("local[*]")\
.appName("PythonWordCount")\
.getOrCreate()
data1=spark.createDataFrame(
[
("1", 'foo'),
("2", 'bar'),
],
['id', 'txt']
row_rdd = data1.rdd.map(func)
print(row_rdd.collect())
concat_df = row_rdd.toDF()
hash_df = concat_df.withColumn("hash_id", md5(F.col("concat_val")))
hash_df.show()

why does standalone application use socket connection in pyspark?

If I am using spark in standalone applications, I don't think I need the connection to the server (of course). But why am I getting this network error message?
[ERROR] Error while sending or receiving.
Traceback (most recent call last):
File "/Users/chlee021690/anaconda/lib/python2.7/site-packages/py4j/java_gateway.py",
line 473, in send_command
answer = smart_decode(self.stream.readline()[:-1])
File "/Users/chlee021690/anaconda/python.app/Contents/lib/python2.7/socket.py", line 430, in readline
data = recv(1)
timeout: timed out
....
Py4JNetworkError: An error occurred while trying to connect to the Java server
My code is as follows:
from pyspark import *
import pyspark.mllib.recommendation as spark_rec
filename = "./yahoo music/train_0.txt"
sc = SparkContext('local', 'spark_rec')
aData = sc.textFile(filename).cache() (this part was successful, but the next following lines were failtures)
ratings = aData.map(lambda line: np.array([float(x) for x in line.split('\t')]))
rank = 10
numIterations = 20
aModel = spark_rec.ALS.train(ratings, rank, numIterations)

how to use MongoHQ database in openshift app

At first I tried:
import pymongo
MONGOHQ_URL = "mongodb://username:password#kahana.mongohq.com:10025/dbname"
conn = pymongo.MongoClient(MONGOHQ_URL)
but apparently it failed and it throws following error:
Traceback (most recent call last):
File "bot.py", line 95, in <module>
conn = pymongo.Connection(MONGOHQ_URL)
File "/var/lib/openshift/53abb500028e/python/virtenv/lib/python2.7/site-packages/pymongo/connection.py", line 236, in __init__
max_pool_size, document_class, tz_aware, _connect, **kwargs)
File "/var/lib/openshift/53abb500028e/python/virtenv/lib/python2.7/site-packages/pymongo/mongo_client.py", line 369, in __init__
raise ConnectionFailure(str(e))
pymongo.errors.ConnectionFailure: [Errno 13] Permission denied
They have an example, but it's in Ruby and I have no idea. If I am not wrong, I guess the connection is happening in this script.
Can anyone makeout from this ruby code and help me so that I can use in my python script with pymongo? I have already set environment variable MONGO_URL

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas DataFrame to Impala Table SSL Error - python

Related

How to integrate and mocking Redshift and S3 locally using redshift-fake-driver

How to fix the error 'TypeError: can't pickle time objects'?

pySpark:ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host

why does standalone application use socket connection in pyspark?

how to use MongoHQ database in openshift app

Categories

Resources