I'm trying to connect to Hive server with PyHive.
So far, I have this:
from pyhive import hive
import pandas as pd
# Create Hive connection
conn = hive.Connection(host="*********", port=10000, auth='NONE')
df = pd.read_sql("select max_temperature_f from `201402_weather_data` LIMIT 10", conn)
print(df.head())
In my hive-site.xml configuration I have this:
<property>
<name>hive.server2.authentication</name>
<value>NONE</value>
<final>false</final>
<source>programmatically</source>
<source>org.apache.hadoop.hive.conf.LoopingByteArrayInputStream#56f71edb</source>
</property>
and
<property>
<name>hive.server2.transport.mode</name>
<value>binary</value>
<final>false</final>
<source>programmatically</source>
<source>org.apache.hadoop.hive.conf.LoopingByteArrayInputStream#56f71edb</source>
</property>
From what I read, these are the correct settings with which I could connect to the hive server (assuming that I want NONE for authentication).
Executing the script I get an error:
Traceback (most recent call last):
File "D:/****/hive_connector.py", line 8, in <module>
auth='NONE')
File "C:\Users\***\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pyhive\hive.py", line 192, in __init__
self._transport.open()
File "C:\Users\***\AppData\Local\Programs\Python\Python37-32\lib\site-packages\thrift_sasl\__init__.py", line 85, in open
message=("Could not start SASL: %s" % self.sasl.getError()))
thrift.transport.TTransport.TTransportException: Could not start SASL: b'Error in sasl_client_start (-4) SASL(-4): no mechanism available: Unable to find a callback: 2'
I'm using windows, so I had to manually download and install SASL - sasl-0.2.1-cp37-cp37m-win32.whl
I'm using:
PyHive - 0.6.3,
sasl - 0.2.1,
thrift - 0.13.0,
thrift-sasl - 0.4.2,
thriftpy2 - 0.4.11 (not sure where that came from)
I've seen a lot of questions and I've tried several things but I'm not able to run the script successfully. Can you point me to the correct solution? Is it the sasl package that is causing the problems?
Related
I have a small Python3-script like this:
import speedtest
# Speedtest
test = speedtest.Speedtest() # <--- line 4
test.get_servers()
best = test.get_best_server()
print(f"Found: {best['host']} located in {best['country']}")
The first time I run it, it works and everything is fine; it outputs:
Found: speedtest.witcom.cloud:8080 located in Germany
Happy days.
The second time (and subsequel times) that I run the script, I get this error:
Traceback (most recent call last):
File "/Users/zeth/Code/pinger/pinger.py", line 4, in <module>
test = speedtest.Speedtest()
File "/usr/local/lib/python3.9/site-packages/speedtest.py", line 1095, in __init__
self.get_config()
File "/usr/local/lib/python3.9/site-packages/speedtest.py", line 1127, in get_config
raise ConfigRetrievalError(e)
speedtest.ConfigRetrievalError: HTTP Error 403: Forbidden
When Googling around, I saw that I could also call this module straight from the command line, but just running this:
$ speedtest-cli
That gives me the same kind of error:
Retrieving speedtest.net configuration...
Cannot retrieve speedtest configuration
ERROR: HTTP Error 403: Forbidden
But if I run the direct cli-command: speedtest-cli --secure ( docs for the --secure-flag ), then it goes through and outputs this:
Retrieving speedtest.net configuration...
Testing from Deutsche Telekom AG (212.185.228.168)...
Retrieving speedtest.net server list...
Selecting best server based on ping...
Hosted by hotspot.koeln (Cologne) [3.44 km]: 28.805 ms
Testing download speed................................................................................
Download: 30.01 Mbit/s
Testing upload speed......................................................................................................
Upload: 8.68 Mbit/s
The question
I can't figure out how to change this Python-line: test = speedtest.Speedtest() to use a --secure-flag (nor via HTTPS).
The documentation for speedtest-cli is scarce.
Other attempts
I found this solution here: Python Speedtest facing problems with certification _ssl.c:1056, that suggests manually approving the certificates.
But in this directory: /Volumes/Macintosh HD/Applications/ I don't have anything called Python3.9. I have python3.9 installed via Brew. And I'm on a Mac.
I could do this:
test = speedtest.Speedtest(secure=True)
I looked into the source code myself, in this directory:
vim /usr/local/lib/python3.9/site-packages/speedtest.py
Where I would see the function was defined like this:
class Speedtest(object):
"""Class for performing standard speedtest.net testing operations"""
def __init__(self, config=None, source_address=None, timeout=10,
secure=False, shutdown_event=None):
self.config = {}
self._source_address = source_address
self._timeout = timeout
self._opener = build_opener(source_address, timeout)
self._secure = secure
...
...
...
I am running following DataFlow config
test_dataflow= BeamRunPythonPipelineOperator(
task_id="xxxx",
runner="DataflowRunner",
py_file=xxxxx,
pipeline_options = dataflow_options,
py_requirements=['apache-beam[gcp]==2.39.0'],
py_interpreter='python3',
dataflow_config=DataflowConfiguration(job_name="{{task.task_id}}", location=LOCATION, project_id=PROJECT, wait_until_finished=False,gcp_conn_id="google_cloud_default")
#dataflow_config={"job_name":"{{task.task_id}}", "location":LOCATION, "project_id":PROJECT, "wait_until_finished":True,"gcp_conn_id":"google_cloud_default"}
)
It keeps throwing error . airflow-2.2.5 version.
Error - Traceback (most recent call last):
File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/apache/beam/operators/beam.py", line 287, in execute
) = self._init_pipeline_options(format_pipeline_options=True, job_name_variable_key="job_name")
File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/apache/beam/operators/beam.py", line 183, in _init_pipeline_options
dataflow_job_name, pipeline_options, process_line_callback = self._set_dataflow(
File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/apache/beam/operators/beam.py", line 63, in _set_dataflow
pipeline_options = self.__get_dataflow_pipeline_options(
File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/apache/beam/operators/beam.py", line 92, in __get_dataflow_pipeline_options
if self.dataflow_config.service_account:
AttributeError: 'DataflowConfiguration' object has no attribute 'service_account'
If I give service_account, then it errors saying parameter invalid
I ran into the same issue.
This is because of the inconsistency between the dataflow_configuration in dataflow and the one expected by beam. The DataflowConfiguration doesn't accepting the service_account.
I resolved my issue by upgrading the composer in place, so it gets the latest package related to dataflow where it has been fixed.
The service_account attribute has been added in this commit https://github.com/apache/airflow/commit/de65a5cc5acaa1fc87ae8f65d367e101034294a6
If you can't upgrade composer, try updating the google providers package to the latest version or > version 7.0 ?
You can check the commit in the commit log and identify the minimum version here - https://airflow.apache.org/docs/apache-airflow-providers-google/stable/commits.html#id6
Even though composer uses it's own fork, the oss should work. You can see the list of packages in the composer version list https://cloud.google.com/composer/docs/concepts/versioning/composer-versions it says apache-airflow-providers-google==2022.5.18+composer instead of apache-airflow-providers-google==7.0.
We create a python shell job which is connecting Redshift and fetching data, below program is working fine in my local system.
Below are the steps and programs.
Program:-
import sqlalchemy as sa
from sqlalchemy.orm import sessionmaker
#>>>>>>>> MAKE CHANGES HERE <<<<<<<<<<<<<
DATABASE = "#####"
USER = "#####"
PASSWORD = "#####"
HOST = "#####.redshift.amazonaws.com"
PORT = "5439"
SCHEMA = "test" #default is "public"
####### connection and session creation ##############
connection_string = "redshift+psycopg2://%s:%s#%s:%s/%s" % (USER,PASSWORD,HOST,str(PORT),DATABASE)
engine = sa.create_engine(connection_string)
session = sessionmaker()
session.configure(bind=engine)
s = session()
SetPath = "SET search_path TO %s" % SCHEMA
s.execute(SetPath)
###### All Set Session created using provided schema #######
################ write queries from here ######################
query = "SELECT * FROM test1 limit 2;"
rr = s.execute(query)
all_results = rr.fetchall()
def pretty(all_results):
for row in all_results :
print("row start >>>>>>>>>>>>>>>>>>>>")
for r in row :
print(" ----" , r)
print("row end >>>>>>>>>>>>>>>>>>>>>>")
pretty(all_results)
########## close session in the end ###############
s.close()
Steps:-
sudo pip install psycopg2
sudo pip install sqlalchemy
sudo pip install sqlalchemy-redshift
I have uploaded the files psycopg2-2.8.4-cp27-cp27m-win32.whl, Flask_SQLAlchemy-2.4.1-py2.py3-none-any.whl and sqlalchemy_redshift-0.7.5-py2.py3-none-any.whl in S3 (s3://####/lib/), and map the folder in Python library path in AWS Glue Job.
When I run the program below error is occurring.
Traceback (most recent call last):
File "/tmp/runscript.py", line 113, in <module>
download_and_install(args.extra_py_files)
File "/tmp/runscript.py", line 56, in download_and_install
download_from_s3(s3_file_path, local_file_path)
File "/tmp/runscript.py", line 81, in download_from_s3
s3.download_file(bucket_name, s3_key, new_file_path)
File "/usr/local/lib/python2.7/site-packages/boto3/s3/inject.py", line 172, in download_file
extra_args=ExtraArgs, callback=Callback)
File "/usr/local/lib/python2.7/site-packages/boto3/s3/transfer.py", line 307, in download_file
future.result()
File "/usr/local/lib/python2.7/site-packages/s3transfer/futures.py", line 106, in result
return self._coordinator.result()
File "/usr/local/lib/python2.7/site-packages/s3transfer/futures.py", line 265, in result
raise self._exception
botocore.exceptions.ClientError: An error occurred (404) when calling the HeadObject operation: Not Found
PS:- The Glue Job Role has full access to S3.
Please suggest how to map those libraries with the program.
You can specify your own Python libraries packaged as an .egg or a .whl file under the "—extra-py-files" flag as shown in below example.
Command line example :
aws glue create-job --name python-redshift-test-cli --role role --command '{"Name" : "pythonshell", "ScriptLocation" : "s3://MyBucket/python/library/redshift_test.py"}'
--connections Connections=connection-name --default-arguments '{"--extra-py-files" : ["s3://MyBucket/python/library/redshift_module-0.1-py2.7.egg", "s3://MyBucket/python/library/redshift_module-0.1-py2.7-none-any.whl"]}'
Refernece : Create a glue job with extra python library
There is a simple way to import python dependencies using whl files, that can be find on Python site for particular module.
You can also add multiple wheel files from S3 using comma.
For eg
"s3://xxxxxxxxx/common/glue/glue_whl/fastparquet-0.4.1-cp37-cp37m-macosx_10_9_x86_64.whl,s3://xxxxxx/common/glue/glue_whl/packaging-20.4-py2.py3-none-any.whl,s3://xxxxxx/common/glue/glue_whl/s3fs-0.5.0-py3-none-any.whl"
enter image description here
I want to convert a >1mn record MySQL database into a graph database, because it is heavily linked network-type data. The free version of Neo4J had some restrictions I thought I might bump up against, so I've installed OrientDB (Community 2.2.0) (on Ubuntu Server 16.04) and got it working. Now I need to access it from Python (3.5.1+), so I'm trying pyorient (1.5.2). (I tried TinkerPop since I eventually want to use Gremlin, and couldn't get the gremlin console to talk to the OrientDB.)
The following simple Python code, to connect to one of the test graphs in OrientDB:
import pyorient
username="user"
password="password"
client = pyorient.OrientDB("localhost", 2424)
session_id = client.connect( username, password )
print("SessionID=",session_id)
db_name="GratefulDeadConcerts"
if client.db_exists( db_name, pyorient.STORAGE_TYPE_MEMORY ):
print("Database",db_name,"exists")
client.db_open( db_name, username, password )
else:
print("Database",db_name,"doesn't exist")
gives a weird error:
SessionID= 27
Database GratefulDeadConcerts exists
Traceback (most recent call last):
File "FirstTest.py", line 18, in <module>
client.db_open( db_name, username, password )
File "/home/tom/MyProgs/TestingPyOrient/env/lib/python3.5/site-packages/pyorient/orient.py", line 379, in db_open
.prepare((db_name, user, password, db_type, client_id)).send().fetch_response()
File "/home/tom/MyProgs/TestingPyOrient/env/lib/python3.5/site-packages/pyorient/messages/database.py", line 141, in fetch_response
info = OrientVersion(release)
File "/home/tom/MyProgs/TestingPyOrient/env/lib/python3.5/site-packages/pyorient/otypes.py", line 202, in __init__
self._parse_version(release)
File "/home/tom/MyProgs/TestingPyOrient/env/lib/python3.5/site-packages/pyorient/otypes.py", line 235, in _parse_version
self.build = int( self.build )
ValueError: invalid literal for int() with base 10: '0 (build develop#r79d281140b01c0bc3b566a46a64f1573cb359783; 2016'
Does anyone know what that is or how I can fix it? Should I really be using TinkerPop instead? If so I'll post a seperate question about my struggles with that.
I firstly got the error, but after upgrading Pyorient to last version 1.5.4 I get no errors.
$ python test.py
('SessionID=', 6)
('Database', 'GratefulDeadConcerts', 'exists')
$ python --version
Python 2.7.11
I'm trying to write an AWS Lambda Python Package that will connect to a FileMaker database over JDBC. To test, I've launched an EC2 instance with the Lambda Linux AMI, and created a virtualenv (/venv) that I'm testing in. I've uploaded the fmjdbc.jar to the instance using WinSCP to /venv/lib/fmjdbc.jar. The code uses JayDeBeApi, following the usage example here: https://pypi.python.org/pypi/JayDeBeApi/#usage
My code so far is the following:
import jaydebeapi as jdb
driverclass = 'com.filemaker.jdbc.Driver'
jdbcURL = 'jdbc:filemaker://url:port;database'
jar = '/home/ec2-user/lambda-test-project/venv/lib/fmjdbc.jar'
print jar
conn = jdb.connect(driverclass,[jdbcURL,'username','password'],jar)
Which gives me the error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/ec2-user/lambda-test-project/venv/local/lib/python2.7/site-package s/jaydebeapi/__init__.py", line 359, in connect
jconn = _jdbc_connect(jclassname, jars, libs, *driver_args)
File "/home/ec2-user/lambda-test-project/venv/local/lib/python2.7/site-package s/jaydebeapi/__init__.py", line 183, in _jdbc_connect_jpype
return jpype.java.sql.DriverManager.getConnection(*driver_args)
jpype._jexception.SQLExceptionPyRaisable: java.sql.SQLException: No suitable driver found for jdbc:filemaker://<MY URL STUFF IS HERE>
How can I get the jdbc driver to be read by Python's virtual environment? I'd like to have this code work in a Lambda package eventually, so I'm hoping there's a solution that can be integrated to the Python code that will work repeatedly on newly created servers.
You can use jpype package to set driver for python. I used it for connecting Oracle DB before. There is my sample code which may be useful for you.
import jaydebeapi,jpype
classpath = "your jdbc jar driver path"
jvm_path = "/usr/lib/jvm/java-1.6.0-openjdk-1.6.0.36.x86_64/jre/lib/amd64/server/libjvm.so" #your java vm path
jpype.startJVM(jvm_path, "-Djava.class.path=%s" % classpath) #start jvm based on the driver
conn = jaydebeapi.connect(xxxxxx)