why does standalone application use socket connection in pyspark?

why does standalone application use socket connection in pyspark? - python

If I am using spark in standalone applications, I don't think I need the connection to the server (of course). But why am I getting this network error message?
[ERROR] Error while sending or receiving.
Traceback (most recent call last):
File "/Users/chlee021690/anaconda/lib/python2.7/site-packages/py4j/java_gateway.py",
line 473, in send_command
answer = smart_decode(self.stream.readline()[:-1])
File "/Users/chlee021690/anaconda/python.app/Contents/lib/python2.7/socket.py", line 430, in readline
data = recv(1)
timeout: timed out
....
Py4JNetworkError: An error occurred while trying to connect to the Java server
My code is as follows:
from pyspark import *
import pyspark.mllib.recommendation as spark_rec
filename = "./yahoo music/train_0.txt"
sc = SparkContext('local', 'spark_rec')
aData = sc.textFile(filename).cache() (this part was successful, but the next following lines were failtures)
ratings = aData.map(lambda line: np.array([float(x) for x in line.split('\t')]))
rank = 10
numIterations = 20
aModel = spark_rec.ALS.train(ratings, rank, numIterations)

Related

Pandas DataFrame to Impala Table SSL Error

I am trying to connect to Impala Shell so I can create tables from Pandas DataFrames from Cloudera Datascience Workbench
based on this blog post:
https://netlify--tdhopper.netlify.app/blog/creating-impala-tables-from-pandas-dataframes/
I get SSL error. Can anyone help me to know what is missing?
import os
import ibis
hdfs_host = 'xxxxx.xxxxx.com'
hdfs_port = xxxxx
impala_host = 'xxxxxx.xxxxxx.com'
impala_port = xxxxxxx
hdfs = ibis.impala.hdfs_connect(host=hdfs_host, port=hdfs_port)
client = ibis.impala.connect(host=impala_host, port=impala_port, hdfs_client=hdfs, auth_mechanism='GSSAPI', use_ssl=True)
Error output
failed to initialize SSL
Traceback (most recent call last):
File "/home/cdsw/.local/lib/python3.9/site-packages/ibis/backends/impala/client.py", line 113, in _get_cursor
cursor = self.connection_pool.popleft()
IndexError: pop from an empty deque
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/thrift/transport/TSSLSocket.py", line 281, in _do_open
return self._wrap_socket(plain_sock)
File "/usr/local/lib/python3.9/site-packages/thrift/transport/TSSLSocket.py", line 181, in _wrap_socket
self.ssl_context.load_verify_locations(self.ca_certs)
FileNotFoundError: [Errno 2] No such file or directory
....
TTransportException: failed to initialize SSL

How to fix the error 'TypeError: can't pickle time objects'?

I am using the OpenOPC library to read data from an OPC Server, I am using 'Matrikon OPC Simulation Server', when I try to read the data it sends me the following error:
TypeError: can't pickle time objects
The code I use is the following, I run it from the python console.
CODE:
import OpenOPC
opc = OpenOPC.client()
opc.connect('Matrikon.OPC.Simulation')
opc.read('Random.Int4')
When I run the line opc.read ('Random.Int4'), that's when the error appears.
This is how the variable appears in my MatrikonOPC Explorer:
This is the complete error:
Traceback (most recent call last):
File "C:\Python27\Lib\multiprocessing\queues.py", line 264, in _feed
send(obj)
TypeError: can't pickle time objects
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "C:\Users\User\PycharmProjects\OPC2\venv\lib\site-packages\OpenOPC.py", line 625, in read
return list(results)
File "C:\Users\User\PycharmProjects\OPC2\venv\lib\site-packages\OpenOPC.py", line 543, in iread
raise TimeoutError('Callback: Timeout waiting for data')
TimeoutError: Callback: Timeout waiting for data

I solved this issue by adding sync=True when calling opc.read()
CODE:
import OpenOPC
opc = OpenOPC.client()
opc.connect('Matrikon.OPC.Simulation')
opc.read('Random.Int4', sync=True)
Reference: mkwiatkowski/openopc

Python Hive: thrift.transport.TTransport.TTransportException: None

Suppose Hive is installed in say, "g" cluster. I do not have access to the Gold Cluster. I'm doing my python development work in "s" cluster. I can access Hive from the "s" cluster and run queries.
I have the below code to connect to Hive from a Python script running in "s" cluster.
some_table is a table that already exists in Hive. I would like to execute a simple select * from some_table command to get some results.
import sys
sys.path.append("/usr/lib/hive/lib/py")
from hive_service import ThriftHive
from hive_service.ttypes import HiveServerException
from thrift import Thrift
from thrift.transport import TSocket
from thrift.transport import TTransport
from thrift.protocol import TBinaryProtocol
transport = TSocket.TSocket('what-ever-server', what-ever-port)
transport = TTransport.TBufferedTransport(transport)
protocol = TBinaryProtocol.TBinaryProtocol(transport)
client = ThriftHive.Client(protocol)
transport.open()
print "connect success"
client.execute("SELECT * FROM some_table")
print client.fetchAll()
print "executed"
But I get the below error after "connect success" is printed. I am assuming that the connection was successful.
Traceback (most recent call last):
File "hiveConnect.py", line 30, in <module>
row = client.execute("SELECT * FROM some_table")
File "/usr/lib/hive/lib/py/hive_service/ThriftHive.py", line 68, in execute
self.recv_execute()
File "/usr/lib/hive/lib/py/hive_service/ThriftHive.py", line 79, in recv_execute
(fname, mtype, rseqid) = self._iprot.readMessageBegin()
File "/usr/lib/hive/lib/py/thrift/protocol/TBinaryProtocol.py", line 137, in readMessageBegin
name = self.trans.readAll(sz)
File "/usr/lib/hive/lib/py/thrift/transport/TTransport.py", line 58, in readAll
chunk = self.read(sz-have)
File "/usr/lib/hive/lib/py/thrift/transport/TTransport.py", line 155, in read
self.__rbuf = StringIO(self.__trans.read(max(sz, self.DEFAULT_BUFFER)))
File "/usr/lib/hive/lib/py/thrift/transport/TSocket.py", line 94, in read
raise TTransportException('TSocket read 0 bytes')
thrift.transport.TTransport.TTransportException: None
What am I doing wrong in this code? I am not experiencing any error while connecting to Hive using the server-name and port, so I'm assuming everything is fine there and that the connection to Hive is not the issue.

Well after a bit more research I found that it was actually HiveServer2 that was being used with a port number of 10,000. After that I had to install pyhs2 for it to work properly.

SUDS Exception Imported Schema Failed

I'm getting the error:
Exception: imported schema (http://www.w3.org/2001/XMLSchema) at
(http://www.w3.org/2001/XMLSchema.x sd), failed
when passing a Doctor (constructed with ImportDoctor) to the suds Client constructor.
I'm working on two Windows machines, both of them got the same version of suds installed, but only one of them rises the error above.
Could someone guide me here to know why this error rises?, so I can figure out what's missing on the machine where it happens?.
Thanks in advance!!!.
UPDATE: I don't really know if this is important, but it's worth noting that my Windows machine that is rising the error is an Amazon Web Services instance. At my local machine everything's working well!.
UPDATE: Here's some code I ran at the python interpreter of the machine I mentioned. Here you can detail how the error is rising...
>>> from suds.client import Client
>>> from suds.xsd.doctor import ImportDoctor, Import
>>> missing_import = Import("http://www.w3.org/2001/XMLSchema")
>>> missing_import.filter.add("http://tempuri.org/")
>>> doctor = ImportDoctor(missing_import)
>>> client = Client("http://etcfulfill.ebooks.com/Fulfillment.asmx?wsdl")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "suds\client.py", line 112, in __init__
self.wsdl = reader.open(url)
File "suds\reader.py", line 152, in open
d = self.fn(url, self.options)
File "suds\wsdl.py", line 159, in __init__
self.build_schema()
File "suds\wsdl.py", line 220, in build_schema
self.schema = container.load(self.options)
File "suds\xsd\schema.py", line 95, in load
child.dereference()
File "suds\xsd\schema.py", line 323, in dereference
midx, deps = x.dependencies()
File "suds\xsd\sxbasic.py", line 422, in dependencies
raise TypeNotFound(self.ref)
suds.TypeNotFound: Type not found: '(schema, http://www.w3.org/2001/XMLSchema, )'
>>> client = Client("http://etcfulfill.ebooks.com/Fulfillment.asmx?wsdl", doctor=doctor)
No handlers could be found for logger "suds.xsd.sxbasic"
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "suds\client.py", line 112, in __init__
self.wsdl = reader.open(url)
File "suds\reader.py", line 152, in open
d = self.fn(url, self.options)
File "suds\wsdl.py", line 159, in __init__
self.build_schema()
File "suds\wsdl.py", line 220, in build_schema
self.schema = container.load(self.options)
File "suds\xsd\schema.py", line 93, in load
child.open_imports(options)
File "suds\xsd\schema.py", line 305, in open_imports
imported = imp.open(options)
File "suds\xsd\sxbasic.py", line 542, in open
result = self.download(options)
File "suds\xsd\sxbasic.py", line 567, in download
raise Exception(msg)
Exception: imported schema (http://www.w3.org/2001/XMLSchema) at (http://www.w3.org/2001/XMLSchema.xsd), failed
UPDATE:
I realized that suds connections always open in TCP increasing ports, and if it reaches the maximum TCP port (65535) then it starts opening again from the minimum TCP port available, so there's no problem with this.
The problem shows up when using suds ImportDoctor, because it has to open a previous connection to the location where the import should be retrieved, and for some reason, if the system reaches the maximum TCP port count, then suds somehow assumes that there's no TCP port available to open the connection for obtaining the import, and in consecuence it throws the exception:
Exception: imported schema (http://www.w3.org/2001/XMLSchema) at (http://www.w3.org/2001/XMLSchema.xsd), failed
I repeat, this only happens if suds has to open this previous connection for obtaining the import. If ImportDoctor is not used, then suds has no problem if the TCP port count reaches its maximum, it just restarts at the minimum port available.
Does anyone has any clue on how to resolve this issue???. I'd really appreciate the help!!!.

I've figured out what the problem was. The schema that was missing from the WSDL I was trying to use with suds was:
http://www.w3.org/2001/XMLSchema
And the XSD file for this schema is at:
http://www.w3.org/2001/XMLSchema.xsd
So when I used suds ImportDoctor to add this schema import, sometimes the w3.org domain was denying my access (don't know why really) and that's why this error was rising:
Exception: imported schema (http://www.w3.org/2001/XMLSchema) at (http://www.w3.org/2001/XMLSchema.xsd), failed
What did I do to solve this problem?. I just downloaded this schema to my machine and used suds ImportDoctor to retrieve this import locally.
And that was it!!!. Confusing bug!!!. But SOLVED.

OpenShift Python mongoDB environment variables not set / can't connect

This is in my application file head:
import os
import sys
from cgi import parse_qs, escape
import pymongo
from pymongo import MongoClient
I have the mongoDB 2.4 gear installed, and am trying to connect via
client = MongoClient('mongodb:$OPENSHIFT_MONGODB_DB_HOST:$OPENSHIFT_MONGODB_DB_PORT/')
I get the errors:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/var/lib/openshift/531b77fd500446980900010d/python/virtenv/lib/python2.7/site-packages/pymongo/mongo_client.py", line 239, in __init__
res = uri_parser.parse_uri(entity, port)
File "/var/lib/openshift/531b77fd500446980900010d/python/virtenv/lib/python2.7/site-packages/pymongo/uri_parser.py", line 269, in parse_uri
nodes = split_hosts(hosts, default_port=default_port)
File "/var/lib/openshift/531b77fd500446980900010d/python/virtenv/lib/python2.7/site-packages/pymongo/uri_parser.py", line 209, in split_hosts
nodes.append(parse_host(entity, port))
File "/var/lib/openshift/531b77fd500446980900010d/python/virtenv/lib/python2.7/site-packages/pymongo/uri_parser.py", line 137, in parse_host
raise ConfigurationError("Port number must be an integer.")
pymongo.errors.ConfigurationError: Port number must be an integer.
looks like OPENSHIFT_MONGODB_DB_PORT isn't set
print OPENSHIFT_MONGODB_DB_PORT --> NameError: name 'OPENSHIFT_MONGODB_DB_PORT' is not defined
Same with OPENSHIFT_MONGODB_DB_HOST
What would I need to do, to set up a connection?
Update:
I was able to connect directly via client by hardcoding info from rockmongo
client = MongoClient('mongodb://admin:password#[ip addr]:[port]/')
but when I do
client = MongoClient('mongodb:admin:password#%s:%s/' % os.environ['OPENSHIFT_MONGODB_DB_HOST'], os.environ['OPENSHIFT_MONGODB_DB_PORT']))
I get
[error] (<type 'exceptions.KeyError'>, KeyError('OPENSHIFT_MONGODB_DB_HOST',), <traceback object at 0x7f7bc8367248>)

The OpenShift connection variables are defined as environment variables, they cannot be accessed as normal Python variables. So the print statement you gave does not work, the following should;
import os
print os.environ['OPENSHIFT_MONGODB_DB_PORT']
You should change your code to;
client = MongoClient('mongodb:%s:%s/' % (os.environ['OPENSHIFT_MONGODB_DB_HOST'], os.environ['OPENSHIFT_MONGODB_DB_PORT))
You can refer to an example here.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

why does standalone application use socket connection in pyspark? - python

Related

Pandas DataFrame to Impala Table SSL Error

How to fix the error 'TypeError: can't pickle time objects'?

Python Hive: thrift.transport.TTransport.TTransportException: None

SUDS Exception Imported Schema Failed

OpenShift Python mongoDB environment variables not set / can't connect

Categories

Resources