Error when using streaming_python in Pig - python

When I run the following:
REGISTER /home/hduser/Documents/ccc/Research/phd/code/ECentre/scripts/bags.py USING streaming_python
AS bp;
raw = LOAD 'hdfs:///user/hduser/smsCorpus_en_2012.04.30_all.xml' AS (line:chararray);
b = foreach raw generate bp.enumerate_bag(line);
I get
Failed to parse: Pig script failed to parse:
<file /home/hduser/Documents/ccc/Research/phd/code/ECentre/scripts/nltk.pig, line 13, column
25> Failed to generate logical plan. Nested exception: org.apache.pig.backend.executionengine.ExecException:
ERROR 1070: Could not resolve bp.enumerate_bag using imports: [, java.lang., org.apache.pig.builtin.,
org.apache.pig.impl.builtin.]
bags.py:
#!/usr/bin/env python
def enumerate_bag(input):
output = []
for rank, item in enumerate(input):
output.append(tuple([rank] + list(item)))
return output
Can anyone tell me why?
My version is:
Apache Pig version 0.12.2-SNAPSHOT (r: unknown)
compiled Apr 29 2014, 13:40:45

Related

ReadFromKafka with python in apache-beam Unsupported signal: 2

I´ve been strugglin making this work, I know this is a cross-language transform and all of that and I installed the Java jdk on my pc (when I write java -version on cmd I get correct information and all of that) but when I am trying to make a simple pipeline work:
import apache_beam as beam
from apache_beam.io.external.kafka import ReadFromKafka
from apache_beam.options.pipeline_options import PipelineOptions
import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS']='credentialsOld.json'
def main():
print('======================================================')
beam_options = PipelineOptions(runner='DataflowRunner',temp_location=temp_location,staging_location=staging_location,project=project,experiments=['use_runner_v2'],streaming=True)
with beam.Pipeline(options=beam_options) as p:
msgs = p | 'ReadKafka' >> ReadFromKafka(consumer_config={'bootstrap.servers':'xxxxx-xxxxx...','group_id':'testAB'},topics=['users'])
msgs | beam.FlatMap(print)
if __name__ == '__main__':
main()
I get this error: ValueError: Unsupported signal: 2
I have tried adding the parameter expansion_service= 'beam:external:java:kafka:read:v1' to the ReadFromKafka but then I get:
status = StatusCode.UNAVAILABLE
details = "DNS resolution failed for
beam:external:java:kafka:read:v1: UNKNOWN: OS Error"
Im working on a venv python enviroment if this info can be usefull and my kafka cluster is on confluent cloud.
Im also getting this runtime error:
RuntimeError: java.lang.RuntimeException: Failed to get dependencies of beam:transform:org.apache.beam:kafka_read_without_metadata:v1 from spec urn: "beam:transform:org.apache.beam:kafka_read_without_metadata:v1"
EDIT: Im getting the bootstrap server option from here
My mistake was that I was skippig the step where I have to start a expansion_service, I did that with this command
java -jar beam-sdks-java-io-expansion-service-2.37.0.jar 8088 --javaClassLookupAllowlistFile='*'
after downloading the beam-sdks-java-io-expansion-service-2.37.0.jar from https://mvnrepository.com/artifact/org.apache.beam/beam-sdks-java-io-expansion-service/2.36.0
and then specifying the port in expansion_service='localhost:8088'
Then I had two minor mistakes one was that I was using the JDK 18 and I think it wasnt compatible https://beam.apache.org/get-started/quickstart-java/ so I switched to JDK 17 and used python 3.8 instead of python 3.10

Renpy, an exception when trying to build a build on Android (renpy.loader.transfn)

Let's say there is a json file in ./resources called "string.json". Then the parsing of this file can be implemented as follows:
label start:
$ import json
$ f = open(renpy.loader.transfn("resources/string.json"))
$ text = json.load(f)
On a PC and on an Android emulator, this script will work fine, but when I build the build and run it on my phone, an exception is thrown:
exception an Android phone
How I can fix it?
Thanks a lot for answer and sorry my english is not good enough.
The renpy.file method can be used to resolve the exception:
label start:
$ import json
$ f = renpy.file("resources/string.json")
$ text = json.load(f)

python27.def "Symbol table not found" when trying weave

I'm trying to use weave with Python ANACONDA 64 bit. As weave requires Python 2.7 i created a new env to be able to import it, during code execution it turned out that libpython27.a is missing. So I created this library i.e. 1st created def file and later library with dll tool
C:\ProgramData\Anaconda3\envs\Python27>gendef python27.dll
C:\ProgramData\Anaconda3\envs\Python27>C:\MinGW64\bin\dlltool -v --dllname python27.dll --def python27.def --output-lib libpython27.a
library creation went OK however during comlipaton by weave i'm getting Symbol table not found. After a bit of debbuging here is a code which reject is complaining that in new python27.def there is no symbol file
File "C:\ProgramData\Anaconda3\envs\Python27\lib\site-packages\numpy\distutils\mingw32ccompiler.py", line 302, in generate_def
raise ValueError("Symbol table not found")
ValueError: Symbol table not found
def dump_table(dll):
st = subprocess.Popen(["objdump.exe", "-p", dll], stdout=subprocess.PIPE)
return st.stdout.readlines()
def generate_def(dll, dfile):
"""Given a dll file location, get all its exported symbols and dump them
into the given def file.
The .def file will be overwritten"""
dump = dump_table(dll)
for i in range(len(dump)):
if _START.match(dump[i].decode()):
break
else:
raise ValueError("Symbol table not found")
Any idea what it can be ??
After more investigations it looks like Anaconda distribution delivers msvcr90.dll without
symbol table. So when generate_def(dll, dfile) for msvcr90.dll in invoked it generates empty def file.
fix for it was in line 352 mingw32compiler.py to add return False
def build_msvcr_library(debug=False):
return False
if os.name != 'nt':
return False

import fails for murmur2 package in Redshift UDF

I am trying to import murmur2 package as a library in Redshift database. I did following steps
Run the module packer
$ ./installPipModuleAsRedshiftLibrary.sh -m murmur2 -s s3://path/to/murmur2/lib
Create library on redshift
CREATE OR REPLACE LIBRARY murmur2 LANGUAGE plpythonu from 's3://path/to/murmur2/lib/murmur2.zip' WITH CREDENTIALS AS 'aws_access_key_id=AAAAAAAAAAAAAAAAAAAA;aws_secret_access_key=SSSSSSSSSSSSSSSSS' region 'us-east-1';
Create function and query
create OR REPLACE function f_py_kafka_partitioner (s varchar, ps int)
returns int stable as $$ import murmur2
m2 = murmur2.murmur64a(s, len(s), 0x9747b28c)
return m2 % ps
$$ language plpythonu;
SELECT f_py_kafka_partitioner('jiimit', 100);
This gives following error :
[Amazon](500310) Invalid operation: ImportError: No module named murmur2. Please look at svl_udf_log for more information
Details:
-----------------------------------------------
error: ImportError: No module named murmur2. Please look at svl_udf_log for more information
code: 10000
context: UDF
query: 0
location: udf_client.cpp:366
process: padbmaster [pid=31381]
-----------------------------------------------;
And here is the contents of svl_udf_log
0 ImportError: No module named murmur2 2018-10-14 07:05:43.431561 line 2, in f_py_kafka_partitioner\n f_py_kafka_partitioner 1000 20000 0
Folder structure looks like this

Pig streaming through python script with import modules

Working with pigtmp$ pig --version
Apache Pig version 0.8.1-cdh3u1 (rexported)
compiled Jul 18 2011, 08:29:40
I have a python script (c-python), which imports another script, both very simple in my example:
DATA
example$ hadoop fs -cat /user/pavel/trivial.log
1 one
2 two
3 three
EXAMPLE WITHOUT INCLUDE - works fine
example$ pig -f trivial_stream.pig
(1,1,one)
()
(1,2,two)
()
(1,3,three)
()
where
1) trivial_stream.pig:
DEFINE test_stream `test_stream.py` SHIP ('test_stream.py');
A = LOAD 'trivial.log' USING PigStorage('\t') AS (mynum: int, mynumstr: chararray);
C = STREAM A THROUGH test_stream;
DUMP C;
2) test_stream.py
#! /usr/bin/env python
import sys
import string
for line in sys.stdin:
if len(line) == 0: continue
new_line = line
print "%d\t%s" % (1, new_line)
So essentially I just aggregate lines with one key, nothing special.
EXAMPLE WITH INCLUDE - bombs!
Now I'd like to append a string from a python import module which sits in the same directory as test_stream.py. I've tried to ship the import module in many different ways but get the same error (see below)
1) trivial_stream.pig:
DEFINE test_stream `test_stream.py` SHIP ('test_stream.py', 'test_import.py');
A = LOAD 'trivial.log' USING PigStorage('\t') AS (mynum: int, mynumstr: chararray);
C = STREAM A THROUGH test_stream;
DUMP C;
2) test_stream.py
#! /usr/bin/env python
import sys
import string
import test_import
for line in sys.stdin:
if len(line) == 0: continue
new_line = ("%s-%s") % (line.strip(), test_import.getTestLine())
print "%d\t%s" % (1, new_line)
3) test_import.py
def getTestLine():
return "test line";
Now
example$ pig -f trivial_stream.pig
Backend error message
org.apache.pig.backend.executionengine.ExecException: ERROR 2055: Received Error while processing the map plan: 'test_stream.py ' failed with exit status: 1
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:265)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.cleanup(PigMapBase.java:103)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at org.apache.hadoop.mapred.Child.main(Child.java:264)
Pig Stack Trace
ERROR 2997: Unable to recreate exception from backed error: org.apache.pig.backend.executionengine.ExecException: ERROR 2055: Received Error while processing the map plan: 'test_stream.py ' failed with exit status: 1
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias C. Backend error : Unable to recreate exception from backed error: org.apache.pig.backend.executionengine.ExecException: ERROR 2055: Received Error while processing the map plan: 'test_stream.py ' failed with exit status: 1
at org.apache.pig.PigServer.openIterator(PigServer.java:753)
at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:615)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:303)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:90)
at org.apache.pig.Main.run(Main.java:396)
at org.apache.pig.Main.main(Main.java:107)
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2997: Unable to recreate exception from backed error: org.apache.pig.backend.executionengine.ExecException: ERROR 2055: Received Error while processing the map plan: 'test_stream.py ' failed with exit status: 1
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getErrorMessages(Launcher.java:221)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:151)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:337)
at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:382)
at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1209)
at org.apache.pig.PigServer.storeEx(PigServer.java:885)
at org.apache.pig.PigServer.store(PigServer.java:827)
at org.apache.pig.PigServer.openIterator(PigServer.java:739)
... 7 more
Thanks you much for your help!
-Pavel
Correct answer from comment above:
The dependencies aren't shipped, if you want your python app to work with pig you need to tar it (don't forget init.py's!), then include the .tar file in pig's SHIP statement. The first thing you do is untar the app. There might be issues with paths, so I'd suggest the following even before tar extraction: sys.path.insert(0, os.getcwd()).
You need to append the current directory to sys.path in your test_stream.py:
#! /usr/bin/env python
import sys
sys.path.append(".")
Thus the SHIP command you had there does ship the python script, but you just need to tell Python where to look.

Categories