Pig streaming through python script with import modules

Pig streaming through python script with import modules - python

Working with pigtmp$ pig --version
Apache Pig version 0.8.1-cdh3u1 (rexported)
compiled Jul 18 2011, 08:29:40
I have a python script (c-python), which imports another script, both very simple in my example:
DATA
example$ hadoop fs -cat /user/pavel/trivial.log
1 one
2 two
3 three
EXAMPLE WITHOUT INCLUDE - works fine
example$ pig -f trivial_stream.pig
(1,1,one)
()
(1,2,two)
()
(1,3,three)
()
where
1) trivial_stream.pig:
DEFINE test_stream `test_stream.py` SHIP ('test_stream.py');
A = LOAD 'trivial.log' USING PigStorage('\t') AS (mynum: int, mynumstr: chararray);
C = STREAM A THROUGH test_stream;
DUMP C;
2) test_stream.py
#! /usr/bin/env python
import sys
import string
for line in sys.stdin:
if len(line) == 0: continue
new_line = line
print "%d\t%s" % (1, new_line)
So essentially I just aggregate lines with one key, nothing special.
EXAMPLE WITH INCLUDE - bombs!
Now I'd like to append a string from a python import module which sits in the same directory as test_stream.py. I've tried to ship the import module in many different ways but get the same error (see below)
1) trivial_stream.pig:
DEFINE test_stream `test_stream.py` SHIP ('test_stream.py', 'test_import.py');
A = LOAD 'trivial.log' USING PigStorage('\t') AS (mynum: int, mynumstr: chararray);
C = STREAM A THROUGH test_stream;
DUMP C;
2) test_stream.py
#! /usr/bin/env python
import sys
import string
import test_import
for line in sys.stdin:
if len(line) == 0: continue
new_line = ("%s-%s") % (line.strip(), test_import.getTestLine())
print "%d\t%s" % (1, new_line)
3) test_import.py
def getTestLine():
return "test line";
Now
example$ pig -f trivial_stream.pig
Backend error message
org.apache.pig.backend.executionengine.ExecException: ERROR 2055: Received Error while processing the map plan: 'test_stream.py ' failed with exit status: 1
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:265)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.cleanup(PigMapBase.java:103)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at org.apache.hadoop.mapred.Child.main(Child.java:264)
Pig Stack Trace
ERROR 2997: Unable to recreate exception from backed error: org.apache.pig.backend.executionengine.ExecException: ERROR 2055: Received Error while processing the map plan: 'test_stream.py ' failed with exit status: 1
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias C. Backend error : Unable to recreate exception from backed error: org.apache.pig.backend.executionengine.ExecException: ERROR 2055: Received Error while processing the map plan: 'test_stream.py ' failed with exit status: 1
at org.apache.pig.PigServer.openIterator(PigServer.java:753)
at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:615)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:303)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:90)
at org.apache.pig.Main.run(Main.java:396)
at org.apache.pig.Main.main(Main.java:107)
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2997: Unable to recreate exception from backed error: org.apache.pig.backend.executionengine.ExecException: ERROR 2055: Received Error while processing the map plan: 'test_stream.py ' failed with exit status: 1
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getErrorMessages(Launcher.java:221)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:151)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:337)
at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:382)
at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1209)
at org.apache.pig.PigServer.storeEx(PigServer.java:885)
at org.apache.pig.PigServer.store(PigServer.java:827)
at org.apache.pig.PigServer.openIterator(PigServer.java:739)
... 7 more
Thanks you much for your help!
-Pavel

Correct answer from comment above:
The dependencies aren't shipped, if you want your python app to work with pig you need to tar it (don't forget init.py's!), then include the .tar file in pig's SHIP statement. The first thing you do is untar the app. There might be issues with paths, so I'd suggest the following even before tar extraction: sys.path.insert(0, os.getcwd()).

You need to append the current directory to sys.path in your test_stream.py:
#! /usr/bin/env python
import sys
sys.path.append(".")
Thus the SHIP command you had there does ship the python script, but you just need to tell Python where to look.

Related

Not Able to Run GEM5 with RISC-V: "!seWorkload occurred: Couldn't find appropriate workload object"

I am trying to run gem5 with RISC-V. I have the Linux 64-bits cross compiler ready and I have also installed and compiled gem5. I then tried to use the following tutorial to run gem5: https://canvas.kth.se/courses/24933/pages/tutorial-simulating-a-cpu-with-gem5
I wrote a simple Hello World C program and compiled it using the following command:
riscv64-unknown-linux-gnu-gcc -c hello.c -static -Wall -O0 -o hello
But when I try to run gem5, I get the following error:
build/RISCV/sim/process.cc:137: fatal: fatal condition !seWorkload occurred: Couldn't find appropriate workload object.
I tried to come over this problem but I could not. I added print statements to the configuration file and realized that the error occurs in the line m5.instantiate() in the configuration file attached below. Does anyone know how to solve this issue? What is an seWorkload and why gem5 considers the object as not appropriate?
I am using Ubuntu 22.04. For reference, this is the configuration python file I use for gem5:
import m5
from m5.objects import *
import sys
system = System()
system.clk_domain = SrcClockDomain()
system.clk_domain.clock = '1GHz'
system.clk_domain.voltage_domain = VoltageDomain()
system.mem_mode = 'timing'
system.mem_ranges = [AddrRange('512MB')]
system.cpu = TimingSimpleCPU()
system.membus = SystemXBar()
system.cpu.icache_port = system.membus.cpu_side_ports
system.cpu.dcache_port = system.membus.cpu_side_ports
system.mem_ctrl = MemCtrl()
system.mem_ctrl.dram = DDR3_1600_8x8()
system.mem_ctrl.dram.range = system.mem_ranges[0]
system.mem_ctrl.port = system.membus.mem_side_ports
# start a process
process = Process()
# read command line arguments for the path to the executable
process.cmd = [str(sys.argv[1])]
system.cpu.workload = process
system.cpu.createThreads()
root = Root(full_system = False, system = system)
m5.instantiate() # the error occurs from this line
print("Beginning simulation!")
exit_event = m5.simulate()
print('Exiting # tick %i because %s' %(m5.curTick(), exit_event.getCause()))

m5.util.addToPath('../../') is missing. This is used to add the common scripts to the path to your directory from where you are instantiating the simulation.

Unable to import module 'lambda_function': No module named 'error'

I have a simple Python Code that uses Elasticsearch module "curator" to make snapshots.
I've tested my code locally and it works.
Now I want to run it in an AWS Lambda but I have this error :
Unable to import module 'lambda_function': No module named 'error'
Here is how I proceeded :
I created manually a Lambda and gave it a "AISA-BasicLambdaExecutionRole" role. Then I created my package with my function and the dependencies that I installed with the command :
pip install elasticsearch-curator -t /<path>/myRepository
I zipped the content (not the folder) and uploaded it in my Lambda.
I changed the Handler name to "lambda_function.lambda_handler" (my function's name is "lambda_function.py").
Did I miss something ? This is my first time working with Lambda and Python
I've seen the other questions about this error :
"errorMessage": "Unable to import module 'lambda_function'"
But nothing works for me.
EDIT :
Here is my lambda_function :
from __future__ import print_function
import curator
import time
from curator.exceptions import NoIndices
from elasticsearch import Elasticsearch
def lambda_handler(event, context):
es = Elasticsearch()
index_list = curator.IndexList(es)
index_list.filter_by_regex(kind='prefix', value="logstash-")
Number = 1
try:
while Number <= 3:
Name="snapshotLmbd_n_"+ str(Number) +""
curator.Snapshot(index_list, repository="s3-backup", name= Name , wait_for_completion=True).do_action()
Number += 1
print('Just taking a nap ! will be back soon')
time.sleep(30)
except KeyboardInterrupt:
print('My bad ! I interrupted this')
return
Thank you for your time.

Ok, since you have everything else correct, check for the permissions of the python script.
It must have executable permissions (755)

ansible: local test new module with Error:Module unable to decode valid JSON on stdin. Unable to figure out what parameters were passed

I'm new to Python. This is my first Ansible module in order to delete the SimpleDB domain from ChaosMonkey deletion.
When tested in my local venv with my Mac OS X, it keeps saying
Module unable to decode valid JSON on stdin. Unable to figure out
what parameters were passed.
Here is the code:
#!/usr/bin/python
# Delete SimpleDB Domain
from ansible.module_utils.basic import *
import boto3
def delete_sdb_domain():
fields = dict(
sdb_domain_name=dict(required=True, type='str')
)
module = AnsibleModule(argument_spec=fields)
client = boto3.client('sdb')
response = client.delete_domain(DomainName='module.params['sdb_domain_name']')
module.exit_json(changed = False, meta = response)
def main():
delete_sdb_domain()
if __name__ == '__main__':
main()
And I'm trying to pass in parameters from this file: /tmp/args.json.
and run the following command to make the local test:
$ python ./delete_sdb_domain.py /tmp/args.json
please note I'm using venv test environment on my Mac.
If you find any syntax error in my module, please also point it out.

This is not how you should test your modules.
AnsibleModule expects to have specific JSON as stdin data.
So the closest thing you can try is:
python ./delete_sdb_domain.py < /tmp/args.json
But I bet you have your json file in wrong format (no ANSIBLE_MODULE_ARGS, etc.).
To debug your modules you can use test-module script from Ansible hacking pack:
./hacking/test-module -m delete_sdb_domain.py -a "sdb_domain_name=zzz"

Getting "ImportError: No module named" with parallel python and methods in a package

I'm trying to use parallel python in order to do some distributed benchmarking (essentially, coordinate and run some code on a set of machines from a central server). The code I had was working perfectly fine until I moved the functionality to a separate package. From then on, I keep getting ImportError: No module named some.module.pp_test.
My question is actually two-fold: has anyone ever came across this problem with pp, and if yes, how to solve it? I tried using dill (import dill), but didn't help. Also, is there a good replacement for parallelpython, that doesn't require any additional infrastructure?
The exact error I get is:
RUNNING TEST
Waiting for hosts to finish booting....A fatal error has occured during the function execution
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/ppworker.py", line 86, in run
__args = pickle.loads(__sargs)
ImportError: No module named some.module.pp_test
Caught exception in the run phase 'NoneType' object is not iterable
Traceback (most recent call last):
File "test.py", line 5, in <module>
p.ping_pong()
File "/home/ubuntu/workspace/pp-test/some/module/pp_test.py", line 5, in ping_pong
a_test.run()
File "/home/ubuntu/workspace/pp-test/some/module/pp_test.py", line 27, in run
pong, hostname = ping()
TypeError: 'NoneType' object is not iterable
The code is structured this way:
pp-test/
test.py
some/
__init__.py
module/
__init__.py
pp_test.py
The test.py is implemented as:
from some.module.pp_test import MWE
p = MWE()
p.ping_pong()
While pp_test.py is:
class MWE():
def ping_pong(self):
print "RUNNING TEST "
a_test = PPTester()
a_test.run()
import pp
import time
from sys import stdout, exit
class PPTester(object):
def run(self):
try:
ppservers = ('10.10.10.10', )
time.sleep(5)
job_server = pp.Server(0, ppservers=ppservers)
stdout.write("Waiting for hosts to finish booting...")
while len(job_server.get_active_nodes()) - 1 < len(ppservers):
stdout.write(".")
stdout.flush()
time.sleep(1)
ppmodules = ()
pings = [(server, job_server.submit(self.run_pong, modules=ppmodules)) for server in ppservers]
for server, ping in pings:
pong, hostname = ping()
print "Host ", hostname, " is alive!"
print "All servers booted up, starting benchmarks..."
job_server.print_stats()
except Exception as e:
print "Caught exception in the run phase", e
raise
pass
def run_pong(self):
import subprocess
p = subprocess.Popen("hostname", stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True)
(output, err) = p.communicate()
p_status = p.wait()
return "pong ", output

dill won't work with pp out of the box, because pp doesn't serialize the python objects -- pp extracts the object's source code (like the inspect module in the standard python library).
To enable pp to use dill (actually dill.source, which is inspect augmented by dill), you have to use a fork of pp called ppft. ppft installs as pp (i.e. imports with import pp), but it has much stronger source inspection, so you can automatically "serialize" most python objects and have ppft track down their dependencies automatically.
Get ppft here: https://github.com/uqfoundation
ppft is also pip installable and python 3.x compatible.

Error when using streaming_python in Pig

When I run the following:
REGISTER /home/hduser/Documents/ccc/Research/phd/code/ECentre/scripts/bags.py USING streaming_python
AS bp;
raw = LOAD 'hdfs:///user/hduser/smsCorpus_en_2012.04.30_all.xml' AS (line:chararray);
b = foreach raw generate bp.enumerate_bag(line);
I get
Failed to parse: Pig script failed to parse:
<file /home/hduser/Documents/ccc/Research/phd/code/ECentre/scripts/nltk.pig, line 13, column
25> Failed to generate logical plan. Nested exception: org.apache.pig.backend.executionengine.ExecException:
ERROR 1070: Could not resolve bp.enumerate_bag using imports: [, java.lang., org.apache.pig.builtin.,
org.apache.pig.impl.builtin.]
bags.py:
#!/usr/bin/env python
def enumerate_bag(input):
output = []
for rank, item in enumerate(input):
output.append(tuple([rank] + list(item)))
return output
Can anyone tell me why?
My version is:
Apache Pig version 0.12.2-SNAPSHOT (r: unknown)
compiled Apr 29 2014, 13:40:45

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pig streaming through python script with import modules - python

You need to append the current directory to sys.path in your test_stream.py: #! /usr/bin/env python import sys sys.path.append(".") Thus the SHIP command you had there does ship the python script, but you just need to tell Python where to look.

Related

Not Able to Run GEM5 with RISC-V: "!seWorkload occurred: Couldn't find appropriate workload object"

Unable to import module 'lambda_function': No module named 'error'

ansible: local test new module with Error:Module unable to decode valid JSON on stdin. Unable to figure out what parameters were passed

Getting "ImportError: No module named" with parallel python and methods in a package

Error when using streaming_python in Pig

Categories

Resources