I am trying to run Jupyter Notebook on AWS Lambda, created a layer with all the dependencies, the jupyter notebook is a simple code which pulls a csv file from amazon S3 and displays the data as bar graph. Below is the lambda function written to download the .ipynb file and execute the notebook with papermill. Not sure why its failing with boto3 module not found.
import json
import sys
import os
import boto3
# papermill to execute notebook
import papermill as pm
import pandas as pd
import logging
import matplotlib.pyplot as plt
sys.path.append("/opt/bin")
sys.path.append("/opt/python")
os.environ["PYTHONPATH"]='/var/task'
os.environ["PYTHONPATH"]='/opt/python/'
os.environ["MPLCONFIGDIR"] = '/tmp/'
# ipython needs a writeable directory
os.environ["IPYTHONDIR"]='/tmp/ipythondir'
logger = logging.getLogger()
logger.setLevel(logging.INFO)
def lambda_handler(event, context):
s3 = boto3.resource('s3')
s3.meta.client.download_file('test-boto', 'testing.ipynb', '/tmp/test.ipynb')
pm.execute_notebook('/tmp/test.ipynb', '/tmp/juptest_output.ipynb', kernel_name='python3')
s3_client.upload_file('/tmp/juptest_output.ipynb', 'test-boto', 'temp/juptest_output.ipynb')
logger.info(event)
Error o/p:
START RequestId: c4da3406-c829-4f99-9fbf-b231a0d3dc06 Version: $LATEST
[INFO] 2020-08-07T17:55:16.602Z c4da3406-c829-4f99-9fbf-b231a0d3dc06 Input Notebook: /tmp/test.ipynb
[INFO] 2020-08-07T17:55:16.603Z c4da3406-c829-4f99-9fbf-b231a0d3dc06 Output Notebook: /tmp/juptest_output.ipynb
Executing: 0%| | 0/15 [00:00<?, ?cell/s][INFO] 2020-08-07T17:55:17.311Z c4da3406-c829-4f99-9fbf-b231a0d3dc06 Executing notebook with kernel: python3
OpenBLAS WARNING - could not determine the L2 cache size on this system, assuming 256k
Executing: 7%|▋ | 1/15 [00:01<00:14, 1.06s/cell]
Executing: 7%|▋ | 1/15 [00:01<00:20, 1.46s/cell]
[ERROR] PapermillExecutionError:
---------------------------------------------------------------------------
Exception encountered at "In [1]":
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
<ipython-input-1-9c332490c231> in <module>
1 import pandas as pd
2 import os
----> 3 import boto3
4 import matplotlib.pyplot as plt
5 client = boto3.client('s3')
ModuleNotFoundError: No module named 'boto3'
Traceback (most recent call last):
File "/var/task/lambda_function.py", line 28, in lambda_handler
pm.execute_notebook('/tmp/test.ipynb', '/tmp/juptest_output.ipynb', kernel_name='python3')
File "/opt/python/papermill/execute.py", line 110, in execute_notebook
raise_for_execution_errors(nb, output_path)
File "/opt/python/papermill/execute.py", line 222, in raise_for_execution_errors
raise errorEND RequestId: c4da3406-c829-4f99-9fbf-b231a0d3dc06
REPORT RequestId:c4da3406-c829-4f99-9fbf-b231a0d3dc06
Duration: 1624.78 ms Billed Duration: 1700 ms Memory Size: 3008 MB Max Memory Used: 293 MB
Jupyter Notebook:
import pandas as pd
import os
import boto3
import matplotlib.pyplot as plt
client = boto3.client('s3')
path = 's3://test-boto/aws-costs-Owner-Month-08.csv'
monthly_owner = pd.read_csv(path)
plt.bar(monthly_owner.Owner.head(6),monthly_owner.Amount.head(6))
plt.xlabel('Owner', fontsize=15)
plt.ylabel('Amount', fontsize=15)
plt.title('AWS Monthly Cost by Owner')
plt.show()
It looks like papermill kernel is not able to detect boto3 package even though your lambda handler is able to find it. I see you are overriding (not appending) PYTHONPATH in your lambda handler. This will remove other directories from PYTHONPATH to look for packages. Papermill child process will use this python path subsequently.
You might also find this useful. It allows you to directly deploy Jupyter Notebooks as serverless functions. It uses papermill behind the scene.
Disclaimer: I work for Clouderizer.
Related
I am starting with a ML model and importing libraries. Every library is working fine except MGLEARN which throws error:
ModuleNotFoundError: No module named 'MGLEARN'.
I didn't pip install anything.
import sys
print("Python version: {}".format(sys.version))
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
print("pandas version: {}".format(pd.__version__))
import matplotlib
print("matplotlib version: {}".format(matplotlib.__version__))
import numpy as np
print("NumPy version: {}".format(np.__version__))
import scipy as sp
print("SciPy version: {}".format(sp.__version__))
import IPython
print("IPython version: {}".format(IPython.__version__))
import sklearn
print("scikit-learn version: {}".format(sklearn.__version__))
import mglearn
The output I get
ModuleNotFoundError Traceback (most recent call last)
Cell In[1], line 17
15 import sklearn
16 print("scikit-learn version: {}".format(sklearn.__version__))
---> 17 import MGLEARN
ModuleNotFoundError: No module named 'MGLEARN'
Pip install anything gives error
!pip install mglearn
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[7], line 1
----> 1 get_ipython().system('pip install mglearn')
File /lib/python3.10/site-packages/IPython/core/interactiveshell.py:2542, in InteractiveShell.system_piped(self, cmd)
2537 raise OSError("Background processes not supported.")
2539 # we explicitly do NOT return the subprocess status code, because
2540 # a non-None value would trigger :func:`sys.displayhook` calls.
2541 # Instead, we store the exit_code in user_ns.
-> 2542 self.user_ns['_exit_code'] = system(self.var_expand(cmd, depth=1))
File /lib/python3.10/site-packages/IPython/utils/_process_posix.py:129, in ProcessHandler.system(self, cmd)
125 enc = DEFAULT_ENCODING
127 # Patterns to match on the output, for pexpect. We read input and
128 # allow either a short timeout or EOF
--> 129 patterns = [pexpect.TIMEOUT, pexpect.EOF]
130 # the index of the EOF pattern in the list.
131 # even though we know it's 1, this call means we don't have to worry if
132 # we change the above list, and forget to change this value:
133 EOF_index = patterns.index(pexpect.EOF)
AttributeError: module 'pexpect' has no attribute 'TIMEOUT'
When I run the code:
import sys
import math
import numpy as np
import matplotlib.pyplot as plt
from typing import List, Tuple
import os
import scipy.io, scipy.signal
import colorednoise as cn
def generate_pink_noise(singal_length):
beta = 1
samples = singal_length
noise = cn.powerlaw_psd_gaussian(beta, samples)
return noise
I get:
Traceback (most recent call last):
File "...MEA_foward_model.py", line 11, in <module>
import colorednoise as cn
ModuleNotFoundError: No module named 'colorednoise'
Note the 'test' environment I'm using shown here when I get the error: environment.
However the same code runs without error in Jupiter notebook:
import colorednoise as cn
signal_length =10
beta = 1 # the exponent: 0=white noite; 1=pink noise; 2=red noise (also "brownian noise")
samples = signal_length # number of samples to generate (time series extension)
noise = cn.powerlaw_psd_gaussian(beta, samples)
This image shows (see top right corner) shows that the same environment is used
jupyter notebook output. What is causing this dissonance between the two different behaviours?
I want to simulate a OpenModelica Model in Python with the help of OMPython. The following is my code:
import matplotlib.pyplot as plt
import OMPython
from OMPython import OMCSessionZMQ
from OMPython import ModelicaSystem
omc = OMCSessionZMQ()
mod = ModelicaSystem("Li_ionBattery.mo", "Li_ionBattery.TestBench.VaryingCurrent")
Li_simulation = mod.getSimulationOptions()
mod.setSimulationOptions(["stopTime=2000", "stepSize=50"])
variables_vary = mod.getQuantities()
Parameters_vary = mod.getParameters()
continous_vary = mod.getContinuous()
mod.setParameters(["nMC_Data.Q_nom=11", "nMC_Data.Rs=0.0003"])
mod.simulate()
And I am getting the following error:
Notification: Li_ionBattery requested package Modelica of version 3.2.2. Modelica 3.2.3 is used instead which states that it is fully compatible without conversion script needed.
Error: Class Li_ionBattery.TestBench.VaryingCurrent not found in scope <top>.
Error: Class Li_ionBattery.TestBench.VaryingCurrent not found in scope <TOP>.
stopTime !is not a simulation-option variable
Traceback (most recent call last):
raise Exception("Error: application file not generated yet")
Exception: Error: application file not generated yet
The error is at line,
mod = ModelicaSystem("Li_ionBattery.mo", "Li_ionBattery.TestBench.VaryingCurrent")
as it reports,
Error: Class Li_ionBattery.TestBench.VaryingCurrent not found in scope <top>.
Make sure the class Li_ionBattery.TestBench.VaryingCurrent exists in Li_ionBattery.mo.
I am working on a simple python script to stream messages from Kafka using pyspark, and I'm doing so using jupyter.
I get an error message saying Spark Streaming's Kafka libraries not found in class path (more details below). I included the solution suggested by #tshilidzi-mudau in a previous post (and confirmed in the docs) to avoid this problem. What should I do to fix the bug?
Following what suggested in the error prompt, I downloaded the JAR of the artifact, stored it in $SPARK_HOME/jars and included the reference in the code.
Here is the code:
import os
from __future__ import print_function
import sys
from pyspark.streaming import StreamingContext
from pyspark import SparkContext,SparkConf
from pyspark.streaming.kafka import KafkaUtils
if __name__ == "__main__":
os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars spark-streaming-kafka-0-10-assembly_2.10-2.2.2.jar pyspark-shell' #note that the "pyspark-shell" part is very important!!.
#conf = SparkConf().setAppName("Kafka-Spark").setMaster("spark://127.0.0.1:7077")
conf = SparkConf().setAppName("Kafka-Spark")
#sc = SparkContext(appName="KafkaSpark")
try:
sc.stop()
except:
pass
sc = SparkContext(conf=conf)
stream=StreamingContext(sc,1)
map1={'spark-kafka':1}
kafkaStream = KafkaUtils.createStream(stream, 'localhost:9092', "name", map1) #tried with localhost:2181 too
print("kafkastream=",kafkaStream)
sc.stop()
And this is the error:
Spark Streaming's Kafka libraries not found in class path. Try one of the following.
1. Include the Kafka library and its dependencies with in the
spark-submit command as
$ bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8:2.2.2 ...
2. Download the JAR of the artifact from Maven Central http://search.maven.org/,
Group Id = org.apache.spark, Artifact Id = spark-streaming-kafka-0-8-assembly, Version = 2.2.2.
Then, include the jar in the spark-submit command as
$ bin/spark-submit --jars <spark-streaming-kafka-0-8-assembly.jar> ...
TypeError Traceback (most recent call last)
<ipython-input-9-34de7dbdfc7c> in <module>()
13 ssc = StreamingContext(sc,1)
14 broker = "<my_broker_ip>"
---> 15 directKafkaStream = KafkaUtils.createDirectStream(ssc, ["test1"], {"metadata.broker.list": broker})
16 directKafkaStream.pprint()
17 ssc.start()
/opt/spark/python/pyspark/streaming/kafka.pyc in createDirectStream(ssc, topics, kafkaParams, fromOffsets, keyDecoder, valueDecoder, messageHandler)
120 return messageHandler(m)
121
--> 122 helper = KafkaUtils._get_helper(ssc._sc)
123
124 jfromOffsets = dict([(k._jTopicAndPartition(helper),
/opt/spark/python/pyspark/streaming/kafka.pyc in _get_helper(sc)
193 def _get_helper(sc):
194 try:
--> 195 return sc._jvm.org.apache.spark.streaming.kafka.KafkaUtilsPythonHelper()
196 except TypeError as e:
197 if str(e) == "'JavaPackage' object is not callable":
TypeError: 'JavaPackage' object is not callable
I have an already running python code of the document similarity server
The code runs fine from the commandline, however when I try to run from Jupyter notebook I get the following error (You can find the code below)
AttributeError Traceback (most recent call last)
in ()
----> 1 simServer.queryIndex('National Intergroup Inc said it plans to file a registration statement')
<ipython-input-2-81df834abc60> in queryIndex(self, queryText)
58 print "Querying the INDEX"
59 doc = {'tokens': utils.simple_preprocess(queryText)}
---> 60 print(self.service.find_similar(doc, min_score=0.4, max_results=50))
At first I got a different error message where the solution was to install simserver library within jupyter notebook using the command !pip install --upgrade simserver .. but now I do not think there is a missing library that needs to be downloaded
Relevant code from jupyter notebook:
Line where the issue occurs
simServer.queryIndex('National Intergroup Inc said it plans to file a registration statement')
#!/usr/bin/env python
import pickle
import os
import re
import glob
import pprint
import json
from gensim import utils
from simserver import SessionServer
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
class SimilarityServer(object):
def __init__(self):
print "Openning sesesion and setting it to true"
self.service = SessionServer('tmp/my_server/')
self.service.open_session()
self.service.set_autosession(True)
def indexDocs(self):
print "Docs indexing and training"
#train and index
print "Training"
self.service.session.train(None,method='lsi',clear_buffer=False)
print "Indexing"
self.service.session.index(None)
def queryIndex(self,queryText):
print "Querying the INDEX"
doc = {'tokens': utils.simple_preprocess(queryText)}
print(self.service.find_similar(doc, min_score=0.4, max_results=50))
simServer = SimilarityServer()
simServer.queryIndex('National Intergroup Inc said it plans to file a registration statement')