Install pandas when calling the PythonRunner object

Install pandas when calling the PythonRunner object - python

I have a python script:
# -*- coding: utf-8 -*-
import pandas as pd
print('Hello World')
I'm trying to run it in my Scala project using a PythonRunner object:
import org.apache.spark.deploy.PythonRunner
import java.io.File
import java.nio.file.Paths
object PythonRunnerApp extends App{
val pyFilePath = this.getClass.getResource("").getPath + "/hello.py"
PythonRunner.main(Array(pyFilePath, "hello.py"))
}
As a result, I get an import error: ImportError: No module named pandas
Traceback (most recent call last):
File "/Users/a19562665/IdeaProjects/PythonRunner/target/scala-2.12/classes//hello.py", line 3, in <module>
import pandas as pd
ImportError: No module named pandas
Exception in thread "main" org.apache.spark.SparkUserAppException: User application exited with 1
at org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:103)
at PythonRunnerApp$.runUsingSpark(PythonRunnerApp.scala:15)
at PythonRunnerApp$.delayedEndpoint$PythonRunnerApp$1(PythonRunnerApp.scala:27)
at PythonRunnerApp$delayedInit$body.apply(PythonRunnerApp.scala:8)
at scala.Function0.apply$mcV$sp(Function0.scala:39)
at scala.Function0.apply$mcV$sp$(Function0.scala:39)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17)
at scala.App.$anonfun$main$1$adapted(App.scala:80)
at scala.collection.immutable.List.foreach(List.scala:431)
at scala.App.main(App.scala:80)
at scala.App.main$(App.scala:78)
at PythonRunnerApp$.main(PythonRunnerApp.scala:8)
at PythonRunnerApp.main(PythonRunnerApp.scala)
Is there any way I can ask PythonRunner to install pandas?
UPD:
Here is another example of running python scripts:
#!/usr/bin/python
# -*- coding: utf-8 -*-
#import pandas as pd
import sys
for line in sys.stdin:
print('Hello, ' + line)
# this is hello.py
And Scala application:
spark.sparkContext.addFile(getClass.getResource("hello.py").getPath, true)
val test = spark.sparkContext.parallelize(List("Body!")).repartition(1)
val piped = test.pipe(SparkFiles.get("./hello.py"))
val c = piped.collect()
c.foreach(println)
Output: Hello, Body!
But the question remains open to me. Can I, as a cluster user, install pandas on workers?

You need to install all necessary Python dependencies across every executor before submitting Spark applications into your cluster
Keep in mind that you're not using Pandas here, and SparkSQL should probably be used instead

Related

Selenium Helper No Module

Traceback (most recent call last):
File "gen.py", line 9, in <module>
from SeleniumHelper import SeleniumHelper
ImportError: No module named 'SeleniumHelper'
# python accounts.py -i ../../data/twitter-creator.json -d regular -f 1
# python accounts.py -i ../../data/twitter-creator.json -d proxy -f 1
import sys
import time
import getopt
import simplejson
from selenium import webdriver
from seleniumHelper import seleniumHelper
class TwitterCreator(SeleniumHelper):
MOBILE_URL_CREATE = 'https://mobile.twitter.com/signup?type=email'
MOBILE_FIELD_SIGN_UP_NAME = '#oauth_signup_client_fullname'
MOBILE_FIELD_SIGN_UP_EMAIL = '#oauth_signup_client_phone_number'
MOBILE_FIELD_SIGN_UP_PASSWORD = '#password'
MOBILE_FIELD_SIGN_UP_USERNAME = '#custom_name'
MOBILE_BUTTON_SKIP_PHONE = '.signup-skip input'
MOBILE_BUTTON_INTERESTS = 'input[data-testid="Button"]'
DESKTOP_URL_CREATE = 'https://twitter.com/signup'
DESKTOP_URL_SKIP = 'https://twitter.com/account/add_username'
DESKTOP_URL_MAIN = 'https://twitter.com'
import mechanize
import cookielib
import subprocess
dear python masters. I am getting selenium helper error. I posted the error above. I tried hard to decode the code. but I couldn't find where is the error. where is the problem? I will be very happy if you answer. good work.
note:there is also another file called selenium. I put the error code at the top.
note2: I installed selenium with pip. but he doesn't see.

from SeleniumHelper import SeleniumHelper
--this is improper syntax. Ostensibly you could do import SeleniumHelper as SeleniumHelper? I'm not sure how this differs from simply import SeleniumHelper anyway.

Import own python modules in nextflow script block?

I created a python script called utilities.py in bin/ directory:
#!/usr/bin/env python3
import numpy as np
import pandas as pd
from datetime import datetime
import io
def print_info(in_df, fname_base):
buffer = io.StringIO()
df = in_df.copy()
df.info(buf=buffer)
s = buffer.getvalue()
with open(fname_base+"_info.txt", "w", encoding="utf-8") as f:
f.write(s)
def print_desc(in_df, fname_base):
df = in_df.copy()
desc = df.describe()
desc.to_csv(fname_base+"_desc.tsv", sep = '\t')
def print_data(in_df, fname_base):
df = in_df.copy()
print_info(df, fname_base)
print_desc(df, fname_base)
df.to_csv(fname_base+".tsv", sep = '\t')
and made it executable with chmod +x. I would like to use these functions in a several script blocks in various processes in my workflow. Currently when I try importing a function from my utilities module:
#!/bin/bash nextflow
process transform_data {
input:
path(data)
output:
path("out.tsv"), emit: out_data
script:
"""
#!/usr/bin/env python3
import pandas as pd
import io
from utilities import print_info
"""
}
I get the following error:
Traceback (most recent call last):
File ".command.sh", line 4, in <module>
from utilities import print_info
ModuleNotFoundError: No module named 'utilities'
Is it possible to import own modules in this way?

Which version of Nextflow are you using?
I tested with v22.04.5 and the following works:
My setup is little bit different, instead of specifying #!/usr/bin/env python3, I directly invoked a python script (test.py) which has from utilities import print_info inside it, and it works fine.
script:
"""
test.py
"""
Note that the following won't work: from .utilities import print_info. Therefore, you can import custom Python module with Nextflow.

Python cannot import from another file name not defined

I am trying to import a custom function from another python file but keep getting an error NameError: name 'testme' is not defined. I confirmed that I am importing the file correctly according to this SO post and that the function is top level. What else can I try to fix this?
My main python file is:
import sys
import dbconn
#from dbconn import testme #<----did not work
dev=True
if(dev):
categId='528'
pollIds=[529,530,531]
else:
categId=str(sys.argv[1])
pollIds=[529,530,531]
df=testme(categIds)#callServer(categId,pollIds)
df
if(not categId.isdigit):
print('categ id fail. expected digit got: '+categId)
.....
and dbconn.py:
import pymysql #pip3 install PyMySQL
import pandas as pd
from scipy.stats.stats import pearsonr
from scipy import stats
def testme(categIds):
try:
df=categIds
except Exception as e:
print("broke")
return categIds
Not sure if it makes a difference but I am running the main python file from within a Jupyter notebook, and have a compiled version of dbconn.py in the same directory
In response to the suggestions I tried:
df=dbconn.testme(categIds)
got the error:
module 'dbconn' has no attribute 'testme'

You Have to follow these fox exact import
1)import <package>
2)import <module>
3)from <package> import <module or subpackage or object>
4)from <module> import <object>
in your case, you have tried
from dbconn import testme
you have to use only packages in from section and module in import section
like >>
from testme import dbconn

import ansible.module_utils in 2.2.1.0 as part of inventory module

Importing UTILS classes into Inventory - can it be done?
I have created a custom LDAP data importer as part of creating my inventory class. The LDAP schema we have wasn't similar enough to the LDAP plugin provided in samples.
My class is called ldapDataModule; the class is in:
/home/agt/ansible/agt_module_utils/ldapDataModule.py
My "$HOME/.ansible.cfg" file has the following:
module_utils = /home/agt/ansible/agt_module_utils
When running my Ansible inventory module, I get the following output:
ansible ecomtest37 -m ping
ERROR! Attempted to execute "/sites/utils/local/ansible/hosts" as
inventory script: Inventory script (/sites/utils/local/ansible/hosts) had
an execution error: Traceback (most recent call last):
File "/sites/utils/local/ansible/hosts", line 22, in
from ansible.module_utils import ldapDataModule
ImportError: No module named module.utils
The include statement inside hosts appears like this:
import copy
import ldap
import re
import sys
import operator
import os
import argparse
import datetime
import os.path
try:
import json
except:
import simplejson as json
from ansible.module_utils import ldapDataModule
class agtInventory(object):
RECOMENDATIONS?

I was able to do the following as a "work around". I'd still like to hear from Ansible guru's on proper use of "module_utils" variable from ansible.cfg
sys.path.insert(0, '/home/agt/ansible/agt_module_utils')
from ldapDataModule import ldapDataModule

About Graphlab library importing

In Ubuntu 14.04, I have installed Graphlab based on https://dato.com/download/install-graphlab-create-command-line.html and it seems to be working fine.
However, I receive this error when trying to use a recommender module:
import graphlab
from graphlab.recommender import ranking_factorization_recommender
In the first line, graphlab is imported without any error. However, the second line causes this error:
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
<ipython-input-5-34df81ffb957> in <module>()
----> 1 from graphlab.recommender import ranking_factorization_recommender
ImportError: No module named recommender
How can the problem be solved? Thanks

It's just a namespace issue. recommender actually lives in the `toolkits module, so this should work:
import graphlab
from graphlab.toolkits.recommender import ranking_factorization_recommender

Graphlab has already imported everything for you in their __init__.py file.
Just do:
from graphlab import ranking_factorization_recommender
from graphlab import <any_other_recommender>
Here is a snippet of graphlab.__init__.py file:
from graphlab.util import get_runtime_config
from graphlab.util import set_runtime_config
import graphlab.connect as _mt
import graphlab.connect.aws as aws
from . import visualization
import os as _os
import sys as _sys
if _sys.platform != 'win32' or \
(_os.path.exists(_os.path.join(_os.path.dirname(__file__), 'cython', 'libstdc++-6.dll')) and \
_os.path.exists(_os.path.join(_os.path.dirname(__file__), 'cython', 'libgcc_s_seh-1.dll'))):
from graphlab.data_structures.sgraph import Vertex, Edge
from graphlab.data_structures.sgraph import SGraph
from graphlab.data_structures.sarray import SArray
from graphlab.data_structures.sframe import SFrame
from graphlab.data_structures.sketch import Sketch
from graphlab.data_structures.image import Image
from graphlab.data_structures.sgraph import load_sgraph, load_graph
from graphlab.toolkits._model import Model, CustomModel
import graphlab.aggregate
import graphlab.toolkits
import graphlab.toolkits.clustering as clustering
import graphlab.toolkits.distances as distances
...

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Install pandas when calling the PythonRunner object - python

You need to install all necessary Python dependencies across every executor before submitting Spark applications into your cluster Keep in mind that you're not using Pandas here, and SparkSQL should probably be used instead

Related

Selenium Helper No Module

Import own python modules in nextflow script block?

Python cannot import from another file name not defined

import ansible.module_utils in 2.2.1.0 as part of inventory module

About Graphlab library importing

Categories

Resources