I am working Jupyter notebook on AWS Sagemaker instance. For convenience I wrote a .py file with couple of functions as defined;
#function to gather the percent of acts in each label feature combo
def compute_pct_accts(data, label_cnt):
"""
data is the output from aggregate_count
labe_cnt gives the breakdown of data for each target value
"""
label_data_combined = pd.merge(data, label_cnt, how='inner', left_on= 'label', right_on = 'label')
label_data_combined['Act_percent'] = np.round((label_data_combined['ACT_CNT']/label_data_combined['Total_Cnt'])*100,2)
return label_data_combined
#write a function to perform aggregation for target and feature column
def aggregate_count(df, var, target):
"""
df is the dataframe,
var is the feature name
target is the label varaible(0 or 1)
"""
label_var_cnt = df.groupby([var,target],observed=True)['ID'].count()
label_var_cnt = label_var_cnt.reset_index()
label_var_cnt.rename(columns={'ID':'ACT_CNT'},inplace=True)
return label_var_cnt
Both these functions are stored in a .py file called file1.py. Then to retrieve them in my notebook I typed;
from file1 import *
import pandas as pd
This command did import both functions. But when I tried to run the function;
compute_pct_accts(GIACT_Match_label_cnt, label_cnt)
I am getting a Name error;
pd not found
Please note that I have imported pandas as pd in my jupyter notebook. I am aware of using the option
%run -i compute_pct_accts_new.py
but that forces me to write a new python file with that function. My question is, can we have one python file with all functions defined in it, so that we can import all of them at once and use interactively in notebook.
Help is appreciated.
Try importing pandas in the .py file containing the function you want to import.
Related
I am using Jupyter in a Conda environment:
import igl
import meshplot as mp
import numpy as np
v, f = igl.read_triangle_mesh("./earth.ply")
k = igl.gaussian_curvature(v, f)
mp.plot(v, f, k, return_plot = True)
OUTPUT:
<meshplot.Viewer.Viewer at 0x1b53eb03fa0>
it is not displaying the mesh. it just outputs the location it stored in memory. Please help me.
It seems that you have your meshplot.rendertype set to "OFFLINE".
If you are using this code in a jupyter notebook and want to display the mesh, then just switch rendertype to "JUPYTER", by executing mp.jupyter() somewhere before your plot() command.
If you are running the code as a normal python program, you can export this View object as a HTML frame using the View.to_html() method. Then you can insert this frame into a html file and view it in a browser.
You can check out the source code for switching rendertype here, how the mp.plot function works here. The View class with the to_html method is defined here.
I'm new to PySpark and I'm trying to run the following code which replaces a name column with fake names.
# !pip install Faker
from faker import Faker
from functools import partial
def synthetic_column(string, faker_function):
return faker_function()
partial_func = partial(synthetic_column, faker_function = Faker().first_name)
spark_df = spark_df.withColumn('name',partial_func(col('name')))
display(spark_df)
Yields AssertionError: col should be Column
I'm running the same code on an integer type column and I don't get this AssertionError,
Why is this happening? I have tried the solutions mentioned here but they aren't helpful.
Please Advise.
You cannot apply directly a python function to your dataframe. You need to transform it to UDF. But, in your case, you probably also need to install Faker on all your nodes because UDFs are executed on local executor env, and they need to be able to do the imports on local.
But once installed, your code just need to import udf and apply it to your faker function
from pyspark.sql.functions import udf
partial_func = udf(Faker().first_name)
df = df.withColumn("Name", partial_func())
I am running the same notebook three times in parallel using the code below:
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
def notebook1_function(country, days):
dbutils.notebook.run(path = "/pathtonotebook1/notebook1", \
timeout_seconds = 300, \
arguments = {"Country":country, "Days":days})
countries = ['US','Canada','UK']
days = [2] * len(countries)
with ThreadPoolExecutor() as executor:
results = executor.map(notebook1_function, countries, days)
Each time, I am passing different value for 'country' and 2 for 'days'. Inside notebook1 I have df1.
I want to know the following:
How to append all the df1's from the three concurrent runs into a single dataframe.
How to get the status [Success/Failure] of each run after completion.
Thank you in advance.
When you're using dbutils.notebook.run (so-called Notebook workflows), the notebook is executed as a separate job, and caller of the notebook that doesn't share anything with it - all communication happens via parameters that you're passing to the notebook, and notebook may return only string value specified via call to dbutils.notebook.exit. So your code doesn't have access to the df1 inside the notebook that you're calling.
Usually, if you're using such notebook workflow, then you need to somehow persist the content of the df1 from the called notebook into some table, and then read that content from caller notebook.
Another possibility, is to extract the code from the called notebook into the function that will receive arguments, and will return the dataframe, include that notebook via %run, call the function with different arguments, and combine results using the union. Something like this:
Notebook 1 (called):
def my_function(country, days):
# do something
return dataframe
Caller notebook:
%run "./Notebook 1"
df_us = my_function('US', 10)
df_canada = my_function('Canada', 10)
df_uk = my_function('UK', 10)
df_all = df_us.union(df_canada).union(df_uk)
Currently, I am processing data in hive using custom mappers and reducers like this:
select TRANSFORM(hostname,impressionId) using 'python process_data.py' as a,b from impressions
But when I try to apply the same logic in Spark sql, I get SparkSqlParser error.
I want to resue the logic in process_data.py off the box. Is there any way to do it?
You need to put in some sort errors stacktrace so that the community can answer your questions quickly.
For the Python Script to run in your Scala code(that is what i am assuming), you can achieve it in the following way:
Example :
Python File : Code for making Input data to Uppercase
#!/usr/bin/python
import sys
for line in sys.stdin:
print line.upper()
Spark Code : For Piping the data
import org.apaches.spark.{SparkConf, SparkContext}
val conf = new SparkConf().setAppName("Pipe")
val sc = new SparkContext(conf)
val distScript = "/path/on/driver/PipeScript.py"
val distScriptName = "PipeScript.py"
sc.addFile(distScript)
val ipData = sc.parallelize(List("asd","xyz","zxcz","sdfsfd","Ssdfd","Sdfsf"))
val opData = ipData.pipe(SparkFiles.get(distScriptName))
opData.foreach(println)
you can create your own custom UDF and can use it within Spark application code. use custom UDF only in case something you can't do with available Spark native functions.
I am not sure what's there in process_data.py and what kind of input it takes and what are you expecting out of it.
If it's something that you want to make it available for different application code. you can do as follows:
you can create a class in python and have a function inside it to do processing and add it to your spark application code.
class MyClass:
def __init__(self, args):
…
def MyFunction(self):
spark.sparkContext.addPyFile('/py file location/somecode.py')
importing your class in pyspark application code
from somecode import MyClass
create an object to access the class and it's function
myobject = MyClass()
now you can access your class function to send and receive arguments.
I have a notebook a.pynb which has the function for read statement of a parquet file.
I am using a.pynb in another notebook b.pynb and in this new notebook, i am calling a function of a.pynb to read this parquet file and create a sqltable. But it always fails with
Error: global name sqlContext is not defined,
When it is defined in both the notebooks.
The exact code :
a.pynb ( Utils)
sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)
def parquet_read(file_name):
df = sqlContext.read.parquet(file_name+"*.parquet")
return df
In b.pynb I have used this function
import nbimporter
import a as commonUtils
reload(commonUtils)
sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)
df2 = commonUtils.parquet_read("abc")
It always fails with
Error: global name sqlContext is not defined,
When it is defined in both the notebooks.
I would very hesitantly use the approach you're following (i.e. importing notebooks as modules). I think you are far better served moving utility code to a .py file and not trying to use magic to import a notebook as a module.
Based on the documentation, it appears you overlooked some magic:
here we only run code which either defines a function or a class
It looks from your code sample like you define sqlContext as a module-level variable, not a class or a function.
One approach would be to reorganize your code as the following. Better still, I think, would be to move this code to a .py file.
def parquet_read(file_name):
sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)
df = sqlContext.read.parquet(file_name+"*.parquet")
return df