I am running the same notebook three times in parallel using the code below:
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
def notebook1_function(country, days):
dbutils.notebook.run(path = "/pathtonotebook1/notebook1", \
timeout_seconds = 300, \
arguments = {"Country":country, "Days":days})
countries = ['US','Canada','UK']
days = [2] * len(countries)
with ThreadPoolExecutor() as executor:
results = executor.map(notebook1_function, countries, days)
Each time, I am passing different value for 'country' and 2 for 'days'. Inside notebook1 I have df1.
I want to know the following:
How to append all the df1's from the three concurrent runs into a single dataframe.
How to get the status [Success/Failure] of each run after completion.
Thank you in advance.
When you're using dbutils.notebook.run (so-called Notebook workflows), the notebook is executed as a separate job, and caller of the notebook that doesn't share anything with it - all communication happens via parameters that you're passing to the notebook, and notebook may return only string value specified via call to dbutils.notebook.exit. So your code doesn't have access to the df1 inside the notebook that you're calling.
Usually, if you're using such notebook workflow, then you need to somehow persist the content of the df1 from the called notebook into some table, and then read that content from caller notebook.
Another possibility, is to extract the code from the called notebook into the function that will receive arguments, and will return the dataframe, include that notebook via %run, call the function with different arguments, and combine results using the union. Something like this:
Notebook 1 (called):
def my_function(country, days):
# do something
return dataframe
Caller notebook:
%run "./Notebook 1"
df_us = my_function('US', 10)
df_canada = my_function('Canada', 10)
df_uk = my_function('UK', 10)
df_all = df_us.union(df_canada).union(df_uk)
Related
I have a simple function in an R notebook (notebook A) that aggregates some data. I want to call notebook A from another notebook (notebook B) and interogate the aggregated data from notebook A in notebook B.
So far I can run notebook A from notebook B no problem, but cannot see any returned data, variables or functions.
Code in notebook A:
function_to_aggregate_data = function(x,y){
...some code...
}
aggregated_data = function_to_aggregate_data(x,y)
Code in notebook B:
%python
dbutils.notebook.run("path/to/notebook_A", 60)
When you use dbutils.notebook.run, that notebook is executed as a separate job, so no variables, etc. are available for the caller notebook, or in the called notebook. You can return some data from the notebook using dbutils.notebook.exit, but it's limited to 1024 bytes (as I remember). But you can return data by registering temp view, and then accessing data in this temp view - here is an example of doing that (although using Python for both).
Notebook B:
def generate_data1(n=1000, name='my_cool_data'):
df = spark.range(0, n)
df.createOrReplaceTempView(name)
Notebook A:
dbutils.notebook.run('./Code1', default_timeout)
df = spark.sql("select * from my_cool_data")
assert(df.count() == 1000)
P.S. You can't directly share data between R & Python code, only by using temp views, etc.
The documentation by Microsoft at https://learn.microsoft.com/en-us/azure/databricks/notebooks/notebook-workflows says that you can run another notebook and pass parameters by doing the following:
notebook1:
result = dbutils.notebook.run("notebook2", 60, {"argument": "data", "argument2": "data2"})
print(f"{result}")
But it doesnt say how I can fetch the paramters argument and argument2 in my notebook2.
notebook2:
argument = ??
argument2 = ??
print(f"argument={argument} and argument2={argument2}")
dbutils.notebook.exit("Success")
How can I get get parameters in notebook2?
The document provided an answer for this. In order to get the parameters passed from notebook1 you must create two text widgets using dbuitls.widgets.text() in notebook2. Now use the dbuitls.widgets.get() method to get the values of these parameters.
You can try using the following code:
Notebook1
result = dbutils.notebook.run("nb2", 60, {"argument": "data", "argument2": "data2"})
print(f"{result}")
Notebook2
dbutils.widgets.text("argument","argument_default")
argument = dbutils.widgets.get("argument")
dbutils.widgets.text("argument2","argument2_default")
argument2 = dbutils.widgets.get("argument2")
ans = argument+' '+argument2
#print(f"argument={argument} and argument2={argument2}")
dbutils.notebook.exit(ans)
When you execute the notebook1 to run notebook2, the notebook2 runs successfully with the exit value as shown below:
data data2
Note: If pass only one value from, then the other argument in notebook2 takes the default value mentioned in dbutils.widgets.text() (2nd parameter).
In prefect workflow, I'm trying to persist data of every schedule run. I need to compare data of every previous and current result. I tried Localresult and checkpoint=true but its not working. For example,
from prefect import Flow, task
from prefect.engine.results import LocalResult
from prefect.schedules import IntervalSchedule
from datetime import timedelta, datetime
import os
import prefect
#task("func_task_target.txt", checkpoint=True, result=LocalResult(dir="~/.prefect"))
def file_scan():
files = os.listdir(test)
#prefect.context.a = files
return files
schedule = IntervalSchedule(interval=timedelta(seconds=61))
with Flow("Test persist data", schedule) as flow:
a = file_scan()
flow.run()
My flow scheduled for every 61 seconds/a minute. On the first run I might get empty result but for the 2nd scheduled run I should get previous flow result to compare. can anyone help me to achieve this? Thanks!
Update (15 November 2021) :
Not sure what is the reason,
LocalResult and checkpoint actually worked when I ran the registered flow through the dashboard or cli prefect run -n "your-workflow.py" --watch. It doesn't work when I manually trigger the flow (e.g.: flow.run) in python code.
Try these following two options:
Option 1 : using target argument:
https://docs.prefect.io/core/concepts/persistence.html#output-caching-based-on-a-file-target
#task(target="func_task_target.txt", checkpoint=True, result=LocalResult(dir="~/.prefect"))
def func_task():
return "999"
Option 2 : instantiate LocalResult instance and invoke write manually.
MY_RESULTS = LocalResult(dir="./.prefect").
#task(checkpoint=True, result=LocalResult(dir="./.prefect"))
def func_task():
MY_RESULTS.write("999")
return "999"
PS:
Having same problem as LocalResult doesn't seem to work for mewhen used in decorator . e.g :
#task("func_task_target.txt", checkpoint=True, result=LocalResult(dir="~/.prefect"))
def file_scan():
I am working Jupyter notebook on AWS Sagemaker instance. For convenience I wrote a .py file with couple of functions as defined;
#function to gather the percent of acts in each label feature combo
def compute_pct_accts(data, label_cnt):
"""
data is the output from aggregate_count
labe_cnt gives the breakdown of data for each target value
"""
label_data_combined = pd.merge(data, label_cnt, how='inner', left_on= 'label', right_on = 'label')
label_data_combined['Act_percent'] = np.round((label_data_combined['ACT_CNT']/label_data_combined['Total_Cnt'])*100,2)
return label_data_combined
#write a function to perform aggregation for target and feature column
def aggregate_count(df, var, target):
"""
df is the dataframe,
var is the feature name
target is the label varaible(0 or 1)
"""
label_var_cnt = df.groupby([var,target],observed=True)['ID'].count()
label_var_cnt = label_var_cnt.reset_index()
label_var_cnt.rename(columns={'ID':'ACT_CNT'},inplace=True)
return label_var_cnt
Both these functions are stored in a .py file called file1.py. Then to retrieve them in my notebook I typed;
from file1 import *
import pandas as pd
This command did import both functions. But when I tried to run the function;
compute_pct_accts(GIACT_Match_label_cnt, label_cnt)
I am getting a Name error;
pd not found
Please note that I have imported pandas as pd in my jupyter notebook. I am aware of using the option
%run -i compute_pct_accts_new.py
but that forces me to write a new python file with that function. My question is, can we have one python file with all functions defined in it, so that we can import all of them at once and use interactively in notebook.
Help is appreciated.
Try importing pandas in the .py file containing the function you want to import.
I'm writing an aggregation in pysaprk
To this project, I'm also adding test, where I create a session, put some data, and then run my aggregation, and check the results
The code looks like as following:
def mapper_convert_row(row):
#... specific of business logic code, eventually return one string value
return my_str
def run_spark_query(spark: SparkSession, from_dt, to_dt):
query = get_hive_query_str(from_dt, to_dt)
df = spark.sql(query).rdd.map(lambda row: Row(mapper_convert_row(row)))
out_schema = StructType([StructField("data", StringType())])
df_conv = spark.createDataFrame(df, out_schema)
df_conv.write.mode('overwrite').format("csv").save(folder)
And here is my test class
class SparkFetchTest(unittest.TestCase):
#staticmethod
def getOrCreateSC():
conf = SparkConf()
conf.setMaster("local")
spark = (SparkSession.builder.config(conf=conf).appName("MyPySparkApp")
.enableHiveSupport().getOrCreate())
return spark
def test_fetch(self):
dt_from = datetime.strptime("2019-01-01-10-00", '%Y-%m-%d-%H-%M')
dt_to = datetime.strptime("2019-01-01-10-05", '%Y-%m-%d-%H-%M')
spark = self.getOrCreateSC()
self.init_and_populate_table_with_test_data(spark, input_tbl, dt_from, dt_to)
run_spark_query(spark, dt_from, dt_to)
# assert on results
I've added PySpark dependencies via the Conda environment
and running this code via PyCharm. Just to make it clear - there is no spark installation on my local machine except PySpark Conda package
When I set the breakpoint inside the code, it works for me in the driver code, but it does not stop inside mapper_convert_row function.
How can I debug this business logic function in a local test environment?
The same approach in scala works perfectly, but this code should be in python.
pyspark is a conduit to the spark runtime that runs on the jvm / is written in scala. The connection is through py4j that provides a tcp-based socket from the python executable to the jvm. Unfortunately that means
No local debugging
I'm no more happy about it than you. I might just write/maintain a parallel code branch in scala to figure some things out that are tiring to do without the debugger.
Update Pycharm is able to debug spark programs. I have been using it nearly daily Pycharm Debugging of Pyspark