Pass a pySpark script in Livy Session statement

Pass a pySpark script in Livy Session statement - python

I understand livy session statement intakes code statements like the below example.
data = {
'code': textwrap.dedent("""
import random
NUM_SAMPLES = 100000
def sample(p):
x, y = random.random(), random.random()
return 1 if x*x + y*y < 1 else 0
count = sc.parallelize(xrange(0, NUM_SAMPLES)).map(sample).reduce(lambda a, b: a + b)
print "Pi is roughly %f" % (4.0 * count / NUM_SAMPLES)
""")
}
r = requests.post(statements_url, data=json.dumps(data), headers=headers)
but is there a way in which I can provide pyspark files, maybe something like this:
data = {
'pySparkFile': file_name.py
}
I understand livy batch provides this functionality but I want an interactive session where users can pass multiple scripts one after another and we can also call variables of other scripts, just like in an interactive pySpark session.

I am not sure this answers your question, but I managed to create a Spark session on EMR using cURL like this:
$ curl -H "Content-Type: application/json" -X POST -d '{"kind":"pyspark", "conf": {"spark.yarn.dist.pyFiles": "s3://bucket-name/test.py"}}' http://ec2-3-87-28-125.compute-1.amazonaws.com:8998/sessions
{"id":0,"name":null,"appId":null,"owner":null,"proxyUser":null,"state":"starting","kind":"pyspark","appInfo":{"driverLogUrl":null,"sparkUiUrl":null},"log":["stdout: ","\nstderr: ","\nYARN Diagnostics: "]}
I inspected /mnt/var/log/livy/livy-livy-server.out and found this line which indicates that the session was successfully created:
20/08/31 18:02:25 INFO InteractiveSession: Interactive session 0 created [appid: application_1598896609416_0002, owner: null, proxyUser: None, state: idle, kind: pyspark, info: {driverLogUrl=http://ip-172-31-85-247.ec2.internal:8042/node/containerlogs/container_1598896609416_0002_01_000001/livy, sparkUiUrl=http://ip-172-31-95-182.ec2.internal:20888/proxy/application_1598896609416_0002/}]

Related

How to get progress of successful build through Jenkins Python API

I have written python code to retrieve information about build. I prints a summary of successful and unsuccessful builds.
from prettytable import PrettyTable
t = PrettyTable(['Job name','Successful','Failed','Unstable','Aborted','Total Builds','Failure Rate'])
t1 = PrettyTable(['Status', 'Job name','Build #','Date','Duration','Node','User'])
aggregation ={}
jobs = server.get_all_jobs(folder_depth=None)
for job in jobs:
print(job['fullname'])
aggregation[job['fullname']] = {"success" : 0 , "failure" : 0 , "aborted" : 0, "unstable":0}
info = server.get_job_info(job['fullname'])
# Loop over builds
builds = info['builds']
for build in builds:
information = server.get_build_info(job["fullname"],
build['number'])
if "SUCCESS" in information['result']:
aggregation[job['fullname']]['success'] = str(int(aggregation[job['fullname']]['success']) + 1)
if "FAILURE" in information['result']:
aggregation[job['fullname']]['failure'] = str(int(aggregation[job['fullname']]['failure']) + 1)
if "ABORTED" in information['result']:
aggregation[job['fullname']]['aborted'] = str(int(aggregation[job['fullname']]['aborted']) + 1)
if "UNSTABLE" in information['result']:
aggregation[job['fullname']]['unstable'] = str(int(aggregation[job['fullname']]['unstable']) + 1)
t1.add_row([ information['result'], job['fullname'],information["id"],datetime.fromtimestamp(information['timestamp']/1000),information["duration"],"master",information["actions"][0]["causes"][0]["userName"]])
total_build = int(aggregation[job['fullname']]['success'])+int(aggregation[job['fullname']]['failure'])
t.add_row([job["fullname"], aggregation[job['fullname']]['success'],aggregation[job['fullname']]['failure'],aggregation[job['fullname']]['aborted'],aggregation[job['fullname']]['unstable'],total_build,(float(aggregation[job['fullname']]['failure'])/total_build)*100])
with open('result', 'w') as w:
w.write(str(t1))
w.write(str(t))
This is what the output looks like:
And this is what Windows batch execute command looks like:
cd E:\airflowtmp
conda activate web_scraping
python hello.py
hello.py prints hello world. If I add print counter =100 or something like this then how do I return it and print it in this resultant table.
Edit:
I am trying to get some kind of variable from code to display. For instance if Im scraping pages and scraper ran successfully then I want to know the number of pages that it scraped. You can think of it as a simple counter. Is there any way to return a variable from Jenkins to python

Behave - Testing using blank Example fields

I am using Behave to automate the testing of a config file, as part of this test I need to populate various fields in the config file with invalid and blank fields. Where I am entering values I can do this using a Scenario Outline entering the values in the Examples. However when I try entering a blank field using this method Behave does not like the fact there is no value.
Is there an easy way to pass a blank value from the Examples file, or will I need to test these conditions using a separate behave test
feature
Scenario Outline:Misconfigured Identity Listener
Given an already stopped Identity Listener
And parameter <parameter> is configured to value <config_value>
When the Identity Listener is started
Then the identity listener process is not present on the system
And the log contains a <message> showing that the parameter is not configured
Examples: Protocols
|parameter |message |config_value|
|cache_ip_address | cache_ip_address | |
|cache_ip_address | cache_ip_address | 123.123.12 |
the step where I define the config value
#given('parameter {parameter} is configured to value {config_value}')
def step_impl(context, parameter, config_value):
context.parameter = parameter
context.config_value = config_value
context.identity_listener.update_config(parameter, config_value)
changing the config file using sed -i (I am interacting with a linux box in this test)
def update_config(self, param, config_value):
command = 'sudo sh -c "sed -i'
command = command + " '/" + param + "/c\\" + param + "= "+ config_value + " \\' {0}\""
command = command.format(self.config_file)
self.il_ssh.runcmd(command)
Thanks to answer from #Verv i got this working solution below
passed an empty value in for fields where I don't want a value passed
|parameter |message |config_value|
|cache_ip_address | cache_ip_address | empty |
Added an if else statement into my update config step
def update_config(self, param, config_value):
if config_value == "empty":
il_config = ""
else:
il_config = config_value
command = 'sudo sh -c "sed -i'
command = command + " '/" + param + "/c\\" + param + "= " + il_config + " \\' {0}\""
command = command.format(self.config_file)
self.il_ssh.runcmd(command)

You could put something like empty in the field, and tweak your method so that whenever the field's value is empty, you treat it as an actual empty string (i.e. "")

how to run django cron job with which function

I have an app that needs a cron job. specifically, for ranking part I need my file to run synchronously so the score changes in the background. Here is my code.
I have rank.py in my utils folder
from datetime import datetime, timedelta
from math import log
epoch = datetime(1970, 1, 1)
def epoch_seconds(date):
td = date - epoch
return td.days * 86400 + td.seconds + (float(td.microseconds) / 1000000)
def score(ups, downs):
return ups - downs
def hot(ups, downs, date):
s = score(ups, downs)
order = log(max(abs(s), 1), 10)
sign = 1 if s > 0 else -1 if s < 0 else 0
seconds = epoch_seconds(date) - 1134028003
return round(sign * order + seconds / 45000, 7)
And I have the two functions under Post model, inside models.py
def get_vote_count(self):
vote_count = self.vote_set.filter(is_up=True).count() - self.vote_set.filter(is_up=False).count()
if vote_count >= 0:
return "+ " + str(vote_count)
else:
return "- " + str(abs(vote_count))
def get_score(self):
"""
:return: The score calculated by hot ranking algorithm
"""
upvote_count = self.vote_set.filter(is_up=True).count()
devote_count = self.vote_set.filter(is_up=False).count()
return hot(upvote_count, devote_count, self.pub_date.replace(tzinfo=None))
Problem is I'm not sure how to run cron job for this. I've seen http://arunrocks.com/building-a-hacker-news-clone-in-django-part-4/ and it looks like I need to create another file and another function to make the whole thing run. again and again./but what function?How do I use cron job for my code? I saw there are many applications that allow me to do this, but I'm just not sure which funcction I need to use and how I should use. my guess is I need to run get_score function in models.py but how....

you may consider celery and rabbitmq
the idea is: in your app you create a file called tasks.py and there you put the code:
# tasks.py
from celery import task
#task
def your_task_for_async_job(data):
# todo
just call the function and it does the job for you asyncly..
Here is the documentation for Celery, you find there also how to set it up with django etc..
hope, this helps

How do I get event log information from pyethereum?

I've got some tests that create contracts with pyethereum and do various things with them, but I'm puzzled over how to get information about events they log.
A simplified example:
from ethereum import tester as t
s = t.state()
code = """contract LogTest {
event LogMyNumber(uint);
function LogTest() {
}
function logSomething() {
LogMyNumber(4);
}
}"""
logtest = t.state().abi_contract(code, language='solidity', sender=t.k0)
logtest.logSomething()
#number_i_logged = WHAT DO I DO HERE?
#print "You logged the number %d" % (number_i_logged)
I run this and get:
No handlers could be found for logger "eth.pow"
{'': 4, '_event_type': 'LogMyNumber'}
That json that's getting printed is the information I want, but can someone explain, or point me to an example, of how I might capture it and load it into a variable in python so that I can check it and do something with it? There seems to be something called log_listener that you can pass into abi_contract that looks like it's related but I couldn't figure out what to do with it.

I know you've been waiting for an answer for quite a long time, but in case someone else is wondering, here is goes:
log_listeners you mentioned is the way to go. You can find some sample code using it in pyethereum's tests, and here is your fixed code:
from ethereum import tester as t
s = t.state()
code = """contract LogTest {
event LogMyNumber(uint loggedNumber);
function LogTest() {
}
function logSomething() {
LogMyNumber(4);
}
}"""
c = s.abi_contract(code, language='solidity', sender=t.k0)
o = []
s.block.log_listeners.append(lambda x: o.append(c._translator.listen(x)))
c.logSomething()
assert len(o) == 1
assert o == [{"_event_type": 'LogMyNumber', "loggedNumber": 4}]
number_i_logged = o[0]["loggedNumber"]
print "You logged the number %d" % (number_i_logged)

Debugging ScraperWiki scraper (producing spurious integer)

Here is a scraper I created using Python on ScraperWiki:
import lxml.html
import re
import scraperwiki
pattern = re.compile(r'\s')
html = scraperwiki.scrape("http://www.shanghairanking.com/ARWU2012.html")
root = lxml.html.fromstring(html)
for tr in root.cssselect("#UniversityRanking tr:not(:first-child)"):
if len(tr.cssselect("td.ranking")) > 0 and len(tr.cssselect("td.rankingname")) > 0:
data = {
'arwu_rank' : str(re.sub(pattern, r'', tr.cssselect("td.ranking")[0].text_content())),
'university' : tr.cssselect("td.rankingname")[0].text_content().strip()
}
# DEBUG BEGIN
if not type(data["arwu_rank"]) is str:
print type(data["arwu_rank"])
print data["arwu_rank"]
print data["university"]
# DEBUG END
if "-" in data["arwu_rank"]:
arwu_rank_bounds = data["arwu_rank"].split("-")
data["arwu_rank"] = int( ( float(arwu_rank_bounds[0]) + float(arwu_rank_bounds[1]) ) * 0.5 )
if not type(data["arwu_rank"]) is int:
data["arwu_rank"] = int(data["arwu_rank"])
scraperwiki.sqlite.save(unique_keys=['university'], data=data)
It works perfectly except when scraping the final data row of the table (the "York University" line), at which point instead of lines 9 through 11 of the code causing the string "401-500" to be retrieved from the table and assigned to data["arwu_rank"], those lines somehow seem instead to be causing the int 450 to be assigned to data["arwu_rank"]. You can see that I've added a few lines of "debugging" code to get a better understanding of what's going on, but also that that debugging code doesn't go very deep.
I have two questions:
What are my options for debugging scrapers run on the ScraperWiki infrastructure, e.g. for troubleshooting issues like this? E.g. is there a way to step through?
Can you tell me why the the int 450, instead of the string "401-500", is being assigned to data["arwu_rank"] for the "York University" line?
EDIT 6 May 2013, 20:07h UTC
The following scraper completes without issue, but I'm still unsure why the first one failed on the "York University" line:
import lxml.html
import re
import scraperwiki
pattern = re.compile(r'\s')
html = scraperwiki.scrape("http://www.shanghairanking.com/ARWU2012.html")
root = lxml.html.fromstring(html)
for tr in root.cssselect("#UniversityRanking tr:not(:first-child)"):
if len(tr.cssselect("td.ranking")) > 0 and len(tr.cssselect("td.rankingname")) > 0:
data = {
'arwu_rank' : str(re.sub(pattern, r'', tr.cssselect("td.ranking")[0].text_content())),
'university' : tr.cssselect("td.rankingname")[0].text_content().strip()
}
# DEBUG BEGIN
if not type(data["arwu_rank"]) is str:
print type(data["arwu_rank"])
print data["arwu_rank"]
print data["university"]
# DEBUG END
if "-" in data["arwu_rank"]:
arwu_rank_bounds = data["arwu_rank"].split("-")
data["arwu_rank"] = int( ( float(arwu_rank_bounds[0]) + float(arwu_rank_bounds[1]) ) * 0.5 )
if not type(data["arwu_rank"]) is int:
data["arwu_rank"] = int(data["arwu_rank"])
scraperwiki.sqlite.save(unique_keys=['university'], data=data)

There's no easy way to debug your scripts on ScraperWiki, unfortunately it just sends your code in its entirety and gets the results back, there's no way to execute the code interactively.
I added a couple more prints to a copy of your code, and it looks like the if check before the bit that assigns data
if len(tr.cssselect("td.ranking")) > 0 and len(tr.cssselect("td.rankingname")) > 0:
doesn't trigger for "York University" so it will be keeping the int value (you set it later on) from the previous time around the loop.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pass a pySpark script in Livy Session statement - python

Related

How to get progress of successful build through Jenkins Python API

Behave - Testing using blank Example fields

how to run django cron job with which function

How do I get event log information from pyethereum?

Debugging ScraperWiki scraper (producing spurious integer)

Categories

Resources