What is the equivalent to scala.util.Try in pyspark? - python

I've got a lousy HTTPD access_log and just want to skip the "lousy" lines.
In scala this is straightforward:
import scala.util.Try
val log = sc.textFile("access_log")
log.map(_.split(' ')).map(a => Try(a(8))).filter(_.isSuccess).map(_.get).map(code => (code,1)).reduceByKey(_ + _).collect()
For python I've got the following solution by explicitly defining a function in contrast using the "lambda" notation:
log = sc.textFile("access_log")
def wrapException(a):
try:
return a[8]
except:
return 'error'
log.map(lambda s : s.split(' ')).map(wrapException).filter(lambda s : s!='error').map(lambda code : (code,1)).reduceByKey(lambda acu,value : acu + value).collect()
Is there a better way doing this (e.g. like in Scala) in pyspark?
Thanks a lot!

Better is a subjective term but there are a few approaches you can try.
The simplest thing you can do in this particular case is to avoid exceptions whatsoever. All you need is a flatMap and some slicing:
log.flatMap(lambda s : s.split(' ')[8:9])
As you can see it means no need for an exception handling or subsequent filter.
Previous idea can be extended with a simple wrapper
def seq_try(f, *args, **kwargs):
try:
return [f(*args, **kwargs)]
except:
return []
and example usage
from operator import div # FYI operator provides getitem as well.
rdd = sc.parallelize([1, 2, 0, 3, 0, 5, "foo"])
rdd.flatMap(lambda x: seq_try(div, 1., x)).collect()
## [1.0, 0.5, 0.3333333333333333, 0.2]
finally more OO approach:
import inspect as _inspect
class _Try(object): pass
class Failure(_Try):
def __init__(self, e):
if Exception not in _inspect.getmro(e.__class__):
msg = "Invalid type for Failure: {0}"
raise TypeError(msg.format(e.__class__))
self._e = e
self.isSuccess = False
self.isFailure = True
def get(self): raise self._e
def __repr__(self):
return "Failure({0})".format(repr(self._e))
class Success(_Try):
def __init__(self, v):
self._v = v
self.isSuccess = True
self.isFailure = False
def get(self): return self._v
def __repr__(self):
return "Success({0})".format(repr(self._v))
def Try(f, *args, **kwargs):
try:
return Success(f(*args, **kwargs))
except Exception as e:
return Failure(e)
and example usage:
tries = rdd.map(lambda x: Try(div, 1.0, x))
tries.collect()
## [Success(1.0),
## Success(0.5),
## Failure(ZeroDivisionError('float division by zero',)),
## Success(0.3333333333333333),
## Failure(ZeroDivisionError('float division by zero',)),
## Success(0.2),
## Failure(TypeError("unsupported operand type(s) for /: 'float' and 'str'",))]
tries.filter(lambda x: x.isSuccess).map(lambda x: x.get()).collect()
## [1.0, 0.5, 0.3333333333333333, 0.2]
You can even use pattern matching with multipledispatch
from multipledispatch import dispatch
from operator import getitem
#dispatch(Success)
def check(x): return "Another great success"
#dispatch(Failure)
def check(x): return "What a failure"
a_list = [1, 2, 3]
check(Try(getitem, a_list, 1))
## 'Another great success'
check(Try(getitem, a_list, 10))
## 'What a failure'
If you like this approach I've pushed a little bit more complete implementation to GitHub and pypi.

First, let me generate some random data to start working with.
import random
number_of_rows = int(1e6)
line_error = "error line"
text = []
for i in range(number_of_rows):
choice = random.choice([1,2,3,4])
if choice == 1:
line = line_error
elif choice == 2:
line = "1 2 3 4 5 6 7 8 9_1"
elif choice == 3:
line = "1 2 3 4 5 6 7 8 9_2"
elif choice == 4:
line = "1 2 3 4 5 6 7 8 9_3"
text.append(line)
Now I have a string text looks like
1 2 3 4 5 6 7 8 9_2
error line
1 2 3 4 5 6 7 8 9_3
1 2 3 4 5 6 7 8 9_2
1 2 3 4 5 6 7 8 9_3
1 2 3 4 5 6 7 8 9_1
error line
1 2 3 4 5 6 7 8 9_2
....
Your solution:
def wrapException(a):
try:
return a[8]
except:
return 'error'
log.map(lambda s : s.split(' ')).map(wrapException).filter(lambda s : s!='error').map(lambda code : (code,1)).reduceByKey(lambda acu,value : acu + value).collect()
#[('9_3', 250885), ('9_1', 249307), ('9_2', 249772)]
Here is my solution:
from operator import add
def myfunction(l):
try:
return (l.split(' ')[8],1)
except:
return ('MYERROR', 1)
log.map(myfunction).reduceByKey(add).collect()
#[('9_3', 250885), ('9_1', 249307), ('MYERROR', 250036), ('9_2', 249772)]
Comment:
(1) I highly recommend also calculating the lines with "error" because it won't add too much overhead, and also can be used for sanity check, for example, all the counts should add up to the total number of rows in the log, if you filter out those lines, you have no idea those are truly bad lines or something went wrong in your coding logic.
(2) I will try to package all the line level operations in one function to avoid chaining of map, filter functions, so it is more readable.
(3) From performance perspective, I generated a sample of 1M records and my code finished in 3 seconds and yours in 2 seconds, it is not a fair comparasion since the data is so small and my cluster is pretty beefy, I would recommend you generate a bigger file (1e12?) and do a benchmark on yours.

Related

How do I make a while loop print a result just once, unless the result changes?

I've made a short countdown program which starts at 4 and counts down to zero, I'd like this countdown to print each number just once before moving on to the next number (i.e 4,3,2,1,0), but it currently prints each number multiple times.
This is my Code:
import time
def timer():
max_time = 4
start_time = time.time()
while max_time > 0:
difference = time.time() - start_time
if 1 > difference > 0:
print(max_time)
if 2 > difference > 1:
max_time = 3
print(max_time)
elif 3 > difference > 2:
max_time = 2
print(max_time)
elif 4 > difference > 3:
max_time = 1
print(max_time)
elif 5 > difference > 4:
print('Go')
break
timer()
Currently I get a result like this:
4
4
4
4
3
3
3
3
2
2
2
2
1
1
1
1
Where I'd like a result like this:
4
3
2
1
Thanks
Your code consumes 100% of a CPU. That's wasteful. You do a timer by putting yourself to sleep for a while:
import time
def timer():
max_time = 4
for i in range(max_time,0,-1):
print(i)
time.sleep(1)
print('Go')
timer()
To answer your explicit question about a print once type of function: I'd use a class that remembers the last thing printed. It's important to use a context manager (with expression) here, since the last line printed won't necessarily be garbage collected when you're done with it otherwise.
class Printer:
'''Prints only changed lines'''
def __init__(self):
# a new, unique object() is never equal to anything else
self._last = object()
def __enter__(self):
# let with...as statement get to print_once() directly
return self.print_once
def __exit__(self, *_):
# let the last line be garbage collected
del self._last
def print_once(self, line):
'''Print onlt changed line'''
# don't print the same thing again
if line != self._last:
# print the unique thing
print(line)
# remember the last thing printed
self._last = line
with Printer() as print_once:
for line in [4, 4, 4, 4, 3, 3, 3, 3, 2, 2, 2, 2, 1, 1, 1, 1]:
print_once(line)
If, instead, you just want something to count down from 4 once a second, then refer to Tim Roberts' answer. ;)

how to read this function

def cat(xx):
if (xx<1):
return 5
if (xx<2):
return 2
if(xx<3):
return 4
if(xx<4):
return 7
if(xx<5):
return 8
if(xx<6):
return 6
if(xx<7):
return 1
if(xx<8):
return 9
if (xx<9):
return 3
else:
return cat(xx-9)
print(cat(38))
the answer python gives me is 4. I am not sure why it gives me this number. I know there are multiple if statements and not elif but I don't know how that causes this answer
The stack call of cat(38) will be:
print(cat(38))
return(cat(38-9))
return(cat(29))
return(cat(20))
return(cat(11))
return(cat(2))
<-- will return 4 since `xx<3` will evaluate to true
The function keeps on substracting 9 to xx until it returns something different from cat(xx-9). When you enter 38 you get 2 by the fourth time you enter the function recursively and then the output is 4
else:
return cat(xx-9)
So because you did print(cat(38)) but there is no if statement when the number is above 9 so it will go to else but what you did was to recall the function instead of returning the number.
Why it gives you 4 every time is because the function is in a loop and it will continue to xx-9 until it finds a valid if statement then it will return the number 4 for example:
38 - 9 = 29, 29 - 9 = 20, 20 - 9 = 11, 11 - 9 = 2 <--- this is what the function is doing.
Solution:
else:
return xx-9

Function calling function in python

def one_good_turn(n):
return n + 1
def deserves_another(n):
return one_good_turn(n) + 2
print(one_good_turn(1))
print(deserves_another(2))
Since I have two function one_good_turn(n) and deserves_another(n) while calling function I had passed parameter 1 and 2:
I expected the output to be:
2
4
but its shows:
2
5
Why is the output not what I had expected?
I believe you assume that one_good_turn(n) in deserves_another(n) will return the value that is previously computed. No. It gets the current input n which is 2, call the function again, do 2+1 which is 3. Then you add 3 + 2 = 5.
Maybe to get your desired output, you should pass 1 to deserves_another:
def one_good_turn(n):
return n + 1
def deserves_another(n):
return one_good_turn(n) + 2
print(one_good_turn(1)) # 2
print(deserves_another(1)) # 4
A better way is to return the value from one_good_turn and pass it to deserves_another. So you don't need to call one_good_turn again inside deserves_another:
def one_good_turn(n):
n = n + 1
print(n) # 2
return n
def deserves_another(n):
return n + 2
n = one_good_turn(1)
print(deserves_another(n)) # 4
one_good_turn(2) returns 2+1=3.
Then the result is passed to deserves_another, which returns 3+2=5.

Python: Unable to stop this program's sentinel loop

SO I'm making a program about hockey players playing and I will record their goals.
Here's what should happen:
Who scored? 4
Who scored? 5
Who scored? 6
Who scored? 4
Who scored? 3
Game still going?(y/n) y
Who scored? 3
Who scored? 2
Who scored? 5
Who scored? 2
Who scored? 3
Game still going?(y/n) n
Creating a histogram from values:
Element Goals Histogram
1 0
2 2 **
3 2 ***
4 2 **
5 2 **
6 1 *
7 1 *
8 0
9 0
10 0
Here is my code:
def info():
ranking = [0,0,0,0,0,0,0,0,0,0,0]
survey = []
return ranking,survey
def info2(survey):
x = ''
for i in range(0,5):
x = int(input("Who scored?"))
survey.append(x)
again(x)
return survey
def info3(ranking,survey):
for i in range(len(survey)):
ranking[survey[i]]+=1
return ranking, survey
def again(x):
y = input("Game still on? y/n").lower()
if y == "yes" or y == "y":
info()
elif y == "n" or y =="no":
hg(x)
#create histogram
def hg():
print("\nCreating a histogram from values: ")
print("%3s %5s %7s"%("Element", "Goals", "Histogram"))
#start from element 1 instead of 0
for i in range(len(ranking)-1):
print("%7d %5d %-s"%(i+1, ranking[i+1], "*" * ranking[i+1]))
def main():
x,y = info()
a = info2(y)
d = again(x)
b,c = info3(x,a)
z = hg(x)
main()
When I run this as it is, I get the Who scored thing, and I enter 'y' on the y/n and it works, but when I enter y/n and i put n, it prints the "element goals histogram" then throws the following:
Traceback (most recent call last):
line 48, in <module>
main()
line 44, in main
a = info2(y)
line 17, in info2
again(x)
line 29, in again
hg(x)
line 39, in hg
for i in range(len(ranking)-1):
NameError: name 'ranking' is not defined
x = input("Game still on? y/n").lower
instead should be:
x = input("Game still on? y/n").lower()
There are a couple issues that I see with the code...
First, you've got lower instead of lower() in your again() function. This binds the function itself to x instead of calling it and assigning its return value.
Also, your hg() function expects an argument, but you don't pass one in here. The ranking defined in info() is local to that function, and not visible from hg().
Edit in response to OP after OP's code was updated based on my comments above:
Also, there are issues with your handling of the exit case in again(). I think you shouldn't call hg() at all there, and instead return the answer to a separate variable in info2().
So that code for those two functions would then look something like this:
def info2(survey):
x = ''
ans = 'y'
while ans in ('y', 'yes'):
for i in range(0,5):
x = int(input("Who scored?"))
survey.append(x)
ans = again()
return survey
def again():
x = input("Game still on? y/n").lower()
if x == "yes" or x == "y":
info()
return x
Note the use of the additional variable, and the pass.
Edit in response to 2nd comment from OP:
info3() is unchanged. I added again() and info2() with my changes. You would keep info3() as is (at least as regards this particular question).
Also, since my change just had pass in the No case, it can actually be removed entirely. Just check for the Yes case, and otherwise return (an else isn't even required in this particular case).
When I run the code with the changes I mentioned, it appears to work as required. This is example output:
Who scored?1
Who scored?1
Who scored?1
Who scored?2
Who scored?2
Game still on? y/ny
Who scored?3
Who scored?3
Who scored?3
Who scored?2
Who scored?2
Game still on? y/nn
Creating a histogram from values:
Element Goals Histogram
1 3 ***
2 4 ****
3 3 ***
4 0
5 0
6 0
7 0
8 0
9 0
10 0
I'm not sure what all of the functions and variables were supposed to do. It's easier to debug and understand code if you use sensible names for functions and variables.
from collections import OrderedDict
def main():
list_of_players = [str(number) for number in range(1, 11)]
ranking = play(list_of_players)
print_histogram(ranking)
def play(list_of_players):
ranking = OrderedDict([(player, 0) for player in list_of_players])
while True:
for i in range(5):
player = input('Who scored? ')
try:
ranking[player] += 1
except KeyError:
print('Not a player')
if input('Game still going?(y/n) ').lower() in ['n', 'no']:
return ranking
def print_histogram(ranking):
template = '{player:^7} {goals:^7} {stars}'
print('\nCreating a histogram from values: ')
print(template.format(player='Element', goals='Goals', stars='Histogram'))
for player, goals in ranking.items():
print(template.format(player=player, goals=goals, stars='*' * goals))
if __name__ == '__main__':
main()

While Loop to produce Mathematical Sequences?

I've been asked to do the following:
Using a while loop, you will write a program which will produce the following mathematical sequence:
1 * 9 + 2 = 11(you will compute this number)
12 * 9 + 3 = 111
123 * 9 + 4 = 1111
Then your program should run as far as the results contain only "1"s. You can build your numbers as string, then convert to ints before calculation. Then you can convert the result back to a string to see if it contains all "1"s.
Sample Output:
1 * 9 + 2 = 11
12 * 9 + 3 = 111
123 * 9 + 4 = 1111
1234 * 9 + 5 = 11111
Here is my code:
def main():
Current = 1
Next = 2
Addition = 2
output = funcCalculation(Current, Addition)
while (verifyAllOnes(output) == True):
print(output)
#string concat to get new current number
Current = int(str(Current) + str(Next))
Addition += 1
Next += 1
output = funcCalculation(Current, Next)
def funcCalculation(a,b):
return (a * 9 + b)
def verifyAllOnes(val):
Num_str = str(val)
for ch in Num_str:
if(str(ch)!= "1"):
return False
return True
main()
The bug is that the formula isn't printing next to the series of ones on each line. What am I doing wrong?
Pseudo-code:
a = 1
b = 2
result = a * 9 + b
while string representation of result contains only 1s:
a = concat a with the old value of b, as a number
b = b + 1
result = a * 9 + b
This can be literally converted into Python code.
Testing all ones
Well, for starters, here is one easy way to check that the value is all ones:
def only_ones(n):
n_str = str(n)
return set(n_str) == set(['1'])
You could do something more "mathy", but I'm not sure that it would be any faster. It would much more easily
generalize to other bases (than 10) if that's something you were interested in though
def only_ones(n):
return (n % 10 == 1) and (n == 1 or only_ones2(n / 10))
Uncertainty about how to generate the specific recurrence relation...
As for actually solving the problem though, it's actually not clear what the sequence should be.
What comes next?
123456
1234567
12345678
123456789
?
Is it 1234567890? Or 12345678910? Or 1234567900?
Without answering this, it's not possible to solve the problem in any general way (unless in fact the 111..s
terminate before you get to this issue).
I'm going to go with the most mathematically appealing assumption, which is that the value in question is the
sum of all the 11111... values before it (note that 12 = 11 + 1, 123 = 111 + 11 + 1, 1234 = 1111 + 111 + 11 + 1, etc...).
A solution
In this case, you could do something along these lines:
def sequence_gen():
a = 1
b = 1
i = 2
while only_ones(b):
yield b
b = a*9 + i
a += b
i += 1
Notice that I've put this in a generator in order to make it easier to only grab as many results from this
sequence as you actually want. It's entirely possible that this is an infinite sequence, so actually running
the while code by itself might take a while ;-)
s = sequence_gen()
s.next() #=> 1
s.next() #=> 11
A generator gives you a lot of flexibility for things like this. For instance, you could grab the first 10 values of the sequence using the itertools.islice
function:
import itertools as it
s = sequence_gen()
xs = [x for x in it.islice(s, 10)]
print xs

Categories