what is try function in python(pyspark) - python

I want to extract 2nd element in array
So I write this
spark.sql("""
SELECT TRY(object_url[2])
~~~~~
I write TRY function. Because sometimes there is not a second element.
But I have a problem. TRY function isn't work in pyspark(python).
What is the similar function? umm...help me.
FYI. I always use the TRY function in AWS Athena when I extract array element.

No need to try. It will gives you the null value when the index is out of range.
spark.sql('''SELECT test, test[2] from test''').show()
+------+-------+
| test|test[2]|
+------+-------+
|[1, 2]| null|
+------+-------+

Related

What is the meaning of hash if we still need to check every item?

We know that tuple objects are immutable and thus hashable. We also know that lists are mutable and non-hashable.
This can be easily illustrated
>>> set([1, 2, 3, (4, 2), (2, 4)])
{(2, 4), (4, 2), 1, 2, 3}
>>> set([1, 2, 3, [4, 2], [2, 4]])
TypeError: unhashable type: 'list'
Now, what is the meaning of hash in this context, if, in order to check for uniqueness (e.g. when building a set), we still have to check each individual item in whatever iterables are there in the set anyway?
We know two objects can have the same hash value and still be different. So, hash only is not enough to compare objects. So, what is the point of hash? Why not just check each individual items in the iterables direrctly?
My intuition is that it could be for one of the reasons
hash is just a (pretty quick) preliminary comparison. If hashes are different, we know objects are different.
hash sinalizes that an object is mutable. This should be enough to raise an exception when comparing to other objects: at that specific time, the objects could be equal, but maybe later, they are not.
Am I in the right direction? Or am I missing important piece of this?
Thank you
Now, what is the meaning of hash in this context, if, in order to check for uniqueness (e.g. when building a set), we still have to check each individual item in whatever iterables are there in the set anyway?
Yes, but the hash is used to make a conservative estimate if two objects can be equal, and is also used to assign a "bucket" to an item. If the hash function is designed carefully, then it is likely (not a certainty) that most, if not all, end up in a different bucket, and as a result, we thus make the membercheck/insertion/removal/... algorithms run on average in constant time O(1), instead of O(n) which is typical for lists.
So your first answer is partly correct, although one has to take into account that the buckets definitely boost performance as well, and are actually more important than a conservative check.
Background
Note: I will here use a simplified model, that makes the principle clear, in reality the implementation of a dictionary is more complicated. For example the hashes are here just some numbers that show the principe.
A hashset and dictionary is implemented as an array of "buckets". The hash of an element determines in which bucket we store an element. If the number of elements grows, then the number of buckets is increased, and the elements that are already in the dictionary are typically "reassigned" to the buckets.
For example, an empty dictionary might look, internally, like:
+---+
| |
| o----> NULL
| |
+---+
| |
| o----> NULL
| |
+---+
So two buckets, in case we add an element 'a', then the hash is 123. Let us consider a simple algorithm to allocate an element to a bucket, here there are two buckets, so we will assign the elements with an even hash to the first bucket, and an odd hash to the second bucket. Since the hash of 'a' is odd, we thus assign 'a' to the second bucket:
+---+
| |
| o----> NULL
| |
+---+
| | +---+---+
| o---->| o | o----> NULL
| | +-|-+---+
+---+ 'a'
So that means if we now check if 'b' is a member of the dictionary, we first calculate hash('b'), which is 456, and thus if we would have added this to the dictionary, it would be in the first bucket. Since the first bucket is empty, we never have to look for the elements in the second bucket to know for sure that 'b' is not a member.
If we then for example want to check if 'c' is a member, we first generate the hash of 'c', which is 789, so we add it to the second bucket again, for example:
+---+
| |
| o----> NULL
| |
+---+
| | +---+---+ +---+---+
| o---->| o | o---->| o | o----> NULL
| | +-|-+---+ +-|-+---+
+---+ 'c' 'a'
So now if we again would check if 'b' is a member, we would look to the first bucket, and again, we never thus have to iterate over 'c' and 'a' to know for sure that 'b' is not a member of the dictionary.
Now of course one might argue that if we keep adding more characters like 'e' and 'g' (here we consider these to have an odd hash), then that bucket will get quite full, and thus if we later check if 'i' is a member, we still will need to iterate over the elements. But in case the number of elements grows, typically the number of buckets will increase as well, and the elements in the dictionary will be assigned a new bucket.
For example if we now want to add 'd' to the dictionary, the dictionary might note that the number of elements after insertion 3, is larger than the number of buckets 2, so we create a new array of buckets:
+---+
| |
| o----> NULL
| |
+---+
| |
| o----> NULL
| |
+---+
| |
| o----> NULL
| |
+---+
| |
| o----> NULL
| |
+---+
and we reassign the members 'a' and 'c'. Now all elements with a hash h with h % 4 == 0 will be assigned to the first bucket, h % 4 == 1 to the second bucket, h % 4 == 2 to the third bucket, and h % 4 == 3 to the last bucket. So that means that 'a' with hash 123 will be stored in the last bucket, and 'c' with hash 789 will be stored in the second bucket, so:
+---+
| |
| o----> NULL
| |
+---+
| | +---+---+
| o---->| o | o----> NULL
| | +-|-+---+
+---+ 'c'
| |
| o----> NULL
| |
+---+
| | +---+---+
| o---->| o | o----> NULL
| | +-|-+---+
+---+ 'a'
we then add 'b' with hash 456 to the first bucket, so:
+---+
| | +---+---+
| o---->| o | o----> NULL
| | +-|-+---+
+---+ 'b'
| | +---+---+
| o---->| o | o----> NULL
| | +-|-+---+
+---+ 'c'
| |
| o----> NULL
| |
+---+
| | +---+---+
| o---->| o | o----> NULL
| | +-|-+---+
+---+ 'a'
So if we want to check the membership of 'a', we calculate the hash, know that if 'a' is in the dictionary, we have to search in the third bucket, and will find it there. If we look for 'b' or 'c' the same happens (but with a different bucket), and if we look for 'd' (here with hash 12), then we will search in the third bucket, and will never have to check equality with a single element to know that it is not part of the dictionary.
If we want to check if 'e' is a member, then we calculate the hash of 'e' (here 345), and search in the second bucket. Since that bucket is not empty, we start iterating over it.
For every element in the bucket (here there is only one), The algorithm will first look if the key we search for, and the key in the node refer to the same object (two different objects can however be equal), since this is not the case, we can not yet claim that 'e' is in the dictionary.
Next we will compare the hash of the key we search for, and the key of the node. Most dictionary implementations (CPython's dictionaries and sets as well if I recall correctly), then store the hash in the list node as well. So here it checks if 345 is equal to 789, since this is not the case, we know that 'c' and 'e' are not the same. If it was expensive to compare the two objects, we thus could save some cycles with that.
If the hashes are equal, that does not mean that the elements are equal, so in that case, we thus will check if the two objects are equivalent, if that is the case, we know that the element is in the dictionary, otherwise, we know it is not.
This is a high level overview of what happens when you want to find a value in a set (or a key in a dict). A hash table is a sparsely populated array, with its cells being called buckets or bins.
Good hashing algorithms aim to minimize the chance of hash collisions such that in the average case foo in my_set has time complexity O(1). Performing a linear scan (foo in my_list) over a sequence has time complexity O(n). On the other hand foo in my_set has complexity O(n) only in the worst case with many hash collisions.
A small demonstration (with timings done in IPython, copy-pasted from my answer here):
>>> class stupidlist(list):
...: def __hash__(self):
...: return 1
...:
>>> lists_list = [[i] for i in range(1000)]
>>> stupidlists_set = {stupidlist([i]) for i in range(1000)}
>>> tuples_set = {(i,) for i in range(1000)}
>>> l = [999]
>>> s = stupidlist([999])
>>> t = (999,)
>>>
>>> %timeit l in lists_list
25.5 µs ± 442 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit s in stupidlists_set
38.5 µs ± 61.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit t in tuples_set
77.6 ns ± 1.5 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
As you can see, the membership test in our stupidlists_set is even slower than a linear scan over the whole lists_list, while you have the expected super fast lookup time (factor 500) in a set without loads of hash collisions.

Why is FlatMap after GroupByKey in Apache Beam python so slow?

Given a relatively small data source (3,000-10,000) of key/value pairs, I am trying to only process records which meet a group threshold (50-100). So the simplest method is to group them by key, filter and unwind - either with FlatMap or a ParDo. The largest group has only 1,500 records so far. But this seems to be a severe bottleneck in production on Google Cloud Dataflow.
With given list
(1, 1)
(1, 2)
(1, 3)
...
(2, 1)
(2, 2)
(2, 3)
...
run through a set of transforms to filter and group by key:
p | 'Group' >> beam.GroupByKey()
| 'Filter' >> beam.Filter(lambda (key, values): len(list(values)) > 50)
| 'Unwind' >> beam.FlatMap(lambda (key, values): values)
Any ideas on how to make this more performant? Thanks for your help!
This is an interesting corner case for a pipeline. I believe that your issue here is on the way you read the data that comes from GroupByKey. Let me give you a quick summary of how GBK works.
What's GroupByKey, and how big data systems implement it
All big data systems implement ways to realize operations over multiple elements of the same key. This was called reduce in MapReduce, and in other big data systems is called Group By Key, or Combine.
When you do a GroupByKey transform, Dataflow needs to gather all the elements for a single key into the same machine. Since different elements for the same key may be processed in different machines, data needs to be serialized somehow.
This means that when you read data that comes from a GroupByKey, you are accessing the IO of the workers (i.e. not from memory), so you really want to avoid reading shuffle data too many times.
How this translates to your pipeline
I believe that your problem here is that Filter and Unwind will both read data from shuffle separately (so you will read the data for each list twice). What you want to do is to read your shuffle data only once. You can do this with a single FlatMap that both filters and unwinds your data without double-reading from shuffle. Something like this:
def unwind_and_filter((key, values)):
# This consumes all the data from shuffle
value_list = list(values)
if len(value_list) > 50:
yield value_list
p | 'Group' >> beam.GroupByKey()
| 'UnwindAndFilter' >> beam.FlatMap(unwind_and_filter)
Let me know if this helps.

Pythonic Way to have multiple Or's when conditioning in a dataframe

Basically instead of writing
data[data['pos'] == "QB" | data['pos'] == "DST" | ...]
where there are many cases I want to check
I was trying to do something similar to this What's the pythonic method of doing multiple ors?. However, this
data[data['pos'] in ("QB", "DST", ...)]
doesn't work.
I read the documentation here http://pandas.pydata.org/pandas-docs/stable/gotchas.html but I'm still having issues.
What you are looking for is Series.isin . Example -
data[data['pos'].isin(("QB", "DST", ...))]
This would check if each value from pos is in the list of values - ("QB", "DST", ...) . Similar to what your multiple | would do.

Format strings to make 'table' in Python 3

Right now I'm using print(), calling the variables I want that are stored in a tuple and then formatting them using: print(format(x,"<10s")+ format(y,"<40s")...) but this gives me output that isn't aligned in a column form. How do I make it so that each row's element is aligned?
So, my code is for storing student details. First, it takes a string and returns a tuple, with constituent parts like: (name,surname,student ID, year).
It reads these details from a long text file on student details, and then it parses them through a tuplelayout function (the bit which will format the tuple) and is meant to tabulate the results.
So, the argument for the tuplelayout function is a tuple, of the form:
surname | name | reg number | course | year
If you are unpacking tuples just use a single str.format and justify the output as required using format-specification-mini-language:
l = [(10,1000),(200,20000)]
for x,y in l:
print("{:<3} {:<6}".format(x,y))
10 1000
200 20000
My shell has the font settings changed so the alignment was off. Back to font: "Courier" and everything is working fine.
Sorry.

Remove Rows that Contain a specific Value anywhere in the row (Pandas, Python 3)

I am trying to remove all rows in a Panda dataset that contain the symbol "+" anywhere in the row. So ideally this:
Keyword
+John
Mary+Jim
David
would become
Keyword
David
I've tried doing something like this in my code but it doesn't seem to be working.
excluded = ('+')
removal2 = removal[~removal['Keyword'].isin(excluded)]
The problem is that sometimes the + is contained within a word, at the beginning of a word, or at the end. Any ideas how to help? Do I need to use an index function? Thank you!
Use the vectorised str method contains and pass the '+' identifier, negate the boolean condition by using ~:
In [29]:
df[~df.Keyword.str.contains('\+')]
Out[29]:
Keyword
2 David

Categories