Pysparkling 2 reset monotonically_increasing_id from 1 - python

I want to split a spark dataframe into two pieces and defined the row number for each of the sub-dataframe. But I found that the function monotonically_increasing_id will still define the row number from the original dataframe.
Here is what I did in python:
# df is the original sparkframe
splits = df.randomSplit([7.0,3.0],400)
# add column rowid for the two subframes
set1 = splits[0].withColumn("rowid", monotonically_increasing_id())
set2 = splits[1].withColumn("rowid", monotonically_increasing_id())
# check the results
set1.select("rowid").show()
set2.select("rowid").show()
I would expect the first five elements of rowid for the two frames are both 1 to 5 (or 0 to 4, can't remember clearly):
set1: 1 2 3 4 5
set2: 1 2 3 4 5
But what I actually got is:
set1: 1 3 4 7 9
set2: 2 5 6 8 10
The two subframes' row id are actually their row id in the original sparkframe df not the new ones.
As a newbee of spark, I am seeking helps on why this happened and how to fix it.

First of all, what version of Spark are you using? The monotonically_increasing_id method implementation has been changed a few times. I can reproduce your problem in Spark 2.0, but it seems the behavior is different in spark 2.2. So it could be a bug that was fixed in newer spark release.
That being said, you should NOT expect the value generated by monotonically_increasing_id to be increase consecutively. In your code, I believe there is only one partition for the dataframe. According to http://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html
The generated ID is guaranteed to be monotonically increasing and
unique, but not consecutive. The current implementation puts the
partition ID in the upper 31 bits, and the record number within each
partition in the lower 33 bits. The assumption is that the data frame
has less than 1 billion partitions, and each partition has less than 8
billion records.
As an example, consider a DataFrame with two partitions, each with 3
records. This expression would return the following IDs: 0, 1, 2,
8589934592 (1L << 33), 8589934593, 8589934594.
So if you code should not expect the rowid to be increased consecutively.
Besides, you should also consider caching scenario and failure scenarios. Even if monotonically_increase_id works as what you expect -- increase the value consecutively, it still will not work. What if your node fails? The partitions on the failed node will be regenerated from the source or last cache/checkpoint, which might have a different order thus different rowid. Eviction out of cache also causes issue. Assuming that after generating a dataframe and you cache it to memory. What if it is evicted out of memory. Future action will try to regenerate the dataframe again, thus gives out different rowid.

Related

innerjoin between two large pandas dataframe using dask

I have two large tables with one of them is relatively small ~8Million rows and one column. Other is large 173Million rows and one column. The index of the first data frame is IntervalIndex (eg (0,13], (13, 20], (20, 23], ...) and the second one is ordered numbers (1,2,3, ...). Both DataFrame are sorted so
DF1 category
(0,13] 1
(13 20] 2
....
Df2 Value
1 5.2
2 3.4
3 7.8
Desired
Df3
index value category
1 5.2 1
2 3.4 1
3 7.8 1
I want two obtain inner join (faster algorithm) that returns inner join similar to MySQL on data_frame2.index
I would like to be able to perform it in an elaborate way in a cluster because when I PRODUCED THE INNERJOIN WITH SMALLER SECOND DATASET THE RESULT ARE SO MEMORY CONSUMING IMAGINE 105MEGABYTE for 10 rows using map_partitions.
Another problem is that I cannot use scatter twice, given if first DaskDF=client.scatter(dataframe2) followed by DaskDF=client.submit(fun1,DaskDF) I am unable to do sth like client.submit(fun2,DaskDF).
You might try using smaller partitions. Recall that the memory use of joins depend on how many shared rows there are. Depending on your data the memory use of an output partition may be much larger than the memory use of your input partitions.

How to apply a XNOR operation to rows from to two dataframes

I have been reading documentation for a few hours now and I feel I am approaching the problem with the wrong mindset.
I have two tables in HIVE which I read with (spark.table(table_A)) with the same amount and type of columns, but with different origins, so their data is different. Both tables reflect flags that show whether or not a condition is met. There are around 20 columns, at least, and they could increase in the future.
If table_A has its first row be 0 0 1 1 0 table_B could be 0 1 0 1 0, I would like the result to be the result of a XNOR, comparing positions, so: 1 0 0 1 1 , since it has the same values in the first, fourth and fifth position
So I thought of the XNOR operation, when if boths values match then it returns a 1, and a 0 otherwise.
I am facing a number of problems, one of them is the volume of my data (right now I am working with a sample of 1 week and it's already at the 300MB mark), so I am working with pyspark and avoiding pandas since it usually does not fit in memory and/or lags the operation a lot.
Summing up, I have two objects of type pyspark.sql.dataframe.DataFrame, each has one of the tables, and so far the best I've got is something like this:
df_bitwise = df_flags_A.flag_column_A.bitwiseXOR(df_flags_B.flag_columns_B)
But sadly this returns a pyspark.sql.column.Column and I do not know how to read that result, and I do not know to build a dataframe with this (I would like the end result to be something like 20 times the above operation, one for each column, each forming a column of a dataframe).
What am I doing wrong because I feel like this is not the right approach.

What is the Python syntax for accessing specific data in a function call?

I am not generally a python user. I do most things in R and Stata. However, I cannot find a good semantic similarity package/API in either of those.
I have two data frames in the environment. One is questions, which consists of 2 columns and 3 rows. The other is results, which has 3 columns and 3 rows.
I am trying to compare each question (individually) in the first column of the questions dataframe to all of the questions in the second column. Then I want the output to populate the results dataframe. The function takes two strings as arguments So far my code looks like this:
for i in range(1, 3):
results.iloc[i-1,i] = liteClient.compare(questions.iloc[0,i], questions.iloc[:,1])
I assume that I am pointing to the data incorrectly, but I really haven't been able to find good documentation about this seemingly straightforward issue.
By comparison, here is my working R code, which uses a slightly different function and only one dataframe.
for (i in 1:3){
df[,i+2] <- levenshteinSim(df$yr1questions[i], df$yr2questions)
}
Any help would be greatly appreciated! I am trying to come up with proof-of-concept code to compare similar survey questions between years based on semantic meaning.
Bob
Let's try to compare(multiply) every question A to every question B
import pandas as pd
questions = pd.DataFrame(np.arange(6).reshape(3,2), columns=["question_A", "question_B"])
This gives:
question_A question_B
0 0 1
1 2 3
2 4 5
Then let's define a compare fonction:
def compare(row):
return pd.Series(row[0]*questions['question_B'])
results = questions.apply(compare, axis=1)
That gives us:
0 1 2
0 0 0 0
1 2 6 10
2 4 12 20
As you pointed in the comments, here is a version comparing only two strings at a time:
def compare(row):
question_a = row[0]
return pd.Series([liteClient.compare(question_a, question_b) for question_b in questions['question_B']])
Based on what you've put so far here are some issues with what you've written which are understandable from your R programming background:
for i in range(1, 3):
In python 3.x, what this does is create a range object which you can think of as a special type of function (though is really an object that contains iteration properties) that allows you to generate a sequence of numbers with a certain step size (default is 1) exclusively. Additionally you need to know that most programming languages index starting at zero, not 1, and this includes python.
What this range object does here is generate the sequence 1, 2 and that is it.
Your arrays you are using i to index over are not going to index over all indicies. What I believe you want is something like:
for i in range(3):
Notice how there is only one number here, this defaults to the exclusive maximum of the range, and 0 being the inclusive minimum, so this will generate the sequence of 0,1,2. If you have an array of size 3, this will represent all possible indices for that array.
This next line is a bit confusing to me, since I'm not familiar with R, but I sort of understand what you were trying to do. If I understand correctly you are trying to compare two columns of 3 questions each, and compare each question in column 1 to the questions in column 2, resulting in a 3 x 3 matrix of comparison results which you are trying to store in results?. Assuming the size are already correct (as in results is 3x3) I'd like to explain some peculiarities I see in this code.
results.iloc[i-1,i] = liteClient.compare(questions.iloc[0,i], questions.iloc[:,1])
with results.iloc[i-1,i] you are going to be indexing by row an column, as in i-1 is the row, and i is the column here. So with out changing range(1,3) this results in the following indexes being accessed, 0,1, 1,2 and that is it. I believe liteClient.compare(...) is supposed to return either a dataframe of 1x3, or a list of size 3 based on what you were trying to do inside of it, this may not be the case however, I'm not sure what object you are using to call that member function, so I don't know where the documentation for that function exists. Assuming it does return a list of size 3 or the dataframe, what you'll need to change the way you are trying to assign the data, via this:
results.iloc[i,:] = ...
What is happening here is that iloc(...) is taking a row positional argument and a slice positional argument, here you are assigning all the columns in the result matrix at that row to the values returned by compare. With the for statement changes this will iterate over all indices in the dataframe.
liteClient.compare(questions.iloc[0,i], questions.iloc[:,1])
This line as it currently stands you will be iterating over each column in the first row of questions.iloc, and then comparing them to the second column and all rows of the second questions.iloc.
I believe what you will want to do is change this to:
liteClient.compare(questions.iloc[i,0], questions.iloc[:,1])
What this does is for each i, 0, 1, 2, at column 0, compare this to every row in column 1. If your questions dataframe is actually organized as 2 columns and 3 rows this should work, otherwise you will need to change how you create questions as well.
in all I believe the fixed program should look something like:
for i in range(3):
results.iloc[i,:] = liteClient.compare(questions.iloc[i,0], questions.iloc[:,1])

How can I efficiently run groupby() over subsets of a dataframe to avoid MemoryError

I have a decent sized dataframe (roughly: df.shape = (4000000, 2000)) that I want to use .groupby().max() on. Unfortunately, neither my laptop nor the server I have access to can do this without throwing a MemoryError (The laptop has 16G of RAM, the server has 64.) This is likely due to the datatypes in a lot of the columns. Right now, I'm considering those are fixed and immutable (many, many dates, large integers, etc), though perhaps that could be part of the solution.
The command I would like is simply new_df = df.groupby('KEY').max().
What is the most efficient way to break down this problem to prevent running into memory problems? Some things I've tried, to varying success:
Break the df into subsets and run .groupby().max on those subsets, then concatenating. Issue: the size of the full df can vary, and is likely to grow. I'm not sure the best way to break the df apart so that the subsets are definitely not going to throw the MemoryError.
Include a subset set of columns on which to run the .groupby() in a new df, then merge this with the original. Issue: the number of columns in this subset can vary (much smaller or larger than current), though the names of the columns all include the prefix ind_.
Look for out-of-memory management tools. As of yet, I've not found anything useful.
Thanks for any insight or help you can provide.
EDIT TO ADD INFO
The data is for a predictive modeling exercise, and the number of columns stems from making binary variables from a column of discrete (non-continuous/categorical) values. The df will grow in column size if another column of values goes through the same process. Also, data is originally pulled from a SQL query; the set of items the query finds is likely to grow over time, meaning both the number of rows will grow, and (since the number of distinct values in one or more columns may grow, so will the number of columns after making binary indicator variables). The data pulled goes through extensive transformation, and gets merged with other datasets, making it implausible to run the grouping in the database itself.
There are repeated observations by KEY, which have the the same value except for the column I turn into indicators (this is just to show shape/value samples: actual df has dates, integers 16 digits or longer, etc):
KEY Col1 Col2 Col3
A 1 blue car
A 1 blue bike
A 1 blue train
B 2 green car
B 2 green plane
B 2 green bike
This should become, with dummied Col3:
KEY Col1 Col2 ind_car ind_bike ind_train ind_plane
A 1 blue 1 1 1 0
B 2 green 1 1 0 1
So the .groupby('KEY') gets me the groups, and the .max() gets the new indicator columns with the right values. I know the `.max()' process might be bogged down by getting a "max" for a string or date column.

Return subset/slice of Pandas dataframe based on matching column of other dataframe, for each element in column?

So I think this is a relatively simple question:
I have a Pandas data frame (A) that has a key column (which is not unique/will have repeats of the key)
I have another Pandas data frame (B) that has a key column, which may have many matching entries/repeats.
So what I'd like is a bunch of data frames (a list, or a bunch of slice parameters, etc.), one for each key in A (regardless of whether it's unique or not)
In [bad] pseudocode:
for each key in A:
resultDF[] = Rows in B where B.key = key
I can easily do this iteratively with loops, but I've read that you're supposed to slice/merge/join data frames holistically, so I'm trying to see if I can find a better way of doing this.
A join will give me all the stuff that matches, but that's not exactly what I'm looking for, since I need a resulting dataframe for each key (i.e. for every row) in A.
Thanks!
EDIT:
I was trying to be brief, but here are some more details:
Eventually, what I need to do is generate some simple statistical metrics for elements in the columns of each row.
In other words, I have a DF, call it A, and it has a r rows, with c columns, one of which is a key. There may be repeats on the key.
I want to "match" that key with another [set of?] dataframe, returning however many rows match the key. Then, for that set of rows, I want to, say, determine the min and max of certain element (and std. dev, variance, etc.) and then determine if the corresponding element in A falls within that range.
You're absolutely right that it's possible that if row 1 and row 3 of DF A have the same key -- but potentially DIFFERENT elements -- they'd be checked against the same result set (the ranges of which obviously won't change). That's fine. These won't likely ever be big enough to make that an issue (but if there's the better way of doing it, that's great).
The point is that I need to be able to do the "in range" and stat summary computation for EACH key in A.
Again, I can easily do all of this iteratively. But this seems like the sort of thing pandas could do well, and I'm just getting into using it.
Thanks again!
FURTHER EDIT
The DF looks like this:
df = pd.DataFrame([[1,2,3,4,1,2,3,4], [28,15,13,11,12,23,21,15],['keyA','keyB','keyC','keyD', 'keyA','keyB','keyC','keyD']]).T
df.columns = ['SEQ','VAL','KEY']
SEQ VAL KEY
0 1 28 keyA
1 2 15 keyB
2 3 13 keyC
3 4 11 keyD
4 1 12 keyA
5 2 23 keyB
6 3 21 keyC
7 4 15 keyD
Both DF's A and B are of this format.
I can iterative get the resultant sets by:
loop_iter = len(A) / max(A['SEQ_NUM'])
for start in range(0, loop_iter):
matchA = A.iloc[start::loop_iter, :]['KEY']
That's simple. But I guess I'm wondering if I can do this "inline". Also, if for some reason the numeric ordering breaks (i.e. the SEQ get out of order) this this won't work. There seems to be no reason NOT to do it explicitly splitting on the keys, right? So perhaps I have TWO questions: 1). How to split on keys, iteratively (i.e. accessing a DF one row at a time), and 2). How to match a DF and do summary statistics, etc., on a DF that matches on the key.
So, once again:
1). Iterate through DF A, going one at a time, and grabbing a key.
2). Match the key to the SET (matchB) of keys in B that match
3). Do some stats on "values" of matchB, check to see if val.A is in range, etc.
4). Profit!
Ok, from what I understand, the problem at its most simple is that you have a pd.Series of values (i.e. a["key"], which let's just call keys), which correspond to the rows of a pd.DataFrame (the df called b), such that set(b["key"]).issuperset(set(keys)). You then want to apply some function to each group of rows in b where the b["key"] is one of the values in keys.
I'm purposefully disregarding the other df -- a -- that you mention in your prompt, because it doesn't seem to bear any significance to the problem, other than being the source of keys.
Anyway, this is a fairly standard sort of operation -- it's a groupby-apply.
def descriptive_func(df):
"""
Takes a df where key is always equal and returns some summary.
:type df: pd.DataFrame
:rtype: pd.Series|pd.DataFrame
"""
pass
# filter down to those rows we're interested in
valid_rows = b[b["key"].isin(set(keys))]
# this groups by the value and applies the descriptive func to each sub df in turn
summary = valid_rows.groupby("key").apply(descriptive_func)
There are a few built in methods on the groupby object that are useful. For example, check out valid_rows.groupby("key").sum() or valid_rows.groupby("key").describe(). Under the covers, these are really similar uses of apply. The shape of the returned summary is determined by the applied function. The unique grouped-by values -- those of b["key"] -- always constitute the index, but if the applied function returns a scalar, summary is a Series; if the applied function returns a Series, then summary constituted of the return Series as rows; if the applied function returns a DataFrame, then the result is a multiindex DataFrame. This is a core pattern in Pandas, and there's a whole, whole lot to explore here.

Categories