innerjoin between two large pandas dataframe using dask

innerjoin between two large pandas dataframe using dask - python

I have two large tables with one of them is relatively small ~8Million rows and one column. Other is large 173Million rows and one column. The index of the first data frame is IntervalIndex (eg (0,13], (13, 20], (20, 23], ...) and the second one is ordered numbers (1,2,3, ...). Both DataFrame are sorted so
DF1 category
(0,13] 1
(13 20] 2
....
Df2 Value
1 5.2
2 3.4
3 7.8
Desired
Df3
index value category
1 5.2 1
2 3.4 1
3 7.8 1
I want two obtain inner join (faster algorithm) that returns inner join similar to MySQL on data_frame2.index
I would like to be able to perform it in an elaborate way in a cluster because when I PRODUCED THE INNERJOIN WITH SMALLER SECOND DATASET THE RESULT ARE SO MEMORY CONSUMING IMAGINE 105MEGABYTE for 10 rows using map_partitions.
Another problem is that I cannot use scatter twice, given if first DaskDF=client.scatter(dataframe2) followed by DaskDF=client.submit(fun1,DaskDF) I am unable to do sth like client.submit(fun2,DaskDF).

You might try using smaller partitions. Recall that the memory use of joins depend on how many shared rows there are. Depending on your data the memory use of an output partition may be much larger than the memory use of your input partitions.

Related

Iterate through two dataframes and create a dictionary one data frame that is a substring in strings found in the second dataframe (values)

I have two dataframes. One is very large and has over 4 million rows of data while the other has about 26k. I'm trying to create a dictionary where the keys are the strings of the smaller data frame. This dataframe (df1) contains substrings or incomplete names and the larger dataframe (df2) contains full names/strings and I want to check if if the substring from df1 is in strings in df2 and then create my dict.
No matter what I try, my code takes long and I keep looking for faster ways to iterate through the df's.
org_dict={}
for rowi in df1.itertuples():
part = rowi.part_name
full_list = []
for rowj in df2.itertuples():
if part in rowj.full_name:
full_list.append(full_name)
org_dict[part]=full_list
Am I missing a break or is there a faster way to iterate through really large dataframes of way over 1 million rows?
Sample data:
df1
part_name
0 aaa
1 bb
2 856
3 cool
4 man
5 a0
df2
full_name
0 aaa35688d
1 coolbbd
2 8564578
3 coolaaa
4 man4857684
5 a03567
expected output:
{'aaa':['aaa35688d','coolaaa'],
'bb':['coolbbd'],
'856':['8564578']
...}
etc

The issue here is that nested for loops perform very badly time-wise as the data grows larger. Luckily, pandas allows us to perform vectorised operations across rows/columns.
I can't properly test without having access to a sample of your data, but I believe this does the trick and performs much faster:
org_dict = {substr: df2.full_name[df2.full_name.str.contains(substr)].tolist() for substr in df1.part_name}

Alternative for drop_duplicates python 3.6

I am working on some huge volume of data, rows around 50 millions.
I want to find unique columns values from multiple columns. I use below script.
dataAll[['Frequency', 'Period', 'Date']].drop_duplicates()
But this is taking long time, more than 40minutes.
I found some alternative:
pd.unique(dataAll[['Frequency', 'Period', 'Date']].values.ravel('K'))
but above script will give array, but I need in dataframe like first script will give as below

Generaly your new code is imposible convert to DataFrame, because:
pd.unique(dataAll[['Frequency', 'Period', 'Date']].values.ravel('K'))
create one big 1d numpy array, so after remove duplicates is impossible recreate rows.
E.g. if there are 2 unique values 3 and 1 is impossible find which datetimes are for 3 and for 1.
But if there is only one unique value for Frequency and for each Period is possible find Date like in sample, solution is possible.
EDIT:
One possible alternative is use dask.dataframe.DataFrame.drop_duplicates.

Create Multiple New Columns Based on Pipe-Delimited Column in Pandas

I have a pandas dataframe with a pipe delimited column with an arbitrary number of elements, called Parts. The number of elements in these pipe-strings varies from 0 to over 10. The number of unique elements contained in all pipe-strings is not much smaller than the number of rows (which makes it impossible for me to manually specify all of them while creating new columns).
For each row, I want to create a new column that acts as an indicator variable for each element of the pipe delimited list. For instance, if the row
...'Parts'...
...'12|34|56'
should be transformed to
...'Part_12' 'Part_34' 'Part_56'...
...1 1 1...
Because they are a lot of unique parts, these columns are obviously going to be sparse - mostly zeros since each row only contains a small fraction of unique parts.
I haven't found any approach that doesn't require manually specifying the columns (for instance, Pandas Dataframe: split column into multiple columns, right-align inconsistent cell entries).
I've also looked at pandas' melt, but I don't think that's the appropriate tool.
The way I know how to solve it would be to pipe the raw CSV to another python script and deal with it on a char-by-char basis, but I need to work within my existing script since I will be processing hundreds of CSVs in this manner.
Here's a better illustration of the data
ID YEAR AMT PARTZ
1202 2007 99.34
9321 1988 1012.99 2031|8942
2342 2012 381.22 1939|8321|Amx3

You can use get_dummies and add_prefix:
df.Parts.str.get_dummies().add_prefix('Part_')
Output:
Part_12 Part_34 Part_56
0 1 1 1
Edit for comment and counting duplicates.
df = pd.DataFrame({'Parts':['12|34|56|12']}, index=[0])
pd.get_dummies(df.Parts.str.split('|',expand=True).stack()).sum(level=0).add_prefix('Part_')
Output:
Part_12 Part_34 Part_56
0 2 1 1

Pysparkling 2 reset monotonically_increasing_id from 1

I want to split a spark dataframe into two pieces and defined the row number for each of the sub-dataframe. But I found that the function monotonically_increasing_id will still define the row number from the original dataframe.
Here is what I did in python:
# df is the original sparkframe
splits = df.randomSplit([7.0,3.0],400)
# add column rowid for the two subframes
set1 = splits[0].withColumn("rowid", monotonically_increasing_id())
set2 = splits[1].withColumn("rowid", monotonically_increasing_id())
# check the results
set1.select("rowid").show()
set2.select("rowid").show()
I would expect the first five elements of rowid for the two frames are both 1 to 5 (or 0 to 4, can't remember clearly):
set1: 1 2 3 4 5
set2: 1 2 3 4 5
But what I actually got is:
set1: 1 3 4 7 9
set2: 2 5 6 8 10
The two subframes' row id are actually their row id in the original sparkframe df not the new ones.
As a newbee of spark, I am seeking helps on why this happened and how to fix it.

First of all, what version of Spark are you using? The monotonically_increasing_id method implementation has been changed a few times. I can reproduce your problem in Spark 2.0, but it seems the behavior is different in spark 2.2. So it could be a bug that was fixed in newer spark release.
That being said, you should NOT expect the value generated by monotonically_increasing_id to be increase consecutively. In your code, I believe there is only one partition for the dataframe. According to http://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html
The generated ID is guaranteed to be monotonically increasing and
unique, but not consecutive. The current implementation puts the
partition ID in the upper 31 bits, and the record number within each
partition in the lower 33 bits. The assumption is that the data frame
has less than 1 billion partitions, and each partition has less than 8
billion records.
As an example, consider a DataFrame with two partitions, each with 3
records. This expression would return the following IDs: 0, 1, 2,
8589934592 (1L << 33), 8589934593, 8589934594.
So if you code should not expect the rowid to be increased consecutively.
Besides, you should also consider caching scenario and failure scenarios. Even if monotonically_increase_id works as what you expect -- increase the value consecutively, it still will not work. What if your node fails? The partitions on the failed node will be regenerated from the source or last cache/checkpoint, which might have a different order thus different rowid. Eviction out of cache also causes issue. Assuming that after generating a dataframe and you cache it to memory. What if it is evicted out of memory. Future action will try to regenerate the dataframe again, thus gives out different rowid.

How can I efficiently run groupby() over subsets of a dataframe to avoid MemoryError

I have a decent sized dataframe (roughly: df.shape = (4000000, 2000)) that I want to use .groupby().max() on. Unfortunately, neither my laptop nor the server I have access to can do this without throwing a MemoryError (The laptop has 16G of RAM, the server has 64.) This is likely due to the datatypes in a lot of the columns. Right now, I'm considering those are fixed and immutable (many, many dates, large integers, etc), though perhaps that could be part of the solution.
The command I would like is simply new_df = df.groupby('KEY').max().
What is the most efficient way to break down this problem to prevent running into memory problems? Some things I've tried, to varying success:
Break the df into subsets and run .groupby().max on those subsets, then concatenating. Issue: the size of the full df can vary, and is likely to grow. I'm not sure the best way to break the df apart so that the subsets are definitely not going to throw the MemoryError.
Include a subset set of columns on which to run the .groupby() in a new df, then merge this with the original. Issue: the number of columns in this subset can vary (much smaller or larger than current), though the names of the columns all include the prefix ind_.
Look for out-of-memory management tools. As of yet, I've not found anything useful.
Thanks for any insight or help you can provide.
EDIT TO ADD INFO
The data is for a predictive modeling exercise, and the number of columns stems from making binary variables from a column of discrete (non-continuous/categorical) values. The df will grow in column size if another column of values goes through the same process. Also, data is originally pulled from a SQL query; the set of items the query finds is likely to grow over time, meaning both the number of rows will grow, and (since the number of distinct values in one or more columns may grow, so will the number of columns after making binary indicator variables). The data pulled goes through extensive transformation, and gets merged with other datasets, making it implausible to run the grouping in the database itself.
There are repeated observations by KEY, which have the the same value except for the column I turn into indicators (this is just to show shape/value samples: actual df has dates, integers 16 digits or longer, etc):
KEY Col1 Col2 Col3
A 1 blue car
A 1 blue bike
A 1 blue train
B 2 green car
B 2 green plane
B 2 green bike
This should become, with dummied Col3:
KEY Col1 Col2 ind_car ind_bike ind_train ind_plane
A 1 blue 1 1 1 0
B 2 green 1 1 0 1
So the .groupby('KEY') gets me the groups, and the .max() gets the new indicator columns with the right values. I know the `.max()' process might be bogged down by getting a "max" for a string or date column.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.