I've implemented a fuzzy string matching algo between two dataframes just using pandas. My issue is how do I convert this to a dask operation using multiple cores? My program runs about 3-4 days on pure python, and I want to parallelize the operations to optimize time cost. I've already used the multiprocessing package to extract the number of cores using the code below:
numCores = multiprocessing.cpu_count()
fields = ['id','phase','new']
emb = pd.read_csv('my_csv.csv', skipinitialspace=True, usecols=fields)
Then I had to subdivide the dataframe emb into two dataframes (emb1, emb2) based on numeric values associated per string. As in I'm matching a dataframe with all elements having value to 3 to their corresponding value 2 in the other dataframe by matched string.The code for pure pandas operation is below.
emb1 = emb[emb.phase.isin([3.0])]
emb1.set_index('id',inplace=True)
emb2 = emb[emb.phase.isin([2.0,1.5])]
emb2.set_index('id',inplace=True)
def fuzzy_match(x, choices, scorer, cutoff):
return process.extractOne(x, choices=choices, scorer=scorer, score_cutoff=cutoff)
FuzzyWuzzyResults = pd.DataFrame(emb1.sort_index().loc[:,'strings'].apply(fuzzy_match, args = (emb2.loc[:,'strings'],fuzz.ratio,90)))
I sort of tried doing a dask implementation using this code:
emb1 = dd.from_pandas(emb1, npartitions=numCores)
emb2 = dd.from_pandas(emb2, npartitions=numCores)
But running the lambda function for two dataframes is confusing me. Any ideas?
So I just fixed my code to remove the manual partition of the dataframe and used groupby instead.
Here's the code:
for i in [2.0,1.5]:
FuzzyWuzzyResults = emb.map_partitions(lambda df: df.groupby('phase').get_group(3.0)['drugs'].apply(fuzzy_match, args=(df.groupby('phase').get_group(i)['drugs'],fuzz.ratio,90)), meta=('results')).compute()
Not sure whether it's accurate, but at least it's running, and that too on all CPU cores.
Related
Question and Problem statement
I have data coming from two sources. Each source contains groups identified by ID column, coordinates and attributes. I would like to process this data by first matching these groups, then finding nearest neighbours within these groups, and then studying how the attributes from different sources compare between the neighbors. My learning challenge for myself was how to process this data using parallel processing.
Question is: "Using Dask for parallel processing, what might be the simplest and most straightforward way to process this kind of data?"
Background and my solution thus far
The data is in CSV files like dummy data below (real files are in the 100 MiB range):
source1.csv:
ID,X_COORDINATE,Y_COORDINATE,ATTRIB1,PARAM1
B,-63802.84728184705,-21755.63629150563,3,36.136464492674556
B,-63254.41147034371,405.6973789009853,1,18.773534321367528
A,-9536.906537069272,32454.934987740824,0,14.043507555168809
A,15250.802157581298,-40868.390394552596,0,6.680542212635015
source2.csv:
ID,X_COORDINATE,Y_COORDINATE,ATTRIB1,PARAM1
B,-6605.150024790153,39733.35763934722,3,5.599467583303852
B,53264.28797042654,24647.24183964514,0,27.938127686688162
A,6690.836682554512,34643.0606728128,0,10.02914141165683
A,15243.16,-40954.928,0,18.130371948545935
What I would like to do is to
Load the data into dataframes
Split them into groups by ID column
For each group in source1 and source2, lets call the sub dataframes in each group source1_sub and source2_sub
construct a kdtree objects k1 and k2 based on columns X_COORDINATE and Y_COORDINATE
For each pair of objects (k1, k2)
find nearest neighbours for the trees
construct three dataframes:
matches_sub: containing the matched rows in source1_sub and source2_sub
source1_sub_only: rows in source1_sub which are not matched
source2_sub_only: rows in source2_sub which are not matched
Concatenate all matches_sub, source1_sub_only, and source2_sub_only dataframes into three dataframes: matches, source1_only, source2_only
Analyze these dataframes
This is a problem that should parallelize beautifully, as each pair of groups are independent of other pairs of groups. I decided to use scipy.spatial.cKDTree for the actual coordinate matching, but the difficulty arises from the fact that it operates on indices of the raw numpy arrays, which isn't so straightforwardly compatible with how Dask arrays can be accessed. At least that's my understanding.
My first futile attempts revolved around really awkwardly
Trying to use two Dask dataframes, aligning them and finding matches. This was dreadfully slow and hard to understand.
Read data with Dask Dataframe and process with using Dask Bag. This was slightly less complex but still not satisfactory.
Answering to myself, the simplest approach I could think of, is to
Read data from sources 1 and 2 into dataframes df_source1 and df_source2, using dask.dataframe.read_csv.
Upon reading, assing new column SOURCE to these dataframes, to identify the source. Now the groups I'm interested in, are specified with columns ID and SOURCE. This can be used for grouping.
Concate these dataframes into new dataframe df = dd.concat([df_source1, df_source2], axis=0)
Group the data by the columns ID and SOURCE, and use apply to find matches.
Analyze the data.
Done.
Something along the lines of:
import dask.dataframe as dd
import pandas as pd
import numpy as np
from scipy.spatial import cKDTree
def find_matches(x):
x_by_source = x.groupby(['SOURCE'])
grp1 = x_by_source.get_group(1)
grp2 = x_by_source.get_group(2)
tree1 = cKDTree(grp1[['X_COORDINATE', 'Y_COORDINATE']])
tree2 = cKDTree(grp2[['X_COORDINATE', 'Y_COORDINATE']])
neighbours = tree1.query_ball_tree(tree2, r=70000)
matches = np.array([[n,k] for (n, j) in enumerate(neighbours) if j != [] for k in j])
indices1 = grp1.index[matches[:,0]]
indices2 = grp2.index[matches[:,1]]
m1 = grp1.loc[indices1]
m2 = grp2.loc[indices2]
# arrange matches side by side
res = pd.concat([m1, m2], ignore_index=True, axis=1)
return(res)
df_source1 = dd.read_csv('source1.csv').assign(SOURCE = 1)
df_source2 = dd.read_csv('source2.csv').assign(SOURCE = 2)
df = dd.concat([df_source1, df_source2], axis=0)
meta = pd.DataFrame(columns=np.arange(0, 2*len(df.columns)))
result = (df.groupby('ID')
.apply(find_matches, meta=meta)
.persist()
)
# Proceed with further analysis
I have the following functions to apply bunch of regexes to each element in a data frame. The dataframe that I am applying the regexes to is a 5MB chunk.
def apply_all_regexes(data, regexes):
# find all regex matches is applied to the pandas' dataframe
new_df = data.applymap(
partial(apply_re_to_cell, regexes))
return regex_applied
def apply_re_to_cell(regexes, cell):
cell = str(cell)
regex_matches = []
for regex in regexes:
regex_matches.extend(re.findall(regex, cell))
return regex_matches
Due to the serial execution of applymap, the time taken to process is ~ elements * (serial execution of the regexes for 1 element). Is there anyway to invoke parallelism? I tried ProcessPoolExecutor, but that appeared to take longer time than executing serially.
Have you tried splitting your one big dataframe in number of threads small dataframes, apply the regex map parallel and stick each small df back together?
I was able to do something similar with a dataframe about gene expression.
I would run it small scale and control if you get the expected output.
Unfortunately I don't have enough reputation to comment
def parallelize_dataframe(df, func):
df_split = np.array_split(df, num_partitions)
pool = Pool(num_cores)
for x in df_split:
print(x.shape)
df = pd.concat(pool.map(func, df_split))
pool.close()
pool.join()
return df
This is the general function I used
I am trying to concatenate multiple dataframes using unionAll function in pyspark.
This is what I do :
df_list = []
for i in range(something):
normalizer = Normalizer(inputCol="features", outputCol="norm", p=1)
norm_df = normalizer.transform(some_df)
norm_df = norm_df.repartition(320)
data = index_df(norm_df)
data.persist()
mat = IndexedRowMatrix(
data.select("id", "norm")\
.rdd.map(lambda row: IndexedRow(row.id, row.norm.toArray()))).toBlockMatrix()
dot = mat.multiply(mat.transpose())
df = dot.toIndexedRowMatrix().rows.toDF()
df_list.append(df)
big_df = reduce(unionAll, df_list)
big_df.write.mode('append').parquet('some_path')
I want to do that because the writing part takes time and therefore, it is much faster to write one big file than n small files in my case.
The problem is that when I write big_df and check Spark UI, I have way too many tasks for writing parquet. While my goal is to write ONE big dataframe, it actually writes all the sub-dataframes.
Any guess?
Spark is lazy evaluated.
The write operation is the action that trigger all previous transformations. Therefore those tasks are for those transformations, not just for writing parquets.
I have 2 pyspark dataframe as shown in file attached. expected_df and actual_df
In my unit test I am trying to check if both are equal or not.
for which my code is
expected = map(lambda row: row.asDict(), expected_df.collect())
actual = map(lambda row: row.asDict(), actaual_df.collect())
assert expected = actual
Since both dfs are same but row order is different so assert fails here.
What is best way to compare such dfs.
You can try pyspark-test
https://pypi.org/project/pyspark-test/
This is inspired by the panadas testing module build for pyspark.
Usage is simple
from pyspark_test import assert_pyspark_df_equal
assert_pyspark_df_equal(df_1, df_2)
Also apart from just comparing dataframe, just like the pandas testing module it also accepts many optional params that you can check in the documentation.
Note:
The datatypes in pandas and pysaprk are bit different, thats why directly converting to .toPandas and using panadas testing module might not be the right approach.
This package is for unit/integration testing, so meant to be used with small size dfs
This is done in some of the pyspark documentation:
assert sorted(expected_df.collect()) == sorted(actaual_df.collect())
We solved this by hashing each row with Spark's hash function and then summing the resultant column.
from pyspark.sql import DataFrame
import pyspark.sql.functions as F
def hash_df(df):
"""Hashes a DataFrame for comparison.
Arguments:
df (DataFrame): A dataframe to generate a hash from
Returns:
int: Summed value of hashed rows of an input DataFrame
"""
# Hash every row into a new hash column
df = df.withColumn('hash_value', F.hash(*sorted(df.columns))).select('hash_value')
# Sum the hashes, see https://shortest.link/28YE
value = df.agg(F.sum('hash_value')).collect()[0][0]
return value
expected_hash = hash_df(expected_df)
actual_hash = hash_df(actual_df)
assert expected_hash == actual_hash
Unfortunately this cannot be done without applying sort on any of the columns(specially on the key column), reason being there isn't any guarantee for ordering of records in a DataFrame . You cannot predict the order in which the records are going to appear in the dataframe. The below approach works fine for me:
expected = expected_df.orderBy('period_start_time').collect()
actual = actaual_df.orderBy('period_start_time').collect()
assert expected == actual
If the overhead of an additional library such as pyspark_test is a problem, you could try sorting both dataframes by the same columns, converting them to pandas, and using pd.testing.assert_frame_equal.
I know that the .toPandas method for pyspark dataframes is generally discouraged because the data is loaded into the driver's memory (see the pyspark documentation here), but this solution works for relatively small unit tests.
For example:
sort_cols = actual_df.columns
pd.testing.assert_frame_equal(
actual_df.sort(sort_cols).toPandas(),
expected_df.sort(sort_cols).toPandas()
)
try to have "==" instead of "=".
assert expected == actual
I have two Dataframes with the same order.
Comparing this two I use:
def test_df(df1, df2):
assert df1.values.tolist() == df2.values.tolist()
Another way go about that ensuring sort order would be:
from pandas.testing import assert_frame_equal
def assert_frame_with_sort(results, expected, key_columns):
results_sorted = results.sort_values(by=key_columns).reset_index(drop=True)
expected_sorted = expected.sort_values(by=key_columns).reset_index(drop=True)
assert_frame_equal(results_sorted, expected_sorted)
I would like to apply a function to a dask.DataFrame, that returns a Series of variable length. An example to illustrate this:
def generate_varibale_length_series(x):
'''returns pd.Series with variable length'''
n_columns = np.random.randint(100)
return pd.Series(np.random.randn(n_columns))
#apply this function to a dask.DataFrame
pdf = pd.DataFrame(dict(A=[1,2,3,4,5,6]))
ddf = dd.from_pandas(pdf, npartitions = 3)
result = ddf.apply(generate_varibale_length_series, axis = 1).compute()
Apparently, this works fine.
Concerning this, I have two questions:
Is this supposed to work always or am I just lucky here? Is dask expecting, that all partitions have the same amount of columns?
In case the metadata inference fails, how can I provide metadata, if the number of columns is not known beforehand?
Background / usecase: In my dataframe each row represents a simulation trail. The function I want to apply extracts time points of certain events from it. Since I do not know the number of events per trail in advance, I do not know how many columns the resulting dataframe will have.
Edit:
As MRocklin suggested, here an approach that uses dask delayed to compute result:
#convert ddf to delayed objects
ddf_delayed = ddf.to_delayed()
#delayed version of pd.DataFrame.apply
delayed_apply = dask.delayed(lambda x: x.apply(generate_varibale_length_series, axis = 1))
#use this function on every delayed object
apply_on_every_partition_delayed = [delayed_apply(d) for d in ddf.to_delayed()]
#calculate the result. This gives a list of pd.DataFrame objects
result = dask.compute(*apply_on_every_partition_delayed)
#concatenate them
result = pd.concat(result)
Short answer
No, dask.dataframe does not support this
Long answer
Dask.dataframe expects to know the columns of every partition ahead of time and it expects those columns to match.
However, you can still use Dask and Pandas together through dask.delayed, which is far more capable of handling problems like these.
http://dask.pydata.org/en/latest/delayed.html