Question and Problem statement
I have data coming from two sources. Each source contains groups identified by ID column, coordinates and attributes. I would like to process this data by first matching these groups, then finding nearest neighbours within these groups, and then studying how the attributes from different sources compare between the neighbors. My learning challenge for myself was how to process this data using parallel processing.
Question is: "Using Dask for parallel processing, what might be the simplest and most straightforward way to process this kind of data?"
Background and my solution thus far
The data is in CSV files like dummy data below (real files are in the 100 MiB range):
source1.csv:
ID,X_COORDINATE,Y_COORDINATE,ATTRIB1,PARAM1
B,-63802.84728184705,-21755.63629150563,3,36.136464492674556
B,-63254.41147034371,405.6973789009853,1,18.773534321367528
A,-9536.906537069272,32454.934987740824,0,14.043507555168809
A,15250.802157581298,-40868.390394552596,0,6.680542212635015
source2.csv:
ID,X_COORDINATE,Y_COORDINATE,ATTRIB1,PARAM1
B,-6605.150024790153,39733.35763934722,3,5.599467583303852
B,53264.28797042654,24647.24183964514,0,27.938127686688162
A,6690.836682554512,34643.0606728128,0,10.02914141165683
A,15243.16,-40954.928,0,18.130371948545935
What I would like to do is to
Load the data into dataframes
Split them into groups by ID column
For each group in source1 and source2, lets call the sub dataframes in each group source1_sub and source2_sub
construct a kdtree objects k1 and k2 based on columns X_COORDINATE and Y_COORDINATE
For each pair of objects (k1, k2)
find nearest neighbours for the trees
construct three dataframes:
matches_sub: containing the matched rows in source1_sub and source2_sub
source1_sub_only: rows in source1_sub which are not matched
source2_sub_only: rows in source2_sub which are not matched
Concatenate all matches_sub, source1_sub_only, and source2_sub_only dataframes into three dataframes: matches, source1_only, source2_only
Analyze these dataframes
This is a problem that should parallelize beautifully, as each pair of groups are independent of other pairs of groups. I decided to use scipy.spatial.cKDTree for the actual coordinate matching, but the difficulty arises from the fact that it operates on indices of the raw numpy arrays, which isn't so straightforwardly compatible with how Dask arrays can be accessed. At least that's my understanding.
My first futile attempts revolved around really awkwardly
Trying to use two Dask dataframes, aligning them and finding matches. This was dreadfully slow and hard to understand.
Read data with Dask Dataframe and process with using Dask Bag. This was slightly less complex but still not satisfactory.
Answering to myself, the simplest approach I could think of, is to
Read data from sources 1 and 2 into dataframes df_source1 and df_source2, using dask.dataframe.read_csv.
Upon reading, assing new column SOURCE to these dataframes, to identify the source. Now the groups I'm interested in, are specified with columns ID and SOURCE. This can be used for grouping.
Concate these dataframes into new dataframe df = dd.concat([df_source1, df_source2], axis=0)
Group the data by the columns ID and SOURCE, and use apply to find matches.
Analyze the data.
Done.
Something along the lines of:
import dask.dataframe as dd
import pandas as pd
import numpy as np
from scipy.spatial import cKDTree
def find_matches(x):
x_by_source = x.groupby(['SOURCE'])
grp1 = x_by_source.get_group(1)
grp2 = x_by_source.get_group(2)
tree1 = cKDTree(grp1[['X_COORDINATE', 'Y_COORDINATE']])
tree2 = cKDTree(grp2[['X_COORDINATE', 'Y_COORDINATE']])
neighbours = tree1.query_ball_tree(tree2, r=70000)
matches = np.array([[n,k] for (n, j) in enumerate(neighbours) if j != [] for k in j])
indices1 = grp1.index[matches[:,0]]
indices2 = grp2.index[matches[:,1]]
m1 = grp1.loc[indices1]
m2 = grp2.loc[indices2]
# arrange matches side by side
res = pd.concat([m1, m2], ignore_index=True, axis=1)
return(res)
df_source1 = dd.read_csv('source1.csv').assign(SOURCE = 1)
df_source2 = dd.read_csv('source2.csv').assign(SOURCE = 2)
df = dd.concat([df_source1, df_source2], axis=0)
meta = pd.DataFrame(columns=np.arange(0, 2*len(df.columns)))
result = (df.groupby('ID')
.apply(find_matches, meta=meta)
.persist()
)
# Proceed with further analysis
Related
I have a dictionary that is filled with multiple dataframes. Now I am searching for an efficient way for changing the key structure, but the solution I have found is rather slow when more dataframes / bigger dataframes are involved. Thats why I wanted to ask if anyone might know a more convenient / efficient / faster way or approach than mine. So first, I created this example to show where I initially started:
import pandas as pd
import numpy as np
# assign keys to dic
teams = ["Arsenal", "Chelsea", "Manchester United"]
dic_teams = {}
# fill dic with random entries
for t1 in teams:
dic_teams[t1] = pd.DataFrame({'date': pd.date_range("20180101", periods=30),
'Goals': pd.Series(np.random.randint(0,5, size = 30)),
'Chances': pd.Series(np.random.randint(0,15, size = 30)),
'Fouls': pd.Series(np.random.randint(0, 20, size = 30)),
'Offside': pd.Series(np.random.randint(0, 10, size = 30))})
dic_teams[t1] = dic_teams[t1].set_index('date')
dic_teams[t1].index.name = None
Now I basically have a dictionary where every key is a team, which means I have a dataframe for every team with information on their game performance over time. Now I would prefer to change this particular dictionary so I get a structure where the key is the date, instead of a team. This would mean that I have a dataframe for every date, which is filled with the performance of each team on that date. I managed to do that using the following code, which works but is really slow once I add more teams and performance factors:
# prepare lists for looping
dates = dic_teams["Arsenal"].index.to_list()
perf = dic_teams["Arsenal"].columns.to_list()
dic_dates = {}
# new structure where key = date
for d in dates:
dic_dates[d] = pd.DataFrame(index = teams, columns = perf)
for t2 in teams:
dic_dates[d].loc[t2] = dic_teams[t2].loc[d]
Because I am using a nested loop, the restructuring of my dictionary is slow. Does anyone have an idea how I could improve the second piece of code? I'm not necessarily searching just for a solution, also for a logic or idea how to do better.
Thanks in advance, any help is highly appreciated
Creating a Pandas dataframes the way you do is (strangely) awfully slow, as well as direct indexing.
Copying a dataframe is surprisingly quite fast. Thus you can use an empty reference dataframe copied multiple times. Here is the code:
dates = dic_teams["Arsenal"].index.to_list()
perf = dic_teams["Arsenal"].columns.to_list()
zygote = pd.DataFrame(index = teams, columns = perf)
dic_dates = {}
# new structure where key = date
for d in dates:
dic_dates[d] = zygote.copy()
for t2 in teams:
dic_dates[d].loc[t2] = dic_teams[t2].loc[d]
This is about 2 times faster than the reference on my machine.
Overcoming the slow dataframe direct indexing is tricky. We can use numpy to do that. Indeed, we can convert the dataframe to a 3D numpy array, use numpy to perform the transposition, and finally convert the slices into dataframes again. Note that this approach assumes that all values are integers and that the input dataframe are well structured.
Here is the final implementation:
dates = dic_teams["Arsenal"].index.to_list()
perf = dic_teams["Arsenal"].columns.to_list()
dic_dates = {}
# Create a numpy array from Pandas dataframes
# Assume the order of the `dates` and `perf` indices are the same in all dataframe (and their order)
full = np.empty(shape=(len(teams), len(dates), len(perf)), dtype=int)
for tId,tName in enumerate(teams):
full[tId,:,:] = dic_teams[tName].to_numpy()
# New structure where key = date, created from the numpy array
for dId,dName in enumerate(dates):
dic_dates[dName] = pd.DataFrame({pName: full[:,dId,pId] for pId,pName in enumerate(perf)}, index = teams)
This implementation is 6.4 times faster than the reference on my machine. Note that about 75% of the time is sadly spent in the pd.DataFrame calls. Thus, if you want a faster code, use a basic 3D numpy array!
I have 2 sets of split data frames from a big data frame. Say for example,
import pandas as pd, numpy as np
np.random.seed([3,1415])
ind1 = ['A_p','B_p','C_p','D_p','E_p','F_p','N_p','M_p','O_p','Q_p']
col1 = ['sap1','luf','tur','sul','sul2','bmw','aud']
df1 = pd.DataFrame(np.random.randint(10, size=(10, 7)), columns=col1,index=ind1)
ind2 = ['G_l','I_l','J_l','K_l','L_l','M_l','R_l','N_l']
col2 = ['sap1','luf','tur','sul','sul2','bmw','aud']
df2 = pd.DataFrame(np.random.randint(20, size=(8, 7)), columns=col2,index=ind2)
# Split the dataframes into two parts
pc_1,pc_2 = np.array_split(df1, 2)
lnc_1,lnc_2 = np.array_split(df2, 2)
And now, I need to concatenate each split data frames from df1 (pc1, pc2) with each data frames from df2 (ln_1,lnc_2). Currently, I am doing it following,
# concatenate each split data frame pc1 with lnc1
pc1_lnc_1 =pd.concat([pc_1,lnc_1])
pc1_lnc_2 =pd.concat([pc_1,lnc_2])
pc2_lnc1 =pd.concat([pc_2,lnc_1])
pc2_lnc2 =pd.concat([pc_2,lnc_2])
On every concatenated data frame I need to run a correlation analysis function, for example,
correlation(pc1_lnc_1)
And I wanted to save the results separately, for example,
pc1_lnc1= correlation(pc1_lnc_1)
pc1_lnc2= correlation(pc1_lnc_2)
......
pc1_lnc1.to_csv(output,sep='\t')
The question is if there is a way I can automate the above concatenation part, rather than coding it in every line using some sort of loop, currently for every concatenated data frame. I am separately running the function correlation. And I have a pretty long list of the split data frame.
You can loop over the split dataframes:
for pc in np.array_split(df1, 2):
for lnc in np.array_split(df2, 2):
print(correlation(pd.concat([pc,lnc])))
Here is another thought,
def correlation(data):
# do some complex operation..
return data
# {"pc_1" : split_1, "pc_2" : split_2}
pc = {f"pc_{i + 1}": v for i, v in enumerate(np.array_split(df1, 2))}
lc = {f"lc_{i + 1}": v for i, v in enumerate(np.array_split(df2, 2))}
for pc_k, pc_v in pc.items():
for lc_k, lc_v in lc.items():
# (pc_1, lc_1), (pc_1, lc_2) ..
correlation(pd.concat([pc_v, lc_v])). \
to_csv(f"{pc_k}_{lc_k}.csv", sep="\t", index=False)
# will create csv like pc_1_lc_1.csv, pc_1_lc_2.csv.. in the current working dir
If you don't have your individual dataframes in an array (and assuming you have a nontrivial number of dataframes), the easiest way (with minimal code modification) would be to throw an eval in with a loop.
Something like
for counter in range(0,n):
for counter2 in range(0:n);
exec("pc{}_lnc{}=correlation(pd.concat([pc_{},lnc_{}]))".format(counter,counter2,counter,counter2))
eval("pc{}_lnc{}.to_csv(filename,sep='\t')".format(counter,counter2)
The standard disclaimer around eval does still apply (don't do it because it's lazy programming practice and unsafe inputs could cause all kinds of problems in your code).
See here for more details about why eval is bad
edit Updating answer for updated question.
i tried and searched many houres to find a solution for my problem without success.
I hope i can describe my problem as clear as possible:
General situation:
My goal is to find the best combinations of features from different DataFrames for machine learning.
Therefore, i want to calculate the correlation of each feature in single version vs. my "target_column", but also the combinations (e.g. df1+df2 vs. target_column, df2+df3+df4 vs. target_column, ...).
Restrictive issues:
The DataFrames are very large --> 5´000´000 observations per DataFrame
My computer has 16 GB memory (i got a lot of memory errors during my work)
Description of my main problem:
I have a list of Dataframes: list_of_dfs = [df1, df2, df3, df4, df5]
Each DataFrame is represented in pandas.get_dummies representation (because many categorical features).
I can pd.concat them, because they already represent information for the same observations.
I want to have all combinations of the list_of_dfs for following goal:
pd.concat them to a new DataFrame
Calculate the correlation of this DataFrame to my target_column (not included in the dfs from list_of_dfs)
My current approach:
def get_df_name(df):
name =[x for x in globals() if globals()[x] is df][0]
return name
def corr(df1, df2):
n = len(df1)
v1, v2 = df1.values, df2.values
sums = np.multiply.outer(v2.sum(0), v1.sum(0))
stds = np.multiply.outer(v2.std(0), v1.std(0))
print(get_df_name(df1))
return pd.DataFrame((v2.T.dot(v1) - sums / n) / stds / n, df2.columns, df1.columns)
from itertools import combinations
for i in range(6):
list_of_dummies_comb = combinations(list_of_dfs, i)
# test = pd.concat(list_of_dummies_comb, axis=1)
# print(corr(test, target_column).median())
In the last two lines of code, I have the problem that i have to deal with class.tuple and I got an error that I can´t concat them.
Has somebody an idea or a solution for my problem or my approach?
Thanks a lot! :)
I have a dataframe of ~20M lines
I have a column called A that gives me an id (there are ~10K ids in total).
The value of this id defines a random distribution's parameters.
Now I want to generate a column B, that is randomly drawn from the distribution that is defined by the value in the column A
What is the fastest way to do this? Doing something with iterrows or apply is extremely slow. Another possiblity is to group by A, and generate all my data for each value of A (so I only draw from one distribution). But then I don't end up with a Dataframe but with a "groupBy" object, and I don't know how to go back to having the initial dataframe, plus my new column.
I think this approach is similar to what you were describing, where you generate the samples for each id. On my machine, it appears this would take around 5 minutes to run. I assume you can trivially get the ids.
import numpy as np
num_ids = 10000
num_rows = 20000000
ids = np.arange(num_ids)
loc_params = np.random.random(num_ids)
A = np.random.randint(0, num_ids, num_rows)
B = np.zeros(A.shape)
for idx in ids:
A_idxs = A == idx
B[A_idxs] = np.random.normal(np.sum(A_idxs), loc_params[idx])
This question is pretty vague, but how would this work for you?
df['B'] = df.apply(lambda row: distribution(row.A), axis=1)
Editing from question edits (apply is too slow):
You could create a mapping dictionary for the 10k ids to their generated value, then do something like
df['B'] = df['A'].map(dictionary)
I'm unsure if this will be faster than apply, but it will require fewer calls to your random distribution generator
I would like to apply a function to a dask.DataFrame, that returns a Series of variable length. An example to illustrate this:
def generate_varibale_length_series(x):
'''returns pd.Series with variable length'''
n_columns = np.random.randint(100)
return pd.Series(np.random.randn(n_columns))
#apply this function to a dask.DataFrame
pdf = pd.DataFrame(dict(A=[1,2,3,4,5,6]))
ddf = dd.from_pandas(pdf, npartitions = 3)
result = ddf.apply(generate_varibale_length_series, axis = 1).compute()
Apparently, this works fine.
Concerning this, I have two questions:
Is this supposed to work always or am I just lucky here? Is dask expecting, that all partitions have the same amount of columns?
In case the metadata inference fails, how can I provide metadata, if the number of columns is not known beforehand?
Background / usecase: In my dataframe each row represents a simulation trail. The function I want to apply extracts time points of certain events from it. Since I do not know the number of events per trail in advance, I do not know how many columns the resulting dataframe will have.
Edit:
As MRocklin suggested, here an approach that uses dask delayed to compute result:
#convert ddf to delayed objects
ddf_delayed = ddf.to_delayed()
#delayed version of pd.DataFrame.apply
delayed_apply = dask.delayed(lambda x: x.apply(generate_varibale_length_series, axis = 1))
#use this function on every delayed object
apply_on_every_partition_delayed = [delayed_apply(d) for d in ddf.to_delayed()]
#calculate the result. This gives a list of pd.DataFrame objects
result = dask.compute(*apply_on_every_partition_delayed)
#concatenate them
result = pd.concat(result)
Short answer
No, dask.dataframe does not support this
Long answer
Dask.dataframe expects to know the columns of every partition ahead of time and it expects those columns to match.
However, you can still use Dask and Pandas together through dask.delayed, which is far more capable of handling problems like these.
http://dask.pydata.org/en/latest/delayed.html