Calculating the minimum distance between two DataFrames - python

I like to find the item of DF2 that is cloest to the item in DF1.
The distance is euclidean distance.
For example, for A in DF1, F in DF2 is the cloeset one.
>>> DF1
X Y name
0 1 2 A
1 3 4 B
2 5 6 C
3 7 8 D
>>> DF2
X Y name
0 3 8 E
1 2 4 F
2 1 9 G
3 6 4 H
My code is
DF1 = pd.DataFrame({'name' : ['A', 'B', 'C', 'D'],'X' : [1,3,5,7],'Y' : [2,4,6,8]})
DF2 = pd.DataFrame({'name' : ['E', 'F', 'G', 'H'],'X' : [3,2,1,6],'Y' : [8,4,9,4]})
def ndis(row):
try:
X,Y=row['X'],row['Y']
DF2['DIS']=(DF2.X-X)*(DF2.X-X)+(DF2.Y-Y)*(DF2.Y-Y)
temp=DF2.ix[DF2.DIS.idxmin()]
return temp[2] # print temp[2]
except:
pass
DF1['Z']=DF1.apply(ndis, axis=1)
This works fine, and it will take too long for large data set.
Another question is to how to find the 2nd and 3d cloeset ones.

There is more than one approach, for example one can use numpy:
>>> xy = ['X', 'Y']
>>> distance_array = numpy.sum((df1[xy].values - df2[xy].values)**2, axis=1)
>>> distance_array.argmin()
1
Top 3 closest (not the fastest approach, I suppose, but simplest)
>>> distance_array.argsort()[:3]
array([1, 3, 2])
If speed is a concern, run performance tests.

Look at scipy.spatial.KDTree and the related cKDTree, which is faster but offers only a subset of the functionality. For large sets, you probably won't beat that for speed.

Related

Filter DataFrame for most matches

I have a list (list_to_match = ['a','b','c','d']) and a dataframe like this one below:
Index
One
Two
Three
Four
1
a
b
d
c
2
b
b
d
d
3
a
b
d
4
c
b
c
d
5
a
b
c
g
6
a
b
c
7
a
s
c
f
8
a
f
c
9
a
b
10
a
b
t
d
11
a
b
g
...
...
...
...
...
100
a
b
c
d
My goal would be to filter for the rows with most matches with the list in the corrisponding position (e.g. position 1 in the list has to match column 1, position 2 column 2 etc...).
In this specific case, excluding row 100, row 5 and 6 would be the one selected since they match 'a', 'b' and 'c' but if row 100 were to be included row 100 and all the other rows matching all elements would be the selected.
Also the list might change in length e.g. list_to_match = ['a','b'].
Thanks for your help!
I would use:
list_to_match = ['a','b','c','d']
# compute a mask of identical values
mask = df.iloc[:, :len(list_to_match)].eq(list_to_match)
# ensure we match values in order
mask2 = mask.cummin(axis=1).sum(axis=1)
# get the rows with max matches
out = df[mask2.eq(mask2.max())]
# or
# out = df.loc[mask2.nlargest(1, keep='all').index]
print(out)
Output (ignoring the input row 100):
One Two Three Four
Index
5 a b c g
6 a b c None
Here is my approach. Descriptions are commented below.
import pandas as pd
import numpy as np
from scipy.spatial.distance import cosine
data = {'One': ['a', 'a', 'a', 'a'],
'Two': ['b', 'b', 'b', 'b'],
'Three': ['c', 'c', 'y', 'c'],
'Four': ['g', 'g', 'z', 'd']}
dataframe_ = pd.DataFrame(data)
#encoding Letters into numerical values so we can compute the cosine similarities
dataframe_[:] = dataframe_.to_numpy().astype('<U1').view(np.uint32)-64
#Our input data which we are going to compare with other rows
input_data = np.array(['a', 'b', 'c', 'd'])
#encode input data into numerical values
input_data = input_data.astype('<U1').view(np.uint32)-64
#compute cosine similarity for each row
dataframe_out = dataframe_.apply(lambda row: 1 - cosine(row, input_data), axis=1)
print(dataframe_out)
output:
0 0.999343
1 0.999343
2 0.973916
3 1.000000
Filtering rows based on their cosine similarities:
df_filtered = dataframe_out[dataframe_out.iloc[:, [0]] > 0.99]
print(df_filtered)
0 0.999343
1 0.999343
2 NaN
3 1.000000
From here on you can easily find the rows with non-NaN values by their indexes.

Fill NA values in dataframe by mapping a double indexed groupby object [duplicate]

This should be straightforward, but the closest thing I've found is this post:
pandas: Filling missing values within a group, and I still can't solve my problem....
Suppose I have the following dataframe
df = pd.DataFrame({'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3], 'name': ['A','A', 'B','B','B','B', 'C','C','C']})
name value
0 A 1
1 A NaN
2 B NaN
3 B 2
4 B 3
5 B 1
6 C 3
7 C NaN
8 C 3
and I'd like to fill in "NaN" with mean value in each "name" group, i.e.
name value
0 A 1
1 A 1
2 B 2
3 B 2
4 B 3
5 B 1
6 C 3
7 C 3
8 C 3
I'm not sure where to go after:
grouped = df.groupby('name').mean()
Thanks a bunch.
One way would be to use transform:
>>> df
name value
0 A 1
1 A NaN
2 B NaN
3 B 2
4 B 3
5 B 1
6 C 3
7 C NaN
8 C 3
>>> df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))
>>> df
name value
0 A 1
1 A 1
2 B 2
3 B 2
4 B 3
5 B 1
6 C 3
7 C 3
8 C 3
fillna + groupby + transform + mean
This seems intuitive:
df['value'] = df['value'].fillna(df.groupby('name')['value'].transform('mean'))
The groupby + transform syntax maps the groupwise mean to the index of the original dataframe. This is roughly equivalent to #DSM's solution, but avoids the need to define an anonymous lambda function.
#DSM has IMO the right answer, but I'd like to share my generalization and optimization of the question: Multiple columns to group-by and having multiple value columns:
df = pd.DataFrame(
{
'category': ['X', 'X', 'X', 'X', 'X', 'X', 'Y', 'Y', 'Y'],
'name': ['A','A', 'B','B','B','B', 'C','C','C'],
'other_value': [10, np.nan, np.nan, 20, 30, 10, 30, np.nan, 30],
'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3],
}
)
... gives ...
category name other_value value
0 X A 10.0 1.0
1 X A NaN NaN
2 X B NaN NaN
3 X B 20.0 2.0
4 X B 30.0 3.0
5 X B 10.0 1.0
6 Y C 30.0 3.0
7 Y C NaN NaN
8 Y C 30.0 3.0
In this generalized case we would like to group by category and name, and impute only on value.
This can be solved as follows:
df['value'] = df.groupby(['category', 'name'])['value']\
.transform(lambda x: x.fillna(x.mean()))
Notice the column list in the group-by clause, and that we select the value column right after the group-by. This makes the transformation only be run on that particular column. You could add it to the end, but then you will run it for all columns only to throw out all but one measure column at the end. A standard SQL query planner might have been able to optimize this, but pandas (0.19.2) doesn't seem to do this.
Performance test by increasing the dataset by doing ...
big_df = None
for _ in range(10000):
if big_df is None:
big_df = df.copy()
else:
big_df = pd.concat([big_df, df])
df = big_df
... confirms that this increases the speed proportional to how many columns you don't have to impute:
import pandas as pd
from datetime import datetime
def generate_data():
...
t = datetime.now()
df = generate_data()
df['value'] = df.groupby(['category', 'name'])['value']\
.transform(lambda x: x.fillna(x.mean()))
print(datetime.now()-t)
# 0:00:00.016012
t = datetime.now()
df = generate_data()
df["value"] = df.groupby(['category', 'name'])\
.transform(lambda x: x.fillna(x.mean()))['value']
print(datetime.now()-t)
# 0:00:00.030022
On a final note you can generalize even further if you want to impute more than one column, but not all:
df[['value', 'other_value']] = df.groupby(['category', 'name'])['value', 'other_value']\
.transform(lambda x: x.fillna(x.mean()))
Shortcut:
Groupby + Apply + Lambda + Fillna + Mean
>>> df['value1']=df.groupby('name')['value'].apply(lambda x:x.fillna(x.mean()))
>>> df.isnull().sum().sum()
0
This solution still works if you want to group by multiple columns to replace missing values.
>>> df = pd.DataFrame({'value': [1, np.nan, np.nan, 2, 3, np.nan,np.nan, 4, 3],
'name': ['A','A', 'B','B','B','B', 'C','C','C'],'class':list('ppqqrrsss')})
>>> df['value']=df.groupby(['name','class'])['value'].apply(lambda x:x.fillna(x.mean()))
>>> df
value name class
0 1.0 A p
1 1.0 A p
2 2.0 B q
3 2.0 B q
4 3.0 B r
5 3.0 B r
6 3.5 C s
7 4.0 C s
8 3.0 C s
I'd do it this way
df.loc[df.value.isnull(), 'value'] = df.groupby('group').value.transform('mean')
The featured high ranked answer only works for a pandas Dataframe with only two columns. If you have a more columns case use instead:
df['Crude_Birth_rate'] = df.groupby("continent").Crude_Birth_rate.transform(
lambda x: x.fillna(x.mean()))
To summarize all above concerning the efficiency of the possible solution
I have a dataset with 97 906 rows and 48 columns.
I want to fill in 4 columns with the median of each group.
The column I want to group has 26 200 groups.
The first solution
start = time.time()
x = df_merged[continuous_variables].fillna(df_merged.groupby('domain_userid')[continuous_variables].transform('median'))
print(time.time() - start)
0.10429811477661133 seconds
The second solution
start = time.time()
for col in continuous_variables:
df_merged.loc[df_merged[col].isnull(), col] = df_merged.groupby('domain_userid')[col].transform('median')
print(time.time() - start)
0.5098445415496826 seconds
The next solution I only performed on a subset since it was running too long.
start = time.time()
for col in continuous_variables:
x = df_merged.head(10000).groupby('domain_userid')[col].transform(lambda x: x.fillna(x.median()))
print(time.time() - start)
11.685635566711426 seconds
The following solution follows the same logic as above.
start = time.time()
x = df_merged.head(10000).groupby('domain_userid')[continuous_variables].transform(lambda x: x.fillna(x.median()))
print(time.time() - start)
42.630549907684326 seconds
So it's quite important to choose the right method.
Bear in mind that I noticed once a column was not a numeric the times were going up exponentially (makes sense as I was computing the median).
def groupMeanValue(group):
group['value'] = group['value'].fillna(group['value'].mean())
return group
dft = df.groupby("name").transform(groupMeanValue)
I know that is an old question. But I am quite surprised by the unanimity of apply/lambda answers here.
Generally speaking, that is the second worst thing to do after iterating rows, from timing point of view.
What I would do here is
df.loc[df['value'].isna(), 'value'] = df.groupby('name')['value'].transform('mean')
Or using fillna
df['value'] = df['value'].fillna(df.groupby('name')['value'].transform('mean'))
I've checked with timeit (because, again, unanimity for apply/lambda based solution made me doubt my instinct). And that is indeed 2.5 faster than the most upvoted solutions.
To fill all the numeric null values with the mean grouped by "name"
num_cols = df.select_dtypes(exclude='object').columns
df[num_cols] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))
df.fillna(df.groupby(['name'], as_index=False).mean(), inplace=True)
You can also use "dataframe or table_name".apply(lambda x: x.fillna(x.mean())).

How can I retain all index levels from 2 multi-index series in pandas multiply?

In my application I am multiplying two Pandas Series which both have multiple index levels. Sometimes, a level contains only a single unique value, in which case I don't get all the index levels from both Series in my result.
To illustrate the problem, let's take two series:
s1 = pd.Series(np.random.randn(4), index=[[1, 1, 1, 1], [1,2,3,4]])
s1.index.names = ['A', 'B']
A B
1 1 -2.155463
2 -0.411068
3 1.041838
4 0.016690
s2 = pd.Series(np.random.randn(4), index=[['a', 'a', 'a', 'a'], [1,2,3,4]])
s2.index.names = ['C', 'B']
C B
a 1 0.043064
2 -1.456251
3 0.024657
4 0.912114
Now, if I multiply them, I get the following:
s1.mul(s2)
A B
1 1 -0.092822
2 0.598618
3 0.025689
4 0.015223
While my desired result would be
A C B
1 a 1 -0.092822
2 0.598618
3 0.025689
4 0.015223
How can I keep index level C in the multiplication?
I have so far been able to get the right result as shown below, but would much prefer a neater solution which keeps my code more simple and readable.
s3 = s2.mul(s1).to_frame()
s3['C'] = 'a'
s3.set_index('C', append=True, inplace=True)
You can use Series.unstack with DataFrame.stack:
s = s2.unstack(level=0).mul(s1, level=1, axis=0).stack().reorder_levels(['A','C','B'])
print (s)
A C B
1 a 1 0.827482
2 -0.476929
3 -0.473209
4 -0.520207
dtype: float64

Python Pandas - Combining Multiple Columns into one Staggered Column

How do you combine multiple columns into one staggered column? For example, if I have data:
Column 1 Column 2
0 A E
1 B F
2 C G
3 D H
And I want it in the form:
Column 1
0 A
1 E
2 B
3 F
4 C
5 G
6 D
7 H
What is a good, vectorized pythonic way to go about doing this? I could probably do some sort of df.apply() hack but I'm betting there is a better way. The application is putting multiple dimensions of time series data into a single stream for ML applications.
First stack the columns and then drop the multiindex:
df.stack().reset_index(drop=True)
Out:
0 A
1 E
2 B
3 F
4 C
5 G
6 D
7 H
dtype: object
To get a dataframe:
pd.DataFrame(df.values.reshape(-1, 1), columns=['Column 1'])
For a series answering OP question:
pd.Series(df.values.flatten(), name='Column 1')
For a series timing tests:
pd.Series(get_df(n).values.flatten(), name='Column 1')
Timing
code
def get_df(n=1):
df = pd.DataFrame({'Column 2': {0: 'E', 1: 'F', 2: 'G', 3: 'H'},
'Column 1': {0: 'A', 1: 'B', 2: 'C', 3: 'D'}})
return pd.concat([df for _ in range(n)])
Given Sample
Given Sample * 10,000
Given Sample * 1,000,000

Creating percentile buckets in pandas

I am trying to classify my data in percentile buckets based on their values. My data looks like,
a = pnd.DataFrame(index = ['a','b','c','d','e','f','g','h','i','j'], columns=['data'])
a.data = np.random.randn(10)
print a
print '\nthese are ranked as shown'
print a.rank()
data
a -0.310188
b -0.191582
c 0.860467
d -0.458017
e 0.858653
f -1.640166
g -1.969908
h 0.649781
i 0.218000
j 1.887577
these are ranked as shown
data
a 4
b 5
c 9
d 3
e 8
f 2
g 1
h 7
i 6
j 10
To rank this data, I am using the rank function. However, I am interested in the creating a bucket of the top 20%. In the example shown above, this would be a list containing labels ['c', 'j']
desired result : ['c','j']
How do I get the desired result
In [13]: df[df > df.quantile(0.8)].dropna()
Out[13]:
data
c 0.860467
j 1.887577
In [14]: list(df[df > df.quantile(0.8)].dropna().index)
Out[14]: ['c', 'j']

Categories