I am trying to classify my data in percentile buckets based on their values. My data looks like,
a = pnd.DataFrame(index = ['a','b','c','d','e','f','g','h','i','j'], columns=['data'])
a.data = np.random.randn(10)
print a
print '\nthese are ranked as shown'
print a.rank()
data
a -0.310188
b -0.191582
c 0.860467
d -0.458017
e 0.858653
f -1.640166
g -1.969908
h 0.649781
i 0.218000
j 1.887577
these are ranked as shown
data
a 4
b 5
c 9
d 3
e 8
f 2
g 1
h 7
i 6
j 10
To rank this data, I am using the rank function. However, I am interested in the creating a bucket of the top 20%. In the example shown above, this would be a list containing labels ['c', 'j']
desired result : ['c','j']
How do I get the desired result
In [13]: df[df > df.quantile(0.8)].dropna()
Out[13]:
data
c 0.860467
j 1.887577
In [14]: list(df[df > df.quantile(0.8)].dropna().index)
Out[14]: ['c', 'j']
Related
I have a list (list_to_match = ['a','b','c','d']) and a dataframe like this one below:
Index
One
Two
Three
Four
1
a
b
d
c
2
b
b
d
d
3
a
b
d
4
c
b
c
d
5
a
b
c
g
6
a
b
c
7
a
s
c
f
8
a
f
c
9
a
b
10
a
b
t
d
11
a
b
g
...
...
...
...
...
100
a
b
c
d
My goal would be to filter for the rows with most matches with the list in the corrisponding position (e.g. position 1 in the list has to match column 1, position 2 column 2 etc...).
In this specific case, excluding row 100, row 5 and 6 would be the one selected since they match 'a', 'b' and 'c' but if row 100 were to be included row 100 and all the other rows matching all elements would be the selected.
Also the list might change in length e.g. list_to_match = ['a','b'].
Thanks for your help!
I would use:
list_to_match = ['a','b','c','d']
# compute a mask of identical values
mask = df.iloc[:, :len(list_to_match)].eq(list_to_match)
# ensure we match values in order
mask2 = mask.cummin(axis=1).sum(axis=1)
# get the rows with max matches
out = df[mask2.eq(mask2.max())]
# or
# out = df.loc[mask2.nlargest(1, keep='all').index]
print(out)
Output (ignoring the input row 100):
One Two Three Four
Index
5 a b c g
6 a b c None
Here is my approach. Descriptions are commented below.
import pandas as pd
import numpy as np
from scipy.spatial.distance import cosine
data = {'One': ['a', 'a', 'a', 'a'],
'Two': ['b', 'b', 'b', 'b'],
'Three': ['c', 'c', 'y', 'c'],
'Four': ['g', 'g', 'z', 'd']}
dataframe_ = pd.DataFrame(data)
#encoding Letters into numerical values so we can compute the cosine similarities
dataframe_[:] = dataframe_.to_numpy().astype('<U1').view(np.uint32)-64
#Our input data which we are going to compare with other rows
input_data = np.array(['a', 'b', 'c', 'd'])
#encode input data into numerical values
input_data = input_data.astype('<U1').view(np.uint32)-64
#compute cosine similarity for each row
dataframe_out = dataframe_.apply(lambda row: 1 - cosine(row, input_data), axis=1)
print(dataframe_out)
output:
0 0.999343
1 0.999343
2 0.973916
3 1.000000
Filtering rows based on their cosine similarities:
df_filtered = dataframe_out[dataframe_out.iloc[:, [0]] > 0.99]
print(df_filtered)
0 0.999343
1 0.999343
2 NaN
3 1.000000
From here on you can easily find the rows with non-NaN values by their indexes.
I have a dataframe of floats
a b c d e
0 0.085649 0.236811 0.801274 0.582162 0.094129
1 0.433127 0.479051 0.159739 0.734577 0.113672
2 0.391228 0.516740 0.430628 0.586799 0.737838
3 0.956267 0.284201 0.648547 0.696216 0.292721
4 0.001490 0.973460 0.298401 0.313986 0.891711
5 0.585163 0.471310 0.773277 0.030346 0.706965
6 0.374244 0.090853 0.660500 0.931464 0.207191
7 0.630090 0.298163 0.741757 0.722165 0.218715
I can divide it into quantiles for a single column like so:
def groupby_quantiles(df, column, groups: int):
quantiles = df[column].quantile(np.linspace(0, 1, groups + 1))
bins = pd.cut(df[column], quantiles, include_lowest=True)
return df.groupby(bins)
>>> df.pipe(groupby_quantiles, "a", 2).apply(lambda x: print(x))
a b c d e
0 0.085649 0.236811 0.801274 0.582162 0.094129
2 0.391228 0.516740 0.430628 0.586799 0.737838
4 0.001490 0.973460 0.298401 0.313986 0.891711
6 0.374244 0.090853 0.660500 0.931464 0.207191
a b c d e
1 0.433127 0.479051 0.159739 0.734577 0.113672
3 0.956267 0.284201 0.648547 0.696216 0.292721
5 0.585163 0.471310 0.773277 0.030346 0.706965
7 0.630090 0.298163 0.741757 0.722165 0.218715
Now, I want to repeat the same operation on each of the groups for the next column. The code becomes ridiculous
>>> (
df
.pipe(groupby_quantiles, "a", 2)
.apply(
lambda df_group: (
df_group
.pipe(groupby_quantiles, "b", 2)
.apply(lambda x: print(x))
)
)
)
a b c d e
0 0.085649 0.236811 0.801274 0.582162 0.094129
6 0.374244 0.090853 0.660500 0.931464 0.207191
a b c d e
2 0.391228 0.51674 0.430628 0.586799 0.737838
4 0.001490 0.97346 0.298401 0.313986 0.891711
a b c d e
3 0.956267 0.284201 0.648547 0.696216 0.292721
7 0.630090 0.298163 0.741757 0.722165 0.218715
a b c d e
1 0.433127 0.479051 0.159739 0.734577 0.113672
5 0.585163 0.471310 0.773277 0.030346 0.706965
My goal is to repeat this operation for as many columns as I want, then aggregate the groups at the end. Here's how the final function could look like and the desired result assuming to aggregate with the mean.
>>> groupby_quantiles(df, columns=["a", "b"], groups=[2, 2], agg="mean")
a b c d e
0 0.229947 0.163832 0.730887 0.756813 0.150660
1 0.196359 0.745100 0.364515 0.450392 0.814774
2 0.793179 0.291182 0.695152 0.709190 0.255718
3 0.509145 0.475180 0.466508 0.382462 0.410319
Any ideas on how to achieve this?
Here is a way. First using quantile then cut can be rewrite with qcut. Then using recursive operation similar to this.
def groupby_quantiles(df, cols, grs, agg_func):
# to store all the results
_dfs = []
# recursive function
def recurse(_df, depth):
col = cols[depth]
gr = grs[depth]
# iterate over the groups per quantile
for _, _dfgr in _df.groupby(pd.qcut(_df[col], gr)):
if depth != -1: recurse(_dfgr, depth+1) #recursive if not at the last column
else: _dfs.append(_dfgr.agg(agg_func)) #else perform the aggregate
# using negative depth is easier to acces the right column and quantile
depth = -len(cols)
recurse(df, depth) # starts the recursion
return pd.concat(_dfs, axis=1).T # concat the results and transpose
print(groupby_quantiles(df, cols = ['a','b'], grs = [2,2], agg_func='mean'))
# a b c d e
# 0 0.229946 0.163832 0.730887 0.756813 0.150660
# 1 0.196359 0.745100 0.364515 0.450392 0.814774
# 2 0.793179 0.291182 0.695152 0.709190 0.255718
# 3 0.509145 0.475181 0.466508 0.382462 0.410318
I have a simple looking problem. I have a dataframe df with two columns. For each of the strings that occurs in either of these columns I would like to count the number of rows which has the symbol in either column.
E.g.
g k
a h
c i
j e
d i
i h
b b
d d
i a
d h
The following code works but is very inefficient.
for elem in set(df.values.flat):
print elem, len(df.loc[(df[0] == elem) | (df[1] == elem)])
a 2
c 1
b 1
e 1
d 3
g 1
i 4
h 3
k 1
j 1
This is however very inefficient and my dataframe is large. The inefficiency comes from calling df.loc[(df[0] == elem) | (df[1] == elem)] separately for every distinct symbol in df.
Is there a fast way of doing this?
You can use loc to filter out row level matches from 'col2', append the filtered 'col2' values to 'col1', and then call value_counts:
counts = df['col1'].append(df.loc[df['col1'] != df['col2'], 'col2']).value_counts()
The resulting output:
i 4
d 3
h 3
a 2
j 1
k 1
c 1
g 1
b 1
e 1
Note: You can add .sort_index() to the end of the counting code if you want the output to appear in alphabetical order.
Timings
Using the following setup to produce a larger sample dataset:
from string import ascii_lowercase
n = 10**5
data = np.random.choice(list(ascii_lowercase), size=(n,2))
df = pd.DataFrame(data, columns=['col1', 'col2'])
def edchum(df):
vals = np.unique(df.values)
count = np.maximum(df['col1'].str.get_dummies().reindex_axis(vals, axis=1).fillna(0), df['col2'].str.get_dummies().reindex_axis(vals, axis=1).fillna(0)).sum()
return count
I get the following timings:
%timeit df['col1'].append(df.loc[df['col1'] != df['col2'], 'col2']).value_counts()
10 loops, best of 3: 19.7 ms per loop
%timeit edchum(df)
1 loop, best of 3: 3.81 s per loop
OK this is much trickier than I thought, not sure how this will scale but if you have a lot of repeating values then it will be more efficient than your current method, basically we can use str.get_dummies and reindex the columns from that result to generate a dummies df for all unique values, we can then use np.maximal on the 2 dfs and sum these:
In [77]:
t="""col1 col2
g k
a h
c i
j e
d i
i h
b b
d d
i a
d h"""
df = pd.read_csv(io.StringIO(t), delim_whitespace=True)
np.maximum(df['col1'].str.get_dummies().reindex_axis(vals, axis=1).fillna(0), df['col2'].str.get_dummies().reindex_axis(vals, axis=1).fillna(0)).sum()
Out[77]:
a 2
b 1
c 1
d 3
e 1
g 1
h 3
i 4
j 1
k 1
dtype: float64
vals here is just the unique values:
In [80]:
vals = np.unique(df.values)
vals
Out[80]:
array(['a', 'b', 'c', 'd', 'e', 'g', 'h', 'i', 'j', 'k'], dtype=object)
How do you combine multiple columns into one staggered column? For example, if I have data:
Column 1 Column 2
0 A E
1 B F
2 C G
3 D H
And I want it in the form:
Column 1
0 A
1 E
2 B
3 F
4 C
5 G
6 D
7 H
What is a good, vectorized pythonic way to go about doing this? I could probably do some sort of df.apply() hack but I'm betting there is a better way. The application is putting multiple dimensions of time series data into a single stream for ML applications.
First stack the columns and then drop the multiindex:
df.stack().reset_index(drop=True)
Out:
0 A
1 E
2 B
3 F
4 C
5 G
6 D
7 H
dtype: object
To get a dataframe:
pd.DataFrame(df.values.reshape(-1, 1), columns=['Column 1'])
For a series answering OP question:
pd.Series(df.values.flatten(), name='Column 1')
For a series timing tests:
pd.Series(get_df(n).values.flatten(), name='Column 1')
Timing
code
def get_df(n=1):
df = pd.DataFrame({'Column 2': {0: 'E', 1: 'F', 2: 'G', 3: 'H'},
'Column 1': {0: 'A', 1: 'B', 2: 'C', 3: 'D'}})
return pd.concat([df for _ in range(n)])
Given Sample
Given Sample * 10,000
Given Sample * 1,000,000
I like to find the item of DF2 that is cloest to the item in DF1.
The distance is euclidean distance.
For example, for A in DF1, F in DF2 is the cloeset one.
>>> DF1
X Y name
0 1 2 A
1 3 4 B
2 5 6 C
3 7 8 D
>>> DF2
X Y name
0 3 8 E
1 2 4 F
2 1 9 G
3 6 4 H
My code is
DF1 = pd.DataFrame({'name' : ['A', 'B', 'C', 'D'],'X' : [1,3,5,7],'Y' : [2,4,6,8]})
DF2 = pd.DataFrame({'name' : ['E', 'F', 'G', 'H'],'X' : [3,2,1,6],'Y' : [8,4,9,4]})
def ndis(row):
try:
X,Y=row['X'],row['Y']
DF2['DIS']=(DF2.X-X)*(DF2.X-X)+(DF2.Y-Y)*(DF2.Y-Y)
temp=DF2.ix[DF2.DIS.idxmin()]
return temp[2] # print temp[2]
except:
pass
DF1['Z']=DF1.apply(ndis, axis=1)
This works fine, and it will take too long for large data set.
Another question is to how to find the 2nd and 3d cloeset ones.
There is more than one approach, for example one can use numpy:
>>> xy = ['X', 'Y']
>>> distance_array = numpy.sum((df1[xy].values - df2[xy].values)**2, axis=1)
>>> distance_array.argmin()
1
Top 3 closest (not the fastest approach, I suppose, but simplest)
>>> distance_array.argsort()[:3]
array([1, 3, 2])
If speed is a concern, run performance tests.
Look at scipy.spatial.KDTree and the related cKDTree, which is faster but offers only a subset of the functionality. For large sets, you probably won't beat that for speed.