I am trying to take a DataFrame column that contains repeating values from a finite set and substitute these values by index number, so if the values are [200,20,1000,1] the indexes of their occurrence will be [1,2,3,4].
Actual data example is:
0 aaa
1 aaa
2 bbb
3 aaa
4 bbb
5 bbb
6 ccc
7 ddd
8 ccc
9 ddd
The desired output is
0 1
1 1
2 2
3 1
4 2
5 2
6 4
7 3
8 4
9 3
I want to change the values that make little sense to numbers. That's all... I do not care about the order of indexing, i.e. 1 could be 3 and so on, as long the ordering is consistent. I.e., I don't care if ['aaa','bbb','ccc','ddd'] will be indexed by [1,2,3,4] or [2,4,3,1].
Suppose that the DF name is tbl and I want to change only a subset of indexes in column 'aaa'. Let's denote these indexes by tbl_ind. The way I want to do that is:
tmp_r = tbl[tbl_ind]
un_r_ind = np.unique(tmp_r)
for r_ind in range(len(un_r_ind)):
r_ind_ind = np.array(np.where(tmp_r == un_r_ind[r_ind])[0])
for j_ind in range(len(r_ind_ind)):
tbl['aaa'].iloc[tbl_ind[r_ind_ind[j_ind]]] = r_ind
It works. And it is REALLY slow on large data sets.
Python does not let to update tbl['aaa'].iloc[tbl_ind[r_ind_ind]] as it's a list of indexes....
Help please? How is it possible to speed this up?
Many thanks!
I'd construct a dict of the values you want to replace and then call map:
In [7]:
df
Out[7]:
data
0
1 aaa
2 bbb
3 aaa
4 bbb
5 bbb
6 ccc
7 ddd
8 ccc
9 ddd
In [8]:
d = {'aaa':1,'bbb':2,'ccc':3,'ddd':4}
df['data'] = df['data'].map(d)
df
Out[8]:
data
0
1 1
2 2
3 1
4 2
5 2
6 3
7 4
8 3
9 4
You could use rank with the dense method:
>>> df[0].rank("dense")
0 1
1 1
2 2
3 1
4 2
5 2
6 3
7 4
8 3
9 4
Name: 0, dtype: float64
This basically sorts the values and maps the lowest to 1, the second-lowest to 2, and so on.
I am not sure I have understood correctly from your example.
Is this what you are trying to achieve? (apart from the bias on the index (zero instead of one)):
df=['aaa','aaa','bbb','aaa','bbb','bbb','ccc','ddd','ccc','ddd']
idx={}
def index_data(v):
global idx
if v in idx:
return idx[v]
else:
n = len(idx)
idx[v] = n
return n
if __name__ == "__main__":
outlist = []
for i in df:
outlist.append(index_data(i))
for i, v in enumerate(outlist):
print i, v
It outputs:
0 0
1 0
2 1
3 0
4 1
5 1
6 2
7 3
8 2
9 3
Obviously it can be optimised (e.g. simply incrementing a counter for n instead of checking the size of the index)
Related
When using groupby(), how can I create a DataFrame with a new column containing an index of the group number, similar to dplyr::group_indices in R. For example, if I have
>>> df=pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})
>>> df
a b
0 1 1
1 1 1
2 1 2
3 2 1
4 2 1
5 2 2
How can I get a DataFrame like
a b idx
0 1 1 1
1 1 1 1
2 1 2 2
3 2 1 3
4 2 1 3
5 2 2 4
(the order of the idx indexes doesn't matter)
Here is the solution using ngroup (available as of pandas 0.20.2) from a comment above by Constantino.
import pandas as pd
df = pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})
df['idx'] = df.groupby(['a', 'b']).ngroup()
df
a b idx
0 1 1 0
1 1 1 0
2 1 2 1
3 2 1 2
4 2 1 2
5 2 2 3
Here's a concise way using drop_duplicates and merge to get a unique identifier.
group_vars = ['a','b']
df.merge( df.drop_duplicates( group_vars ).reset_index(), on=group_vars )
a b index
0 1 1 0
1 1 1 0
2 1 2 2
3 2 1 3
4 2 1 3
5 2 2 5
The identifier in this case goes 0,2,3,5 (just a residual of original index) but this could be easily changed to 0,1,2,3 with an additional reset_index(drop=True).
Update: Newer versions of pandas (0.20.2) offer a simpler way to do this with the ngroup method as noted in a comment to the question above by #Constantino and a subsequent answer by #CalumYou. I'll leave this here as an alternate approach but ngroup seems like the better way to do this in most cases.
A simple way to do that would be to concatenate your grouping columns (so that each combination of their values represents a uniquely distinct element), then convert it to a pandas Categorical and keep only its labels:
df['idx'] = pd.Categorical(df['a'].astype(str) + '_' + df['b'].astype(str)).codes
df
a b idx
0 1 1 0
1 1 1 0
2 1 2 1
3 2 1 2
4 2 1 2
5 2 2 3
Edit: changed labels properties to codes as the former seem to be deprecated
Edit2: Added a separator as suggested by Authman Apatira
Definetely not the most straightforward solution, but here is what I would do (comments in the code):
df=pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})
#create a dummy grouper id by just joining desired rows
df["idx"] = df[["a","b"]].astype(str).apply(lambda x: "".join(x),axis=1)
print df
That would generate an unique idx for each combination of a and b.
a b idx
0 1 1 11
1 1 1 11
2 1 2 12
3 2 1 21
4 2 1 21
5 2 2 22
But this is still a rather silly index (think about some more complex values in columns a and b. So let's clear the index:
# create a dictionary of dummy group_ids and their index-wise representation
dict_idx = dict(enumerate(set(df["idx"])))
# switch keys and values, so you can use dict in .replace method
dict_idx = {y:x for x,y in dict_idx.iteritems()}
#replace values with the generated dict
df["idx"].replace(dict_idx,inplace=True)
print df
That would produce the desired output:
a b idx
0 1 1 0
1 1 1 0
2 1 2 1
3 2 1 2
4 2 1 2
5 2 2 3
A way that I believe is faster than the current accepted answer by about an order of magnitude (timing results below):
def create_index_usingduplicated(df, grouping_cols=['a', 'b']):
df.sort_values(grouping_cols, inplace=True)
# You could do the following three lines in one, I just thought
# this would be clearer as an explanation of what's going on:
duplicated = df.duplicated(subset=grouping_cols, keep='first')
new_group = ~duplicated
return new_group.cumsum()
Timing results:
a = np.random.randint(0, 1000, size=int(1e5))
b = np.random.randint(0, 1000, size=int(1e5))
df = pd.DataFrame({'a': a, 'b': b})
In [6]: %timeit df['idx'] = pd.Categorical(df['a'].astype(str) + df['b'].astype(str)).codes
1 loop, best of 3: 375 ms per loop
In [7]: %timeit df['idx'] = create_index_usingduplicated(df, grouping_cols=['a', 'b'])
100 loops, best of 3: 17.7 ms per loop
I'm not sure this is such a trivial problem. Here is a somewhat convoluted solution that first sorts the grouping columns and then checks whether each row is different than the previous row and if so accumulates by 1. Check further below for an answer with string data.
df.sort_values(['a', 'b']).diff().fillna(0).ne(0).any(1).cumsum().add(1)
Output
0 1
1 1
2 2
3 3
4 3
5 4
dtype: int64
So breaking this up into steps, lets see the output of df.sort_values(['a', 'b']).diff().fillna(0) which checks if each row is different than the previous row. Any non-zero entry indicates a new group.
a b
0 0.0 0.0
1 0.0 0.0
2 0.0 1.0
3 1.0 -1.0
4 0.0 0.0
5 0.0 1.0
A new group only need to have a single column different so this is what .ne(0).any(1) checks - not equal to 0 for any of the columns. And then just a cumulative sum to keep track of the groups.
Answer for columns as strings
#create fake data and sort it
df=pd.DataFrame({'a':list('aabbaccdc'),'b':list('aabaacddd')})
df1 = df.sort_values(['a', 'b'])
output of df1
a b
0 a a
1 a a
4 a a
3 b a
2 b b
5 c c
6 c d
8 c d
7 d d
Take similar approach by checking if group has changed
df1.ne(df1.shift().bfill()).any(1).cumsum().add(1)
0 1
1 1
4 1
3 2
2 3
5 4
6 5
8 5
7 6
I have two dataframes, df_diff and df_three. For each column of df_three, it contains the index values of three largest values from each column of df_diff. For example, let's say df_diff looks like this:
A B C
0 4 7 8
1 5 5 7
2 8 2 1
3 10 3 4
4 1 12 3
Using
df_three = df_diff.apply(lambda s: pd.Series(s.nlargest(3).index))
df_three would look like this:
A B C
0 3 4 0
1 2 0 1
2 1 1 3
How could I match the index values in df_three to the column values of df_diff? In other words, how could I get df_three to look like this:
A B C
0 10 12 8
1 8 7 7
2 5 5 4
Am I making this problem too complicated? Would there be an easier way?
Any help is appreciated!
def top_3(s, top_values):
res = s.sort_values(ascending=False)[:top_values]
res.index = range(top_values)
return res
res = df.apply(lambda x: top_3(x, 3))
print(res)
Use numpy.sort with dataframe values:
n=3
arr = df.copy().to_numpy()
df_three = pd.DataFrame(np.sort(arr, 0)[::-1][:n], columns=df.columns)
print(df_three)
A B C
0 10 12 8
1 8 7 7
2 5 5 4
I want to sort a subset of a dataframe (say, between indexes i and j) according to some value. I tried
df2=df.iloc[i:j].sort_values(by=...)
df.iloc[i:j]=df2
No problem with the first line but nothing happens when I run the second one (not even an error). How should I do ? (I tried also the update function but it didn't do either).
I believe need assign to filtered DataFrame with converting to numpy array by values for avoid align indices:
df = pd.DataFrame({'A': [1,2,3,4,3,2,1,4,1,2]})
print (df)
A
0 1
1 2
2 3
3 4
4 3
5 2
6 1
7 4
8 1
9 2
i = 2
j = 7
df.iloc[i:j] = df.iloc[i:j].sort_values(by='A').values
print (df)
A
0 1
1 2
2 1
3 2
4 3
5 3
6 4
7 4
8 1
9 2
I have a Dataframe like
Sou Des
1 3
1 4
2 3
2 4
3 1
3 2
4 1
4 2
I need to assign random value for each pair between 0 and 1 but have to assign the same random value for both similar pairs like "1-3", "3-1" and other pairs. I'm expecting a result dataframe like
Sou Des Val
1 3 0.1
1 4 0.6
2 3 0.9
2 4 0.5
3 1 0.1
3 2 0.9
4 1 0.6
4 2 0.5
How to assign same random value similar pairs like "A-B" and "B-A" in python pandas .
Let's create first a sorted by axis=1 helper DF:
In [304]: x = pd.DataFrame(np.sort(df, axis=1), df.index, df.columns)
In [305]: x
Out[305]:
Sou Des
0 1 3
1 1 4
2 2 3
3 2 4
4 1 3
5 2 3
6 1 4
7 2 4
now we can group by its columns:
In [306]: df['Val'] = (x.assign(c=1)
.groupby(x.columns.tolist())
.transform(lambda x: np.random.rand(1)))
In [307]: df
Out[307]:
Sou Des Val
0 1 3 0.989035
1 1 4 0.918397
2 2 3 0.463653
3 2 4 0.313669
4 3 1 0.989035
5 3 2 0.463653
6 4 1 0.918397
7 4 2 0.313669
This is new way
s=pd.crosstab(df.Sou,df.Des)
b = np.random.random_integers(-2000,2000,size=(len(s),len(s)))
sy = (b + b.T)/2
s.mul(sy).replace(0,np.nan).stack().reset_index()
Out[292]:
Sou Des 0
0 1 3 -60.0
1 1 4 -867.0
2 2 3 269.0
3 2 4 1152.0
4 3 1 -60.0
5 3 2 269.0
6 4 1 -867.0
7 4 2 1152.0
The trick here is to do a bit of work away from the dataframe. You can break this down into three steps:
assemble a list of all tuples (a,b)
assign a random value to each pair so that (a,b) and (b,a) have the same value
fill in the new column
Assuming your dataframe is called df, we can make a list of all the pairs ordered so that a <= b. I think this will be easier than trying to keep track of both (a,b) and (b,a).
pairs = set([(a,b) if a <= b else (b,a)
for a, b in df.itertuples(index=False,name=None))
It's simple enough to assign a random number to each of these pairs and store it in a dictionary, so I'll leave that to you. Call it pair_dict.
Now, we just have to lookup the values. We'll ultimately want to write
df['Val'] = df.apply(<some function>, axis=1)
where our function looks up the appropriate value in pair_dict.
Rather than try to cram it into a lambda (though we could), let's write it separately.
def func(row):
if row['Sou'] <= row['Des']:
key = (row['Sou'], row['Des'])
else:
key = (row['Des'], row['Sou'])
return pair_dict[key]
if you are ok having the "random" value coming from the hash() method you can achieve with frozenset()
df = pd.DataFrame([[1,1,2,2,3,3,4,4],[3,4,3,4,1,2,1,2]]).T
df.columns = ['Sou','Des']
df['Val']= df.apply(lambda x: hash(frozenset([x["Sou"],x["Des"]])),axis=1)
print df
which gives:
Sou Des Val
0 1 3 1580307032
1 1 4 -1736016661
2 2 3 741508915
3 2 4 -1930135584
4 3 1 1580307032
5 3 2 741508915
6 4 1 -1736016661
7 4 2 -1930135584
reference:
Why aren't Python sets hashable?
When using groupby(), how can I create a DataFrame with a new column containing an index of the group number, similar to dplyr::group_indices in R. For example, if I have
>>> df=pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})
>>> df
a b
0 1 1
1 1 1
2 1 2
3 2 1
4 2 1
5 2 2
How can I get a DataFrame like
a b idx
0 1 1 1
1 1 1 1
2 1 2 2
3 2 1 3
4 2 1 3
5 2 2 4
(the order of the idx indexes doesn't matter)
Here is the solution using ngroup (available as of pandas 0.20.2) from a comment above by Constantino.
import pandas as pd
df = pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})
df['idx'] = df.groupby(['a', 'b']).ngroup()
df
a b idx
0 1 1 0
1 1 1 0
2 1 2 1
3 2 1 2
4 2 1 2
5 2 2 3
Here's a concise way using drop_duplicates and merge to get a unique identifier.
group_vars = ['a','b']
df.merge( df.drop_duplicates( group_vars ).reset_index(), on=group_vars )
a b index
0 1 1 0
1 1 1 0
2 1 2 2
3 2 1 3
4 2 1 3
5 2 2 5
The identifier in this case goes 0,2,3,5 (just a residual of original index) but this could be easily changed to 0,1,2,3 with an additional reset_index(drop=True).
Update: Newer versions of pandas (0.20.2) offer a simpler way to do this with the ngroup method as noted in a comment to the question above by #Constantino and a subsequent answer by #CalumYou. I'll leave this here as an alternate approach but ngroup seems like the better way to do this in most cases.
A simple way to do that would be to concatenate your grouping columns (so that each combination of their values represents a uniquely distinct element), then convert it to a pandas Categorical and keep only its labels:
df['idx'] = pd.Categorical(df['a'].astype(str) + '_' + df['b'].astype(str)).codes
df
a b idx
0 1 1 0
1 1 1 0
2 1 2 1
3 2 1 2
4 2 1 2
5 2 2 3
Edit: changed labels properties to codes as the former seem to be deprecated
Edit2: Added a separator as suggested by Authman Apatira
Definetely not the most straightforward solution, but here is what I would do (comments in the code):
df=pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})
#create a dummy grouper id by just joining desired rows
df["idx"] = df[["a","b"]].astype(str).apply(lambda x: "".join(x),axis=1)
print df
That would generate an unique idx for each combination of a and b.
a b idx
0 1 1 11
1 1 1 11
2 1 2 12
3 2 1 21
4 2 1 21
5 2 2 22
But this is still a rather silly index (think about some more complex values in columns a and b. So let's clear the index:
# create a dictionary of dummy group_ids and their index-wise representation
dict_idx = dict(enumerate(set(df["idx"])))
# switch keys and values, so you can use dict in .replace method
dict_idx = {y:x for x,y in dict_idx.iteritems()}
#replace values with the generated dict
df["idx"].replace(dict_idx,inplace=True)
print df
That would produce the desired output:
a b idx
0 1 1 0
1 1 1 0
2 1 2 1
3 2 1 2
4 2 1 2
5 2 2 3
A way that I believe is faster than the current accepted answer by about an order of magnitude (timing results below):
def create_index_usingduplicated(df, grouping_cols=['a', 'b']):
df.sort_values(grouping_cols, inplace=True)
# You could do the following three lines in one, I just thought
# this would be clearer as an explanation of what's going on:
duplicated = df.duplicated(subset=grouping_cols, keep='first')
new_group = ~duplicated
return new_group.cumsum()
Timing results:
a = np.random.randint(0, 1000, size=int(1e5))
b = np.random.randint(0, 1000, size=int(1e5))
df = pd.DataFrame({'a': a, 'b': b})
In [6]: %timeit df['idx'] = pd.Categorical(df['a'].astype(str) + df['b'].astype(str)).codes
1 loop, best of 3: 375 ms per loop
In [7]: %timeit df['idx'] = create_index_usingduplicated(df, grouping_cols=['a', 'b'])
100 loops, best of 3: 17.7 ms per loop
I'm not sure this is such a trivial problem. Here is a somewhat convoluted solution that first sorts the grouping columns and then checks whether each row is different than the previous row and if so accumulates by 1. Check further below for an answer with string data.
df.sort_values(['a', 'b']).diff().fillna(0).ne(0).any(1).cumsum().add(1)
Output
0 1
1 1
2 2
3 3
4 3
5 4
dtype: int64
So breaking this up into steps, lets see the output of df.sort_values(['a', 'b']).diff().fillna(0) which checks if each row is different than the previous row. Any non-zero entry indicates a new group.
a b
0 0.0 0.0
1 0.0 0.0
2 0.0 1.0
3 1.0 -1.0
4 0.0 0.0
5 0.0 1.0
A new group only need to have a single column different so this is what .ne(0).any(1) checks - not equal to 0 for any of the columns. And then just a cumulative sum to keep track of the groups.
Answer for columns as strings
#create fake data and sort it
df=pd.DataFrame({'a':list('aabbaccdc'),'b':list('aabaacddd')})
df1 = df.sort_values(['a', 'b'])
output of df1
a b
0 a a
1 a a
4 a a
3 b a
2 b b
5 c c
6 c d
8 c d
7 d d
Take similar approach by checking if group has changed
df1.ne(df1.shift().bfill()).any(1).cumsum().add(1)
0 1
1 1
4 1
3 2
2 3
5 4
6 5
8 5
7 6