Assuming the following DataFrame:
df = pd.DataFrame({'id': [8,16,23,8,23], 'count': [5,8,7,1,2]}, columns=['id', 'count'])
id count
0 8 5
1 16 8
2 23 7
3 8 1
4 23 2
...is there some Pandas magic that allows me to remap the ids so that the ids become sequential? Looking for a result like:
id count
0 0 5
1 1 8
2 2 7
3 0 1
4 2 2
where the original ids [8,16,23] were remapped to [0,1,2]
Note: the remapping doesn't have to maintain original order of ids. For example, the following remapping would also be fine: [8,16,23] -> [2,0,1], but the id space after remapping should be contiguous.
I'm currently using a for loop and a dict to keep track of the remapping, but it feels like Pandas might have a better solution.
use factorize:
>>> df
id count
0 8 5
1 16 8
2 23 7
3 8 1
4 23 2
>>> df['id'] = pd.factorize(df['id'])[0]
>>> df
id count
0 0 5
1 1 8
2 2 7
3 0 1
4 2 2
You can do this via a groupby's labels:
In [11]: df
Out[11]:
id count
0 8 5
1 16 8
2 23 7
3 8 1
4 23 2
In [12]: g = df.groupby("id")
In [13]: g.grouper.labels
Out[13]: [array([0, 1, 2, 0, 2])]
In [14]: df["id"] = g.grouper.labels[0]
In [15]: df
Out[15]:
id count
0 0 5
1 1 8
2 2 7
3 0 1
4 2 2
This may be helpful to you.
x,y = pd.factorize(df['id'])
remap = dict(set(zip(list(x),list(y))))
Related
Suppose I have this dataframe df:
A B count
0 1 2 3
1 3 4 2
2 5 6 1
3 7 8 2
Then I want to do row-replication operation depending on the count column, and then add a new column that does the counter. So the resulting outcome is:
counter A B count
0 0 1 2 3
1 1 1 2 3
2 2 1 2 3
3 0 3 4 2
4 1 3 4 2
5 0 5 6 1
6 0 7 8 2
7 1 7 8 2
My idea was to duplicate the rows accordingly (using numpy and pandas df). Then add a counter column that increments for every row found the same and then reset to 0 once found a new row. But I was thinking this may be slow. Is there any way to do it much easily and not that slow?
Let's try index.repeat to scale up the DataFrame, then groupby cumcount to create the groups and insert it into the DataFrame at the front:
df = df.loc[df.index.repeat(df['count'])]
df.insert(0, 'counter', df.groupby(level=0).cumcount())
df = df.reset_index(drop=True)
df:
counter A B count
0 0 1 2 3
1 1 1 2 3
2 2 1 2 3
3 0 3 4 2
4 1 3 4 2
5 0 5 6 1
6 0 7 8 2
7 1 7 8 2
DataFrame constructor:
import pandas as pd
df = pd.DataFrame({
'A': [1, 3, 5, 7], 'B': [2, 4, 6, 8], 'count': [3, 2, 1, 2]
})
Let us take a sample dataframe
df = pd.DataFrame(np.arange(10).reshape((5,2)))
df
0 1
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
and concatenate the two columns into a single column
temp = pd.concat([df[0], df[1]]).to_frame()
temp
0
0 0
1 2
2 4
3 6
4 8
0 1
1 3
2 5
3 7
4 9
What would be the most efficient way to get the original dataframe i.e df from temp?
The following way using groupby works. But is there any more efficient way (like without groupby-apply, pivot) to do this whole task from concatenation (and then doing some operation) and then reverting back to the original dataframe?
pd.DataFrame(temp.groupby(level=0)[0]
.apply(list)
.to_numpy().tolist())
I think we can do pivot after assign the column value with cumcount
check = temp.assign(c=temp.groupby(level=0).cumcount()).pivot(columns='c',values='0')
Out[66]:
c 0 1
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
You can use groupby + cumcount to create a sequential counter per level=0 group then append it to the index of the dataframe and use unstack to reshape:
temp.set_index(temp.groupby(level=0).cumcount(), append=True)[0].unstack()
0 1
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
You can try this:
In [1267]: temp['g'] = temp.groupby(level=0)[0].cumcount()
In [1273]: temp.pivot(columns='g', values=0)
Out[1279]:
g 0 1
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
OR:
In [1281]: temp['g'] = (temp.index == 0).cumsum() - 1
In [1282]: temp.pivot(columns='g', values=0)
Out[1282]:
g 0 1
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
df = pd.DataFrame(np.arange(10).reshape((5,2)))
temp = pd.concat([df[0], df[1]]).to_frame()
duplicated_index = temp.index.duplicated()
pd.concat([temp[~duplicated_index], temp[duplicated_index]], axis=1)
Works for this specific case (as pointed out in the comments, it will fail if you have more than one set of duplicate index values) so I don't think it's a better solution.
I am trying to get the columns that are unique to a data frame.
DF_A has 10 columns
DF_B has 3 columns (all three match column names in DF_A).
Before I was using:
cols_to_use = DF_A.columns - DF_B.columns.
Since my pandas update, I am getting this error:
TypeError: cannot perform sub with this index type:
What should I be doing now instead?
Thank you!
You can use difference method:
Demo:
In [12]: df
Out[12]:
a b c d
0 0 8 0 3
1 3 4 1 7
2 0 5 4 0
3 0 9 7 0
4 5 8 5 4
In [13]: df2
Out[13]:
a d
0 4 3
1 3 1
2 1 2
3 3 4
4 0 3
In [14]: df.columns.difference(df2.columns)
Out[14]: Index(['b', 'c'], dtype='object')
In [15]: cols = df.columns.difference(df2.columns)
In [16]: df[cols]
Out[16]:
b c
0 8 0
1 4 1
2 5 4
3 9 7
4 8 5
I have a pandas data frame that consists of 5 columns. The second column has the numbers 1 to 500 repeated 5 times. As a shorter example the second column is something like this (1,4,2,4,3,1,1,2,4,3,2,1,4,3,2,3) and I want to sort it to look like this (1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4). The code i am using to sort is df=res.sort([2],ascending=True) but this code sorts it (1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4).
Any help will be much appreciated. Thanks
How's about this: sort by the cumcount and then the value itself:
In [11]: df = pd.DataFrame({"s": [1,4,2,4,3,1,1,2,4,3,2,1,4,3,2,3]})
In [12]: df.groupby("s").cumcount()
Out[12]:
0 0
1 0
2 0
3 1
4 0
5 1
6 2
7 1
8 2
9 1
10 2
11 3
12 3
13 2
14 3
15 3
dtype: int64
In [13]: df["s_cumcounts"] = df.groupby("s").cumcount()
In [14]: df.sort_values(["s_cumcounts", "s"])
Out[14]:
s s_cumcounts
0 1 0
2 2 0
4 3 0
1 4 0
5 1 1
7 2 1
9 3 1
3 4 1
6 1 2
10 2 2
13 3 2
8 4 2
11 1 3
14 2 3
15 3 3
12 4 3
In [15]: df = df.sort_values(["s_cumcounts", "s"])
In [16]: del df["s_cumcounts"]
I have the following python pandas data frame:
df = pd.DataFrame( {
'A': [1,1,1,1,2,2,2,3,3,4,4,4],
'B': [5,5,6,7,5,6,6,7,7,6,7,7],
'C': [1,1,1,1,1,1,1,1,1,1,1,1]
} );
df
A B C
0 1 5 1
1 1 5 1
2 1 6 1
3 1 7 1
4 2 5 1
5 2 6 1
6 2 6 1
7 3 7 1
8 3 7 1
9 4 6 1
10 4 7 1
11 4 7 1
I would like to have another column storing a value of a sum over C values for fixed (both) A and B. That is, something like:
A B C D
0 1 5 1 2
1 1 5 1 2
2 1 6 1 1
3 1 7 1 1
4 2 5 1 1
5 2 6 1 2
6 2 6 1 2
7 3 7 1 2
8 3 7 1 2
9 4 6 1 1
10 4 7 1 2
11 4 7 1 2
I have tried with pandas groupby and it kind of works:
res = {}
for a, group_by_A in df.groupby('A'):
group_by_B = group_by_A.groupby('B', as_index = False)
res[a] = group_by_B['C'].sum()
but I don't know how to 'get' the results from res into df in the orderly fashion. Would be very happy with any advice on this. Thank you.
Here's one way (though it feels this should work in one go with an apply, I can't get it).
In [11]: g = df.groupby(['A', 'B'])
In [12]: df1 = df.set_index(['A', 'B'])
The size groupby function is the one you want, we have to match it to the 'A' and 'B' as the index:
In [13]: df1['D'] = g.size() # unfortunately this doesn't play nice with as_index=False
# Same would work with g['C'].sum()
In [14]: df1.reset_index()
Out[14]:
A B C D
0 1 5 1 2
1 1 5 1 2
2 1 6 1 1
3 1 7 1 1
4 2 5 1 1
5 2 6 1 2
6 2 6 1 2
7 3 7 1 2
8 3 7 1 2
9 4 6 1 1
10 4 7 1 2
11 4 7 1 2
You could also do a one liner using transform applied to the groupby:
df['D'] = df.groupby(['A','B'])['C'].transform('sum')
You could also do a one liner using merge as follows:
df = df.merge(pd.DataFrame({'D':df.groupby(['A', 'B'])['C'].size()}), left_on=['A', 'B'], right_index=True)
you can use this method :
columns = ['col1','col2',...]
df.groupby('col')[columns].sum()
if you want you can also use .sort_values(by = 'colx', ascending = True/False) after .sum() to sort the final output by a specific column (colx) and in an ascending or descending order.