I've looked into pandas join, merge, concat with different param values (how to join, indexing, axis=1, etc) but nothing solves it!
I have two dataframes:
x = pd.DataFrame(np.random.randn(4,4))
y = pd.DataFrame(np.random.randn(4,4),columns=list(range(2,6)))
x
Out[67]:
0 1 2 3
0 -0.036327 -0.594224 0.469633 -0.649221
1 1.891510 0.164184 -0.010760 -0.848515
2 -0.383299 1.416787 0.719434 0.025509
3 0.097420 -0.868072 -0.591106 -0.672628
y
Out[68]:
2 3 4 5
0 -0.328402 -0.001436 -1.339613 -0.721508
1 0.408685 1.986148 0.176883 0.146694
2 -0.638341 0.018629 -0.319985 -1.832628
3 0.125003 1.134909 0.500017 0.319324
I'd like to combine to one dataframe where the values from y in columns 2 and 3 overwrite those of x and then columns 4 and 5 are inserted on the end:
new
Out[100]:
0 1 2 3 4 5
0 -0.036327 -0.594224 -0.328402 -0.001436 -1.339613 -0.721508
1 1.891510 0.164184 0.408685 1.986148 0.176883 0.146694
2 -0.383299 1.416787 -0.638341 0.018629 -0.319985 -1.832628
3 0.097420 -0.868072 0.125003 1.134909 0.500017 0.319324
You can try combine_first:
df = y.combine_first(x)
You need update and combine_first
x.update(y)
x.combine_first(y)
Out[1417]:
0 1 2 3 4 5
0 -1.075266 1.044069 -0.423888 0.247130 0.008867 2.058995
1 0.122782 -0.444159 1.528181 0.595939 0.155170 1.693578
2 -0.825819 0.395140 -0.171900 -0.161182 -2.016067 0.223774
3 -0.009081 -0.148430 -0.028605 0.092074 1.355105 -0.003027
Or you using pd.concat + intersection
pd.concat([x.drop(x.columns.intersection(y.columns),1),y],1)
Out[1432]:
0 1 2 3 4 5
0 -1.075266 1.044069 -0.423888 0.247130 0.008867 2.058995
1 0.122782 -0.444159 1.528181 0.595939 0.155170 1.693578
2 -0.825819 0.395140 -0.171900 -0.161182 -2.016067 0.223774
3 -0.009081 -0.148430 -0.028605 0.092074 1.355105 -0.003027
Related
When using groupby(), how can I create a DataFrame with a new column containing an index of the group number, similar to dplyr::group_indices in R. For example, if I have
>>> df=pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})
>>> df
a b
0 1 1
1 1 1
2 1 2
3 2 1
4 2 1
5 2 2
How can I get a DataFrame like
a b idx
0 1 1 1
1 1 1 1
2 1 2 2
3 2 1 3
4 2 1 3
5 2 2 4
(the order of the idx indexes doesn't matter)
Here is the solution using ngroup (available as of pandas 0.20.2) from a comment above by Constantino.
import pandas as pd
df = pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})
df['idx'] = df.groupby(['a', 'b']).ngroup()
df
a b idx
0 1 1 0
1 1 1 0
2 1 2 1
3 2 1 2
4 2 1 2
5 2 2 3
Here's a concise way using drop_duplicates and merge to get a unique identifier.
group_vars = ['a','b']
df.merge( df.drop_duplicates( group_vars ).reset_index(), on=group_vars )
a b index
0 1 1 0
1 1 1 0
2 1 2 2
3 2 1 3
4 2 1 3
5 2 2 5
The identifier in this case goes 0,2,3,5 (just a residual of original index) but this could be easily changed to 0,1,2,3 with an additional reset_index(drop=True).
Update: Newer versions of pandas (0.20.2) offer a simpler way to do this with the ngroup method as noted in a comment to the question above by #Constantino and a subsequent answer by #CalumYou. I'll leave this here as an alternate approach but ngroup seems like the better way to do this in most cases.
A simple way to do that would be to concatenate your grouping columns (so that each combination of their values represents a uniquely distinct element), then convert it to a pandas Categorical and keep only its labels:
df['idx'] = pd.Categorical(df['a'].astype(str) + '_' + df['b'].astype(str)).codes
df
a b idx
0 1 1 0
1 1 1 0
2 1 2 1
3 2 1 2
4 2 1 2
5 2 2 3
Edit: changed labels properties to codes as the former seem to be deprecated
Edit2: Added a separator as suggested by Authman Apatira
Definetely not the most straightforward solution, but here is what I would do (comments in the code):
df=pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})
#create a dummy grouper id by just joining desired rows
df["idx"] = df[["a","b"]].astype(str).apply(lambda x: "".join(x),axis=1)
print df
That would generate an unique idx for each combination of a and b.
a b idx
0 1 1 11
1 1 1 11
2 1 2 12
3 2 1 21
4 2 1 21
5 2 2 22
But this is still a rather silly index (think about some more complex values in columns a and b. So let's clear the index:
# create a dictionary of dummy group_ids and their index-wise representation
dict_idx = dict(enumerate(set(df["idx"])))
# switch keys and values, so you can use dict in .replace method
dict_idx = {y:x for x,y in dict_idx.iteritems()}
#replace values with the generated dict
df["idx"].replace(dict_idx,inplace=True)
print df
That would produce the desired output:
a b idx
0 1 1 0
1 1 1 0
2 1 2 1
3 2 1 2
4 2 1 2
5 2 2 3
A way that I believe is faster than the current accepted answer by about an order of magnitude (timing results below):
def create_index_usingduplicated(df, grouping_cols=['a', 'b']):
df.sort_values(grouping_cols, inplace=True)
# You could do the following three lines in one, I just thought
# this would be clearer as an explanation of what's going on:
duplicated = df.duplicated(subset=grouping_cols, keep='first')
new_group = ~duplicated
return new_group.cumsum()
Timing results:
a = np.random.randint(0, 1000, size=int(1e5))
b = np.random.randint(0, 1000, size=int(1e5))
df = pd.DataFrame({'a': a, 'b': b})
In [6]: %timeit df['idx'] = pd.Categorical(df['a'].astype(str) + df['b'].astype(str)).codes
1 loop, best of 3: 375 ms per loop
In [7]: %timeit df['idx'] = create_index_usingduplicated(df, grouping_cols=['a', 'b'])
100 loops, best of 3: 17.7 ms per loop
I'm not sure this is such a trivial problem. Here is a somewhat convoluted solution that first sorts the grouping columns and then checks whether each row is different than the previous row and if so accumulates by 1. Check further below for an answer with string data.
df.sort_values(['a', 'b']).diff().fillna(0).ne(0).any(1).cumsum().add(1)
Output
0 1
1 1
2 2
3 3
4 3
5 4
dtype: int64
So breaking this up into steps, lets see the output of df.sort_values(['a', 'b']).diff().fillna(0) which checks if each row is different than the previous row. Any non-zero entry indicates a new group.
a b
0 0.0 0.0
1 0.0 0.0
2 0.0 1.0
3 1.0 -1.0
4 0.0 0.0
5 0.0 1.0
A new group only need to have a single column different so this is what .ne(0).any(1) checks - not equal to 0 for any of the columns. And then just a cumulative sum to keep track of the groups.
Answer for columns as strings
#create fake data and sort it
df=pd.DataFrame({'a':list('aabbaccdc'),'b':list('aabaacddd')})
df1 = df.sort_values(['a', 'b'])
output of df1
a b
0 a a
1 a a
4 a a
3 b a
2 b b
5 c c
6 c d
8 c d
7 d d
Take similar approach by checking if group has changed
df1.ne(df1.shift().bfill()).any(1).cumsum().add(1)
0 1
1 1
4 1
3 2
2 3
5 4
6 5
8 5
7 6
I have df like this
A B
1 1
1 2
1 3
2 2
2 1
3 2
3 3
3 4
I would like to extract rows whose col B is not ascending like
A B
2 2
2 1
I tried
df.groupby("A").filter()...
But I stacked to extract.
If you have any solution,please let me know.
One way is to use pandas.Series.is_monotonic:
df[df.groupby('A')['B'].transform(lambda x:not x.is_monotonic)]
Output:
A B
3 2 2
4 2 1
Use GroupBy.transform with Series.diff and compare by Series.lt for at least one negative value with Series.any and filter by boolean indexing:
df1 = df[df.groupby('A')['B'].transform(lambda x: x.diff().lt(0).any())]
print (df1)
A B
3 2 2
4 2 1
I have a dataframe,
x y z new_col
1 2 3 1
1 2 3 4
1 2 3 7
1 2 3 10
1 2 3 13
Want to create a new column and set a value 1 to first row.
And all other value for new column will be 1+3(3 from z), then 4+3 and so on.
You can perform a shifted cumulative sum:
df['new'] = 1 + df['z'].shift().fillna(0).astype(int).cumsum()
print(df)
x y z new
0 1 2 3 1
1 1 2 3 4
2 1 2 3 7
3 1 2 3 10
4 1 2 3 13
You can use the function: pd.cumsum
If your DataFrame is called df:
df['new_column'] = df.cumsum() - df.z[0] + 1
The -2 is there so that your sum start to 1 as you requested
You can do it like that:
df.assign(new_col = lambda x: 1 + x['z'].shift().cumsum()).fillna(1).astype(int)
x y z new_col
0 1 2 3 1
1 1 2 3 4
2 1 2 3 7
3 1 2 3 10
4 1 2 3 13
if you want more specific control over type cast and na filling you can also use the more verbose:
df.assign(new_col = lambda x: 1 + x['z'].shift().cumsum()
).fillna({'new_col':1}).astype({'new_col': int})
Actually, you can use the same logic as from jpp's answer but wrap it in an assign call:
df.assign(new_col = lambda x: 1+ x['z'].shift().fillna(0).astype(int).cumsum())
There are various ways to do it, but you have two very simple ones here:
df['new_col'] = (3*df.x).cumsum() - 2
df['new_col'] = 3*df.index + 1
The former assumes that your 'x' column only contains value 1 (if not, you can easily create a column like this df['temp'] = 1).
And the latter assumes that your index has no holes (which can be due to drops for instance). This two methods are easy to implement and very fast (way faster than shift cumsums for instance).
Moreover if the step you need depends on the values contained in the column z it can easily be adapted:
df['new_col'] = (df.z*df.x).cumsum() - 2
x y z new_col
0 1 2 3 1
1 1 2 3 4
2 1 2 3 7
3 1 2 3 10
4 1 2 3 13
When using groupby(), how can I create a DataFrame with a new column containing an index of the group number, similar to dplyr::group_indices in R. For example, if I have
>>> df=pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})
>>> df
a b
0 1 1
1 1 1
2 1 2
3 2 1
4 2 1
5 2 2
How can I get a DataFrame like
a b idx
0 1 1 1
1 1 1 1
2 1 2 2
3 2 1 3
4 2 1 3
5 2 2 4
(the order of the idx indexes doesn't matter)
Here is the solution using ngroup (available as of pandas 0.20.2) from a comment above by Constantino.
import pandas as pd
df = pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})
df['idx'] = df.groupby(['a', 'b']).ngroup()
df
a b idx
0 1 1 0
1 1 1 0
2 1 2 1
3 2 1 2
4 2 1 2
5 2 2 3
Here's a concise way using drop_duplicates and merge to get a unique identifier.
group_vars = ['a','b']
df.merge( df.drop_duplicates( group_vars ).reset_index(), on=group_vars )
a b index
0 1 1 0
1 1 1 0
2 1 2 2
3 2 1 3
4 2 1 3
5 2 2 5
The identifier in this case goes 0,2,3,5 (just a residual of original index) but this could be easily changed to 0,1,2,3 with an additional reset_index(drop=True).
Update: Newer versions of pandas (0.20.2) offer a simpler way to do this with the ngroup method as noted in a comment to the question above by #Constantino and a subsequent answer by #CalumYou. I'll leave this here as an alternate approach but ngroup seems like the better way to do this in most cases.
A simple way to do that would be to concatenate your grouping columns (so that each combination of their values represents a uniquely distinct element), then convert it to a pandas Categorical and keep only its labels:
df['idx'] = pd.Categorical(df['a'].astype(str) + '_' + df['b'].astype(str)).codes
df
a b idx
0 1 1 0
1 1 1 0
2 1 2 1
3 2 1 2
4 2 1 2
5 2 2 3
Edit: changed labels properties to codes as the former seem to be deprecated
Edit2: Added a separator as suggested by Authman Apatira
Definetely not the most straightforward solution, but here is what I would do (comments in the code):
df=pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})
#create a dummy grouper id by just joining desired rows
df["idx"] = df[["a","b"]].astype(str).apply(lambda x: "".join(x),axis=1)
print df
That would generate an unique idx for each combination of a and b.
a b idx
0 1 1 11
1 1 1 11
2 1 2 12
3 2 1 21
4 2 1 21
5 2 2 22
But this is still a rather silly index (think about some more complex values in columns a and b. So let's clear the index:
# create a dictionary of dummy group_ids and their index-wise representation
dict_idx = dict(enumerate(set(df["idx"])))
# switch keys and values, so you can use dict in .replace method
dict_idx = {y:x for x,y in dict_idx.iteritems()}
#replace values with the generated dict
df["idx"].replace(dict_idx,inplace=True)
print df
That would produce the desired output:
a b idx
0 1 1 0
1 1 1 0
2 1 2 1
3 2 1 2
4 2 1 2
5 2 2 3
A way that I believe is faster than the current accepted answer by about an order of magnitude (timing results below):
def create_index_usingduplicated(df, grouping_cols=['a', 'b']):
df.sort_values(grouping_cols, inplace=True)
# You could do the following three lines in one, I just thought
# this would be clearer as an explanation of what's going on:
duplicated = df.duplicated(subset=grouping_cols, keep='first')
new_group = ~duplicated
return new_group.cumsum()
Timing results:
a = np.random.randint(0, 1000, size=int(1e5))
b = np.random.randint(0, 1000, size=int(1e5))
df = pd.DataFrame({'a': a, 'b': b})
In [6]: %timeit df['idx'] = pd.Categorical(df['a'].astype(str) + df['b'].astype(str)).codes
1 loop, best of 3: 375 ms per loop
In [7]: %timeit df['idx'] = create_index_usingduplicated(df, grouping_cols=['a', 'b'])
100 loops, best of 3: 17.7 ms per loop
I'm not sure this is such a trivial problem. Here is a somewhat convoluted solution that first sorts the grouping columns and then checks whether each row is different than the previous row and if so accumulates by 1. Check further below for an answer with string data.
df.sort_values(['a', 'b']).diff().fillna(0).ne(0).any(1).cumsum().add(1)
Output
0 1
1 1
2 2
3 3
4 3
5 4
dtype: int64
So breaking this up into steps, lets see the output of df.sort_values(['a', 'b']).diff().fillna(0) which checks if each row is different than the previous row. Any non-zero entry indicates a new group.
a b
0 0.0 0.0
1 0.0 0.0
2 0.0 1.0
3 1.0 -1.0
4 0.0 0.0
5 0.0 1.0
A new group only need to have a single column different so this is what .ne(0).any(1) checks - not equal to 0 for any of the columns. And then just a cumulative sum to keep track of the groups.
Answer for columns as strings
#create fake data and sort it
df=pd.DataFrame({'a':list('aabbaccdc'),'b':list('aabaacddd')})
df1 = df.sort_values(['a', 'b'])
output of df1
a b
0 a a
1 a a
4 a a
3 b a
2 b b
5 c c
6 c d
8 c d
7 d d
Take similar approach by checking if group has changed
df1.ne(df1.shift().bfill()).any(1).cumsum().add(1)
0 1
1 1
4 1
3 2
2 3
5 4
6 5
8 5
7 6
I have a DataFrame of the following form:
df = pd.DataFrame({('a','A'):[3,4,5,6],
('a','B'):[1,1,3,5],
('b','A'):[9,7,0,3],
('b','B'):[2,0,1,6]
})
which looks like this:
a b
A B A B
0 3 1 9 2
1 4 1 7 0
2 5 3 0 1
3 6 5 3 6
I group it by the second level using the following command:
grouped = df.groupby(level=1,axis=1)
And get:
Group A
________
a b
A A
0 3 9
1 4 7
2 5 0
3 6 3
Group B
________
a b
B B
0 1 2
1 1 0
2 3 1
3 5 6
How can I take each group's two columns and put them into a tuple row-wise and convert that into a new DataFrame. Basically I'm trying to get at this:
A B
0 (3,9) (1,2)
1 (4,7) (1,0)
2 (5,0) (3,1)
3 (6,3) (5,6)
I've been trying
grouped.apply(lambda x : tuple(x))
But it doesn't do the job and instead gives me tuples of column names. Is there a simple way to do this without resorting to for loops?
Try
grouped.apply(lambda x: pd.Series([tuple(i) for i in x.values]))
This seems to do the trick:
grouped.apply(lambda x: pd.Series(list(x.itertuples(index=False))))