I'm tryng to reverse this but I can't figure out how.
I'm starting from
>>> d = {'col1': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C'], 'col2': [1, 2, 3, 4, 5, 6, 7, 7]}
>>> df = pd.DataFrame(data=d)
>>> df
col1 col2
0 A 1
1 A 2
2 A 3
3 B 4
4 B 5
5 B 6
6 C 7
7 C 7
And I want to obtain:
col1 new_1 new_2 new_3
0 A 1 2 3
1 B 4 5 6
2 C 7 7 empty
where there are new_x columns based on max number of times a col1 item is repeated.
It seems to be a pretty standard transpose, but I can't find a solution.
Sorry if duplicated.
Thx
Sirius
It's not a one-liner but maybe a bit simpler / easier to follow.
First, aggregate to one lists column:
df_ = pd.DataFrame(df.groupby('col1').col2.agg(list))
which gives
col2
col1
A [1, 2, 3]
B [4, 5, 6]
C [7, 7]
Then, build a new DataFrame from these lists:
df2 = (pd.DataFrame(df_.col2.tolist(), index=df_.index).add_prefix('new_')
.reset_index())
which gives
col1 new_0 new_1 new_2
0 A 1 2 3.0
1 B 4 5 6.0
2 C 7 7 NaN
Please note that:
I interpreted empty as an empty cell, not the 'empty' string
NaN is always seen as a float, that's why values in this column were cast by pandas to floats
use .cumcount() and .unstack() after setting your indices.
cumcount() here groups by your target column and applies a sequential count along the index, this allows us to unstack() it and create your new pivoted structure.
the rest of the code is to obtain your target dataframe, you could also do this with pivot and crosstab.
df1 = df.set_index([df.groupby('col1').cumcount() + 1,
df['col1']]).drop('col1',1)\
.unstack(0)\
.droplevel(0,1)\
.add_prefix('new_')\
.fillna('empty')\
.reset_index()
Or with pivot:
(df.assign(k=df.groupby("col1").cumcount()+1).pivot('col1','k','col2')
.add_prefix("col_").reset_index())
col1 new_1 new_2 new_3
0 A 1.0 2.0 3.0
1 B 4.0 5.0 6.0
2 C 7.0 7.0 empty
d = {'col1': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C'], 'col2': [1, 2, 3, 4, 5, 6, 7, 7]}
df = pd.DataFrame(data=d)
print(df)
print(df.pivot_table(index='col1',columns=df.index, values='col2').fillna(0))
output:
0 1 2 3 4 5 6 7
col1
A 1.0 2.0 3.0 0.0 0.0 0.0 0.0 0.0
B 0.0 0.0 0.0 4.0 5.0 6.0 0.0 0.0
C 0.0 0.0 0.0 0.0 0.0 0.0 7.0 7.0
Related
I might be doing something wrong, but I was trying to calculate a rolling average (let's use sum instead in this example for simplicity) after grouping the dataframe. Until here it all works well, but when I apply a shift I'm finding the values spill over to the group below. See example below:
import pandas as pd
df = pd.DataFrame({'X': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
'Y': [1, 2, 3, 1, 2, 3, 1, 2, 3]})
grouped_df = df.groupby(by='X')['Y'].rolling(window=2, min_periods=2).sum().shift(periods=1)
print(grouped_df)
Expected result:
X
A 0 NaN
1 NaN
2 3.0
B 3 NaN
4 NaN
5 3.0
C 6 NaN
7 NaN
8 3.0
Result I actually get:
X
A 0 NaN
1 NaN
2 3.0
B 3 5.0
4 NaN
5 3.0
C 6 5.0
7 NaN
8 3.0
You can see the result of A2 gets passed to B3 and the result of B5 to C6. I'm not sure this is the intended behaviour and I'm doing something wrong or there is some bug in pandas?
Thanks
The problem is that
df.groupby(by='X')['Y'].rolling(window=2, min_periods=2).sum()
returns a new series, then when you chain with shift(), you shift the series as a whole, not within the group.
You need another groupby to shift within the group:
grouped_df = (df.groupby(by='X')['Y'].rolling(window=2, min_periods=2).sum()
.groupby(level=0).shift(periods=1)
)
Or use groupby.transform:
grouped_df = (df.groupby('X')['Y']
.transform(lambda x: x.rolling(window=2, min_periods=2)
.sum().shift(periods=1))
)
Output:
X
A 0 NaN
1 NaN
2 3.0
B 3 NaN
4 NaN
5 3.0
C 6 NaN
7 NaN
8 3.0
Name: Y, dtype: float64
I'm trying to return row-wise the top two records by value, and the bottom record by value as additional columns, or a separate dataframe entirely (either or).
Lets say I have the following dataframe of values:
example = pd.DataFrame({'a':[3, 8, 5, 3, 2, 1, 3],
'b':[6, 5, 8, 0, 3, 2, 1],
'c':[1, 4, 5, 3, 6, 2, 7],
'd':[4, 6, 5, 3, 9, 11, 3],
'e':[8, 0, 5, 2, 1, 1, 3]})
example
a b c d e
0 3 6 1 4 8
1 8 5 4 6 0
2 5 8 5 5 5
3 3 0 3 3 2
4 2 3 6 9 1
5 1 2 2 11 1
6 3 1 7 3 3
Since I want to find the top values, I end up ranking this dataframe, columnwise with no repeating in rank.
rank_df = example.rank(axis=1, method='first', ascending=False)
rank_df
a b c d e
0 4.0 2.0 5.0 3.0 1.0
1 1.0 3.0 4.0 2.0 5.0
2 2.0 1.0 3.0 4.0 5.0
3 1.0 5.0 2.0 3.0 4.0
4 4.0 3.0 2.0 1.0 5.0
5 4.0 2.0 3.0 1.0 5.0
6 2.0 5.0 1.0 3.0 4.0
Now lastly, I would like to pull in the column names for the ranks, and pull out into a dataframe the top two and bottom one. For example, row 0 has rank 1 in column e, rank 2 in column b, and rank 5 in column c, so the three columns would be e, b, c.
Expected Output:
top_1, top_2, bottom_1
0 e b c
1 a d e
2 b a e
3 a c b
4 d c e
5 d b e
6 c a b
Use numpy.argsort for positions of sorted values and get columns names by indexing:
a = example.columns.to_numpy()[(-example).to_numpy().argsort()]
#if use lower version of pandas
#a = example.columns.values[(-example).values.argsort()]
print(a)
[['e' 'b' 'd' 'a' 'c']
['a' 'd' 'b' 'c' 'e']
['b' 'a' 'c' 'd' 'e']
['a' 'c' 'd' 'e' 'b']
['d' 'c' 'b' 'a' 'e']
['d' 'b' 'c' 'a' 'e']
['c' 'a' 'd' 'e' 'b']]
And then select first, second and last column and convert to DataFrame:
df = pd.DataFrame(a[:, [0,1,-1]], index=df.index, columns=['top_1', 'top_2', 'bottom_1'])
print (df)
top_1 top_2 bottom_1
0 e b c
1 a d e
2 b a e
3 a c b
4 d c e
5 d b e
6 c a b
I work in python and pandas.
Let's suppose that I have the following two dataframes df_1 and df_2 (INPUT):
# df1
A B C
0 2 8 6
1 5 2 5
2 3 4 9
3 5 1 1
# df2
A B C
0 2 7 NaN
1 5 1 NaN
2 3 3 NaN
3 5 0 NaN
I want to process it to join/merge them to get a new dataframe which looks like that (EXPECTED OUTPUT):
A B C
0 2 7 NaN
1 5 1 1
2 3 3 NaN
3 5 0 NaN
So basically it is a right-merge/join but with preserving the order of the original right dataframe.
However, if I do this:
df_2 = df_1.merge(df_2[['A', 'B']], on=['A', 'B'], how='right')
then I get this:
A B C
0 5 1 1.0
1 2 7 NaN
2 3 3 NaN
3 5 0 NaN
So I get the right rows joined/merged but the output dataframe does not have the same row-order as the original right dataframe.
How can I do the join/merge and preserve the row-order too?
The code to create the original dataframes is the following:
import pandas as pd
import numpy as np
columns = ['A', 'B', 'C']
data_1 = [[2, 5, 3, 5], [8, 2, 4, 1], [6, 5, 9, 1]]
data_1 = np.array(data_1).T
df_1 = pd.DataFrame(data=data_1, columns=columns)
columns = ['A', 'B', 'C']
data_2 = [[2, 5, 3, 5], [7, 1, 3, 0], [np.nan, np.nan, np.nan, np.nan]]
data_2 = np.array(data_2).T
df_2 = pd.DataFrame(data=data_2, columns=columns)
I think that by using either .join() or .update() I could get what I want but to start with I am quite surprised that .merge() does not do this very simple thing too.
I think it is bug.
Possible solution with left join:
df_2 = df_2.merge(df_1, on=['A', 'B'], how='left', suffixes=('_','')).drop('C_', axis=1)
print (df_2)
A B C
0 2.0 7.0 NaN
1 5.0 1.0 1.0
2 3.0 3.0 NaN
3 5.0 0.0 NaN
You can play with index between the both dataframe
print(df)
# A B C
# 0 5 1 1.0
# 1 2 7 NaN
# 2 3 3 NaN
# 3 5 0 NaN
df = df.set_index('B')
df = df.reindex(index=df_2['B'])
df = df.reset_index()
df = df[['A', 'B', 'C']]
print(df)
# A B C
# 0 2 7.0 NaN
# 1 5 1.0 1.0
# 2 3 3.0 NaN
# 3 5 0.0 NaN
Source
One quick way is:
df_2=df_2.set_index(['A','B'])
temp = df_1.set_index(['A','B'])
df_2.update(temp)
df_2.reset_index(inplace=True)
As I discuss above with #jezrael above and if I am not missing something, if you do not need both the columns C from the original dataframes and you need only the column C with the matching values then .update() is the quickest way since you do not have to drop the columns that you do not need.
I'm working with a huge dataframe in python and sometimes I need to add an empty row or several rows in a definite position to dataframe. For this question I create a small dataframe df in order to show, what I want to achieve.
> df = pd.DataFrame(np.random.randint(10, size = (3,3)), columns =
> ['A','B','C'])
> A B C
> 0 4 5 2
> 1 6 7 0
> 2 8 1 9
Let's say I need to add an empty row, if I have a zero-value in the column 'C'. Here the empty row should be added after the second row. So at the end I want to have a new dataframe like:
>new_df
> A B C
> 0 4 5 2
> 1 6 7 0
> 2 nan nan nan
> 3 8 1 9
I tried with concat and append, but I didn't get what I want to. Could you help me please?
You can try in this way:
l = df[df['C']==0].index.tolist()
for c, i in enumerate(l):
dfs = np.split(df, [i+1+c])
df = pd.concat([dfs[0], pd.DataFrame([[np.NaN, np.NaN, np.NaN]], columns=df.columns), dfs[1]], ignore_index=True)
print df
Input:
A B C
0 4 3 0
1 4 0 4
2 4 4 2
3 3 2 1
4 3 1 2
5 4 1 4
6 1 0 4
7 0 2 0
8 2 0 3
9 4 1 3
Output:
A B C
0 4.0 3.0 0.0
1 NaN NaN NaN
2 4.0 0.0 4.0
3 4.0 4.0 2.0
4 3.0 2.0 1.0
5 3.0 1.0 2.0
6 4.0 1.0 4.0
7 1.0 0.0 4.0
8 0.0 2.0 0.0
9 NaN NaN NaN
10 2.0 0.0 3.0
11 4.0 1.0 3.0
Last thing: it can happen that the last row has 0 in 'C', so you can add:
if df["C"].iloc[-1] == 0 :
df.loc[len(df)] = [np.NaN, np.NaN, np.NaN]
Try using slice.
First, you need to find the rows where C == 0. So let's create a bool df for this. I'll just name it 'a':
a = (df['C'] == 0)
So, whenever C == 0, a == True.
Now we need to find the index of each row where C == 0, create an empty row and add it to the df:
df2 = df.copy() #make a copy because we want to be safe here
for i in df.loc[a].index:
empty_row = pd.DataFrame([], index=[i]) #creating the empty data
j = i + 1 #just to get things easier to read
df2 = pd.concat([df2.ix[:i], empty_row, df2.ix[j:]]) #slicing the df
df2 = df2.reset_index(drop=True) #reset the index
I must say... I don't know the size of your df and if this is fast enough, but give it a try
In case you know the index where you want to insert a new row, concat can be a solution.
Example dataframe:
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
# A B C
# 0 1 4 7
# 1 2 5 8
# 2 3 6 9
Your new row as a dataframe with index 1:
new_row = pd.DataFrame({'A': np.nan, 'B': np.nan,'C': np.nan}, index=[1])
Inserting your new row after the second row:
new_df = pd.concat([df.loc[:1], new_row, df.loc[2:]]).reset_index(drop=True)
# A B C
# 0 1.0 4.0 7.0
# 1 2.0 5.0 8.0
# 2 NaN NaN NaN
# 3 3.0 6.0 9.0
something like this should work for you:
for key, row in df.iterrows():
if row['C'] == 0:
df.loc[key+1] = pd.Series([np.nan])
Suppose I have two tables
import pandas as pd
import numpy as np
first_table = pd.DataFrame({'key1': [1, 2, 2, 2, 3, 3],
'key2': ['a', 'a', 'a', 'b', 'a', 'b'],
'key3': ['A', 'A', 'B', 'A', 'A', 'A'],
'value_first': range(6)})
second_table = pd.DataFrame({'key1': [1, 1, 2, 2, 3],
'key2': [np.nan, np.nan, 'a', 'a', 'b'],
'key3': [np.nan, np.nan, 'A', 'B', np.nan],
'value_second': [6, 4, 2, 0, -2]})
so the first table looks like this
key1 key2 key3 value_first
0 1 a A 0
1 2 a A 1
2 2 a B 2
3 2 b A 3
4 3 a A 4
5 3 b A 5
while the second table looks like this
key1 key2 key3 value_second
0 1 NaN NaN 6
1 1 NaN NaN 4
2 2 a A 2
3 2 a B 0
4 3 b NaN -2
Now I want an outer merge of first_table and second_table based on the three keys. Note that the second table is not unique based on the three keys but the first are. Hence key2 and key3 is not necessary when key1 is unique in the first table. In the same way key3 is not necessary when the first two keys are unique in combination.
If the second table were proberly filled in then the merge would be straightforward by
pd.merge(first_table, second_table,
how='outer',
left_on=['key1', 'key2', 'key3'],
right_on=['key1', 'key2', 'key3'])
but how do I get the desired result in this case? The desired result looks like this
key1 key2 key3 value_first value_second
0 1 a A 0.0 6.0
1 1 a A 0.0 4.0
2 2 a A 1.0 2.0
3 2 a B 2.0 0.0
4 2 b A 3.0 NaN
5 3 a A 4.0 NaN
6 3 b A 5.0 -2.0
The idea is to first merge the dataframes just on key1, then fill the NaN with respective values from first table, then drop rows when values are different, and finally merge with the first table again to get the remaining rows (where value_second=NaN).
df = pd.merge(first_table, second_table, left_on=['key1'], right_on=['key1'], how='outer')
df['key2_y'] = df['key2_y'].fillna(df['key2_x'])
df['key3_y'] = df['key3_y'].fillna(df['key3_x'])
df = df[(df['key2_x'] == df['key2_y']) & (df['key3_x'] == df['key3_y'])]
df.drop(['key2_y', 'key3_y'], axis=1, inplace=True)
df = pd.merge(first_table, df, left_on=['key1', 'key2', 'key3', 'value_first'],
right_on=['key1', 'key2_x', 'key3_x', 'value_first'], how='outer')
df.drop(['key2_x', 'key3_x'], axis=1, inplace=True)
key1 key2 key3 value_first value_second
0 1 a A 0 6.0
1 1 a A 0 4.0
2 2 a A 1 2.0
3 2 a B 2 0.0
4 2 b A 3 NaN
5 3 a A 4 NaN
6 3 b A 5 -2.0