pandas get original dataframe after vertical concatenation - python

Let us take a sample dataframe
df = pd.DataFrame(np.arange(10).reshape((5,2)))
df
0 1
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
and concatenate the two columns into a single column
temp = pd.concat([df[0], df[1]]).to_frame()
temp
0
0 0
1 2
2 4
3 6
4 8
0 1
1 3
2 5
3 7
4 9
What would be the most efficient way to get the original dataframe i.e df from temp?
The following way using groupby works. But is there any more efficient way (like without groupby-apply, pivot) to do this whole task from concatenation (and then doing some operation) and then reverting back to the original dataframe?
pd.DataFrame(temp.groupby(level=0)[0]
.apply(list)
.to_numpy().tolist())

I think we can do pivot after assign the column value with cumcount
check = temp.assign(c=temp.groupby(level=0).cumcount()).pivot(columns='c',values='0')
Out[66]:
c 0 1
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9

You can use groupby + cumcount to create a sequential counter per level=0 group then append it to the index of the dataframe and use unstack to reshape:
temp.set_index(temp.groupby(level=0).cumcount(), append=True)[0].unstack()
0 1
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9

You can try this:
In [1267]: temp['g'] = temp.groupby(level=0)[0].cumcount()
In [1273]: temp.pivot(columns='g', values=0)
Out[1279]:
g 0 1
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
OR:
In [1281]: temp['g'] = (temp.index == 0).cumsum() - 1
In [1282]: temp.pivot(columns='g', values=0)
Out[1282]:
g 0 1
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9

df = pd.DataFrame(np.arange(10).reshape((5,2)))
temp = pd.concat([df[0], df[1]]).to_frame()
duplicated_index = temp.index.duplicated()
pd.concat([temp[~duplicated_index], temp[duplicated_index]], axis=1)
Works for this specific case (as pointed out in the comments, it will fail if you have more than one set of duplicate index values) so I don't think it's a better solution.

Related

pop rows from dataframe based on conditions

From the dataframe
import pandas as pd
df1 = pd.DataFrame({'A':[1,1,1,1,2,2,2,2],'B':[1,2,3,4,5,6,7,8]})
print(df1)
A B
0 1 1
1 1 2
2 1 3
3 1 4
4 2 5
5 2 6
6 2 7
7 2 8
I want to pop 2 rows where 'A' == 2, preferably in a single statement like
df2 = df1.somepopfunction(...)
to generate the following result:
print(df1)
A B
0 1 1
1 1 2
2 1 3
3 1 4
4 2 7
5 2 8
print(df2)
A B
0 2 5
1 2 6
The pandas pop function sounds promising, but only pops complete colums.
What statement can replace the pseudocode
df2 = df1.somepopfunction(...)
to generate the desired results?
Pop function for remove rows does not exist in pandas, need filter first and then remove filtred rows from df1:
df2 = df1[df1.A.eq(2)].head(2)
print (df2)
A B
4 2 5
5 2 6
df1 = df1.drop(df2.index)
print (df1)
A B
0 1 1
1 1 2
2 1 3
3 1 4
6 2 7
7 2 8

Flip DataFrame column order by keeping the Index

I have a similar question to this one: Reverse DataFrame Column, But Maintain the Index
Reversing the rows works fine:
import pandas as pd
df = pd.DataFrame(data=[[1,2,3],[4,5,6],[7,8,9]])
df.iloc[:] = df.iloc[::-1].values
How can I reverse the columns to get this result
0 1 2
0 3 2 1
1 6 5 4
2 9 8 7
Pass the reverse to column by add ,
df.iloc[:] = df.iloc[:,::-1].values
df
0 1 2
0 3 2 1
1 6 5 4
2 9 8 7
You can use numpy flip to reverse the columns :
pd.DataFrame(np.flip(df.to_numpy()))
0 1 2
0 3 2 1
1 6 5 4
2 9 8 7

Setting values in DataFrames using .loc

I have a configuration where it would be extremely useful to modify value of a dataframe using a combination of loc and iloc.
df = pd.DataFrame([[1,2],[1,3],[1,4],[2,6],[2,5],[2,7]],columns=['A','B'])
Basically in the dataframe above, I would like to take only the column that are equal to something (i.e. A = 2). which would give :
A B
3 2 6
4 2 5
5 2 7
And then modify the value of B of the second index (which is actually the index 4 in this case)
I can access to the value I want using this command :
df.loc[df['A'] == 2,'B'].iat[1]
(or .iloc instead of .iat, but I heard that for changing a lot of single row, iat is faster)
It yields me : 5
However I cannot seems to be able to modify it using the same command :
df.loc[df['A'] == 2,'B'].iat[1] = 0
It gives me :
A B
0 1 2
1 1 3
2 1 4
3 2 6
4 2 5
5 2 7
I would like to get this :
A B
0 1 2
1 1 3
2 1 4
3 2 6
4 2 0
5 2 7
Thank you !
We should not chain .loc and .iloc (iat,at)
df.loc[df.index[df.A==2][1],'B']=0
df
A B
0 1 2
1 1 3
2 1 4
3 2 6
4 2 0
5 2 7
You can go around with cumsum, which counts the instances:
s = df['A'].eq(2)
df.loc[s & s.cumsum().eq(2), 'B'] = 0
Output:
A B
0 1 2
1 1 3
2 1 4
3 2 6
4 2 0
5 2 7

Assign values to dataframe based on index

Having a dataframe as
col
1 1
2 2
3 3
and another dataframe where i need to put calculated values from the previous df. the val column is a multiplication of values by index
i j val
1 1 1
1 2 2
1 3 3
2 1 2
2 2 4
2 3 6
3 1 3
3 2 6
3 3 9
ive tried to calculate it as using a loop but i dont think this approach is the fastest one. How can i accomplish this in a more efficient way?
IIUC.
df2 = pd.DataFrame(index=pd.MultiIndex.from_product([df.index, df.col])).reset_index()
df2.columns = ['i', 'j']
df2['val'] = df2.i * df2.j
df2
Out[45]:
i j val
0 1 1 1
1 1 2 2
2 1 3 3
3 2 1 2
4 2 2 4
5 2 3 6
6 3 1 3
7 3 2 6
8 3 3 9
I would suggest:
df2['i'] = df.index
df2['j'] = df.col
df2['val'] = df2['j'] * df2['i']

python pandas groupby() result

I have the following python pandas data frame:
df = pd.DataFrame( {
'A': [1,1,1,1,2,2,2,3,3,4,4,4],
'B': [5,5,6,7,5,6,6,7,7,6,7,7],
'C': [1,1,1,1,1,1,1,1,1,1,1,1]
} );
df
A B C
0 1 5 1
1 1 5 1
2 1 6 1
3 1 7 1
4 2 5 1
5 2 6 1
6 2 6 1
7 3 7 1
8 3 7 1
9 4 6 1
10 4 7 1
11 4 7 1
I would like to have another column storing a value of a sum over C values for fixed (both) A and B. That is, something like:
A B C D
0 1 5 1 2
1 1 5 1 2
2 1 6 1 1
3 1 7 1 1
4 2 5 1 1
5 2 6 1 2
6 2 6 1 2
7 3 7 1 2
8 3 7 1 2
9 4 6 1 1
10 4 7 1 2
11 4 7 1 2
I have tried with pandas groupby and it kind of works:
res = {}
for a, group_by_A in df.groupby('A'):
group_by_B = group_by_A.groupby('B', as_index = False)
res[a] = group_by_B['C'].sum()
but I don't know how to 'get' the results from res into df in the orderly fashion. Would be very happy with any advice on this. Thank you.
Here's one way (though it feels this should work in one go with an apply, I can't get it).
In [11]: g = df.groupby(['A', 'B'])
In [12]: df1 = df.set_index(['A', 'B'])
The size groupby function is the one you want, we have to match it to the 'A' and 'B' as the index:
In [13]: df1['D'] = g.size() # unfortunately this doesn't play nice with as_index=False
# Same would work with g['C'].sum()
In [14]: df1.reset_index()
Out[14]:
A B C D
0 1 5 1 2
1 1 5 1 2
2 1 6 1 1
3 1 7 1 1
4 2 5 1 1
5 2 6 1 2
6 2 6 1 2
7 3 7 1 2
8 3 7 1 2
9 4 6 1 1
10 4 7 1 2
11 4 7 1 2
You could also do a one liner using transform applied to the groupby:
df['D'] = df.groupby(['A','B'])['C'].transform('sum')
You could also do a one liner using merge as follows:
df = df.merge(pd.DataFrame({'D':df.groupby(['A', 'B'])['C'].size()}), left_on=['A', 'B'], right_index=True)
you can use this method :
columns = ['col1','col2',...]
df.groupby('col')[columns].sum()
if you want you can also use .sort_values(by = 'colx', ascending = True/False) after .sum() to sort the final output by a specific column (colx) and in an ascending or descending order.

Categories