pandas dataframe reshape cast [duplicate] - python

This question already has answers here:
Concatenate strings from several rows using Pandas groupby
(8 answers)
Closed 6 years ago.
I have a dataframe like this:
import pandas
df=pandas.DataFrame([['a','b'],['a','c'],['b','c'],['b','d'],['c','f']],columns=['id','key'])
print(df)
id key
0 a b
1 a c
2 b c
3 b d
4 c f
the result that I wanted:
id key
0 a b,c
1 b c,d
2 c f
I try use pivot function, but I don't get the result. The cast packages in R seems to tackle the problem. Thanks!

You need groupby with apply join:
df1 = df.groupby('id')['key'].apply(','.join).reset_index()
print (df1)
id key
0 a b,c
1 b c,d
2 c f

a numpy approach
g = df.id.values
k = df.key.values
a = g.argsort(kind='mergesort')
gg = g[a]
kg = k[a]
w = np.where(gg[:-1] != gg[1:])[0]
pd.DataFrame(dict(
id=gg[np.append(w, len(a) - 1)],
key=[','.join(l.tolist()) for l in np.split(kg, w + 1)]
))
id key
0 a b,c
1 b c,d
2 c f
speed versus intuition

Related

Create a dataframe of combinations with an ID with pandas [duplicate]

This question already has answers here:
cartesian product in pandas
(13 answers)
Closed 19 days ago.
I'm running into a wall in terms of how to do this with Pandas. Given a dataframe (df1) with an ID column, and a separate dataframe (df2), how can I combine the two to make a third dataframe that preserves the ID column with all the possible combinations it could have?
df1
ID name.x
1 a
2 b
3 c
df2
name.y
l
m
dataframe creation:
df1 = pd.DataFrame({'ID':[1,2,3],'name.x':['a','b','c']})
df2 = pd.DataFrame({'name.y':['l','m']})
combined df
ID name.x name.y
1 a l
1 a m
2 b l
2 b m
3 c l
3 c m
create a col on each that is the same, do a full outer join, then keep the cols you want:
df1 = pd.DataFrame({'ID':[1,2,3],'name.x':['a','b','c']})
df2 = pd.DataFrame({'name.y':['l','m']})
df1['join_col'] = True
df2['join_col'] = True
df3 = pd.merge(df1,df2, how='outer',on = 'join_col')
print(df3[['ID','name.x','name.y']])
will output:
ID name.x name.y
0 1 a l
1 1 a m
2 2 b l
3 2 b m
4 3 c l
5 3 c m

How can a duplicate row be dropped with some condition [duplicate]

This question already has answers here:
Get the row(s) which have the max value in groups using groupby
(15 answers)
Closed 9 months ago.
Simple DataFrame:
df = pd.DataFrame({'A': [1,1,2,2], 'B': [0,1,2,3], 'C': ['a','b','c','d']})
df
A B C
0 1 0 a
1 1 1 b
2 2 2 c
3 2 3 d
I wish for every value (groupby) of column A, to get the value of column C, for which column B is maximum. For example for group 1 of column A, the maximum of column B is 1, so I want the value "b" of column C:
A C
0 1 b
1 2 d
No need to assume column B is sorted, performance is of top priority, then elegance.
Check with sort_values +drop_duplicates
df.sort_values('B').drop_duplicates(['A'],keep='last')
Out[127]:
A B C
1 1 1 b
3 2 3 d
df.groupby('A').apply(lambda x: x.loc[x['B'].idxmax(), 'C'])
# A
#1 b
#2 d
Use idxmax to find the index where B is maximal, then select column C within that group (using a lambda-function
Here's a little fun with groupby and nlargest:
(df.set_index('C')
.groupby('A')['B']
.nlargest(1)
.index
.to_frame()
.reset_index(drop=True))
A C
0 1 b
1 2 d
Or, sort_values, groupby, and last:
df.sort_values('B').groupby('A')['C'].last().reset_index()
A C
0 1 b
1 2 d
Similar solution to #Jondiedoop, but avoids the apply:
u = df.groupby('A')['B'].idxmax()
df.loc[u, ['A', 'C']].reset_index(drop=1)
A C
0 1 b
1 2 d

How to create conditional pandas series/column?

Here is a sample df:
A B C D E (New Column)
0 1 2 a n ?
1 3 3 p d ?
2 5 9 f z ?
If Column A == Column B PICK Column C's value apply to Column E;
Otherwise PICK Column D's value apply to Column E.
I have tried many ways but failed, I am new please teach me how to do it, THANK YOU!
Note:
It needs to PICK the value from Col.C or Col.D in this case. So there are not specify values are provided to fill in the Col.E(this is the most different to other similar questions)
use numpy.where
df['E'] = np.where(df['A'] == df['B'],
df['C'],
df['D'])
df
A B C D E
0 1 2 a n n
1 3 3 p d p
2 5 9 f z z
Try pandas where
df['E'] = df['C'].where(df['A'].eq(df['B']), df['D'])
df
Out[570]:
A B C D E
0 1 2 a n n
1 3 3 p d p
2 5 9 f z z

How can I do a specific operation on my Dataframe?

Hello I have this dataframe using python and pandas :
b c
d 1 4
e 2 5
f 3 6
I would like to have this :
a b c
d 1 4
e 2 5
f 3 6
How can I do this operation ?
Thank you very much !
df['a'] = df.index
df = df[['a','b','c']].reset_index(drop=True)
To change the index column's name:
df.index.name = "a"
To change the index column to be a regular column:
df.reset_index(inplace=True)

How to shuffle several lists or arrays in Python? [duplicate]

This question already has answers here:
Better way to shuffle two related lists
(8 answers)
Closed 6 years ago.
Suppose I have a list A and shuffled it:
import random
random.shuffle(A)
Now I with to shuffle the seconf list B with THE SAME permutation, as was applied by shuffle to A.
How is it possible?
What about pandas?
You could shuffle a list of indices:
import random
def reorderList(l, order):
ret = []
for i in order:
ret.append(l[i])
return ret
order = random.shuffle(range(len(a)))
a = reorderList(a, order)
b = reorderList(b, order)
You could zip the lists before shuffling, then unzip them. It's not particularly memory efficient (since you're essentially copying the lists during the shuffle).
a = [1,2,3]
b = [4,5,6]
c = zip(a, b)
random.shuffle(c)
a, b = zip(*c)
With pandas, you would permute the index instead:
df = pd.DataFrame(np.random.choice(list('abcde'), size=(10, 2)), columns = list('AB'))
df
Out[39]:
A B
0 c e
1 b e
2 c e
3 a d
4 e d
5 d d
6 b b
7 a e
8 e b
9 a b
Sampling with frac=1 gives you the shuffled dataframe:
df.sample(frac=1)
Out[40]:
A B
6 b b
5 d d
1 b e
2 c e
9 a b
8 e b
4 e d
7 a e
3 a d
0 c e

Categories