The pandas dataframe includes two columns 'A' and 'B'
A B
1 a b
2 a c d
3 x
Each value in column 'B' is a string containing a variable number of letters separated by spaces.
Is there a simple way to construct:
A B
1 a
1 b
2 a
2 c
2 d
3 x
You can use the following:
splitted = df.set_index("A")["B"].str.split(expand=True)
stacked = splitted.stack().reset_index(1, drop=True)
result = stacked.to_frame("B").reset_index()
print(result)
A B
0 1 a
1 1 b
2 2 a
3 2 c
4 2 d
5 3 x
For the sub steps, see below:
print(splitted)
0 1 2
A
1 a b None
2 a c d
3 x None None
print(stacked)
A
1 a
1 b
2 a
2 c
2 d
3 x
dtype: object
Or you may also use pd.melt:
splitted = df["B"].str.split(expand=True)
pd.melt(splitted.assign(A=df.A), id_vars="A", value_name="B")\
.dropna()\
.drop("variable", axis=1)\
.sort_values("A")
A B
0 1 a
3 1 b
1 2 a
4 2 c
7 2 d
2 3 x
Related
I am working on a data frame as below,
import pandas as pd
df=pd.DataFrame({'A':['A','A','A','B','B','C','C','C','C'],
'B':['a','a','b','a','b','a','b','c','c'],
})
df
A B
0 A a
1 A a
2 A b
3 B a
4 B b
5 C a
6 C b
7 C c
8 C c
I want to create a new column with the sequence value for Column B subgroups based on Column A groups like below
A B C
0 A a 1
1 A a 1
2 A b 2
3 B a 1
4 B b 2
5 C a 3
6 C b 1
7 C c 2
8 C c 2
I tried this , but does not give me desired output
df['C'] = df.groupby(['A','B']).cumcount()+1
IIUC, I think you want something like this:
df['C'] = df.groupby('A')['B'].transform(lambda x: (x != x.shift()).cumsum())
Output:
A B C
0 A a 1
1 A a 1
2 A b 2
3 B a 1
4 B b 2
5 C c 1
6 C b 2
7 C c 3
8 C c 3
Let df1 be a pandas data frame with a column of letters and a column of integers:
>>> k = pd.DataFrame({
"a": numpy.random.choice([i for i in "abcde"], 10),
"b": numpy.random.choice(range(5), 10)
})
>>> k
a b
0 a 1
1 c 2
2 e 1
3 b 3
4 c 2
5 d 2
6 e 2
7 c 3
8 b 0
9 a 3
Using value_counts(), the counts of the letters are found:
>>> counts = k["a"].value_counts()
>>> counts
c 3
e 2
b 2
a 2
d 1
Name: a, dtype: int64
How to add each occurrance to the respective row? It should result in
>>> k
a b count
0 a 1 2
1 c 2 3
2 e 1 2
[...]
9 a 3 2
Here's an alternate to using transform:
First, you can extract the value_counts() into a dataframe:
mycounts = k['a'].value_counts().rename_axis('a').reset_index(name = 'counts')
The step above is useful in many different scenarios (and good to know in general).
Then, a left-join will put the value counts into the original dataframe:
k = k.merge(mycounts, left_on = 'a', right_on = 'a', how = 'left')
You can try with transform
k['count']=k.groupby('a').a.transform('count')
k
Out[330]:
a b count
0 d 1 2
1 e 3 3
2 e 3 3
3 d 3 2
4 b 4 4
5 b 1 4
6 b 0 4
7 a 2 1
8 b 0 4
9 e 4 3
Having the following data frames:
d1 = pd.DataFrame({'A':[1,1,1,2,2,2,3,3,3]})
A C
0 1 'x'
1 1 'x'
2 1 'x'
3 2 'y'
4 2 'y'
5 2 'y'
6 3 'z'
7 3 'z'
8 3 'z'
d2 = pd.DataFrame({'B':['a','b','c']})
0 a
1 b
2 c
I would like to apply the values of d2 to the groups of A and C of d1 so the resulting DF would look like this:
A C B
0 1 x a
1 1 x a
2 1 x a
3 2 y b
4 2 y b
5 2 y b
6 3 z c
7 3 z c
8 3 z c
How can I achieve this using Pandas?
If possible you can use Series.map with enumerate object converted to dictionary:
d1['b'] = d1['A'].map(dict(enumerate(d2['B'], 1)))
print (d1)
A b
0 1 a
1 1 a
2 1 a
3 2 b
4 2 b
5 2 b
6 3 c
7 3 c
8 3 c
General solutions with factorize for numeric values started by 0 and mapped to dictionary:
d = dict(zip(*pd.factorize(d2['B'])))
d1['B'] = pd.Series(pd.factorize(d1['A'])[0], index=d1.index).map(d)
#alternative
#d1['B'] = d1.groupby('A', sort=False).ngroup().map(d)
print (d1)
A B
0 1 a
1 1 a
2 1 a
3 2 b
4 2 b
5 2 b
6 3 c
7 3 c
8 3 c
To take duplicate categories in your d2 into account, we will use drop_duplicates with Series.map:
values = d2['B'].drop_duplicates()
values.index = values.index + 1
d1['B'] = d1['A'].map(values)
A B
0 1 a
1 1 a
2 1 a
3 2 b
4 2 b
5 2 b
6 3 c
7 3 c
8 3 c
You can use df.merge here.
d2.index+=1
d1.merge(d2,left_on='A',right_index=True)
A B
0 1 a
1 1 a
2 1 a
3 2 b
4 2 b
5 2 b
6 3 c
7 3 c
8 3 c
df have:
A B C
a 1 2 3
b 2 1 4
c 1 1 1
df want:
A B C
a 1 2 3
b 2 1 4
c 1 1 1
d 1 -1 1
I am able to get df want by using:
df.loc['d']=df.loc['b']-df.loc['a']
However, my actual df has 'a','b','c' rows for multiple IDs 'X', 'Y' etc.
A B C
X a 1 2 3
b 2 1 4
c 1 1 1
Y a 1 2 3
b 2 1 4
c 1 1 1
How can I create the same output with multiple IDs?
My original method:
df.loc['d']=df.loc['b']-df.loc['a']
fails KeyError:'b'
Desired output:
A B C
X a 1 2 3
b 2 1 4
c 1 1 1
d 1 -1 1
Y a 1 2 3
b 2 2 4
c 1 1 1
d 1 0 1
IIUC,
for i, sub in df.groupby(df.index.get_level_values(0)):
df.loc[(i, 'd'), :] = sub.loc[(i,'b')] - sub.loc[(i, 'a')]
print(df.sort_index())
Or maybe
k = df.groupby(df.index.get_level_values(0), as_index=False).apply(lambda s: pd.DataFrame([s.loc[(s.name,'b')].values - s.loc[(s.name, 'a')].values],
columns=s.columns,
index=pd.MultiIndex(levels=[[s.name], ['d']], codes=[[0],[0]])
)).reset_index(drop=True, level=0)
pd.concat([k, df]).sort_index()
Data reshaping is a useful trick if you want to do manipulation on a particular level of a multiindex. See code below,
result = (df.unstack(0).T
.assign(d=lambda x:x.b-x.a)
.stack()
.unstack(0))
Use pd.IndexSlice to slice a and b. Call diff and slice on b and rename it to d. Finally, append it to original df
idx = pd.IndexSlice
df1 = df.loc[idx[:,['a','b']],:].diff().loc[idx[:,'b'],:].rename({'b': 'd'})
df2 = df.append(df1).sort_index().astype(int)
Out[106]:
A B C
X a 1 2 3
b 2 1 4
c 1 1 1
d 1 -1 1
Y a 1 2 3
b 2 2 4
c 1 1 1
d 1 0 1
I have three columns, A, B and C. I want to create a fourth column D that contains values of A or B, based on the value of C. For example:
A B C D
0 1 2 1 1
1 2 3 0 3
2 3 4 0 4
3 4 5 1 4
In the above example, column D takes the value of column A if the value of C is 1 and the value of column B if the value of C is 0. Is there an elegant way to do it in Pandas? Thank you for your help.
Use numpy.where:
In [20]: df
Out[20]:
A B C
0 1 2 1
1 2 3 0
2 3 4 0
3 4 5 1
In [21]: df['D'] = np.where(df.C, df.A, df.B)
In [22]: df
Out[22]:
A B C D
0 1 2 1 1
1 2 3 0 3
2 3 4 0 4
3 4 5 1 4
pandas
In consideration of the OP's request
Is there an elegant way to do it in Pandas?
my opinion of elegance
and idiomatic pure pandas
assign + pd.Series.where
df.assign(D=df.A.where(df.C, df.B))
A B C D
0 1 2 1 1
1 2 3 0 3
2 3 4 0 4
3 4 5 1 4
response to comment
how would you modify the pandas answer if instead of 0, 1 in column C you had A, B?
df.assign(D=df.lookup(df.index, df.C))
A B C D
0 1 2 A 1
1 2 3 B 3
2 3 4 B 4
3 4 5 A 4