df have:
A B C
a 1 2 3
b 2 1 4
c 1 1 1
df want:
A B C
a 1 2 3
b 2 1 4
c 1 1 1
d 1 -1 1
I am able to get df want by using:
df.loc['d']=df.loc['b']-df.loc['a']
However, my actual df has 'a','b','c' rows for multiple IDs 'X', 'Y' etc.
A B C
X a 1 2 3
b 2 1 4
c 1 1 1
Y a 1 2 3
b 2 1 4
c 1 1 1
How can I create the same output with multiple IDs?
My original method:
df.loc['d']=df.loc['b']-df.loc['a']
fails KeyError:'b'
Desired output:
A B C
X a 1 2 3
b 2 1 4
c 1 1 1
d 1 -1 1
Y a 1 2 3
b 2 2 4
c 1 1 1
d 1 0 1
IIUC,
for i, sub in df.groupby(df.index.get_level_values(0)):
df.loc[(i, 'd'), :] = sub.loc[(i,'b')] - sub.loc[(i, 'a')]
print(df.sort_index())
Or maybe
k = df.groupby(df.index.get_level_values(0), as_index=False).apply(lambda s: pd.DataFrame([s.loc[(s.name,'b')].values - s.loc[(s.name, 'a')].values],
columns=s.columns,
index=pd.MultiIndex(levels=[[s.name], ['d']], codes=[[0],[0]])
)).reset_index(drop=True, level=0)
pd.concat([k, df]).sort_index()
Data reshaping is a useful trick if you want to do manipulation on a particular level of a multiindex. See code below,
result = (df.unstack(0).T
.assign(d=lambda x:x.b-x.a)
.stack()
.unstack(0))
Use pd.IndexSlice to slice a and b. Call diff and slice on b and rename it to d. Finally, append it to original df
idx = pd.IndexSlice
df1 = df.loc[idx[:,['a','b']],:].diff().loc[idx[:,'b'],:].rename({'b': 'd'})
df2 = df.append(df1).sort_index().astype(int)
Out[106]:
A B C
X a 1 2 3
b 2 1 4
c 1 1 1
d 1 -1 1
Y a 1 2 3
b 2 2 4
c 1 1 1
d 1 0 1
Related
I am trying to create a "two-entry table" from many columns in my df. I tried with pivot_table / crosstrab / groupby but results appeareance using this functions is not acomplish since will not be a "two entry table"
for example if i have a dataframe like this :
df
A B C D E
1 0 0 1 1
0 1 0 1 0
1 1 1 1 1
I will like to transform my df to a df which could be seen like a "two-entry table"
A B C D E
A 2 1 1 2 2
B 1 2 1 2 1
C 1 1 1 1 1
D 2 2 1 3 1
E 2 1 1 1 2
so if i should explain first row, would be as A has two 1 in his column, then A-A = 2, A-B = 1 because they shared one's in the third row level in df, A-C = 1 because the third row in df they shared one's in the same row level and finaly A-E = 2 because they shared one's in the first row and the third row of df
Use pd.DataFrame.dot with T:
df.T.dot(df) # or df.T#df
Output:
A B C D E
A 2 1 1 2 2
B 1 2 1 2 1
C 1 1 1 1 1
D 2 2 1 3 2
E 2 1 1 2 2
Consider the following DataFrame:
>>> df = pd.DataFrame({'A': [1,2,3], 'B':['abc', 'def', 'ghi']}).apply({'A':int, 'B':list})
>>> df
A B
0 1 [a, b, c]
1 2 [d, e, f]
2 3 [g, h, I]
This is one way to get the desired result:
>>> df['B'] = df['B'].apply(enumerate).apply(list)
>>> df = df.explode('B', ignore_index=True)
>>> df['B'] = pd.Series(df['B'], index=['B1', 'B2'])})
>>> df.droplevel(0, axis=1)
A B1 B2
0 1 0 a
1 1 1 b
2 1 2 c
3 2 0 d
4 2 1 e
5 2 2 f
6 3 0 g
7 3 1 h
8 3 2 i
Is there a neater way?
A groupby on the index is an option:
df.explode('B').assign(
B1 = lambda df: df.groupby(level=0).cumcount())
A B B1
0 1 a 0
0 1 b 1
0 1 c 2
1 2 d 0
1 2 e 1
1 2 f 2
2 3 g 0
2 3 h 1
2 3 i 2
you can always reset the index, if you have no use for it:
df.explode('B').assign(
B1 = lambda df: df.groupby(level=0).cumcount()).reset_index(drop=True)
A B B1
0 1 a 0
1 1 b 1
2 1 c 2
3 2 d 0
4 2 e 1
5 2 f 2
6 3 g 0
7 3 h 1
8 3 i 2
Since pandas version 1.3.0 you can use multiple columns with explode out of the box:
df.assign(
B1 = df.B.apply(len).apply(range)).explode(['B', 'B1'], ignore_index = True))
A B B1
0 1 a 0
1 1 b 1
2 1 c 2
3 2 d 0
4 2 e 1
5 2 f 2
6 3 g 0
7 3 h 1
8 3 i 2
I think a faster option would be to run the reshaping outside Pandas, and then rejoin back to the dataframe (of course only tests can affirm/deny this):
from itertools import chain
# you can use np.concatenate instead
# np.concatenate(df.B)
flattened = chain.from_iterable(df.B)
index = df.index.repeat([*map(len, df.B)])
flattened = pd.Series(flattened, index, name = 'B1')
(pd.concat([df.A, flattened], axis=1)
.assign(B2 = lambda df: df.groupby(level=0).cumcount())
)
A B1 B2
0 1 a 0
0 1 b 1
0 1 c 2
1 2 d 0
1 2 e 1
1 2 f 2
2 3 g 0
2 3 h 1
2 3 i 2
The pandas dataframe includes two columns 'A' and 'B'
A B
1 a b
2 a c d
3 x
Each value in column 'B' is a string containing a variable number of letters separated by spaces.
Is there a simple way to construct:
A B
1 a
1 b
2 a
2 c
2 d
3 x
You can use the following:
splitted = df.set_index("A")["B"].str.split(expand=True)
stacked = splitted.stack().reset_index(1, drop=True)
result = stacked.to_frame("B").reset_index()
print(result)
A B
0 1 a
1 1 b
2 2 a
3 2 c
4 2 d
5 3 x
For the sub steps, see below:
print(splitted)
0 1 2
A
1 a b None
2 a c d
3 x None None
print(stacked)
A
1 a
1 b
2 a
2 c
2 d
3 x
dtype: object
Or you may also use pd.melt:
splitted = df["B"].str.split(expand=True)
pd.melt(splitted.assign(A=df.A), id_vars="A", value_name="B")\
.dropna()\
.drop("variable", axis=1)\
.sort_values("A")
A B
0 1 a
3 1 b
1 2 a
4 2 c
7 2 d
2 3 x
I have a dataframe with multiple group columns and a value column.
a b val
0 A C 1
1 A D 1
2 A D 1
3 A D 2
4 B E 0
For any one group, for eg a==A, b==CI can do value_counts on the series slice. How can I get the value counts of all possible combinations of the group columns in a dataframe format similar to:
a b val counts
0 A C 1 1
1 A D 1 2
2 A D 2 1
2 B E 0 1
is that what you want?
In [47]: df.groupby(['a','b','val']).size().reset_index()
Out[47]:
a b val 0
0 A C 1 1
1 A D 1 2
2 A D 2 1
3 B E 0 1
or this?
In [43]: df['counts'] = df.groupby(['a','b'])['val'].transform('size')
In [44]: df
Out[44]:
a b val counts
0 A C 1 1
1 A D 1 3
2 A D 1 3
3 A D 2 3
4 B E 0 1
I have an n by n data in csv in the following format
- A B C D
A 0 1 2 4
B 2 0 3 1
C 1 0 0 5
D 2 5 4 0
...
I would like to read it and convert to a 3D pandas dataframe in the following format:
Origin Dest Distance
A A 0
A B 1
A C 2
...
What is the best way to convert it? In the worst case, I'll write a for loop to read each line and append the transpose of it but there must be an easier way. Any help would be appreciated.
Use pd.melt()
Assuming, your dataframe looks like
In [479]: df
Out[479]:
- A B C D
0 A 0 1 2 4
1 B 2 0 3 1
2 C 1 0 0 5
3 D 2 5 4 0
In [480]: pd.melt(df, id_vars=['-'], value_vars=df.columns.values.tolist()[1:],
.....: var_name='Dest', value_name='Distance')
Out[480]:
- Dest Distance
0 A A 0
1 B A 2
2 C A 1
3 D A 2
4 A B 1
5 B B 0
6 C B 0
7 D B 5
8 A C 2
9 B C 3
10 C C 0
11 D C 4
12 A D 4
13 B D 1
14 C D 5
15 D D 0
Where df.columns.values.tolist()[1:] are remaining columns ['A', 'B', 'C', 'D']
To replace '-' with 'Origin', you could use dataframe.rename(columns={...})
pd.melt(df, id_vars=['-'], value_vars=df.columns.values.tolist()[1:],
var_name='Dest', value_name='Distance').rename(columns={'-': 'Origin'})