How can I join Series A multiindexed by (A, B) with Series B indexed by A?
Currently the only way is to bring the indices to a common footing -- e.g. move the B level of the series_A MultiIndex to a column so that both series_A and series_B are indexed only by A:
import pandas as pd
series_A = pd.Series(1, index=pd.MultiIndex.from_product([['A1', 'A4'],['B1','B2']], names=['A','B']), name='series_A')
# A B
# A1 B1 1
# B2 1
# A4 B1 1
# B2 1
# Name: series_A, dtype: int64
series_B = pd.Series(2, index=pd.Index(['A1', 'A2', 'A3'], name='A'), name='series_B')
# A
# A1 2
# A2 2
# A3 2
# Name: series_B, dtype: int64
tmp = series_A.to_frame().reset_index('B')
result = tmp.join(series_B, how='outer').set_index('B', append=True)
print(result)
yields
series_A series_B
A B
A1 B1 1.0 2.0
B2 1.0 2.0
A2 NaN NaN 2.0
A3 NaN NaN 2.0
A4 B1 1.0 NaN
B2 1.0 NaN
Another way to join them would be to unstack the B level from series_A:
In [215]: series_A.unstack('B').join(series_B, how='outer')
Out[215]:
B1 B2 series_B
A
A1 1.0 1.0 2.0
A2 NaN NaN 2.0
A3 NaN NaN 2.0
A4 1.0 1.0 NaN
unstack moves the B index level to the column index. Thus the theme is the
same (bring the indices to a common footing), though the result is different.
Related
Probably easier to understand with an example so here we go:
random = np.random.uniform(size=(3))
what_i_have = pd.DataFrame({
('a', 'a'): random,
('b', 'b1'): np.linspace(3, 5, 3),
('b', 'b2'): np.linspace(6, 8, 3),
('b', 'b3'): np.linspace(9, 11, 3)
})
what_i_want = pd.DataFrame({
('a', 'a'): np.concatenate((random, random, random)),
('b', 'b_category'): ['b1']*3 + ['b2']*3 + ['b3']*3,
('b', 'b_value'): np.linspace(3, 11, 9)
})
print(what_i_have)
print('----------------------------------')
print(what_i_want)
Output:
a b
a b1 b2 b3
0 0.587075 3.0 6.0 9.0
1 0.798710 4.0 7.0 10.0
2 0.206860 5.0 8.0 11.0
----------------------------------
a b
a b_category b_value
0 0.587075 b1 3.0
1 0.798710 b1 4.0
2 0.206860 b1 5.0
3 0.587075 b2 6.0
4 0.798710 b2 7.0
5 0.206860 b2 8.0
6 0.587075 b3 9.0
7 0.798710 b3 10.0
8 0.206860 b3 11.0
My issue is that my data doesn't just have b1 b2 b3, it also has b4, b5, b6... All the way to about b90. The obvious solution would be to make a loop creating 90 dataframes, one for each category, then concatenating them into one dataframe, but I imagine there must be a better way of doing it.
edit:
what_i_have.unstack() doesn't really solve the issue, as can be seen below. It could be an intermediate step but there's still some work to do with this result before reaching what I want and I don't see much of an advantage in doing this over the loop solution I've previously mentioned:
a a 0 0.587075
1 0.798710
2 0.206860
b b1 0 3.000000
1 4.000000
2 5.000000
b2 0 6.000000
1 7.000000
2 8.000000
b3 0 9.000000
1 10.000000
2 11.000000
Keeping the MultiIndex, might be a better way out there though:
df = df.melt(id_vars=[df.columns[0]], var_name=['b','b_category'], value_name='b_value')
a = df[[('a','a')]]
b = df[['b', 'b_category', 'b_value']].pivot(columns='b').swaplevel(0,1, axis=1)
df = pd.concat([a, b], axis=1)
df.columns = pd.MultiIndex.from_tuples(df.columns)
print(df)
Output:
a b
a b_category b_value
0 0.737076 b1 3.0
1 0.718409 b1 4.0
2 0.269516 b1 5.0
3 0.737076 b2 6.0
4 0.718409 b2 7.0
5 0.269516 b2 8.0
6 0.737076 b3 9.0
7 0.718409 b3 10.0
8 0.269516 b3 11.0
what_i_have.columns = what_i_have.columns.droplevel()
what_i_have.melt(id_vars=['a1'],value_vars=['b1','b2','b3'],var_name='b_category', value_name='b_value')
I think these commands should solve your issue. You lose the multilevel due to the use of the melt. Not sure how to get around that at the moment but otherwise this should meet your requirements.
output
Pull out b and melt:
out = what_i_have.pop('b')
# we need this length to extend
# what_i_have
size = out.columns.size
out = out.melt()
out.columns = ['b_category', 'b_value']
out.columns = [['b', 'b'], out.columns]
Reindex what is left of what_i_have and concatenate:
index = np.tile(what_i_have.a.index, size)
what_i_want = what_i_have.reindex(index).reset_index(drop = True)
pd.concat([what_i_want, out], axis = 1)
a b
a b_category b_value
0 0.883754 b1 3.0
1 0.172427 b1 4.0
2 0.631352 b1 5.0
3 0.883754 b2 6.0
4 0.172427 b2 7.0
5 0.631352 b2 8.0
6 0.883754 b3 9.0
7 0.172427 b3 10.0
8 0.631352 b3 11.0
I have a pandas dataframe and i need to subtract rows if they have the same group.
Input Dataframe:
A
B
Value
A1
B1
10.0
A1
B1
5.0
A1
B2
5.0
A2
B1
3.0
A2
B1
5.0
A2
B2
1.0
Expected Dataframe:
A
B
Value
A1
B1
5.0
A1
B2
5.0
A2
B1
-2.0
A2
B2
1.0
Logic: For example the first and second rows of the dataframe are in group A1 and B1 so the value must be 10.0 - 5.0 = 5.0. 4º and 5º rows have the same group as well so the value must be 3.0 - 5.0 = -2.0.
Only subtract rows if they have same A and B value.
Thank you!
Try:
subtract = lambda x: x.iloc[0] - (x.iloc[1] if len(x) == 2 else 0)
out = df.groupby(['A', 'B'])['Value'].apply(subtract).reset_index()
print(out)
# Output:
A B Value
0 A1 B1 5.0
1 A1 B2 5.0
2 A2 B1 -2.0
3 A2 B2 1.0
You can prepare duplicated rows to be substracted and then sum after grouping. This works for more than one duplicated row in the correct order, too.
import pandas as pd
df = pd.read_html('https://stackoverflow.com/q/70438208/14277722')[0]
df.loc[df.duplicated(subset=['A','B']), 'Value'] *=-1
df.groupby(['A','B'], as_index=False).sum()
Output
A B Value
0 A1 B1 5.0
1 A1 B2 5.0
2 A2 B1 -2.0
3 A2 B2 1.0
Let us pass the condition within groupby apply
out = df.groupby(['A','B'])['Value'].apply(lambda x : x.iloc[0]-x.iloc[-1] if len(x)>1 else x.iloc[0]).reset_index(name = 'Value')
Out[18]:
A B Value
0 A1 B1 5.0
1 A1 B2 5.0
2 A2 B1 -2.0
3 A2 B2 1.0
One option is to use numpy's reduce function:
df.groupby(['A', 'B'], as_index = False).Value.agg(np.subtract.reduce)
A B Value
0 A1 B1 5.0
1 A1 B2 5.0
2 A2 B1 -2.0
3 A2 B2 1.0
I have a dataframe with three columns and a function that calculates the values of column y and z given the value of column x. I need to only calculate the values if they are missing NaN.
def calculate(x):
return 1, 2
df = pd.DataFrame({'x':['a', 'b', 'c', 'd', 'e', 'f'], 'y':[np.NaN, np.NaN, np.NaN, 'a1', 'b2', 'c3'], 'z':[np.NaN, np.NaN, np.NaN, 'a2', 'b1', 'c4']})
x y z
0 a NaN NaN
1 b NaN NaN
2 c NaN NaN
3 d a1 a2
4 e b2 b1
5 f c3 c4
mask = (df.isnull().any(axis=1))
df[['y', 'z']] = df[mask].apply(calculate, axis=1, result_type='expand')
However, I get the following result, although I only apply to the masked set. Unsure what I'm doing wrong.
x y z
0 a 1.0 2.0
1 b 1.0 2.0
2 c 1.0 2.0
3 d NaN NaN
4 e NaN NaN
5 f NaN NaN
If the mask is inverted I get the following result:
df[['y', 'z']] = df[~mask].apply(calculate, axis=1, result_type='expand')
x y z
0 a NaN NaN
1 b NaN NaN
2 c NaN NaN
3 d 1.0 2.0
4 e 1.0 2.0
5 f 1.0 2.0
Expected result:
x y z
0 a 1.0 2.0
1 b 1.0 2.0
2 c 1.0 2.0
3 d a1 a2
4 e b2 b1
5 f c3 c4
you can fillna after calculating for the full dataframe and set_axis
out = (df.fillna(df.apply(calculate, axis=1, result_type='expand')
.set_axis(['y','z'],inplace=False,axis=1)))
print(out)
x y z
0 a 1 2
1 b 1 2
2 c 1 2
3 d a1 a2
4 e b2 b1
5 f c3 c4
Try:
df.loc[mask,["y","z"]] = pd.DataFrame(df.loc[mask].apply(calculate, axis=1).to_list(), index=df[mask].index, columns = ["y","z"])
print(df)
x y z
0 a 1 2
1 b 1 2
2 c 1 2
3 d a1 a2
4 e b2 b1
5 f c3 c4
I have a dataframe below
A B
a0 1
b0 1
c0 2
a1 3
b1 4
b2 3
First,If df.A startswith "a",I would like to cut df.
df[df.A.str.startswith("a")]
A B
a0 1
a1 3
Therefore I would like to cut df like below.
sub1
A B
a0 1
b0 1
c0 2
sub2
A B
a1 3
b1 4
b2 3
then I would like to extract rows whose column B number matches the rows whose column A startswith"a"
sub1
A B
a0 1
b0 1
sub2
A B
a1 3
b2 3
then append.
result
A B
a0 1
b0 1
a1 3
b2 3
How can I cut and append df like this.
I tried cut method but didn't work well.
I think you can use where with mask for creating NaN which are forward filled by B values with ffill:
Notice is necessary values starts with a has to be first in each group for using ffill
print (df.B.where(df.A.str.startswith("a")))
0 1.0
1 NaN
2 NaN
3 3.0
4 NaN
5 NaN
Name: B, dtype: float64
print (df.B.where(df.A.str.startswith("a")).ffill())
0 1.0
1 1.0
2 1.0
3 3.0
4 3.0
5 3.0
Name: B, dtype: float64
df = df[df.B == df.B.where(df.A.str.startswith("a")).ffill()]
print (df)
A B
0 a0 1
1 b0 1
3 a1 3
5 b2 3
Here is a simple DataFrame:
> df = pd.DataFrame({'a': ['a1', 'a2', 'a3'],
'b': ['optional1', None, 'optional3'],
'c': ['c1', 'c2', 'c3'],
'd': [1, 2, 3]})
> df
a b c d
0 a1 optional1 c1 1
1 a2 None c2 2
2 a3 optional3 c3 3
Pivot method 1
The data can be pivoted to this:
> df.pivot_table(index=['a','b'], columns='c')
d
c c1 c3
a b
a1 optional1 1.0 NaN
a3 optional3 NaN 3.0
Downside: data in the 2nd row is lost because df['b'][1] == None.
Pivot method 2
> df.pivot_table(index=['a'], columns='c')
d
c c1 c2 c3
a
a1 1.0 NaN NaN
a2 NaN 2.0 NaN
a3 NaN NaN 3.0
Downside: column b is lost.
How can the two methods be combined so that columns b and the 2nd row are kept like so:
d
c c1 c2 c3
a b
a1 optional1 1.0 NaN NaN
a2 None NaN 2.0 NaN
a3 optional3 NaN NaN 3.0
More generally: How can information from a row be retained during pivoting if a key has NaN value?
Use set_index and unstack to perform the pivot:
df = df.set_index(['a', 'b', 'c']).unstack('c')
This is essentially what pandas does under the hood for pivot. The stack and unstack methods are closely related to pivot, and can generally be used to perform pivot-like operations that don't quite conform with the built-in pivot functions.
The resulting output:
d
c c1 c2 c3
a b
a1 optional1 1.0 NaN NaN
a2 NaN NaN 2.0 NaN
a3 optional3 NaN NaN 3.0
You could use fillna to replace the None entry:
df['b'] = df['b'].fillna('foo')
df.pivot_table(index=['a','b'], columns=['c'])
----
d
c c1 c2 c3
a b
a1 optional1 1.0 NaN NaN
a2 foo NaN 2.0 NaN
a3 optional3 NaN NaN 3.0
Use this one:
def pivot_table(df, index, columns, values):
df = df[index + columns + values]
i = len(index)
df = df.set_index(index+columns).unstack(columns).reset_index()
df.columns = df.columns.droplevel(1)[:i].append(df.columns.droplevel(0)[i:])
return df
pivot_table(df, index =['a', 'b'], columns= ['c'], values= ['d'])
You can use fillna to replace type None to string "NULL"
Say...
df.fillna("NULL").pivot_table(index=['a'], columns='c')