Assign series to DataFrame with unequal indices - python

I have the following dataframe and series with different indices and like to add series 's' to dataframe df2.
>>> import numpy as np
>>> import pandas as pd
>>> df = pd.DataFrame({'a': [1, 2, 2, 3], 'b': [1, 1, 2, 2], 'c': [1, 2, 3,4]})
>>> df
a b c
0 1 1 1
1 2 1 2
2 2 2 3
3 3 2 4
>>> df2 = df.set_index(['a', 'b'])
>>> df2
c
a b
1 1 1
2 1 2
2 3
3 2 4
>>> s = pd.Series([10, 20, 30], pd.MultiIndex.from_tuples([[1], [2], [3]], names=['a']))
>>> s
a
1 10
2 20
3 30
dtype: int64
>>> df2['x'] = s
>>> df2
c x
a b
1 1 1 NaN
2 1 2 NaN
2 3 NaN
3 2 4 NaN
I know column 'x' is set NaN because the column indices don't match, but is there a way to add series 's' by only taking into account the matching columns?
The expected result is
>>> df2
c x
a b
1 1 1 10
2 1 2 20
2 3 20 # because index a=2 (ignored 'b' because it didn't exist in series 's')
3 2 4 30

You can use DataFrame.join:
>>> df2.join(pd.DataFrame({"x": s}))
c x
a b
1 1 1 10
2 1 2 20
2 3 20
3 2 4 30

Related

Pandas add new column with CumSum of two columns, restart with new value in other column

I have the following df:
A B C
1 10 2
1 15 0
2 5 2
2 5 0
I add column D through:
df["D"] = (df.B - df.C).cumsum()
A B C D
1 10 2 8
1 15 0 23
2 5 2 26
2 5 0 31
I want the cumsum to restart in row 3 where the value in column A is different from the value in row 2.
Desired output:
A B C D
1 10 2 8
1 15 0 23
2 5 2 3
2 5 0 8
Try with
df['new'] = (df.B-df.C).groupby(df.A).cumsum()
Out[343]:
0 8
1 23
2 3
3 8
dtype: int64
Use groupby and cumsum
df['D'] = df.assign(D=df['B']-df['C']).groupby('A')['D'].cumsum()
A B C D
0 1 10 2 8
1 1 15 0 23
2 2 5 2 3
3 2 5 0 8
import pandas as pd
df = pd.DataFrame({"A": [1, 1, 2, 2], "B": [10, 15, 5, 5], "C": [2, 0, 2, 0]})
df['D'] = df['B'] - df['C']
df = df.groupby('A').cumsum()
print(df)
output:
B C D
0 10 2 8
1 25 2 23
2 5 2 3
3 10 2 8

Pandas - Attach column to a DataFrame

I have two dataframes, which for simplicity look like:
A B C D E
1 2 3 4 5
5 4 3 2 1
1 3 5 7 9
9 7 5 3 1
And the second one looks like:
F
0
1
0
1
So, both dataframes have the SAME number of rows.
I want to attach column F to the first dataframe:
A B C D E F
1 2 3 4 5 0
5 4 3 2 1 1
1 3 5 7 9 0
9 7 5 3 1 1
I have already tried various methods such as joins, iloc, adding df['F'] manually, and I don't seem to find an answer. Most of the time I get F added to the dataframe, but with its data filled with NaN (e.g. the lines where the first dataframe was filled, I get NaN in F, and then I get double the number of rows with NaN everywhere, except F, where the data is OK).
It seems you want to add column F to the first dataframe regardless of the index of both dataframes. In that case, just assign through ndarray of column F
df1['F'] = df2['F'].to_numpy()
Out[131]:
A B C D E F
0 1 2 3 4 5 0
1 5 4 3 2 1 1
2 1 3 5 7 9 0
3 9 7 5 3 1 1
You have just to create a new column on the original dataframe assigning the result of the second dataframe:
generating the example
import pandas as pd
data1 = {"A": [1, 5, 1, 9],
"B": [2, 4, 3, 7],
"C": [3, 3, 5, 5],
"D": [4, 2, 7, 3],
"E": [5, 1, 9, 1]}
data2 = {"F": [0, 1, 0, 1]}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
#creating the column
df1["F"] = df2.F
df1
> A B C D E F
> 0 1 2 3 4 5 0
> 1 5 4 3 2 1 1
> 2 1 3 5 7 9 0
> 3 9 7 5 3 1 1

use meshgrid for rows with common values in column

my dataframes:
df1 = pd.DataFrame(np.array([[1, 2, 3], [4, 2, 3], [7, 8, 8]]),columns=['a', 'b', 'c'])
df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 2, 3], [5, 8, 8]]),columns=['a', 'b', 'c'])
df1,df2:
a b c
0 1 2 3
1 4 2 3
2 7 8 8
a b c
0 1 2 3
1 4 2 3
2 5 8 8
I want to combine rows from columns a from both df's in all sequences but only where values in column b and c are equal.
Right now I have only solution for all in general with this code:
x = np.array(np.meshgrid(df1.a.values,
df2.a.values)).T.reshape(-1,2)
df = pd.DataFrame(x)
print(df)
0 1
0 1 1
1 1 4
2 1 5
3 4 1
4 4 4
5 4 5
6 7 1
7 7 4
8 7 5
expected output for df1.a and df2.a only for rows where df1.b==df2.b and df1.c==df2.c:
0 1
0 1 1
1 1 4
2 4 1
3 4 4
4 7 5
so basically i need to group by common rows in selected columns band c
You should try DataFrame.merge using inner merge:
df1.merge(df2, on=['b', 'c'])[['a_x', 'a_y']]
a_x a_y
0 1 1
1 1 4
2 4 1
3 4 4
4 7 5

Reset column index of pandas dataframe

Is it possible to reset columns so they becomes first row of DataFrame. For example,
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
a b
0 1 4
1 2 5
2 3 6
Desired ouput,
df2 = df.reset_column() ???
0 1
0 a b
1 1 4
2 2 5
3 3 6
Can also chain reset.index
df.T.reset_index().T.reset_index(drop=True)
0 1
0 a b
1 1 4
2 2 5
3 3 6
Use
In [57]: pd.DataFrame(np.vstack([df.columns, df]))
Out[57]:
0 1
0 a b
1 1 4
2 2 5
3 3 6
Inserting column names at the first row and resetting the indices.
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
df.loc[-1] = df.columns
df.index = df.index + 1
df = df.sort_index()
df.columns = [0,1]
df
0 1
0 a b
1 1 4
2 2 5
3 3 6

Pandas number rows within group in increasing order

Given the following data frame:
import pandas as pd
import numpy as np
df=pd.DataFrame({'A':['A','A','A','B','B','B'],
'B':['a','a','b','a','a','a'],
})
df
A B
0 A a
1 A a
2 A b
3 B a
4 B a
5 B a
I'd like to create column 'C', which numbers the rows within each group in columns A and B like this:
A B C
0 A a 1
1 A a 2
2 A b 1
3 B a 1
4 B a 2
5 B a 3
I've tried this so far:
df['C']=df.groupby(['A','B'])['B'].transform('rank')
...but it doesn't work!
Use groupby/cumcount:
In [25]: df['C'] = df.groupby(['A','B']).cumcount()+1; df
Out[25]:
A B C
0 A a 1
1 A a 2
2 A b 1
3 B a 1
4 B a 2
5 B a 3
Use groupby.rank function.
Here the working example.
df = pd.DataFrame({'C1':['a', 'a', 'a', 'b', 'b'], 'C2': [1, 2, 3, 4, 5]})
df
C1 C2
a 1
a 2
a 3
b 4
b 5
df["RANK"] = df.groupby("C1")["C2"].rank(method="first", ascending=True)
df
C1 C2 RANK
a 1 1
a 2 2
a 3 3
b 4 1
b 5 2

Categories