Reset column index of pandas dataframe - python

Is it possible to reset columns so they becomes first row of DataFrame. For example,
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
a b
0 1 4
1 2 5
2 3 6
Desired ouput,
df2 = df.reset_column() ???
0 1
0 a b
1 1 4
2 2 5
3 3 6

Can also chain reset.index
df.T.reset_index().T.reset_index(drop=True)
0 1
0 a b
1 1 4
2 2 5
3 3 6

Use
In [57]: pd.DataFrame(np.vstack([df.columns, df]))
Out[57]:
0 1
0 a b
1 1 4
2 2 5
3 3 6

Inserting column names at the first row and resetting the indices.
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
df.loc[-1] = df.columns
df.index = df.index + 1
df = df.sort_index()
df.columns = [0,1]
df
0 1
0 a b
1 1 4
2 2 5
3 3 6

Related

Pandas - Attach column to a DataFrame

I have two dataframes, which for simplicity look like:
A B C D E
1 2 3 4 5
5 4 3 2 1
1 3 5 7 9
9 7 5 3 1
And the second one looks like:
F
0
1
0
1
So, both dataframes have the SAME number of rows.
I want to attach column F to the first dataframe:
A B C D E F
1 2 3 4 5 0
5 4 3 2 1 1
1 3 5 7 9 0
9 7 5 3 1 1
I have already tried various methods such as joins, iloc, adding df['F'] manually, and I don't seem to find an answer. Most of the time I get F added to the dataframe, but with its data filled with NaN (e.g. the lines where the first dataframe was filled, I get NaN in F, and then I get double the number of rows with NaN everywhere, except F, where the data is OK).
It seems you want to add column F to the first dataframe regardless of the index of both dataframes. In that case, just assign through ndarray of column F
df1['F'] = df2['F'].to_numpy()
Out[131]:
A B C D E F
0 1 2 3 4 5 0
1 5 4 3 2 1 1
2 1 3 5 7 9 0
3 9 7 5 3 1 1
You have just to create a new column on the original dataframe assigning the result of the second dataframe:
generating the example
import pandas as pd
data1 = {"A": [1, 5, 1, 9],
"B": [2, 4, 3, 7],
"C": [3, 3, 5, 5],
"D": [4, 2, 7, 3],
"E": [5, 1, 9, 1]}
data2 = {"F": [0, 1, 0, 1]}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
#creating the column
df1["F"] = df2.F
df1
> A B C D E F
> 0 1 2 3 4 5 0
> 1 5 4 3 2 1 1
> 2 1 3 5 7 9 0
> 3 9 7 5 3 1 1

use meshgrid for rows with common values in column

my dataframes:
df1 = pd.DataFrame(np.array([[1, 2, 3], [4, 2, 3], [7, 8, 8]]),columns=['a', 'b', 'c'])
df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 2, 3], [5, 8, 8]]),columns=['a', 'b', 'c'])
df1,df2:
a b c
0 1 2 3
1 4 2 3
2 7 8 8
a b c
0 1 2 3
1 4 2 3
2 5 8 8
I want to combine rows from columns a from both df's in all sequences but only where values in column b and c are equal.
Right now I have only solution for all in general with this code:
x = np.array(np.meshgrid(df1.a.values,
df2.a.values)).T.reshape(-1,2)
df = pd.DataFrame(x)
print(df)
0 1
0 1 1
1 1 4
2 1 5
3 4 1
4 4 4
5 4 5
6 7 1
7 7 4
8 7 5
expected output for df1.a and df2.a only for rows where df1.b==df2.b and df1.c==df2.c:
0 1
0 1 1
1 1 4
2 4 1
3 4 4
4 7 5
so basically i need to group by common rows in selected columns band c
You should try DataFrame.merge using inner merge:
df1.merge(df2, on=['b', 'c'])[['a_x', 'a_y']]
a_x a_y
0 1 1
1 1 4
2 4 1
3 4 4
4 7 5

Get values and column names

I have a pandas data frame that looks something like this:
data = {'1' : [0, 2, 0, 0], '2' : [5, 0, 0, 2], '3' : [2, 0, 0, 0], '4' : [0, 7, 0, 0]}
df = pd.DataFrame(data, index = ['a', 'b', 'c', 'd'])
df
1 2 3 4
a 0 5 2 0
b 2 0 0 7
c 0 0 0 0
d 0 2 0 0
I know I can get the maximum value and the corresponding column name for each row by doing (respectively):
df.max(1)
df.idxmax(1)
How can I get the values and the column name for every cell that is not zero?
So in this case, I'd want 2 tables, one giving me each value != 0 for each row:
a 5
a 2
b 2
b 7
d 2
And one giving me the column names for those values:
a 2
a 3
b 1
b 4
d 2
Thanks!
You can use stack for Series, then filter by boolean indexing, rename_axis, reset_index and last drop column or select columns by subset:
s = df.stack()
df1 = s[s!= 0].rename_axis(['a','b']).reset_index(name='c')
print (df1)
a b c
0 a 2 5
1 a 3 2
2 b 1 2
3 b 4 7
4 d 2 2
df2 = df1.drop('b', axis=1)
print (df2)
a c
0 a 5
1 a 2
2 b 2
3 b 7
4 d 2
df3 = df1.drop('c', axis=1)
print (df3)
a b
0 a 2
1 a 3
2 b 1
3 b 4
4 d 2
df3 = df1[['a','c']]
print (df3)
a c
0 a 5
1 a 2
2 b 2
3 b 7
4 d 2
df3 = df1[['a','b']]
print (df3)
a b
0 a 2
1 a 3
2 b 1
3 b 4
4 d 2

Pandas number rows within group in increasing order

Given the following data frame:
import pandas as pd
import numpy as np
df=pd.DataFrame({'A':['A','A','A','B','B','B'],
'B':['a','a','b','a','a','a'],
})
df
A B
0 A a
1 A a
2 A b
3 B a
4 B a
5 B a
I'd like to create column 'C', which numbers the rows within each group in columns A and B like this:
A B C
0 A a 1
1 A a 2
2 A b 1
3 B a 1
4 B a 2
5 B a 3
I've tried this so far:
df['C']=df.groupby(['A','B'])['B'].transform('rank')
...but it doesn't work!
Use groupby/cumcount:
In [25]: df['C'] = df.groupby(['A','B']).cumcount()+1; df
Out[25]:
A B C
0 A a 1
1 A a 2
2 A b 1
3 B a 1
4 B a 2
5 B a 3
Use groupby.rank function.
Here the working example.
df = pd.DataFrame({'C1':['a', 'a', 'a', 'b', 'b'], 'C2': [1, 2, 3, 4, 5]})
df
C1 C2
a 1
a 2
a 3
b 4
b 5
df["RANK"] = df.groupby("C1")["C2"].rank(method="first", ascending=True)
df
C1 C2 RANK
a 1 1
a 2 2
a 3 3
b 4 1
b 5 2

Assign series to DataFrame with unequal indices

I have the following dataframe and series with different indices and like to add series 's' to dataframe df2.
>>> import numpy as np
>>> import pandas as pd
>>> df = pd.DataFrame({'a': [1, 2, 2, 3], 'b': [1, 1, 2, 2], 'c': [1, 2, 3,4]})
>>> df
a b c
0 1 1 1
1 2 1 2
2 2 2 3
3 3 2 4
>>> df2 = df.set_index(['a', 'b'])
>>> df2
c
a b
1 1 1
2 1 2
2 3
3 2 4
>>> s = pd.Series([10, 20, 30], pd.MultiIndex.from_tuples([[1], [2], [3]], names=['a']))
>>> s
a
1 10
2 20
3 30
dtype: int64
>>> df2['x'] = s
>>> df2
c x
a b
1 1 1 NaN
2 1 2 NaN
2 3 NaN
3 2 4 NaN
I know column 'x' is set NaN because the column indices don't match, but is there a way to add series 's' by only taking into account the matching columns?
The expected result is
>>> df2
c x
a b
1 1 1 10
2 1 2 20
2 3 20 # because index a=2 (ignored 'b' because it didn't exist in series 's')
3 2 4 30
You can use DataFrame.join:
>>> df2.join(pd.DataFrame({"x": s}))
c x
a b
1 1 1 10
2 1 2 20
2 3 20
3 2 4 30

Categories