Pandas MultiIndex DataFrame - python

I have an array:
w = np.array([1, 2, 3])
and I need to create a Dataframe with a MultiIndex looking like this:
df= 0 1 2
0 0 1 1 1
1 1 1 1
2 1 1 1
1 0 2 2 2
1 2 2 2
2 2 2 2
2 0 3 3 3
1 3 3 3
2 3 3 3
How can i assign the values of my array to the correct positions in the DataFrame?

The exact logic is unclear, by assuming w is the only input and you want to broadcast it as index(0,1) and columns:
w = np.array([1, 2, 3])
N = len(w)
df = pd.DataFrame(np.repeat(w, N**2).reshape((-1,N)),
index=pd.MultiIndex.from_product([np.arange(N)]*2)
)
output:
0 1 2
0 0 1 1 1
1 1 1 1
2 1 1 1
1 0 2 2 2
1 2 2 2
2 2 2 2
2 0 3 3 3
1 3 3 3
2 3 3 3

Related

Duplicate a selected row and put the duplicate just below in a Pandas DataFrame

I have a Pandas dataframe like this :
id A B
0 2 2
1 1 1
2 3 3
3 7 7
And I want to duplicate the first row 3 times just below the selected row :
id A B
0 2 2
1 2 2
2 2 2
3 2 2
4 1 1
5 3 3
6 7 7
Is there a method that already exist in Pandas library ?
There is no built-in method for doing just this. However, you can create a list of indexes, and use df.loc + df.index.repeat:
new_df = df.loc[df.index.repeat([4] + [1] * (len(df) - 1))].reset_index(drop=True)
Output:
>>> new_df
id A B
0 0 2 2
1 0 2 2
2 0 2 2
3 0 2 2
4 1 1 1
5 2 3 3
6 3 7 7
Use reindex and Index.repeat to create your dataframe:
>>> df.reindex(df.index.repeat([3] + [1] * (len(df) - 1)))
id A B
0 0 2 2
0 0 2 2
0 0 2 2
1 1 1 1
2 2 3 3
3 3 7 7
Another way:
>>> df.loc[[df.index[0]]*3 + df.index[1:].tolist()]
id A B
0 0 2 2
0 0 2 2
0 0 2 2
1 1 1 1
2 2 3 3
3 3 7 7
A more generalized way proposed by #MuhammadHassan:
row_index = 0
repeat_time = 3
>>> df.reindex(df.index.tolist() + [row_index]*repeat_time).sort_index()
id A B
0 0 2 2
0 0 2 2
0 0 2 2
0 0 2 2
1 1 1 1
2 2 3 3
3 3 7 7
Let us try
n=3
row = 0
df = df.append(df.loc[[row]*(n-1)]).sort_index()
df
id A B
0 0 2 2
0 0 2 2
0 0 2 2
1 1 1 1
2 2 3 3
3 3 7 7

Python: create new column conditionally on values from two other columns

I would like to combine two columns in a new column.
Lets suppose I have:
Index A B
0 1 0
1 1 0
2 1 0
3 1 0
4 1 0
5 1 2
6 1 2
7 1 2
8 1 2
9 1 2
10 1 2
Now I would like to create a column C with the entries from A from Index 0 to 4 and from column B from Index 5 to 10. It should look like this:
Index A B C
0 1 0 1
1 1 0 1
2 1 0 1
3 1 0 1
4 1 0 1
5 1 2 2
6 1 2 2
7 1 2 2
8 1 2 2
9 1 2 2
10 1 2 2
Is there a python code how I can get this? Thanks in advance!
If Index is an actual column you can use numpy.where and specify your condition
import numpy as np
df['C'] = np.where(df['Index'] <= 4, df['A'], df['B'])
Index A B C
0 0 1 0 1
1 1 1 0 1
2 2 1 0 1
3 3 1 0 1
4 4 1 0 1
5 5 1 2 2
6 6 1 2 2
7 7 1 2 2
8 8 1 2 2
9 9 1 2 2
10 10 1 2 2
if your index is your actual index
you can slice your indices with iloc and create your column with concat.
df['C'] = pd.concat([df['A'].iloc[:5], df['B'].iloc[5:]])
print(df)
A B C
0 1 0 1
1 1 0 1
2 1 0 1
3 1 0 1
4 1 0 1
5 1 2 2
6 1 2 2
7 1 2 2
8 1 2 2
9 1 2 2
10 1 2 2

group consecutive equal values and count up

The df:
a b
0 1
0 3
0 3
0 1
1 1
1 2
1 4
I would like to group by a and count up the equal consecutive rows in a group:
a b c
0 1 1
0 3 2
0 3 2
0 1 3
1 1 1
1 2 2
1 4 3
I tried:
df['c'] = df.b.groupby([df.a, df.b.diff().ne(0).cumsum()])
which gave me a type error:
Length of values does not match length of index
In your case , that is factor
s=df.b.diff().ne(0).cumsum().groupby(df.a).transform(lambda x : x.factorize()[0])+1
Out[276]:
0 1
1 2
2 2
3 3
4 1
5 2
6 3
Name: b, dtype: int32
df['c']=s
Or
df.b.groupby(df.a).apply(lambda x : x.diff().ne(0).cumsum())
Out[277]:
0 1
1 2
2 2
3 3
4 1
5 2
6 3
Name: b, dtype: int32
Another approach:
s = df.ne(df.shift()).any(1).astype(int)
df['c'] = s.groupby(df['a']).cumsum()
Output:
a b c
0 0 1 1
1 0 3 2
2 0 3 2
3 0 1 3
4 1 1 1
5 1 2 2
6 1 4 3

Pandas - updating sequence of values

I have this Sample DataFrame:
pd.DataFrame(data={1:[0,3,4,1], 2:[4,1,0,0], 3:[0,0,1,2], 4:[1,2,3,4] })
1 2 3 4
0 0 4 0 1
1 3 1 0 2
2 4 0 1 3
3 1 0 2 4
But i want to convert it to the format below:
pd.DataFrame(data={1:[1,1,1,1], 2:[0,2,0,2], 3:[0,3,3,0], 4:[4,0,4,4] })
1 2 3 4
0 1 0 0 4
1 1 2 3 0
2 1 0 3 4
3 1 2 0 4
Is there any way or a function to do this as i have more than 100,000 rows so for loops, dictionaries, lists won't work.
My entry:
data = df.reset_index().melt("index").query("value > 0")
out = data.pivot("index", "value", "value").fillna(0).astype(int)
giving
In [273]: out
Out[273]:
value 1 2 3 4
index
0 1 0 0 4
1 1 2 3 0
2 1 0 3 4
3 1 2 0 4
Unfortunately you'd have to clear the index and column names if you want to get rid of them, using either df.index.name = df.columns.name = None or df.rename_axis(None).rename_axis(None, 1) or something.
Using get_dummies:
s = pd.get_dummies(df, columns=df.columns, prefix_sep='', prefix='')
out = s.groupby(s.columns, axis=1).sum().drop('0', 1)
out.mask(out.ne(0)).fillna(dict(zip(out.columns, out.columns))).astype(int)
1 2 3 4
0 1 0 0 4
1 1 2 3 0
2 1 0 3 4
3 1 2 0 4
Using zip and np.isin
pd.DataFrame([ np.isin(y, x)*df.columns.values for x , y in zip([df.columns.values]*len(df),df.values)])
Out[900]:
0 1 2 3
0 0 2 0 4
1 1 2 0 4
2 1 0 3 4
3 1 0 3 4

How to count distinct values in a column of a pandas group by object?

I have a pandas data frame and group it by two columns (for example col1 and col2). For fixed values of col1 and col2 (i.e. for a group) I can have several different values in the col3. I would like to count the number of distinct values from the third columns.
For example, If I have this as my input:
1 1 1
1 1 1
1 1 2
1 2 3
1 2 3
1 2 3
2 1 1
2 1 2
2 1 3
2 2 3
2 2 3
2 2 3
I would like to have this table (data frame) as the output:
1 1 2
1 2 1
2 1 3
2 2 1
df.groupby(['col1','col2'])['col3'].nunique().reset_index()
In [17]: df
Out[17]:
0 1 2
0 1 1 1
1 1 1 1
2 1 1 2
3 1 2 3
4 1 2 3
5 1 2 3
6 2 1 1
7 2 1 2
8 2 1 3
9 2 2 3
10 2 2 3
11 2 2 3
In [19]: df.groupby([0,1])[2].apply(lambda x: len(x.unique()))
Out[19]:
0 1
1 1 2
2 1
2 1 3
2 1
dtype: int64

Categories