I defined a MultiIndex Dataframe as follows:
columns = pd.MultiIndex.from_product(
[assets, ['A', 'B', 'C']],
names=['asset', 'var']
)
res = pd.DataFrame(0, index=data.index, columns=columns)
However, I did not have success setting or updating single values of this Dataframe. Any suggestion? Or should I move to use numpy arrays - because of efficiency?
Use DataFrame.loc with tuples for select MultiIndex columns and set new value like:
assets = ['X','Y']
columns = pd.MultiIndex.from_product(
[assets, ['A', 'B', 'C']],
names=['asset', 'var']
)
res = pd.DataFrame(0, index=range(3), columns=columns)
print (res)
asset X Y
var A B C A B C
0 0 0 0 0 0 0
1 0 0 0 0 0 0
2 0 0 0 0 0 0
res.loc[0, ('X','B')] = 100
print (res)
asset X Y
var A B C A B C
0 0 100 0 0 0 0
1 0 0 0 0 0 0
2 0 0 0 0 0 0
Related
I have a dataframe that looks something like this:
i j
0 a b
1 a c
2 b c
I would like to convert it to another dataframe that looks like this:
a b c
0 1 -1 0
1 1 0 -1
2 0 1 -1
The idea is to look at each row in the first dataframe and assign the value 1 to the item in the first column and the value -1 for the item in the second column and 0 for all other items in the new dataframe.
The second dataframe will have as many rows as the first and as many columns as the number of unique entries in the first dataframe. Thank you.
Couldn't really get a start on this.
example
data = {'i': {0: 'a', 1: 'a', 2: 'b'}, 'j': {0: 'b', 1: 'c', 2: 'c'}}
df = pd.DataFrame(data)
df
i j
0 a b
1 a c
2 b c
First make dummy
df1 = pd.get_dummies(df)
df1
i_a i_b j_b j_c
0 1 0 1 0
1 1 0 0 1
2 0 1 0 1
Second make df1 index to multi-index
df1.columns = df1.columns.map(lambda x: tuple(x.split('_')))
df1
i j
a b b c
0 1 0 1 0
1 1 0 0 1
2 0 1 0 1
Third make j to negative value
df1.loc[:, 'j'] = df1.loc[:, 'j'].mul(-1).to_numpy()
df1
i j
a b b c
0 1 0 -1 0
1 1 0 0 -1
2 0 1 0 -1
Final sum i & j
df1.sum(level=1 ,axis=1)
a b c
0 1 -1 0
1 1 0 -1
2 0 1 -1
we can put multiple columns as list instead of i and j
columns = ['a', 'b', 'c']
def get_order(input_row):
output_row[input_row[i]] = 1
output_row[input_row[j]] = -1
return pd.Series(output_row)
ordering_df = original_df.apply(get_order, axis = 1)
I am creating a pandas dataframe and wanting to create a new column by assigning and reindexing method. The way I am doing is to pull the data which may have say 'A', 'B', 'C', D' 'E' columns
and I am wanting to create a new column say 'XX'. ( of course there are other columns in the dataframe and its a huge one, I just show this sample below). XX column is usually the OR logic or max of the columns of A->E
Like
INPUT:
df
A B C D E
0 0 1 0 1
0 0 0 0 0
1 0 0 0 0
OUTPUT:
df
A B C D E XX
0 0 1 0 1 1
0 0 0 0 0 1
1 0 0 0 0 1
So the way I am doing
ICOLS = ["A", "B", "C", "D", "E]
df = (df.assign(XX=df.reindex(ICOLS, axis=1).dropna().max(axis=1)).dropna(axis=1, how='all'))
The script is working fine, but its only working when I have all the columns from A to E. Many times in the database ( say C or E etc is missing) , but I still want to have the same logic and XX should give the similar output.
So if the data base has only A, B & E rows, then:
INPUT:
df
A B E
0 0 1
0 0 0
1 0 0
OUTPUT:
df
A B E XX
0 0 1 1
0 0 0 1
1 0 0 1
I am not sure how to acheive that in the way I am doing from the list of the inputCols ICOLS. I will appreciate if a help in the direction in which I am trying to fix. Any help will be appreciated. Thanks
You can create a base list of columns, then check if those columns exists in your df:
BASE_COLUMNS = ["A", "B", "C", "D", "E"]
available_cols = [column for column in df.columns if column in BASE_COLUMNS]
Finally, apply your solution, but now passing available_cols as the columns:
df = (df.assign(XX=df.reindex(available_cols, axis=1).dropna().max(axis=1)).dropna(axis=1, how='all'))
This will handle the situation when some column is missing
Do it in one line.
Please filter required columns. Put the columns you need in a list. That will filter them, try find max in each row into a new column and find max in the resultant column
Data
print(df)
A B C f D E
0 0 0 1 2 0 1
1 0 0 0 56 0 0
2 1 0 0 70 0 0
Solution;
df['xx']=df.filter(items=['A', 'B','E','D']).max(1).max(0)
OR
ICOLS = ["A", "B", "C", "D", "E"]
df['xx']=df.filter(items=ICOLS).max(1).max(0)
print(df)
A B C f D E xx
0 0 0 1 2 0 1 1
1 0 0 0 56 0 0 1
2 1 0 0 70 0 0 1
Note: Using filter as suggested #wwnde is probably better
If your main problem is to select columns based on the available columns, you could simply look at df.columns for the available columns.
>>> df = pd.DataFrame(
... [
... [0, 0, 1, 0, 1],
... [0, 0, 0, 0, 0],
... [1, 0, 0, 0, 0]
... ],
... columns=['A', 'B', 'C', 'D', 'E']
... )
>>> df
A B C D E
0 0 0 1 0 1
1 0 0 0 0 0
2 1 0 0 0 0
>>> df.columns
Index(['A', 'B', 'C', 'D', 'E'], dtype='object')
Then using a Python set you could find the intersection.
>>> ICOLS = ["A", "B", "C", "D", "E"]
>>> set(df.columns) & set(ICOLS)
{'D', 'B', 'C', 'E', 'A'}
Together that could be:
>>> df.assign(XX=df[set(df.columns) & set(ICOLS)].max(1))
A B C D E XX
0 0 0 1 0 1 1
1 0 0 0 0 0 0
2 1 0 0 0 0 1
I have the following DataFrame as a toy example:
a = [5,2,6,8]
b = [2,10,19,16]
c = [3,8,15,17]
d = [3,8,12,20]
df = pd.DataFrame([a,b,c,d], columns = ['a','b','c','d'])
df
I want to create a new DataFrame df1 that keeps only the diagonal elements and converts upper and lower triangular values to zero.
My final dataset should look like:
a b c d
0 5 0 0 0
1 0 10 0 0
2 0 0 15 0
3 0 0 0 20
You could use numpy.diag:
df = pd.DataFrame(data=np.diag(np.diag(df)), columns=df.columns)
print(df)
Output
a b c d
0 5 0 0 0
1 0 10 0 0
2 0 0 15 0
3 0 0 0 20
import pandas as pd
def diag(df):
res_df = pd.DataFrame(0, index=df.index, columns=df.columns)
for i in range(min(df.shape)): res_df.iloc[i, i] = df.iloc[i, i]
return res_df
Using get_dummies(), it is possible to create one-hot encoded dummy variables for categorical data. For example:
import pandas as pd
df = pd.DataFrame({'A': ['a', 'b', 'a'],
'B': ['b', 'a', 'c']})
print(pd.get_dummies(df))
# A_a A_b B_a B_b B_c
# 0 1 0 0 1 0
# 1 0 1 1 0 0
# 2 1 0 0 0 1
So far, so good. But how can I use get_dummies() in combination with multi-index columns? The default behavior is not very practical: The multi-index tuple is converted into a string and the same suffix mechanism applies as with the simple-index columns.
df = pd.DataFrame({('i','A'): ['a', 'b', 'a'],
('ii','B'): ['b', 'a', 'c']})
ret = pd.get_dummies(df)
print(ret)
print(type(ret.columns[0]))
# ('i','A')_a ('i','A')_b ('ii','B')_a ('ii','B')_b ('ii','B')_c
# 0 1 0 0 1 0
# 1 0 1 1 0 0
# 2 1 0 0 0 1
#
# str
What I would like to get, however, is that the dummies create a new column level:
ret = pd.get_dummies(df, ???)
print(ret)
print(type(ret.columns[0]))
# i ii
# A B
# a b a b c
# 0 1 0 0 1 0
# 1 0 1 1 0 0
# 2 1 0 0 0 1
#
# tuple
#
# Note that the ret would be equivalent to the following:
# ('i','A','a') ('i','A','b') ('ii','B','a') ('ii','B','b') ('ii','B','c')
# 0 1 0 0 1 0
# 1 0 1 1 0 0
# 2 1 0 0 0 1
How could this be achieved?
Update: I placed a feature request for better support of multi-index data-frames in get_dummies: https://github.com/pandas-dev/pandas/issues/26560
You can parse the column names and rename them:
import ast
def parse_dummy(x):
parts = x.split('_')
return ast.literal_eval(parts[0]) + (parts[1],)
ret.columns = pd.Series(ret.columns).apply(parse_dummy)
# (i, A, a) (i, A, b) (ii, B, a) (ii, B, b) (ii, B, c)
#0 1 0 0 1 0
#1 0 1 1 0 0
#2 1 0 0 0 1
Note that this DataFrame is not the same as a DataFrame with three-level multiindex column names.
I had a similar need, but in a more complex DataFrame, with a multi index as row index and numerical columns which shall not be converted to dummy. So my case required to scan through the columns, expand to dummy only the columns of dtype='object', and build a new column index as a concatenation of the name of the column with the dummy variable and the value of the dummy variable itself. This because I didn't want to add a new column index level.
Here is the code
first build a dataframe in the format I need
import pandas as pd
import numpy as np
df_size = 3
objects = ['obj1','obj2']
attributes = ['a1','a2','a3']
cols = pd.MultiIndex.from_product([objects, attributes], names=['objects', 'attributes'])
lab1 = ['car','truck','van']
lab2 = ['bay','zoe','ros','lol']
time = np.arange(df_size)
node = ['n1','n2']
idx = pd.MultiIndex.from_product([time, node], names=['time', 'node'])
df = pd.DataFrame(np.random.randint(10,size=(len(idx),len(cols))),columns=cols,index=idx)
c1 = map(lambda i:lab1[i],np.random.randint(len(lab1),size=len(idx)))
c2 = map(lambda i:lab2[i],np.random.randint(len(lab2),size=len(idx)))
df[('obj1','a3')]=list(c1)
df[('obj2','a2')]=list(c2)
print(df)
objects obj1 obj2
attributes a1 a2 a3 a1 a2 a3
time node
0 n1 6 5 truck 3 ros 3
n2 5 6 car 9 lol 7
1 n1 0 8 truck 7 zoe 8
n2 4 3 truck 8 bay 3
2 n1 5 8 van 0 bay 0
n2 4 8 car 5 lol 4
And here is the code to dummify only the object columns
for t in [df.columns[i] for i,dt in enumerate(df.dtypes) if dt=='object']:
dummy_block = pd.get_dummies( df[t] )
dummy_block.columns = pd.MultiIndex.from_product([[t[0]],[f'{t[1]}_{c}' for c in dummy_block.columns]],
names=df.columns.names)
df = pd.concat([df.drop(t,axis=1),dummy_block],axis=1).sort_index(axis=1)
print(df)
objects obj1 obj2
attributes a1 a2 a3_car a3_truck a3_van a1 a2_bay a2_lol a2_ros a2_zoe a3
time node
0 n1 6 5 0 1 0 3 0 0 1 0 3
n2 5 6 1 0 0 9 0 1 0 0 7
1 n1 0 8 0 1 0 7 0 0 0 1 8
n2 4 3 0 1 0 8 1 0 0 0 3
2 n1 5 8 0 0 1 0 1 0 0 0 0
n2 4 8 1 0 0 5 0 1 0 0 4
It can be easily changed to answer to original use case, adding one more row to the columns multi index.
df = pd.DataFrame({('i','A'): ['a', 'b', 'a'],
('ii','B'): ['b', 'a', 'c']})
print(df)
i ii
A B
0 a b
1 b a
2 a c
df.columns = pd.MultiIndex.from_tuples([t+('',) for t in df.columns])
for t in [df.columns[i] for i,dt in enumerate(df.dtypes) if dt=='object']:
dummy_block = pd.get_dummies( df[t] )
dummy_block.columns = pd.MultiIndex.from_product([[t[0]],[t[1]],[c for c in dummy_block.columns]],
names=df.columns.names)
df = pd.concat([df.drop(t,axis=1),dummy_block],axis=1).sort_index(axis=1)
print(df)
i ii
A B
a b a b c
0 1 0 0 1 0
1 0 1 1 0 0
2 1 0 0 0 1
note that it still works if there are numerical columns - it just adds an empty additional level to them in the columns index as well.
I am trying to create a new dataframe with binary (0 or 1) values from an exisitng dataframe. For every row in the given dataframe, the program should take value from each cell and set 1 for the corresponding columns of the row indexed with same number in the new dataframe
I have tried executing the following code snippet.
for col in products :
index = 0;
for item in products.loc[col] :
products_coded.ix[index, 'prod_' + str(item)] = 1;
index = index + 1;
It works for less number of rows. But,it takes lot of time for any large dataset. What could be the best way to get the desired outcome.
I think you need:
first get_dummies with casting values to strings
aggregate max by columns names max
for correct ordering convert columns to int
reindex for ordering and append missing columns, replace NaNs by 0 by parameter fill_value=0 and remove first 0 column
add_prefix for rename columns
df = pd.DataFrame({'B':[3,1,12,12,8],
'C':[0,6,0,14,0],
'D':[0,14,0,0,0]})
print (df)
B C D
0 3 0 0
1 1 6 14
2 12 0 0
3 12 14 0
4 8 0 0
df1 = (pd.get_dummies(df.astype(str), prefix='', prefix_sep='')
.max(level=0, axis=1)
.rename(columns=lambda x: int(x))
.reindex(columns=range(1, df.values.max() + 1), fill_value=0)
.add_prefix('prod_'))
print (df1)
prod_1 prod_2 prod_3 prod_4 prod_5 prod_6 prod_7 prod_8 prod_9 \
0 0 0 1 0 0 0 0 0 0
1 1 0 0 0 0 1 0 0 0
2 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 1 0
prod_10 prod_11 prod_12 prod_13 prod_14
0 0 0 0 0 0
1 0 0 0 0 1
2 0 0 1 0 0
3 0 0 1 0 1
4 0 0 0 0 0
Another similar solution:
df1 = (pd.get_dummies(df.astype(str), prefix='', prefix_sep='')
.max(level=0, axis=1))
df1.columns = df1.columns.astype(int)
df1 = (df1.reindex(columns=range(1, df1.columns.max() + 1), fill_value=0)
.add_prefix('prod_'))