I have a data frame that looks like this:
df = pd.DataFrame({"value": [4, 5, 3], "item1": [0, 1, 0], "item2": [1, 0, 0], "item3": [0, 0, 1]})
df
value item1 item2 item3
0 4 0 1 0
1 5 1 0 0
2 3 0 0 1
Basically what I want to do is replace the value of the one hot encoded elements with the value from the "value" column and then delete the "value" column. The resulting data frame should be like this:
df_out = pd.DataFrame({"item1": [0, 5, 0], "item2": [4, 0, 0], "item3": [0, 0, 3]})
item1 item2 item3
0 0 4 0
1 5 0 0
2 0 0 3
Why not just multiply?
df.pop('value').values * df
item1 item2 item3
0 0 5 0
1 4 0 0
2 0 0 3
DataFrame.pop has the nice effect of in-place removing and returning a column, so you can do this in a single step.
if the "item_*" columns have anything besides 1 in them, then you can multiply with bools:
df.pop('value').values * df.astype(bool)
item1 item2 item3
0 0 5 0
1 4 0 0
2 0 0 3
If your DataFrame has other columns, then do this:
df
value name item1 item2 item3
0 4 John 0 1 0
1 5 Mike 1 0 0
2 3 Stan 0 0 1
# cols = df.columns[df.columns.str.startswith('item')]
cols = df.filter(like='item').columns
df[cols] = df.pop('value').values * df[cols]
df
name item1 item2 item3
0 John 0 5 0
1 Mike 4 0 0
2 Stan 0 0 3
You could do something like:
df = pd.DataFrame([df['value']*df['item1'],df['value']*df['item2'],df['value']*df['item3']])
df.columns = ['item1','item2','item3']
EDIT:
As this answer will not scale well to many columns as #coldspeed comments, it should be done iterating a loop:
cols = ['item1','item2','item3']
for c in cols:
df[c] *= df['value']
df.drop('value',axis=1,inplace=True)
You need:
col = ['item1','item2','item3']
for c in col:
df[c] = df[c] * df['value']
df.drop(['value'],1,inplace=True)
pd.DataFrame.mul
You can use mul, or eqivalently multiply, either using labels or integer positional indexing:
# label-based indexing
res = df.filter(regex='^item').mul(df['value'], axis='index')
# integer positional indexing
res = df.iloc[:, 1:].mul(df.iloc[:, 0], axis='index')
print(res)
# item1 item2 item3
# 0 0 4 0
# 1 5 0 0
# 2 0 0 3
Related
is there a way to convert the multiindex columns to normal value columns? I have a multiindexed table like that:
level_0
level_1
Value
0
0
0
A
1
0
1
B
2
1
0
C
3
1
1
D
I want to convert level_0 and level_1 to normal columns:
ID
col0
col1
Value
0
0
0
A
1
0
1
B
2
1
0
C
3
1
1
D
Any suggestion?
Thank you!
You can use reset_index followed by rename.
# Setup
my_index = pd.MultiIndex.from_arrays([(0, 1, 2, 3),
(0, 0, 1, 1),
(0, 1, 0, 1)],
names=[None, 'level_0', 'level_1'])
df = pd.DataFrame({'Value': ['A', 'B', 'C', 'D']}, index=my_index)
>>> # level=['level_0', 'level_1'] works, too
>>> df = df.reset_index(level=[1, 2])
>>> df
level_0 level_1 Value
0 0 0 A
1 0 1 B
2 1 0 C
3 1 1 D
To rename the columns, you can do
>>> df.rename(columns={'level_0': 'col0', 'level_1': 'col1'})
col0 col1 Value
0 0 0 A
1 0 1 B
2 1 0 C
3 1 1 D
I'm trying to replace the values in each cell with 1 if the value is equal to highest value in other columns in the row.
This is the data i have
This is where i want to get to
This is what i tried so far:
df_ref['max'] = df_ref.max(axis=1)
df_ref['col1'] = df_ref.col1.apply(lambda x:1 if (x==df_ref['max']) else 0)
Thanks in advance
you're almost there, you don't need the max column just apply it within your lambda function and use .any(), you also need your process within a loop over columns:
import pandas as pd
#data
d = {'col1': [0, 1, 0.170531, 0.170533, 0.170531],
'col2': [0, 0, 0.005285, 0.005285, 0.005285],
'col3': [0, 0, 0.047557, 0.047557, 0.047557],
'col4': [1, 0, 0.482381, 0.003104, 0.482381],
'col5': [0, 0, 0.003104, 0.482458, 0.003104],
'col6': [0, 0, 0.001109, 0.001108, 0.001109]}
#create dataframe
df = pd.DataFrame(data = d)
#list of columns
columns = df.columns.tolist()
#loop over columns
for col in columns:
#change to 1 if value equals to the max in that row
df[col] = df[col].apply(lambda x:1 if (x==df.max(axis=1)).any() else 0)
print(df)
col1 col2 col3 col4 col5 col6
0 0 0 0 1 0 0
1 1 0 0 0 0 0
2 0 0 0 1 0 0
3 0 0 0 0 1 0
4 0 0 0 1 0 0
I have the following data
attr1_A attr1_B attr1_C attr1_D attr2_A attr2_B attr2_C
1 0 0 1 1 0 0
0 1 1 0 0 0 1
0 0 0 0 0 1 0
1 1 1 0 1 1 0
I want to retain attr1_A, attr1_B and combine attr1_C and attr1_D into attr1_others. As long as attr1_C and/or attr1_D is 1, then attr1_others will be 1. Similarly, I want to keep attr2_A but combine the remaining attr2_* into attr2_others. Like this:
attr1_A attr1_B attr1_others attr2_A attr2_others
1 0 1 1 0
0 1 1 0 1
0 0 0 0 1
1 1 1 1 1
In other words, for any group of attr, I want to retain a few known columns but combine the remaining (which I don't know how many remaining attr of the same group.
I am thinking of doing each group separately: processing all attr1_*, and then attr2_* because there are a limited number of groups in my dataset, but many attr under each group.
What I can think right now is to retrieve the others columns like:
# for group 1
df[x for x in df.columns if "A" not in x and "B" not in x and "attr1_" in x]
# for group 2
df[x for x in df.columns if "A" not in x and "attr2_" in x]
And to combine, I am thinking of using any function, but I can't come up with the syntax. Could you help?
Updated attempt:
I tried this
# for group 1
df['attr1_others'] = df[df[[x for x in list(df.columns)
if "attr1_" in x
and "A" not in x
and "B" not in x]].any(axis = 'column')]
but got the below error:
ValueError: No axis named column for object type <class 'pandas.core.frame.DataFrame'>
Dataframes have the great ability to manipulate data in place, without having to write complex python logic.
To create your attr1_others and attr2_others columns, you can combine the columns with or conditions using this:
df['attr1_others'] = df['attr1_C'] | df['attr1_D']
df['attr2_others'] = df['attr2_B'] | df['attr2_C']
If instead, you wanted an and condition, you could use:
df['attr1_others'] = df['attr1_C'] & df['attr1_D']
df['attr2_others'] = df['attr2_B'] & df['attr2_C']
You can then delete the lingering original values using del:
del df['attr1_C']
del df['attr1_D']
del df['attr2_B']
del df['attr2_C']
Create a list of kept-columns. Drop those kept-columns out and assign left-over columns to new dataframe df1. Groupby df1 by the splitted column names; call any on axis=1; add_suffix '_others' and assign result to df2. Finally, join and sort_index
keep_cols = ['attr1_A', 'attr1_B', 'attr2_A']
df1 = df.drop(keep_cols,1)
df2 = (df1.groupby(df1.columns.str.split('_').str[0], axis=1)
.any(1).add_suffix('_others').astype(int))
Out[512]:
attr1_others attr2_others
0 1 0
1 1 1
2 0 1
3 1 1
df_final = df[keep_cols].join(df2).sort_index(1)
Out[514]:
attr1_A attr1_B attr1_others attr2_A attr2_others
0 1 0 1 1 0
1 0 1 1 0 1
2 0 0 0 0 1
3 1 1 1 1 1
You can use custom list to select columns, and then .any() with axis=1 parameter. To convert to interger, use .astype(int).
For example:
import pandas as pd
df = pd.DataFrame({
'attr1_A': [1, 0, 0, 1],
'attr1_B': [0, 1, 0, 1],
'attr1_C': [0, 1, 0, 1],
'attr1_D': [1, 0, 0, 0],
'attr2_A': [1, 0, 0, 1],
'attr2_B': [0, 0, 1, 1],
'attr2_C': [0, 1, 0, 0]})
cols = [col for col in df.columns.values if col.startswith('attr1') and col.split('_')[1] not in ('A', 'B')]
df['attr1_others'] = df[cols].any(axis=1).astype(int)
df.drop(cols, axis=1, inplace=True)
cols = [col for col in df.columns.values if col.startswith('attr2') and col.split('_')[1] not in ('A', )]
df['attr2_others'] = df[cols].any(axis=1).astype(int)
df.drop(cols, axis=1, inplace=True)
print(df)
Prints:
attr1_A attr1_B attr2_A attr1_others attr2_others
0 1 0 1 1 0
1 0 1 0 1 1
2 0 0 0 0 1
3 1 1 1 1 1
I have data of the following format:
Col1 Col2 Col3
1, 1424549456, "3 4"
2, 1424549457, "2 3 4 5"
& have successfully read it into pandas.
How can I turn Col3 to a numpy matrix of the following form:
# each value needs to become a 1 in the index of the col
# i.e. in the above example 3 is the 4th value, thus
# it is [0 0 0 1] [0 indexing is included]
mtx = [0 0 0 1 1 0 # corresponds to first row
0 0 1 1 1 1]; # corresponds to second row
Thanks for any help you can provide!
Since 0.13.1 there's str.get_dummies:
In [11]: s = pd.Series(["3 4", "2 3 4 5"])
In [12]: s.str.get_dummies(sep=" ")
Out[12]:
2 3 4 5
0 0 1 1 0
1 1 1 1 1
You have to ensure the columns are integers (rather than strings) and reindex:
In [13]: df = s.str.get_dummies(sep=" ")
In [14]: df.columns = df.columns.map(int)
In [15]: df.reindex(columns=np.arange(6), fill_value=0)
Out[15]:
0 1 2 3 4 5
0 0 0 0 1 1 0
1 0 0 1 1 1 1
To get the numpy values use .values:
In [16]: df.reindex(columns=np.arange(6), fill_value=0).values
Out[16]:
array([[0, 0, 0, 1, 1, 0],
[0, 0, 1, 1, 1, 1]])
if there's not a lot of data you can do something like
res = []
def f(v):
r = np.zeros(6, np.int)
r[map(int, v.split())] = 1
res.append(r)
df.Col3.apply(f)
mat = np.array(res)
# if you really want it to be a matrix, you can do
mat = np.matrix(res)
check out this link for more info
I have a column, 'col2', that has a list of strings. The current code I have is too slow, there's about 2000 unique strings (the letters in the example below), and 4000 rows. Ending up as 2000 columns and 4000 rows.
In [268]: df.head()
Out[268]:
col1 col2
0 6 A,B
1 15 C,G,A
2 25 B
Is there a fast way to make this in a get dummies format? Where each string has it's own column and in each string's column there is a 0 or 1 if it that row has that string in col2.
In [268]: def get_list(df):
d = []
for row in df.col2:
row_list = row.split(',')
for string in row_list:
if string not in d:
d.append(string)
return d
df_list = get_list(df)
def make_cols(df, lst):
for string in lst:
df[string] = 0
return df
df = make_cols(df, df_list)
for idx in range(0, len(df['col2'])):
row_list = df['col2'].iloc[idx].split(',')
for string in row_list:
df[string].iloc[idx]+= 1
Out[113]:
col1 col2 A B C G
0 6 A,B 1 1 0 0
1 15 C,G,A 1 0 1 1
2 25 B 0 1 0 0
This is my current code for it but it's too slow.
Thanks you any help!
You can use:
>>> df['col2'].str.get_dummies(sep=',')
A B C G
0 1 1 0 0
1 1 0 1 1
2 0 1 0 0
To join the Dataframes:
>>> pd.concat([df, df['col2'].str.get_dummies(sep=',')], axis=1)
col1 col2 A B C G
0 6 A,B 1 1 0 0
1 15 C,G,A 1 0 1 1
2 25 B 0 1 0 0