For example, I have a data set of this:
data = {
"A": [1, 2, 3],
"B": [3, 5, 1],
"C": [9, 0, 1]
}
data_df = pd.DataFrame(data)
data_df
A B C
0 1 3 9
1 2 5 0
2 3 1 1
I want to replace the max value for each columns to 0. My desired output is:
A B C
0 1 3 0
1 2 0 0
2 0 1 1
Thank you in advance!
You can interate through columns, get the max value and replace row with max value:
for col in data_df.columns:
data_df[col] = data_df[col].apply(lambda x: 0 if x==data_df.max()[col] else x)
This works if your max value is unique.
Just be aware that idxmax() returns the first index of the maximum value. If the values occurs more often this won't work.
for col in df.columns:
df.loc[df.idxmax()[col], col] = 0
Related
I'm a python newbie and need help with a specfic task. My main goal is to identify all indicies with their specific values and column-names which are greater than 0 within a row and to sum up these values below each other into another column within the same row.
Here is what I tried:
import pandas as pd
import numpy as np
table = {
'A':[0, 2, 0, 5],
'B' :[4, 1, 3, 0],
'C':[2, 9, 0, 6],
'D':[1, 0, 1, 6]
}
df = pd.DataFrame(table)
print(df)
# create a new column that sums up the row
df['summary'] = 'NoData'
# print the header
print(df.columns.values)
A B C D summary
0 0 4 2 1 NoData
1 2 1 9 0 NoData
2 0 3 0 1 NoData
3 5 0 6 6 NoData
# get length of rows and columns
row = len(df.index)
column = len(df.columns)
# If a value at a spefic index is greater
# than 0, take the column name and the value at that index and print it into the column
#'summary'. Also write all values greater than 0 within a row below each other
for i in range(row):
for j in range(column):
if df.iloc[i][j] > 0:
df.at[i,'summary'] = df.columns(df.iloc[i][j]) + '\n'
I hope it is a bit clear what I want to achieve. Here is a picture of how the result should look in the column 'summary'
You don't really need a for loop.
Starting with df:
A B C D
0 0 4 2 1
1 2 1 9 0
2 0 3 0 1
3 5 0 6 6
You can do:
# Define an helper function
def f(val, col_name):
# You can modify this function in order to customize the summary string
return "" if val == 0 else str(val) + col_name + "\n"
# Assign summary column
df["summary"] = df.apply(lambda x: x.apply(f, args=(x.name,))).sum(axis=1).str[:-1]
Output:
A B C D summary
0 0 4 2 1 4B\n2C\n1D
1 2 1 9 0 2A\n1B\n9C
2 0 3 0 1 3B\n1D
3 5 0 6 6 5A\n6C\n6D
It works for longer column names as well:
one two three four summary
0 0 4 2 1 4two\n2three\n1four
1 2 1 9 0 2one\n1two\n9three
2 0 3 0 1 3two\n1four
3 5 0 6 6 5one\n6three\n6four
Try this:
import pandas as pd
import numpy as np
table = {
'A':[0, 2, 0, 5],
'B' :[4, 1, 3, 0],
'C':[2, 9, 0, 6],
'D':[1, 0, 1, 6]
}
df = pd.DataFrame(table)
print(df)
print(f'\n\n-------------BREAK-----------\n\n')
def func(line):
templist = ''
list_col = line.index.values.tolist()
temp = line.values.tolist()
for x in range(0, len(temp)):
if (temp[x] <= 0):
pass
else:
if (x == 0 ):
templist = f"{temp[x]}{list_col[x]}"
else:
templist = f"{templist}\n{temp[x]}{list_col[x]}"
return templist
df['summary'] = df.apply(func, axis = 1)
print(df)
EXIT
A B C D
0 0 4 2 1
1 2 1 9 0
2 0 3 0 1
3 5 0 6 6
-------------BREAK-----------
A B C D summary
0 0 4 2 1 \n4B\n2C\n1D
1 2 1 9 0 2A\n1B\n9C
2 0 3 0 1 \n3B\n1D
3 5 0 6 6 5A\n6C\n6D
I am trying to add another column based on the value of two columns. Here is the mini version of my dataframe.
data = {'current_pair': ['"["StimusNeu/2357.jpg","StimusNeu/5731.jpg"]"', '"["StimusEmo/6350.jpg","StimusEmo/3230.jpg"]"', '"["StimusEmo/3215.jpg","StimusEmo/9570.jpg"]"','"["StimusNeu/7020.jpg","StimusNeu/7547.jpg"]"', '"["StimusNeu/7080.jpg","StimusNeu/7179.jpg"]"'],
'B': [1, 0, 1, 1, 0]
}
df = pd.DataFrame(data)
df
current_pair B
0 "["StimusNeu/2357.jpg","StimusNeu/5731.jpg"]" 1
1 "["StimusEmo/6350.jpg","StimusEmo/3230.jpg"]" 0
2 "["StimusEmo/3215.jpg","StimusEmo/9570.jpg"]" 1
3 "["StimusNeu/7020.jpg","StimusNeu/7547.jpg"]" 1
4 "["StimusNeu/7080.jpg","StimusNeu/7179.jpg"]" 0
I want the result to be:
current_pair B C
0 "["StimusNeu/2357.jpg","StimusNeu/5731.jpg"]" 1 1
1 "["StimusEmo/6350.jpg","StimusEmo/3230.jpg"]" 0 2
2 "["StimusEmo/3215.jpg","StimusEmo/9570.jpg"]" 1 0
3 "["StimusNeu/7020.jpg","StimusNeu/7547.jpg"]" 1 1
4 "["StimusNeu/7080.jpg","StimusNeu/7179.jpg"]" 0 2
I used the numpy select commands:
conditions=[(data['B']==1 & data['current_pair'].str.contains('Emo/', na=False)),
(data['B']==1 & data['current_pair'].str.contains('Neu/', na=False)),
data['B']==0]
choices = [0, 1, 2]
data['C'] = np.select(conditions, choices, default=np.nan)
Unfortunately, it gives me this dataframe without recognizing anything with "1" in column "C".
current_pair B C
0 "["StimusNeu/2357.jpg","StimusNeu/5731.jpg"]" 1 0
1 "["StimusEmo/6350.jpg","StimusEmo/3230.jpg"]" 0 2
2 "["StimusEmo/3215.jpg","StimusEmo/9570.jpg"]" 1 0
3 "["StimusNeu/7020.jpg","StimusNeu/7547.jpg"]" 1 0
4 "["StimusNeu/7080.jpg","StimusNeu/7179.jpg"]" 0 2
Any help counts! thanks a lot.
There is problem with () after ==1 for precedence of operators:
conditions=[(data['B']==1) & data['current_pair'].str.contains('Emo/', na=False),
(data['B']==1) & data['current_pair'].str.contains('Neu/', na=False),
data['B']==0]
I think some logic went wrong here; this works:
df.assign(C=np.select([df.B==0, df.current_pair.str.contains('Emo/'), df.current_pair.str.contains('Neu/')], [2,0,1]))
Here is a slightly more generalized suggestion, easily applicable to more complex cases. You should, however mind execution speed:
import pandas as pd
df = pd.DataFrame({'col_1': ['Abc', 'Xcd', 'Afs', 'Xtf', 'Aky'], 'col_2': [1, 2, 3, 4, 5]})
def someLogic(col_1, col_2):
if 'A' in col_1 and col_2 == 1:
return 111
elif "X" in col_1 and col_2 == 4:
return 999
return 888
df['NewCol'] = df.apply(lambda row: someLogic(row.col_1, row.col_2), axis=1, result_type="expand")
print(df)
Say there's Dataframe df with columns A and B
A B
0 1 1
1 0 1
2 0 1
3 0 1
4 1 0
If I want to 'equalize' the cases of column A I just have to drop one of the rows [1, 2, 3]. If I want to equalize the cases of col B then I'd have to drop three of the rows [0, 1, 2, 3].
However, if I want to equalize the cases of both columns so that the general imbalance is minimized how could I do that through pandas? Bear in mind that efficiency is very important.
Use:
def remove(df, col):
#get counts of column
s = df[col].value_counts()
#subtract for number of removed rows
d = s.sub(s.min())
#remove filtered rows with samples
return df.drop(df[df[col].eq(d.idxmax())].sample(d.max()).index)
df = remove(df, 'A')
print (df)
A B
0 1 1
1 0 1
3 0 1
4 1 0
df = remove(df, 'B')
print (df)
A B
3 0 1
4 1 0
I have the following data
attr1_A attr1_B attr1_C attr1_D attr2_A attr2_B attr2_C
1 0 0 1 1 0 0
0 1 1 0 0 0 1
0 0 0 0 0 1 0
1 1 1 0 1 1 0
I want to retain attr1_A, attr1_B and combine attr1_C and attr1_D into attr1_others. As long as attr1_C and/or attr1_D is 1, then attr1_others will be 1. Similarly, I want to keep attr2_A but combine the remaining attr2_* into attr2_others. Like this:
attr1_A attr1_B attr1_others attr2_A attr2_others
1 0 1 1 0
0 1 1 0 1
0 0 0 0 1
1 1 1 1 1
In other words, for any group of attr, I want to retain a few known columns but combine the remaining (which I don't know how many remaining attr of the same group.
I am thinking of doing each group separately: processing all attr1_*, and then attr2_* because there are a limited number of groups in my dataset, but many attr under each group.
What I can think right now is to retrieve the others columns like:
# for group 1
df[x for x in df.columns if "A" not in x and "B" not in x and "attr1_" in x]
# for group 2
df[x for x in df.columns if "A" not in x and "attr2_" in x]
And to combine, I am thinking of using any function, but I can't come up with the syntax. Could you help?
Updated attempt:
I tried this
# for group 1
df['attr1_others'] = df[df[[x for x in list(df.columns)
if "attr1_" in x
and "A" not in x
and "B" not in x]].any(axis = 'column')]
but got the below error:
ValueError: No axis named column for object type <class 'pandas.core.frame.DataFrame'>
Dataframes have the great ability to manipulate data in place, without having to write complex python logic.
To create your attr1_others and attr2_others columns, you can combine the columns with or conditions using this:
df['attr1_others'] = df['attr1_C'] | df['attr1_D']
df['attr2_others'] = df['attr2_B'] | df['attr2_C']
If instead, you wanted an and condition, you could use:
df['attr1_others'] = df['attr1_C'] & df['attr1_D']
df['attr2_others'] = df['attr2_B'] & df['attr2_C']
You can then delete the lingering original values using del:
del df['attr1_C']
del df['attr1_D']
del df['attr2_B']
del df['attr2_C']
Create a list of kept-columns. Drop those kept-columns out and assign left-over columns to new dataframe df1. Groupby df1 by the splitted column names; call any on axis=1; add_suffix '_others' and assign result to df2. Finally, join and sort_index
keep_cols = ['attr1_A', 'attr1_B', 'attr2_A']
df1 = df.drop(keep_cols,1)
df2 = (df1.groupby(df1.columns.str.split('_').str[0], axis=1)
.any(1).add_suffix('_others').astype(int))
Out[512]:
attr1_others attr2_others
0 1 0
1 1 1
2 0 1
3 1 1
df_final = df[keep_cols].join(df2).sort_index(1)
Out[514]:
attr1_A attr1_B attr1_others attr2_A attr2_others
0 1 0 1 1 0
1 0 1 1 0 1
2 0 0 0 0 1
3 1 1 1 1 1
You can use custom list to select columns, and then .any() with axis=1 parameter. To convert to interger, use .astype(int).
For example:
import pandas as pd
df = pd.DataFrame({
'attr1_A': [1, 0, 0, 1],
'attr1_B': [0, 1, 0, 1],
'attr1_C': [0, 1, 0, 1],
'attr1_D': [1, 0, 0, 0],
'attr2_A': [1, 0, 0, 1],
'attr2_B': [0, 0, 1, 1],
'attr2_C': [0, 1, 0, 0]})
cols = [col for col in df.columns.values if col.startswith('attr1') and col.split('_')[1] not in ('A', 'B')]
df['attr1_others'] = df[cols].any(axis=1).astype(int)
df.drop(cols, axis=1, inplace=True)
cols = [col for col in df.columns.values if col.startswith('attr2') and col.split('_')[1] not in ('A', )]
df['attr2_others'] = df[cols].any(axis=1).astype(int)
df.drop(cols, axis=1, inplace=True)
print(df)
Prints:
attr1_A attr1_B attr2_A attr1_others attr2_others
0 1 0 1 1 0
1 0 1 0 1 1
2 0 0 0 0 1
3 1 1 1 1 1
I have a long list of data, that meaningful data being sandwiched between 0 values, here is how it looks like
0
0
1
0
0
2
3
1
0
0
0
0
1
0
The length of 0 and meaningful value sequence is variable. I want to extract the meaningful sequence, each of them into a row in a dataframe. For example, the above data can be extracted to this:
1
2 3 1
1
I used this code to 'slice' the meaningful data:
import pandas as pd
import numpy as np
raw = pd.read_csv('data.csv')
df = pd.DataFrame(index=np.arange(0, 10000),columns = ['DT01', 'DT02', 'DT03', 'DT04', 'DT05', 'DT06', 'DT07', 'DT08', 'DT02', 'DT09', 'DT10', 'DT11', 'DT12', 'DT13', 'DT14', 'DT15', 'DT16', 'DT17', 'DT18', 'DT19', 'DT20',])
a = 0
b = 0
n=0
for n in range(0,999999):
if raw.iloc[n].values > 0:
df.iloc[a,b] = raw.iloc[n].values
a=a+1
if raw [n+1] == 0:
b=b+1
a=0
but I keep getting KeyError: n, while n is the row after the first row has a value different than 0.
Where is the problem with me code? And is there any way to improve it, in term of speed and memory cost?
Thank you very much
You can use:
df['Group'] = df['col'].eq(0).cumsum()
df = df.loc[ df['col'] != 0]
df = df.groupby('Group')['col'].apply(list)
print (df)
Group
2 [1]
4 [2, 3, 1]
8 [1]
Name: col, dtype: object
df = pd.DataFrame(df.groupby('Group')['col'].apply(list).values.tolist())
print (df)
0 1 2
0 1 NaN NaN
1 2 3.0 1.0
2 1 NaN NaN
Let's try this outputs a dataframe:
df.groupby(df[0].eq(0).cumsum().mask(df[0].eq(0)),as_index=False)[0]\
.apply(lambda x: x.reset_index(drop=True)).unstack(1)
Output:
0 1 2
0 1.0 NaN NaN
1 2.0 3.0 1.0
2 1.0 NaN NaN
Or a string:
df.groupby(df[0].eq(0).cumsum().mask(df[0].eq(0)),as_index=False)[0]\
.apply(lambda x: ' '.join(x.astype(str)))
Output:
0 1
1 2 3 1
2 1
dtype: object
Or as a list:
df.groupby(df[0].eq(0).cumsum().mask(df[0].eq(0)),as_index=False)[0]\
.apply(list)
Output:
0 [1]
1 [2, 3, 1]
2 [1]
dtype: object
Try this , I break down the steps
df.LIST=df.LIST.replace({0:np.nan})
df['Group']=df.LIST.isnull().cumsum()
df=df.dropna()
df.groupby('Group').LIST.apply(list)
Out[384]:
Group
2 [1]
4 [2, 3, 1]
8 [1]
Name: LIST, dtype: object
Data Input
df = pd.DataFrame({'LIST' : [0,0,1,0,0,2,3,1,0,0,0,0,1,0]})
Let's start with packing your original data into a pandas dataframe (in real life, you will probably use pd.read_csv() to generate this dataframe):
raw = pd.DataFrame({'0' : [0,0,1,0,0,2,3,1,0,0,0,0,1,0]})
The default index will help you locate zero spans:
s1 = raw.reset_index()
s1['index'] = np.where(s1['0'] != 0, np.nan, s1['index'])
s1['index'] = s1['index'].fillna(method='ffill').fillna(0).astype(int)
s1[s1['0'] != 0].groupby('index')['0'].apply(list).tolist()
#[[1], [2, 3, 1], [1]]