Replace a particular column value with 1 and the rest with 0 - python

I have a DataFrame which has a column containing these values with % occurrence
I want to convert the value with highest occurrence as 1 and the rest as 0.
How can I do the same using Pandas?

Try this:
import pandas as pd
import numpy as np
df = pd.DataFrame({'availability': np.random.randint(0, 100, 10), 'some_col': np.random.randn(10)})
print(df)
"""
availability some_col
0 9 -0.332662
1 35 0.193257
2 1 2.042402
3 50 -0.298372
4 52 -0.669655
5 3 -1.031884
6 44 -0.763867
7 28 1.093086
8 67 0.723319
9 87 -1.439568
"""
df['availability'] = np.where(df['availability'] == df['availability'].max(), 1, 0)
print(df)
"""
availability some_col
0 0 -0.332662
1 0 0.193257
2 0 2.042402
3 0 -0.298372
4 0 -0.669655
5 0 -1.031884
6 0 -0.763867
7 0 1.093086
8 0 0.723319
9 1 -1.439568
"""
Edit
If you are trying to mask the rows with the values that occur most often instead, try this:
df = pd.DataFrame(
{
'availability': [10, 10, 20, 30, 40, 40, 50, 50, 50, 50],
'some_col': np.random.randn(10)
}
)
print(df)
"""
availability some_col
0 10 0.954199
1 10 0.779256
2 20 -0.438860
3 30 -2.547989
4 40 0.587108
5 40 0.398858
6 50 0.776177 # <--- Most Frequent is 50
7 50 -0.391724 # <--- Most Frequent is 50
8 50 -0.886805 # <--- Most Frequent is 50
9 50 1.989000 # <--- Most Frequent is 50
"""
df['availability'] = np.where(df['availability'].isin(df['availability'].mode()), 1, 0)
print(df)
"""
availability some_col
0 0 0.954199
1 0 0.779256
2 0 -0.438860
3 0 -2.547989
4 0 0.587108
5 0 0.398858
6 1 0.776177
7 1 -0.391724
8 1 -0.886805
9 1 1.989000
"""

Try:
df.availability.apply(lambda x: 1 if x == df.availability.value_counts().idxmax() else 0)

You can use Series.mode() to get the most often value and use isin to check if value in column in list
df['col'] = df['availability'].isin(df['availability'].mode()).astype(int)

You can compare to the mode with isin, then convert the boolean to integer (True -> 1, False -> 0):
df['col2'] = df['col'].isin(df['col'].mode()).astype(int)
example (here, 2 and 4 are tied as most frequent value), as new column "col2" for clarity:
col col2
0 0 0
1 2 1
2 2 1
3 2 1
4 4 1
5 4 1
6 4 1
7 1 0

Related

Auto re-assign ids in a dataframe

I have the following dataframe:
import pandas as pd
data = {'id': [542588, 542594, 542594, 542605, 542605, 542605, 542630, 542630],
'label': [3, 3, 1, 1, 2, 0, 0, 2]}
df = pd.DataFrame(data)
df
id label
0 542588 3
1 542594 3
2 542594 1
3 542605 1
4 542605 2
5 542605 0
6 542630 0
7 542630 2
The id columns contains large integers (6-digits). I want a way to simplify it, starting from 10, so that 542588 becomes 10, 542594 becomes 11, etc...
Required output:
id label
0 10 3
1 11 3
2 11 1
3 12 1
4 12 2
5 12 0
6 13 0
7 13 2
You can use factorize:
df['id'] = df['id'].factorize()[0] + 10
Output:
id label
0 10 3
1 11 3
2 11 1
3 12 1
4 12 2
5 12 0
6 13 0
7 13 2
Note: factorize will enumerate the keys in the order that they occur in your data, while groupby().ngroup() solution will enumerate the key in the increasing order. You can mimic the increasing order with factorize by sorting the data first. Or you can replicate the data order with groupby() by passing sort=False to it.
You can try
df['id'] = df.groupby('id').ngroup().add(10)
print(df)
id label
0 10 3
1 11 3
2 11 1
3 12 1
4 12 2
5 12 0
6 13 0
7 13 2
This is a naive way of looping through the IDs, and every time you encounter an ID you haven't seen before, associate it in a dictionary with a new ID (starting at 10, incrementing by 1 each time).
You can then swap out the values of the ID column using the map method.
new_ids = dict()
new_id = 10
for old_id in df['id']:
if old_id not in new_ids:
new_ids[old_id] = new_id
new_id += 1
df['id'] = df['id'].map(new_ids)

Count number of consecutive rows that are greater than current row value but less than the value from other column

Say I have the following sample dataframe (there are about 25k rows in the real dataframe)
df = pd.DataFrame({'A' : [0,3,2,9,1,0,4,7,3,2], 'B': [9,8,3,5,5,5,5,8,0,4]})
df
A B
0 0 9
1 3 8
2 2 3
3 9 5
4 1 5
5 0 5
6 4 5
7 7 8
8 3 0
9 2 4
For the column A I need to know how many next and previous rows are greater than current row value but less than value in column B.
So my expected output is :
A B next count previous count
0 9 2 0
3 8 0 0
2 3 0 1
9 5 0 0
1 5 0 0
0 5 2 1
4 5 1 0
7 8 0 0
3 0 0 2
2 4 0 0
Explanation :
First row is calculated as : since 3 and 2 are greater than 0 but less than corresponding B value 8 and 3
Second row is calculated as : since next value 2 is not greater than 3
Third row is calculated as : since 9 is greater than 2 but not greater than its corresponding B value
Similarly, previous count is calculated
Note : I know how to solve this problem by looping using list comprehension or using the pandas apply method but still I won't mind a clear and concise apply approach. I was looking for a more pandaic approach.
My Solution
Here is the apply solution, which I think is inefficient. Also, as people said that there might be no vector solution for the question. So as mentioned, a more efficient apply solution will be accepted for this question.
This is what I have tried.
This function gets the number of previous/next rows that satisfy the condition.
def get_prev_next_count(row):
next_nrow = df.loc[row['index']+1:,['A', 'B']]
prev_nrow = df.loc[:row['index']-1,['A', 'B']][::-1]
if (next_nrow.size == 0):
return 0, ((prev_nrow.A > row.A) & (prev_nrow.A < prev_nrow.B)).argmin()
if (prev_nrow.size == 0):
return ((next_nrow.A > row.A) & (next_nrow.A < next_nrow.B)).argmin(), 0
return (((next_nrow.A > row.A) & (next_nrow.A < next_nrow.B)).argmin(), ((prev_nrow.A > row.A) & (prev_nrow.A < prev_nrow.B)).argmin())
Generating output :
df[['next count', 'previous count']] = df.reset_index().apply(get_prev_next_count, axis=1, result_type="expand")
Output :
This gives us the expected output
df
A B next count previous count
0 0 9 2 0
1 3 8 0 0
2 2 3 0 1
3 9 5 0 0
4 1 5 0 0
5 0 5 2 1
6 4 5 1 0
7 7 8 0 0
8 3 0 0 2
9 2 4 0 0
I made some optimizations:
You don't need to reset_index() you can access the index with .name
If you only pass df[['A']] instead of the whole frame, that may help.
prev_nrow.empty is the same as (prev_nrow.size == 0)
Applied different logic to get the desired value via first_false, this speeds things up significantly.
def first_false(val1, val2, A):
i = 0
for x, y in zip(val1, val2):
if A < x < y:
i += 1
else:
break
return i
def get_prev_next_count(row):
A = row['A']
next_nrow = df.loc[row.name+1:,['A', 'B']]
prev_nrow = df2.loc[row.name-1:,['A', 'B']]
if next_nrow.empty:
return 0, first_false(prev_nrow.A, prev_nrow.B, A)
if prev_nrow.empty:
return first_false(next_nrow.A, next_nrow.B, A), 0
return (first_false(next_nrow.A, next_nrow.B, A),
first_false(prev_nrow.A, prev_nrow.B, A))
df2 = df[::-1].copy() # Shave a tiny bit of time by only reversing it once~
df[['next count', 'previous count']] = df[['A']].apply(get_prev_next_count, axis=1, result_type='expand')
print(df)
Output:
A B next count previous count
0 0 9 2 0
1 3 8 0 0
2 2 3 0 1
3 9 5 0 0
4 1 5 0 0
5 0 5 2 1
6 4 5 1 0
7 7 8 0 0
8 3 0 0 2
9 2 4 0 0
Timing
Expanding the data:
df = pd.concat([df]*(10000//4), ignore_index=True)
# df.shape == (25000, 2)
Original Method:
Gave up at 15 minutes.
New Method:
1m 20sec
Throw pandarallel at it:
from pandarallel import pandarallel
pandarallel.initialize()
df[['A']].parallel_apply(get_prev_next_count, axis=1, result_type='expand')
26sec

Python crossover or switch formula

I would like a formula or anything that acts like a "switch". If the column 'position' goes to 3 or above, the switch is turned on (=1). If 'position' goes above 5, the switch is turned off (=0). And if position goes below 3, the switch is also turned off (=0). I have included the column 'desired' to display what I would like this new column to automate.
df = pd.DataFrame()
df['position'] = [1,2,3,4,5,6,7,8,7,6,5,4,3,2,1,2,3,4,5,4,3,2,1]
df['desired'] = [0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,0,0]
I would use .shift() to create row with shifted position to have current and provious value in one row. And then I can check if it goes above 3 or 5 or below 3 and change value which will be assigned to in column 'desired'.
After creating column `'desired' I have to drop shifted data.
import pandas as pd
df = pd.DataFrame()
df['position'] = [1,2,3,4,5,6,7,8,7,6,5,4,3,2,1,2,3,4,5,4,3,2,1]
#df['desired'] = [0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,0,0]
df['previous'] = df['position'].shift()
# ---
value = 0
def change(row):
global value
#print(row)
if (row['previous'] < 3) and (row['position'] >= 3):
value = 1
if (row['previous'] >= 3) and (row['position'] < 3):
value = 0
if (row['previous'] <= 5) and (row['position'] > 5):
value = 0
return value
# ---
#for ind, row in df.iterrows():
# print(int(row['position']), change(row))
df['desired'] = df.apply(change, axis=1)
df.drop('previous', axis=1)
print(df)
Result
position desired
0 1 0
1 2 0
2 3 1
3 4 1
4 5 1
5 6 0
6 7 0
7 8 0
8 7 0
9 6 0
10 5 0
11 4 0
12 3 0
13 2 0
14 1 0
15 2 0
16 3 1
17 4 1
18 5 1
19 4 1
20 3 1
21 2 0
22 1 0

Access next, previous, or current row in pandas .loc[] assignment

Under the if-then section of the pandas documentation cookbook, we can assign values in one column, based on a condition being met for a separate column using loc[].
df = pd.DataFrame({'AAA' : [4,5,6,7],
'BBB' : [10,20,30,40],
'CCC' : [100,50,-30,-50]})
# AAA BBB CCC
# 0 4 10 100
# 1 5 20 50
# 2 6 30 -30
# 3 7 40 -50
df.loc[df.AAA >= 5,'BBB'] = -1
# AAA BBB CCC
# 0 4 10 100
# 1 5 -1 50
# 2 6 -1 -30
# 3 7 -1 -50
But what if I want to write a condition that involves the previous or subsequent row using .loc[]? For example, say I want to assign df.BBB=5 wherever the difference between the df.CCC of the current row and the df.CCC of the next row is greater than or equal to 50. Then I would like to create a condition that gives me the following data frame:
# AAA BBB CCC
# 0 4 5 100 <-| 100 - 50 = 50, assign df.BBB = 5
# 1 5 5 50 <-| 50 -(-30)= 80, assign df.BBB = 5
# 2 6 -1 -30 <-| 30 -(-50)= 20, don't assign df.BBB = 5
# 3 7 -1 -50 <-| (-50) -0 =-50, don't assign df.BBB = 5
How can I get this result?
Edit
The answer I'm hoping to find is something like
mask = df['CCC'].current - df['CCC'].next >= 50
df.loc[mask, 'BBB'] = 5
because I'm interested in the general problem of how I can access values above or below the current row being considered in a dataframe.(not necessarily solving this one toy example.)
diff() will work on the example I first described, but what of other cases, say, where we want to compare two elements instead of subtracting them?
What if I take the previous data frame and I want to find all rows where the current column entry doesn't match the next in df.BBB and then assign df.CCC based on those comparisons?
if df.BBB.current == df.CCC.next:
df.CCC = 1
# AAA BBB CCC
# 0 4 5 1 <-| 5 == 5, assign df.CCC = 1
# 1 5 5 50 <-| 5 != -1, do nothing
# 2 6 -1 1 <-| -1 == -1, assign df.CCC = 1
# 3 7 -1 -50 <-| -1 != 0, do nothing
Is there a way to do this with pandas using .loc[]?
Given
>>> df
AAA BBB CCC
0 4 10 100
1 5 20 50
2 6 30 -30
3 7 40 -50
you can compute a boolean mask first via
>>> mask = df['CCC'].diff(-1) >= 50
>>> mask
0 True
1 True
2 False
3 False
Name: CCC, dtype: bool
and then issue
>>> df.loc[mask, 'BBB'] = 5
>>>
>>> df
AAA BBB CCC
0 4 5 100
1 5 5 50
2 6 30 -30
3 7 40 -50
More generally, you can compute a shift
>>> df['CCC_next'] = df['CCC'].shift(-1) # or df['CCC'].shift(-1).fillna(0)
>>> df
AAA BBB CCC CCC_next
0 4 5 100 50.0
1 5 5 50 -30.0
2 6 30 -30 -50.0
3 7 40 -50 NaN
... and then do whatever you want, such as:
>>> df['CCC'].sub(df['CCC_next'], fill_value=0)
0 50.0
1 80.0
2 20.0
3 -50.0
dtype: float64
>>> mask = df['CCC'].sub(df['CCC_next'], fill_value=0) >= 50
>>> mask
0 True
1 True
2 False
3 False
dtype: bool
although for the specific problem in your question the diff approach is sufficient.
You can use enumerate function to access row and its index simultaneously. Thus you can obtain previous and next row based on the index of the current row. I provide an example script below for your reference:
import pandas as pd
df = pd.DataFrame({'AAA' : [4,5,6,7],
'BBB' : [10,20,30,40],
'CCC' : [100,50,-30,-50]}, index=['a','b','c','d'])
print('row_pre','row_pre_AAA','row','row_AA','row_next','row_next_AA')
for irow, row in enumerate(df.index):
if irow==0:
row_next = df.index[irow+1]
print('row_pre', "df.loc[row_pre,'AAA']", row, df.loc[row,'AAA'], row_next, df.loc[row_next,'AAA'])
elif irow>0 and irow<df.index.size-1:
row_pre = df.index[irow-1]
row_next = df.index[irow+1]
print(row_pre, df.loc[row_pre,'AAA'], row, df.loc[row,'AAA'], row_next, df.loc[row_next,'AAA'])
else:
row_pre = df.index[irow-1]
print(row_pre, df.loc[row_pre,'AAA'], row, df.loc[row,'AAA'], 'row_next', "df.loc[row_next,'AAA']")
Output as below:
row_pre row_pre_AAA row row_AA row_next row_next_AA
row_pre df.loc[row_pre,'AAA'] a 4 b 5
a 4 b 5 c 6
b 5 c 6 d 7
c 6 d 7 row_next df.loc[row_next,'AAA']

How sum columns with same name [duplicate]

I have a dataframe with about 100 columns that looks like this:
Id Economics-1 English-107 English-2 History-3 Economics-zz Economics-2 \
0 56 1 1 0 1 0 0
1 11 0 0 0 0 1 0
2 6 0 0 1 0 0 1
3 43 0 0 0 1 0 1
4 14 0 1 0 0 1 0
Histo Economics-51 Literature-re Literatureu4
0 1 0 1 0
1 0 0 0 1
2 0 0 0 0
3 0 1 1 0
4 1 0 0 0
My goal is to leave only global categories -- English, History, Literature -- and write the sum of the value of their components, respectively, in this dataframe. For instance, "English" would be the sum of "English-107" and "English-2":
Id Economics English History Literature
0 56 1 1 2 1
1 11 1 0 0 1
2 6 0 1 1 0
3 43 2 0 1 1
4 14 0 1 1 0
For this purpose, I have tried two methods. First method:
df = pd.read_csv(file_path, sep='\t')
df['History'] = df.loc[df[df.columns[pd.Series(df.columns).str.startswith('History')]].sum(axes=1)]
Second method:
df = pd.read_csv(file_path, sep='\t')
filter_col = [col for col in list(df) if col.startswith('History')]
df['History'] = 0 # initialize value, otherwise throws KeyError
for c in df[filter_col]:
df['History'] = df[filter_col].sum(axes=1)
print df['History', df[filter_col]]
However, both gives the error:
TypeError: 'DataFrame' objects are mutable, thus they cannot be
hashed
My question is either: how can I debug this error or is there another solution for my problem. Notice that I have a rather large dataframe with about 100 columns and 400000 rows, so I'm looking for an optimized solution, like using loc in pandas.
I'd suggest that you do something different, which is to perform a transpose, groupby the prefix of the rows (your original columns), sum, and transpose again.
Consider the following:
df = pd.DataFrame({
'a_a': [1, 2, 3, 4],
'a_b': [2, 3, 4, 5],
'b_a': [1, 2, 3, 4],
'b_b': [2, 3, 4, 5],
})
Now
[s.split('_')[0] for s in df.T.index.values]
is the prefix of the columns. So
>>> df.T.groupby([s.split('_')[0] for s in df.T.index.values]).sum().T
a b
0 3 3
1 5 5
2 7 7
3 9 9
does what you want.
In your case, make sure to split using the '-' character.
You can use these to create sum of columns starting with specific name,
df['Economics']= df[list(df.filter(regex='Economics'))].sum(axis=1)
Using brilliant DSM's idea:
from __future__ import print_function
import pandas as pd
categories = set(['Economics', 'English', 'Histo', 'Literature'])
def correct_categories(cols):
return [cat for col in cols for cat in categories if col.startswith(cat)]
df = pd.read_csv('data.csv', sep=r'\s+', index_col='Id')
#print(df)
print(df.groupby(correct_categories(df.columns),axis=1).sum())
Output:
Economics English Histo Literature
Id
56 1 1 2 1
11 1 0 0 1
6 1 1 0 0
43 2 0 1 1
14 1 1 1 0
Here is another version, which takes care of "Histo/History" problematic..
from __future__ import print_function
import pandas as pd
#categories = set(['Economics', 'English', 'Histo', 'Literature'])
#
# mapping: common starting pattern: desired name
#
categories = {
'Histo': 'History',
'Economics': 'Economics',
'English': 'English',
'Literature': 'Literature'
}
def correct_categories(cols):
return [categories[cat] for col in cols for cat in categories.keys() if col.startswith(cat)]
df = pd.read_csv('data.csv', sep=r'\s+', index_col='Id')
#print(df.columns, len(df.columns))
#print(correct_categories(df.columns), len(correct_categories(df.columns)))
#print(df.groupby(pd.Index(correct_categories(df.columns)),axis=1).sum())
rslt = df.groupby(correct_categories(df.columns),axis=1).sum()
print(rslt)
print('History\n', rslt['History'])
Output:
Economics English History Literature
Id
56 1 1 2 1
11 1 0 0 1
6 1 1 0 0
43 2 0 1 1
14 1 1 1 0
History
Id
56 2
11 0
6 0
43 1
14 1
Name: History, dtype: int64
PS You may want to add missing categories to categories map/dictionary

Categories