Mapping values inside pandas column - python

I used the code below to map the 2 values inside S column to 0 but it didn't work. Any suggestion on how to solve this?
N.B : I want to implement an external function inside the map.
df = pd.DataFrame({
'Age': [30,40,50,60,70,80],
'Sex': ['F','M','M','F','M','F'],
'S' : [1,1,2,2,1,2]
})
def app(value):
for n in df['S']:
if n == 1:
return 1
if n == 2:
return 0
df["S"] = df.S.map(app)

Use eq to create a boolean series and conver that boolean series to int with astype:
df['S'] = df['S'].eq(1).astype(int)
OR
df['S'] = (df['S'] == 1).astype(int)
Output:
Age Sex S
0 30 F 1
1 40 M 1
2 50 M 0
3 60 F 0
4 70 M 1
5 80 F 0

Don't use apply, simply use loc to assign the values:
df.loc[df.S.eq(2), 'S'] = 0
Age Sex S
0 30 F 1
1 40 M 1
2 50 M 0
3 60 F 0
4 70 M 1
5 80 F 0
If you need a more performant option, use np.select. This is also more scalable, as you can always add more conditions:
df['S'] = np.select([df.S.eq(2)], [0], 1)

You're close but you need a few corrections. Since you want to use a function, remove the for loop and replace n with value. Additionally, use apply instead of map. Apply operates on the entire column at once. See this answer for how to properly use apply vs applymap vs map
def app(value):
if value == 1:
return 1
elif value == 2:
return 0
df['S'] = df.S.apply(app)
Age Sex S
0 30 F 1
1 40 M 1
2 50 M 0
3 60 F 0
4 70 M 1
5 80 F 0

If you only wish to change values equal to 2, you can use pd.DataFrame.loc:
df.loc[df['S'] == 0, 'S'] = 0
pd.Series.apply is not recommend and this is just a thinly veiled, inefficient loop.

You could use .replace as follows:
df["S"] = df["S"].replace([2], 0)
This will replace all of 2 values to 0 in one line

Go with vectorize numpy operation:
df['S'] = np.abs(df['S'] - 2)
and stand yourself out from competitions in interviews and SO answers :)

>>>df = pd.DataFrame({'Age':[30,40,50,60,70,80],'Sex':
['F','M','M','F','M','F'],'S':
[1,1,2,2,1,2]})
>>> def app(value):
return 1 if value == 1 else 0
# or app = lambda value : 1 if value == 1 else 0
>>> df["S"] = df["S"].map(app)
>>> df
Age S Sex
Age S Sex
0 30 1 F
1 40 1 M
2 50 0 M
3 60 0 F
4 70 1 M
5 80 0 F

You can do:
import numpy as np
df['S'] = np.where(df['S'] == 2, 0, df['S'])

Related

Find longest run of consecutive zeros for each user in dataframe

I'm looking to find the max run of consecutive zeros in a DataFrame with the result grouped by user. I'm interested in running the RLE on usage.
sample input:
user--day--usage
A-----1------0
A-----2------0
A-----3------1
B-----1------0
B-----2------1
B-----3------0
Desired output
user---longest_run
a - - - - 2
b - - - - 1
mydata <- mydata[order(mydata$user, mydata$day),]
user <- unique(mydata$user)
d2 <- data.frame(matrix(NA, ncol = 2, nrow = length(user)))
names(d2) <- c("user", "longest_no_usage")
d2$user <- user
for (i in user) {
if (0 %in% mydata$usage[mydata$user == i]) {
run <- rle(mydata$usage[mydata$user == i]) #Run Length Encoding
d2$longest_no_usage[d2$user == i] <- max(run$length[run$values == 0])
} else {
d2$longest_no_usage[d2$user == i] <- 0 #some users did not have no-usage days
}
}
d2 <- d2[order(-d2$longest_no_usage),]
this works in R but I want to do the same thing in python, I'm totally stumped
Use groupby with size by columns user, usage and helper Series for consecutive values first:
print (df)
user day usage
0 A 1 0
1 A 2 0
2 A 3 1
3 B 1 0
4 B 2 1
5 B 3 0
6 C 1 1
df1 = (df.groupby([df['user'],
df['usage'].rename('val'),
df['usage'].ne(df['usage'].shift()).cumsum()])
.size()
.to_frame(name='longest_run'))
print (df1)
longest_run
user val usage
A 0 1 2
1 2 1
B 0 3 1
5 1
1 4 1
C 1 6 1
Then filter only zero rows, get max and add reindex for append non 0 groups:
df2 = (df1.query('val == 0')
.max(level=0)
.reindex(df['user'].unique(), fill_value=0)
.reset_index())
print (df2)
user longest_run
0 A 2
1 B 1
2 C 0
Detail:
print (df['usage'].ne(df['usage'].shift()).cumsum())
0 1
1 1
2 2
3 3
4 4
5 5
6 6
Name: usage, dtype: int32
get max number of consecutive zeros on series:
def max0(sr):
return (sr != 0).cumsum().value_counts().max() - (0 if (sr != 0).cumsum().value_counts().idxmax()==0 else 1)
max0(pd.Series([1,0,0,0,0,2,3]))
4
I think the following does what you are looking for, where the consecutive_zero function is an adaptation of the top answer here.
Hope this helps!
import pandas as pd
from itertools import groupby
df = pd.DataFrame([['A', 1], ['A', 0], ['A', 0], ['B', 0],['B',1],['C',2]],
columns=["user", "usage"])
def len_iter(items):
return sum(1 for _ in items)
def consecutive_zero(data):
x = list((len_iter(run) for val, run in groupby(data) if val==0))
if len(x)==0: return 0
else: return max(x)
df.groupby('user').apply(lambda x: consecutive_zero(x['usage']))
Output:
user
A 2
B 1
C 0
dtype: int64
If you have a large dataset and speed is essential, you might want to try the high-performance pyrle library.
Setup:
# pip install pyrle
# or
# conda install -c bioconda pyrle
import numpy as np
np.random.seed(0)
import pandas as pd
from pyrle import Rle
size = int(1e7)
number = np.random.randint(2, size=size)
user = np.random.randint(5, size=size)
df = pd.DataFrame({"User": np.sort(user), "Number": number})
df
# User Number
# 0 0 0
# 1 0 1
# 2 0 1
# 3 0 0
# 4 0 1
# ... ... ...
# 9999995 4 1
# 9999996 4 1
# 9999997 4 0
# 9999998 4 0
# 9999999 4 1
#
# [10000000 rows x 2 columns]
Execution:
for u, udf in df.groupby("User"):
r = Rle(udf.Number)
is_0 = r.values == 0
print("User", u, "Max", np.max(r.runs[is_0]))
# (Wall time: 1.41 s)
# User 0 Max 20
# User 1 Max 23
# User 2 Max 20
# User 3 Max 22
# User 4 Max 23

Apply a custom function iteratively in a subset of rows

I am trying to write a function that enables me to do some arithmetic iteratively on a subset of rows when a condition is met in another column. My DataFrame looks like this:
Value store flag
0 16051.249 0 0
36 16140.792 0.019822 0
0 16150.500 AAA 1
37 16155.223 1.24698 0
1 16199.700 BBB 1
38 16235.732 1.90162 0
41 16252.594 2.15627 0
2 16256.300 CCC 1
42 16260.678 2.15627 0
1048 17071.513 14.7752 0
3 17071.600 DDD 1
1049 17072.347 14.7752 0
1391 17134.538 16.7026 0
4 17134.600 EEE 1
1392 17134.635 16.7026 0
1675 17227.600 19.4348 0
5 17227.800 EFG 1
1676 17228.796 19.4348 0
1722 17262.189 20.5822 0
6 17264.300 XYZ 1
1723 17266.625 20.6702 0
2630 17442.770 32.7927 0
7 17442.800 ZZZ 1
2631 17442.951 32.7927 0
3068 17517.492 37.6485 0
8 17517.500 TTT 1
3069 17518.296 37.6485 0
3295 17565.776 38.2871 0
9 17565.800 SDF 1
3296 17565.888 38.2871 0
... ... ... ...
I'd like to apply the following function to all rows where the flag value equals 1:
def f(x):
return df.iloc[0,1]+(df.iloc[2,1]-df.iloc[0,1])*((df.iloc[1,0]-df.iloc[0,0])/(df.iloc[2,0]-df.iloc[0,0]))
and finally put the return value into a dictionary with it's corresponding key value; for example {AAA: 123, BBB:456,...}.
This function requires the rows above and below the row where flag=="1"
I have tried to re-structure my df in a way that I can use rooling window with my function, i.e:
idx = (df['flag'] == "1").fillna(False)
idx |= idx.shift(1) | idx.shift(2)
idx |= idx.shift(-1) | idx.shift(-2)
df=df[idx]
df.rolling(window=3, min_periods=1).apply(f)[::3].reset_index(drop=True)
but this doesn't work!
Since the function is location dependent I am not sure how to apply it to all triplet of rows where flag value is 1. Any suggestion is much appreciated!
IIUC, your calculation could be handled directly on the df columns level, no need to apply function on specific rows.
# convert to numeric so that the column can be used for arithmetic calculations
df['store2'] = pd.to_numeric(df.store, errors='coerce')
# calculate the f(x) based on 'Value' and 'store2' column
df['result'] = df.store2.shift(1) + (df.store2.shift(-1) - df.store2.shift(1))*(df.Value - df.Value.shift(1))/(df.Value.shift(-1) - df.Value.shift(1))
# export the resultset:
df.loc[df.flag==1,['store','result']].set_index('store')['result'].to_json()
just keep the state and use apply:
zero_vals = []
def func(row):
if row.flag == 0:
zero_vals.append(row)
elif row.flag == 1:
# do math here using previous rows of data and current row
zero_vals.clear()
else:
raise ValueError('unexpected flag value')
then it's just:
df.apply(func, axis=1)

Increasing column value pandas

I have a dataframe of 143999 rows which contains position and time data.
I already made a column "dt" which calulates the time difference between rows.
Now I want to create a new column which gives the dt values a group number.
So it starts with group = 0 and when dt > 60 the group number should increase by 1.
I tried the following:
def group(x):
c = 0 #
if densdata["dt"] < 60:
densdata["group"] = c
elif densdata["dt"] >= 60:
c += 1
densdata["group"] = c
densdata["group"] = densdata.apply(group, axis=1)'
The error that I get is: The truth value of a Series is ambiguous.
Any ideas how to fix this problem?
This is what I want:
dt group
0.01 0
2 0
0.05 0
300 1
2 1
60 2
You can take advantage of the fact that True evaluates to 1 and use .cumsum().
densdata = pd.DataFrame({'dt': np.random.randint(low=50,high=70,size=20),
'group' : np.zeros(20, dtype=np.int32)})
print(densdata.head())
dt group
0 52 0
1 59 0
2 69 0
3 55 0
4 63 0
densdata['group'] = (densdata.dt >= 60).cumsum()
print(densdata.head())
dt group
0 52 0
1 59 0
2 69 1
3 55 1
4 63 2
If you want to guarantee that the first value of group will be 0, even if the first value of dt is >= 60, then use
densdata['group'] = (densdata.dt.replace(densdata.dt[0],np.nan) >= 60).cumsum()

How to iterate over a pandas dataframe while referencing previous rows?

I am iterating over a Python dataframe and finding it to be extremely slow. I understand that in Pandas you try to vectorize everything, but in this case I specifically need to iterate (or if it is possible to vectorize, I'm unclear how to do it).
The logic is simple: you have two columns "A" and "B" and a result column "signal." If A equals 1, then you set signal to 1. If B equals 1, then you set signal to 0. Otherwise, signals is whatever it was previously. In other words, column A is an "on" signal, column B is an "off" signal, and "signal" represents the state.
Here is my code:
def signals(indata):
numrows = len(indata)
data = pd.DataFrame(index= range(0,numrows))
data['A'] = indata['A']
data['B'] = indata['B']
data['signal'] = 0
for i in range(1,numrows):
if data['A'].iloc[i] == 1:
data['signal'].iloc[i] = 1
elif data['B'].iloc[i] == 1:
data['signal'].iloc[i] = 0
else:
data['signal'].iloc[i] = data['signal'].iloc[i-1]
return data
Example input/output:
indata = pd.DataFrame(index = range(0,10))
indata['A'] = [0, 1, 0, 0, 0, 0, 1, 0, 0, 0]
indata['B'] = [1, 0, 0, 0, 1, 0, 0, 0, 1, 1]
signals(indata)
Output:
A B signal
0 0 1 0
1 1 0 1
2 0 0 1
3 0 0 1
4 0 1 0
5 0 0 0
6 1 0 1
7 0 0 1
8 0 1 0
9 0 1 0
This simple logic takes my computer 46 seconds to run on a dataframe of 2000 rows with randomly generated data.
df['signal'] = df.A.groupby((df.A != df.B).cumsum()).transform('head', 1)
df
A B signal
0 0 1 0
1 1 0 1
2 0 0 1
3 0 0 1
4 0 1 0
5 0 0 0
6 1 0 1
7 0 0 1
8 0 1 0
9 0 1 0
The logic here involves dividing your series into groups based on the inequality between A and B, and every group's value is determined by A.
You dont need to iterate at all you can do some Boolean indexing
#set condition for A
indata.loc[indata.A == 1,'signal'] = 1
#set condition for B
indata.loc[indata.B == 1,'signal'] = 0
#forward fill NaN values
indata.signal.fillna(method='ffill',inplace=True)
The simplest answer to my problem was to not write to the dataframe while iterating through it. I created an array of zeros in numpy, then did my iterative logic in the array. Then I wrote the array to the column in my dataframe.
def signals3(indata):
numrows = len(indata)
data = pd.DataFrame(index= range(0,numrows))
data['A'] = indata['A']
data['B'] = indata['B']
out_signal = np.zeros(numrows)
for i in range(1,numrows):
if data['A'].iloc[i] == 1:
out_signal[i] = 1
elif data['B'].iloc[i] == 1:
out_signal[i] = 0
else:
out_signal[i] = out_signal[i-1]
data['signal'] = out_signal
return data
On a dataframe of 2000 rows of random data, this takes only 43 milliseconds as opposed to 46 seconds (~1,000x faster).
I also tried a variant where I assigned the dataframe columns A and B to series, and then iterated through the series. This was a bit faster (27 milliseconds). But it appears most of the slowness is in writing to a dataframe.
Both coldspeed and djk's answers were faster than my solution (about 4.5ms) but in practice I'll probably just iterate through series even though that is not optimal.

For loop in pandas('transfer' solution from R to pandas)

I want to perform for loop in pandas: for each row i I want to take column x1 and perform the test(if else statements)
In R I will do like this:
df <- data.frame(x1 = rnorm(10),x2 = rexp(10))
for(i in 1:length(df$x1)){
if(df[i,'x1'] >0){
print('+')
} else{
print('-')
}
}
How can I do this in pandas data frame?
P.S I need to perfom a loop like this. But if you have better ideas, I will appreciate it
EDIT:
In case multiple comparison:
Thank you for the answer!
And maybe you can give me an advise, how can i do the iteration if i have multiple if/else statements? For example:
if x>0:
if x%2 == 0:
#do stuff 1
else:
#do other stuff 2
elif x<0:
if x%2 == 0:
#do stuff 3
else:
#do other stuff 4
If need new column use numpy.where:
np.random.seed(54)
df = pd.DataFrame({'x1':np.random.randint(10, size=10)}) - 5
df['new'] = np.where(df['x1'] > 0, '+', '-')
print (df)
x1 new
0 0 -
1 -3 -
2 2 +
3 -4 -
4 -5 -
5 3 +
6 2 +
7 -4 -
8 4 +
9 1 +
But if need loop (obviously avoid it, because slow) is possible use iteritems or items():
for i, x in df['x1'].iteritems():
if x > 0:
print ('+')
else:
print ('-')
EDIT:
df['new'] = np.where(df['x1'] > 0, 'a',
np.where(df['x1'] & 2, 'b', 'c'))
print (df)
x1 new
0 0 c
1 -3 c
2 2 a
3 -4 c
4 -5 b
5 3 a
6 2 a
7 -4 c
8 4 a
9 1 a
But if have many conditions (4 or more) use apply with custom function:
def f(x):
#x == 0
y = 5
if x>0:
if x%2 == 0:
y = 0
#do stuff 1
else:
y = 1
#do other stuff 2
elif x<0:
if x%2 == 0:
y = 2
#do stuff 3
else:
y = 3
#do other stuff 4
return y
df['new'] = df['x1'].apply(f)
print (df)
x1 new
0 0 5
1 -3 3
2 2 0
3 -4 2
4 -5 3
5 3 1
6 2 0
7 -4 2
8 4 0
9 1 1
You can use this code to print out each index with the correct symbol:
print(df['x1'].map(lambda x: '+' if x > 0 else '-').to_string(index=False))
What the above code does is creates a new Series object, for which you use the map function to convert each symbol into a + if i>0 and a - if i<=0. Then, the Series is converted to a string and printed out without indices.
But if you absolutely need to loop through each row, you can use the following code, which is what you have but condensed into 2 lines:
for i in df['x1']:
print('+' if i > 0 else '-')

Categories