Generate Column Value in Pandas based on previous rows - python

Let us assume I am taking a temperature measurement on a regular interval and recording the values in a Pandas Dataframe
day temperature [F]
0 89
1 91
2 93
3 88
4 90
Now I want to create another column which is set to 1 if and only if the two previous values are above a certain level. In my scenario I want to create a column value of 1 if the two consecutive values are above 90, thus yielding
day temperature Above limit?
0 89 0
1 91 0
2 93 1
3 88 0
4 91 0
5 91 1
6 93 1
Despite some SO and Google digging, it's not clear if I can use iloc[x], loc[x] or something else in a for loop?

You are looking for the shift function in pandas.
import io
import pandas as pd
data = """
day temperature Expected
0 89 0
1 91 0
2 93 1
3 88 0
4 91 0
5 91 1
6 93 1
"""
data = io.StringIO(data)
df = pd.read_csv(data, sep='\s+')
df['Result'] = ((df['temperature'].shift(1) > 90) & (df['temperature'] > 90)).astype(int)
# Validation
(df['Result'] == df['Expected']).all()

Try this:
df = pd.DataFrame({'temperature': [89, 91, 93, 88, 90, 91, 91, 93]})
limit = 90
df['Above'] = ((df['temperature']>limit) & (df['temperature'].shift(1)>limit)).astype(int)
df
In the future, please include the code to testing (in this case the df construction line)

df['limit']=""
df.iloc[0,2]=0
for i in range (1,len(df)):
if df.iloc[i,1]>90 and df.iloc[i-1,1]>90:
df.iloc[i,2]=1
else:
df.iloc[i,2]=0
Here iloc[i,2] refers to ith row index and 2 column index(limit column). Hope this helps

Solution using shift():
>> threshold = 90
>> df['Above limit?'] = 0
>> df.loc[((df['temperature [F]'] > threshold) & (df['temperature [F]'].shift(1) > threshold)), 'Above limit?'] = 1
>> df
day temperature [F] Above limit?
0 0 89 0
1 1 91 0
2 2 93 1
3 3 88 0
4 4 90 0

Try using rolling(window = 2) and then apply() as follows:
df["limit"]=df['temperature'].rolling(2).apply(lambda x: int(x[0]>90)&int(x[-1]> 90))

Related

Counting the number of entries in a dataframe that satisfies multiple criteria

I have a dataframe with 9 columns, two of which are gender and smoker status. Every row in the dataframe is a person, and each column is their entry on a particular trait.
I want to count the number of entries that satisfy the condition of being both a smoker and is male.
I have tried using a sum function:
maleSmoke = sum(1 for i in data['gender'] if i is 'm' and i in data['smoker'] if i is 1 )
but this always returns 0. This method works when I only check one criteria however and I can't figure how to expand it to a second.
I also tried writing a function that counted its way through every entry into the dataframe but this also returns 0 for all entries.
def countSmokeGender(df):
maleSmoke = 0
femaleSmoke = 0
maleNoSmoke = 0
femaleNoSmoke = 0
for i in range(20000):
if df['gender'][i] is 'm' and df['smoker'][i] is 1:
maleSmoke = maleSmoke + 1
if df['gender'][i] is 'f' and df['smoker'][i] is 1:
femaleSmoke = femaleSmoke + 1
if df['gender'][i] is 'm' and df['smoker'][i] is 0:
maleNoSmoke = maleNoSmoke + 1
if df['gender'][i] is 'f' and df['smoker'][i] is 0:
femaleNoSmoke = femaleNoSmoke + 1
return maleSmoke, femaleSmoke, maleNoSmoke, femaleNoSmoke
I've tried pulling out the data sets as numpy arrays and counting those but that wasn't working either.
Are you using pandas?
Assuming you are, you can simply do this:
# How many male smokers
len(df[(df['gender']=='m') & (df['smoker']==1)])
# How many female smokers
len(df[(df['gender']=='f') & (df['smoker']==1)])
# How many male non-smokers
len(df[(df['gender']=='m') & (df['smoker']==0)])
# How many female non-smokers
len(df[(df['gender']=='f') & (df['smoker']==0)])
Or, you can use groupby:
df.groupby(['gender'])['smoker'].sum()
Another alternative, which is great for data exploration: .pivot_table
With a DataFrame like this
id gender smoker other_trait
0 0 m 0 0
1 1 f 1 1
2 2 m 1 0
3 3 m 1 1
4 4 f 1 0
.. .. ... ... ...
95 95 f 0 0
96 96 f 1 1
97 97 f 0 1
98 98 m 0 0
99 99 f 1 0
you could do
result = df.pivot_table(
index="smoker", columns="gender", values="id", aggfunc="count"
)
to get a result like
gender f m
smoker
0 32 16
1 27 25
If you want to display the partial counts you can add the margins=True option and get
gender f m All
smoker
0 32 16 48
1 27 25 52
All 59 41 100
If you don't have a column to count over (you can't use smoker and gender because they are used for the labels) you could add a dummy column:
result = df.assign(dummy=1).pivot_table(
index="smoker", columns="gender", values="dummy", aggfunc="count",
margins=True
)

Count items with condition in DataFrame Python

I have a DataFrame like this:
index column1 column2 column3
1 30 55 62
2 69 20 40
3 23 62 23
...
May I know how to count the number of values which are > 50 for all elements in the above table?
I'm trying below method:
count = 0
for column in df.items():
count += df[df[column] > 50][column].count()
Is this a proper way to do it? Or any other more effective suggestion?
You can just check all the values at once and then sum() them since True evaluates to 1 and False to 0:
df.gt(50).sum().sum()
(df > 54).values.sum() will do what you're looking for here is the total code to get the results:
>>> df = pd.DataFrame(np.random.randint(0,100,size=(5, 2)), columns=list('AB'))
>>> df
A B
0 68 92
1 47 53
2 5 35
3 75 82
4 51 89
>>> (df > 54).values.sum()
5
>>>
Basically what I'm doing is creating a mask of true false values of the entire data frame based on the condition in this case > 54 and then just rolling up the data frame because true/false is equal to 1/0 when added.

Binary Vectorization Encoding for categorical variable grouped by date issue

I'm having an issue trying to vectorize this in some kind of binary encoding but aggregated when there is more than one row (as the variations of the categorical variable are non-exclusive), yet avoiding merging it with other dates. (python and pandas)
Let's say this is the data
id1
id2
type
month.measure
105
50
growing
04-2020
105
50
advancing
04-2020
44
29
advancing
04-2020
105
50
retreating
05-2020
105
50
shrinking
05-2020
It would have to end like this
id1
id2
growing
shrinking
advancing
retreating
month.measure
105
50
1
0
1
0
04-2020
44
29
0
0
1
0
04-2020
105
50
0
1
0
1
05-2020
I've been trying with transformations of all kinds, lambda functions, pandas get_dummies and trying to aggregate them grouped by the 2 ids and the date but I couldn't find a way.
Hope we can sort it out! Thanks in advance! :)
This solution uses pandas get_dummies to one-hot encode the "TYPE" column, then concatenates the one-hot encoded dataframe back with the original, followed by a groupby applied to the ID columns and "MONTH":
# Set up the dataframe
ID1 = [105,105,44,105,105]
ID2 = [50,50,29,50,50]
TYPE = ['growing','advancing','advancing','retreating','shrinking']
MONTH = ['04-2020','04-2020','04-2020','05-2020','05-2020']
df = pd.DataFrame({'ID1':ID1,'ID2':ID2, 'TYPE':TYPE, 'MONTH.MEASURE':MONTH})
# Apply get_dummies and groupby operations
df = pd.concat([df.drop('TYPE',axis=1),pd.get_dummies(df['TYPE'])],axis=1)\
.groupby(['ID1','ID2','MONTH.MEASURE']).sum().reset_index()
# These bits are just cosmetic to get the output to look more like your required output
df.columns = [c.upper() for c in df.columns]
col_order = ['GROWING','SHRINKING','ADVANCING','RETREATING','MONTH.MEASURE']
df[['ID1','ID2']+col_order]
# ID1 ID2 GROWING SHRINKING ADVANCING RETREATING MONTH.MEASURE
# 0 44 29 0 0 1 0 04-2020
# 1 105 50 1 0 1 0 04-2020
# 2 105 50 0 1 0 1 05-2020
This is crosstab:
pd.crosstab([df['id1'],df['id2'],df['month.measure']], df['type']).reset_index()
Output:
type id1 id2 month.measure advancing growing retreating shrinking
0 44 29 04-2020 1 0 0 0
1 105 50 04-2020 1 1 0 0
2 105 50 05-2020 0 0 1 1

Python Pandas Get Values According to If/Else

My input dataframe;
Order Need WarehouseStock StoreStock
1 3 74 5
0 4 44 44
0 0 44 44
6 12 44 44
0 6 644 44
6 6 44 44
I want to count whether any difference or not among "Order" and Need values with below code;
difference = df['Need'] - df['Order']
mask = difference.between(-1,1)
print (f'Count: {(~mask).sum()}')
I want to that something like this;
If (WarehouseStock-StoreStock) >= Need:
difference1 = df['Need'] - df['Order']
mask1 = difference1.between(-1,1)
print (f'Count: {(~mask1).sum()}')
Else
difference2 = df['Need'] - df['Order']
mask2 = difference2.between(-5,5)
print (f'Count: {(~mask2).sum()}')
Desired Outputs are;
Count 3
Order Need WarehouseStock StoreStock
1 3 74 5
6 12 44 44
0 6 644 44
Could you please help me about this?
Using numpy.where with pandas.Series.between:
import pandas as pd
import numpy as np
s = df['Need'] - df['Order']
ind = np.where((df['WarehouseStock'] - df['StoreStock']).ge(df['Need']), ~s.between(-1, 1), ~s.between(-5 , 5))
Output:
ind.sum()
# 3
df[ind]
Order Need WarehouseStock StoreStock
0 1 3 74 5
3 6 12 44 44
4 0 6 644 44

how to add complementary intervals in pandas dataframe

Lets say that I have a signal of 100 samples L=100
In this signal I found some intervals that I label as "OK". The intervals are stored in a Pandas DataFrame that looks like this:
c = pd.DataFrame(np.array([[10,26],[50,84]]),columns=['Start','End'])
c['Value']='OK'
How can I add the complementary intervals in another dataframe in order to have something like this
d = pd.DataFrame(np.array([[0,9],[10,26],[27,49],[50,84],[85,100]]),columns=['Start','End'])
d['Value']=['Check','OK','Check','OK','Check']
You can use the first Dataframe to create the second one and merge like suggested #jezrael :
d = pd.DataFrame({"Start":[0] + sorted(pd.concat([c.Start , c.End+1])), "End": sorted(pd.concat([c.Start-1 , c.End]))+[100]} )
d = pd.merge(d, c, how='left')
d['Value'] = d['Value'].fillna('Check')
d = d.reindex_axis(["Start","End","Value"], axis=1)
output
Start End Value
0 0 9 Check
1 10 26 OK
2 27 49 Check
3 50 84 OK
4 85 100 Check
I think you need:
d = pd.merge(d, c, how='left')
d['Value'] = d['Value'].fillna('Check')
print (d)
Start End Value
0 0 9 Check
1 10 26 OK
2 27 49 Check
3 50 84 OK
4 85 100 Check
EDIT:
You can use numpy.concatenate with numpy.sort, numpy.column_stack and DataFrame constructor for new df. Last need merge with fillna by dict for column for replace:
s = np.sort(np.concatenate([[0], c['Start'].values, c['End'].values + 1]))
e = np.sort(np.concatenate([c['Start'].values - 1, c['End'].values, [100]]))
d = pd.DataFrame(np.column_stack([s,e]), columns=['Start','End'])
d = pd.merge(d, c, how='left').fillna({'Value':'Check'})
print (d)
Start End Value
0 0 9 Check
1 10 26 OK
2 27 49 Check
3 50 84 OK
4 85 100 Check
EDIT1 :
For d was added new values by loc, rehape to Series by stack and shift. Last create df back by unstack:
b = c.copy()
max_val = 100
min_val = 0
c.loc[-1, 'Start'] = max_val + 1
a = c[['Start','End']].stack(dropna=False).shift().fillna(min_val - 1).astype(int).unstack()
a['Start'] = a['Start'] + 1
a['End'] = a['End'] - 1
a['Value'] = 'Check'
print (a)
Start End Value
0 0 9 Check
1 27 49 Check
-1 85 100 Check
d = pd.concat([b, a]).sort_values('Start').reset_index(drop=True)
print (d)
Start End Value
0 0 9 Check
1 10 26 OK
2 27 49 Check
3 50 84 OK
4 85 100 Check

Categories