I am trying to combine a few relatively simple conditions into an np.where clause, but am having trouble getting the syntax down for the logic.
My current dataframe looks like the df below, with four columns. I would like to add two columns, named the below, with the following conditions:
The desired output is below - the df df_so_v2
Days since activity
*Find most recent prior row with same ID, then subtract dates column
*If no most recent value, return NA
Chg. Avg. Value
Condition 1: If Count = 0, NA
Condition 2: If Count !=0, find most recent prior row with BOTH the same ID and Count!=0, then find the difference in Avg. Value column.
However, I am building off simple np.where queries like the below and do not know how to combine the multiple conditions needed in this case.
df['CASH'] = np.where(df['CASH'] != 0, df['CASH'] + commission , df['CASH'])
Thank you very much for your help on this.
df_dict={'DateOf': ['2017-08-07','2017-08-07','2017-08-07','2017-08-04','2017-08-04','2017-08-04'
, '2017-08-03','2017-08-03','2017-08-03','2017-08-02','2017-08-02','2017-08-02','2017-08-01','2017-08-01','2017-08-01'],
'ID': ['553','559','914','553','559','914','553','559','914','553','559','914','553','559','914'], 'Count': [0, 4, 5, 0, 11, 10, 3, 9, 0,1,0,2,4,4,0],
'Avg. Value': [0,3.5,2.2,0,4.2,3.3,5.3,5,0,3,0,2,4.4,6.4,0]}
df_so=pd.DataFrame(df_dict)
df_dict_v2={'DateOf': ['2017-08-07','2017-08-07','2017-08-07','2017-08-04','2017-08-04','2017-08-04'
, '2017-08-03','2017-08-03','2017-08-03','2017-08-02','2017-08-02','2017-08-02','2017-08-01','2017-08-01','2017-08-01'],
'ID': ['553','559','914','553','559','914','553','559','914','553','559','914','553','559','914'], 'Count': [0, 4, 5, 0, 11, 10, 3, 9, 0,1,0,2,4,4,0],
'Avg. Value': [0,3.5,2.2,0,4.2,3.3,5.3,5,0,3,0,2,4.4,6.4,0],
'Days_since_activity': [4,3,1,1,1,2,1,2,1,1,1,1,'NA','NA','NA'],
'Chg. Avg Value': ['NA',-0.7,-1.1,'NA',-0.8,1.3,2.3,-1.4,'NA',-1.4,'NA','NA','NA','NA','NA']
}
df_so_v2=pd.DataFrame(df_dict_v2)
Here is the answer to this part of the question. I need more clarification on the conditions of 2.
1) Days since activity *Find most recent prior row with same ID, then subtract dates column *If no most recent value, return NA
First you need to convert strings to datetime, then sort the dates in ascending order. Finally use .transform to find the difference.
df_dict={'DateOf': ['2017-08-07','2017-08-07','2017-08-07','2017-08-04','2017-08-04','2017-08-04'
, '2017-08-03','2017-08-03','2017-08-03','2017-08-02','2017-08-02','2017-08-02','2017-08-01','2017-08-01','2017-08-01'],
'ID': ['553','559','914','553','559','914','553','559','914','553','559','914','553','559','914'], 'Count': [0, 4, 5, 0, 11, 10, 3, 9, 0,1,0,2,4,4,0],
'Avg. Value': [0,3.5,2.2,0,4.2,3.3,5.3,5,0,3,0,2,4.4,6.4,0]}
df_so = pd.DataFrame(df_dict)
df_so['DateOf'] = pd.to_datetime(df_so['DateOf'])
df_so.sort_values('DateOf', inplace=True)
df_so['Days_since_activity'] = df_so.groupby(['ID'])['DateOf'].transform(pd.Series.diff)
df_so.sort_index()
Edited based on your comment:
Find the most recent previous day that does not have a count of Zero and calculate the difference.
df_dict={'DateOf': ['2017-08-07','2017-08-07','2017-08-07','2017-08-04','2017-08-04','2017-08-04'
, '2017-08-03','2017-08-03','2017-08-03','2017-08-02','2017-08-02','2017-08-02','2017-08-01','2017-08-01','2017-08-01'],
'ID': ['553','559','914','553','559','914','553','559','914','553','559','914','553','559','914'], 'Count': [0, 4, 5, 0, 11, 10, 3, 9, 0,1,0,2,4,4,0],
'Avg. Value': [0,3.5,2.2,0,4.2,3.3,5.3,5,0,3,0,2,4.4,6.4,0]}
df = pd.DataFrame(df_dict)
df['DateOf'] = pd.to_datetime(df['DateOf'], format='%Y-%m-%d')
df.sort_values(['ID','DateOf'], inplace=True)
df['Days_since_activity'] = df.groupby(['ID'])['DateOf'].diff()
mask = df.ID != df.ID.shift(1)
mask2 = df.groupby('ID').Count.shift(1) == 0
df['Days_since_activity'][mask] = np.nan
df['Days_since_activity'][mask2] = df.groupby(['ID'])['DateOf'].diff(2)
df['Chg. Avg Value'] = df.groupby(['ID'])['Avg. Value'].diff()
df['Chg. Avg Value'][mask2] = df.groupby(['ID'])['Avg. Value'].diff(2)
conditions = [((df['Count'] == 0)),]
choices = [np.nan,]
df['Chg. Avg Value'] = np.select(conditions, choices, default = df['Chg. Avg Value'])
# df = df.sort_index()
df
New unsorted Output for easy comparison:
DateOf ID Count Avg. Value Days_since_activity Chg. Avg Value
12 2017-08-01 553 4 4.4 NaT NaN
9 2017-08-02 553 1 3.0 1 days -1.4
6 2017-08-03 553 3 5.3 1 days 2.3
3 2017-08-04 553 0 0.0 1 days NaN
0 2017-08-07 553 0 0.0 4 days NaN
13 2017-08-01 559 4 6.4 NaT NaN
10 2017-08-02 559 0 0.0 1 days NaN
7 2017-08-03 559 9 5.0 2 days -1.4
4 2017-08-04 559 11 4.2 1 days -0.8
1 2017-08-07 559 4 3.5 3 days -0.7
14 2017-08-01 914 0 0.0 NaT NaN
11 2017-08-02 914 2 2.0 NaT NaN
8 2017-08-03 914 0 0.0 1 days NaN
5 2017-08-04 914 10 3.3 2 days 1.3
2 2017-08-07 914 5 2.2 3 days -1.1
index 11 should be NaT because the most current previous row has a count of zero and there is nothing else to compare it to
Related
I have a dataframe that looks like this:
df
Daily Risk Score
0 13.0
1 10.0
2 25.0
3 7.0
4 18.0
... ...
672 14.0
673 9.0
674 15.0
675 6.0
676 13.0
I want to count the number of times a value of 0<x<9, 9<x<17 and >=17 occurs. I tried doing this:
df1=pd.cut(df['Daily Risk Score'], bins=[0, 9, 17, np.inf], labels=['Green','Orange','Red'])
However, all this does is change the value to the label. What I want is a new dataframe that just has the counts of the values like this:
df1
Green Orange Red
x y z
What am I missing to accomplish this task?
Use .groupby and .transpose at the end of this code.
df1 = pd.cut(df['Daily Risk Score'],
bins=[0, 9, 17, np.inf],
labels=['Green','Orange','Red']).reset_index(). \
groupby('Daily Risk Score').count().transpose()
df1
output:
Daily Risk Score Green Orange Red
index 3 4 2
I tried it with a bit different method. Its easy too. Try it out and let me know if you face any issue/error.
Here you go:
df["col"] = 0
for i in range(len(df)):
if 0<df["Daily Risk Score"][i]<9:
df["col"][i] = "0<Daily Risk Score<9"
elif 9<df["Daily Risk Score"][i]<17:
df["col"][i] = "9<Daily Risk Score<17"
elif 9<df["Daily Risk Score"][i]<17:
df["col"][i] = "Daily Risk Score>=17"
else:
df["col"][i] = "other"
df["col"].value_counts()
df.drop(columns=["col"])
Try:
df1=df.groupby(pd.cut(df['Daily Risk Score'], bins=[0, 9, 17, np.inf], labels=['Green','Orange','Red'])).size()
df1:
Daily Risk Score
Green 3
Orange 5
Red 2
dtype: int64
OR
df1=df.groupby(pd.cut(df['Daily Risk Score'], bins=[0, 9, 17, np.inf], labels=['Green','Orange','Red'])).size()
df2 = pd.DataFrame(df1.reset_index().values.T)
df2.columns = df2.iloc[0]
df2 = df2[1:]
df2:
Green Orange Red
1 3 5 2
Let's say I have this dataframe containing the difference in number of active cases from previous value in each country:
[in]
import pandas as pd
import numpy as np
active_cases = {'Day(s) since outbreak':['0', '1', '2', '3', '4', '5'], 'Australia':[np.NaN, 10, 10, -10, -20, -20], 'Albania':[np.NaN, 20, 0, 15, 0, -20], 'Algeria':[np.NaN, 25, 10, -10, 20, -20]}
df = pd.DataFrame(active_cases)
df
[out]
Day(s) since outbreak Australia Albania Algeria
0 0 NaN NaN NaN
1 1 10.0 20.0 25.0
2 2 10.0 0.0 10.0
3 3 -10.0 15.0 -10.0
4 4 -20.0 0.0 20.0
5 5 -20.0 -20.0 -20.0
I need to find the average length of days for a local outbreak to peak in this COVID-19 dataframe.
My solution is to find the nth row with the first negative value in each column (e.g., nth row of first negative value in 'Australia': 3, nth row of first negative value in 'Albania': 5) and average it.
However, I have no idea how to do this in Panda/Python.
Are there any ways to perform this task with simple lines of Python/Panda code?
you can set_index the column Day(s) since outbreak, then use iloc to select all rows except the first one, then check where the values are less than (lt) 0. Use idxmax to get the first row where the value is less than 0 and take the mean. With your input, it gives:
print (df.set_index('Day(s) since outbreak')\
.iloc[1:, :].lt(0).idxmax().astype(float).mean())
3.6666666666666665
IICU
using df.where mask negatives and replace positives with np.NaN and then calculate the mean
cols= ['Australia','Albania','Algeria']
df.set_index('Day(s) since outbreak', inplace=True)
m = df< 0
df2=df.where(m, np.NaN)
#df2 = df2.replace(0, np.NaN)
df2.mean()
Result
I have a dataframe with two month value columns as 'month1' and 'month2'. If the value in 'month1' column is not 'NA', then sum the corresponding 'amount' values as per 'month1' column. If the value in 'month1' column is 'NA', then sum the corresponding 'amount' values of 'month2' column.
import pandas as pd
df = pd.DataFrame({'month1': [1,2,'NA', 1, 4, 'NA', 'NA'],
'month2': ['NA',5,1, 2, 'NA', 1, 3],
'amount': [10,20,40, 50, 60, 70, 100]})
The input and output dataframes are as follows:
Input dataframe
month1 month2 amount
0 1.0 NaN 10
1 2.0 5.0 20
2 NaN 1.0 40
3 1.0 2.0 50
4 4.0 NaN 60
5 NaN 1.0 70
6 NaN 3.0 100
Output dataframe
since your NA values is string, you can simply groupby on the two columns:
# ignore month2 if month1 is NA
df.loc[df.month1.ne('NA'), 'month2'] = 'NA'
# groupby and sum
df.groupby(['month1','month2']).amount.transform('sum')
if you don't want to alter your data, you can do
s = np.where(df.month1.ne('NA'), 'NA', df['month2'])
df.groupby(['month1', s]).amount.transform('sum')
Output:
0 60
1 20
2 110
3 60
4 60
5 110
6 100
Name: amount, dtype: int64
You can use:
c=df.month1.eq('NA')
np.select([c,~c],[df.groupby('month2')['amount'].transform('sum')
,df.groupby('month1')['amount'].transform('sum')],default='NA') #assign to new column
array(['60', '20', '110', '60', '60', '110', '100'], dtype='<U21')
Edit: as #rafael pointed out, your data may be mixing of numbers and strings, so converting them all to numeric before processing is needed.
A simple way is groupby and transform month1 and month2 separately and fillna result of month1 by month2
df = df.apply(pd.to_numeric, errors='coerce')
m1 = df.groupby('month1').amount.transform('sum')
m2 = df.groupby('month2').amount.transform('sum')
m1.fillna(m2)
Out[406]:
0 60.0
1 20.0
2 110.0
3 60.0
4 60.0
5 110.0
6 100.0
Name: amount, dtype: float64
I have a dataframe with two numeric columns. I want to add a third column to calculate the difference. But the condition is if the values in the first column are blank or Nan, the difference should be the value in the second column...
Can anyone help me with this problem?
Any suggestions and clues will be appreciated!
Thank you.
You should use vectorised operations where possible. Here you can use numpy.where:
df['Difference'] = np.where(df['July Sales'].isnull(), df['August Sales'],
df['August Sales'] - df['July Sales'])
However, consider this is precisely the same as considering NaN values in df['July Sales'] to be equal to zero. So you can use pd.Series.fillna:
df['Difference'] = df['August Sales'] - df['July Sales'].fillna(0)
This isn't really a situation with conditions, it is just a math operation.. Suppose you have the df:
consider your df using the .sub() method:
df['Diff'] = df['August Sales'].sub(df['July Sales'], fill_value=0)
returns output:
July Sales August Sales Diff
0 459.0 477 18.0
1 422.0 125 -297.0
2 348.0 483 135.0
3 397.0 271 -126.0
4 NaN 563 563.0
5 191.0 325 134.0
6 435.0 463 28.0
7 NaN 479 479.0
8 475.0 473 -2.0
9 284.0 496 212.0
Used a sample dataframe, but it shouldn't be hard to comprehend:
df = pd.DataFrame({'A': [1, 2, np.nan, 3], 'B': [10, 20, 30, 40]})
def diff(row):
return row['B'] if (pd.isnull(row['A'])) else (row['B'] - row['A'])
df['C'] = df.apply(diff, axis=1)
ORIGINAL DATAFRAME:
A B
0 1.0 10
1 2.0 20
2 NaN 30
3 3.0 40
AFTER apply:
A B C
0 1.0 10 9.0
1 2.0 20 18.0
2 NaN 30 30.0
3 3.0 40 37.0
try this:
def diff(row):
if not row['col1']:
return row['col2']
else:
return row['col1'] - row['col2']
df['col3']= df.apply(diff, axis=1)
I have a dataframe that looks like so
time usd hour day
0 2015-08-30 07:56:28 1.17 7 0
1 2015-08-30 08:56:28 1.27 8 0
2 2015-08-30 09:56:28 1.28 9 0
3 2015-08-30 10:56:28 1.29 10 0
4 2015-08-30 11:56:28 1.29 11 0
14591 2017-04-30 23:53:46 9.28 23 609
Given this how would I go about building a numpy 2d matrix with hour being one axis day being the other axis and then usd being the value stored in the matrix
Consider the dataframe df
df = pd.DataFrame(dict(
time=pd.date_range('2015-08-30', periods=14000, freq='H'),
usd=(np.random.randn(14000) / 100 + 1.0005).cumprod()
))
Then we can set the index with the date and hour of df.time column and unstack. We take the values of this result in order to access the numpy array.
a = df.set_index([df.time.dt.date, df.time.dt.hour]).usd.unstack().values
I would do a pivot_table and leave the data as a pandas DataFrame but the conversion to a numpy array is trivial if you don't want labels.
import pandas as pd
data = <data>
data.pivot_table(values = 'usd', index = 'hour', columns = 'day').values
Edit: Thank you #pyRSquared for the "Value"able tip. (changed np.array(data) to df...values)
You can use the pivot functionality of pandas, as described here. You will get NaN values for usd, when there is no value for the day or hour.
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'usd': [1.17, 1.27, 1.28, 1.29, 1.29, 9.28], 'hour': [7, 8, 9, 10, 11, 23], 'day': [0, 0, 0, 0, 0, 609]})
In [3]: df
Out[3]:
day hour usd
0 0 7 1.17
1 0 8 1.27
2 0 9 1.28
3 0 10 1.29
4 0 11 1.29
5 609 23 9.28
In [4]: df.pivot(index='hour', columns='day', values='usd')
Out[4]:
day 0 609
hour
7 1.17 NaN
8 1.27 NaN
9 1.28 NaN
10 1.29 NaN
11 1.29 NaN
23 NaN 9.28