Count values and create new dataframe - python

I have a dataframe that looks like this:
df
Daily Risk Score
0 13.0
1 10.0
2 25.0
3 7.0
4 18.0
... ...
672 14.0
673 9.0
674 15.0
675 6.0
676 13.0
I want to count the number of times a value of 0<x<9, 9<x<17 and >=17 occurs. I tried doing this:
df1=pd.cut(df['Daily Risk Score'], bins=[0, 9, 17, np.inf], labels=['Green','Orange','Red'])
However, all this does is change the value to the label. What I want is a new dataframe that just has the counts of the values like this:
df1
Green Orange Red
x y z
What am I missing to accomplish this task?

Use .groupby and .transpose at the end of this code.
df1 = pd.cut(df['Daily Risk Score'],
bins=[0, 9, 17, np.inf],
labels=['Green','Orange','Red']).reset_index(). \
groupby('Daily Risk Score').count().transpose()
df1
output:
Daily Risk Score Green Orange Red
index 3 4 2

I tried it with a bit different method. Its easy too. Try it out and let me know if you face any issue/error.
Here you go:
df["col"] = 0
for i in range(len(df)):
if 0<df["Daily Risk Score"][i]<9:
df["col"][i] = "0<Daily Risk Score<9"
elif 9<df["Daily Risk Score"][i]<17:
df["col"][i] = "9<Daily Risk Score<17"
elif 9<df["Daily Risk Score"][i]<17:
df["col"][i] = "Daily Risk Score>=17"
else:
df["col"][i] = "other"
df["col"].value_counts()
df.drop(columns=["col"])

Try:
df1=df.groupby(pd.cut(df['Daily Risk Score'], bins=[0, 9, 17, np.inf], labels=['Green','Orange','Red'])).size()
df1:
Daily Risk Score
Green 3
Orange 5
Red 2
dtype: int64
OR
df1=df.groupby(pd.cut(df['Daily Risk Score'], bins=[0, 9, 17, np.inf], labels=['Green','Orange','Red'])).size()
df2 = pd.DataFrame(df1.reset_index().values.T)
df2.columns = df2.iloc[0]
df2 = df2[1:]
df2:
Green Orange Red
1 3 5 2

Related

Sorting Pandas MultiIndex by the last value of level 0 index

I have a df called df_world with the following shape:
Cases Death Delta_Cases Delta_Death
Country/Region Date
Brazil 2020-01-22 0.0 0 NaN NaN
2020-01-23 0.0 0 0.0 0.0
2020-01-24 0.0 0 0.0 0.0
2020-01-25 0.0 0 0.0 0.0
2020-01-26 0.0 0 0.0 0.0
... ... ... ...
World 2020-05-12 4261747.0 291942 84245.0 5612.0
2020-05-13 4347018.0 297197 85271.0 5255.0
2020-05-14 4442163.0 302418 95145.0 5221.0
2020-05-15 4542347.0 307666 100184.0 5248.0
2020-05-16 4634068.0 311781 91721.0 4115.0
I'de like to sort the country index by the value of the columns 'Cases' on the last recording i.e. comparing the cases values on 2020-05-16 for all countries and return the sorted country list
I thought about creating another df with only the 2020-05-16 values and then use the df.sort_values() method but I am sure there has to be a more efficient way.
While I'm at it, I've also tried to only select the countries that have a number of cases on 2020-05-16 above a certain value and the only way I found to do it was to iterate over the Country index:
for a_country in df_world.index.levels[0]:
if df_world.loc[(a_country, last_date), 'Cases'] < cut_off_val:
df_world = df_world.drop(index=a_country)
But it's quite a poor way to do it.
If anyone has an idea on how the improve the efficiency of this code I'de be very happy.
Thank you :)
You can first group thee dataset by "Country/Region", then sort each group by "Date", take the last one, and sort again by "Cases".
Faking some data myself (data types are different but you see my point):
df = pd.DataFrame([['a', 1, 100],
['a', 2, 10],
['b', 2, 55],
['b', 3, 15],
['c', 1, 22],
['c', 3, 80]])
df.columns = ['country', 'date', 'cases']
df = df.set_index(['country', 'date'])
print(df)
# cases
# country date
# a 1 100
# 2 10
# b 2 55
# 3 15
# c 1 22
# 3 80
Then,
# group them by country
grp_by_country = df.groupby(by='country')
# for each group, aggregate by sorting by data and taking the last row (latest date)
latest_per_grp = grp_by_country.agg(lambda x: x.sort_values(by='date').iloc[-1])
# sort again by cases
sorted_by_cases = latest_per_grp.sort_values(by='cases')
print(sorted_by_cases)
# cases
# country
# a 10
# b 15
# c 80
Stay safe!
last_recs = df_world.reset_index().groupby('Country/Region').last()
sorted_countries = last_recs.sort_values('Cases')['Country/Region']
As I don't have your raw data, I can't test it but this should do what you need. All methods are self-explanatory I believe.
you may need to sort df_world by the dates in the first line if it isn't the case.

Pandas: How to find the average length of days for a local outbreak to peak in a COVID-19 dataframe?

Let's say I have this dataframe containing the difference in number of active cases from previous value in each country:
[in]
import pandas as pd
import numpy as np
active_cases = {'Day(s) since outbreak':['0', '1', '2', '3', '4', '5'], 'Australia':[np.NaN, 10, 10, -10, -20, -20], 'Albania':[np.NaN, 20, 0, 15, 0, -20], 'Algeria':[np.NaN, 25, 10, -10, 20, -20]}
df = pd.DataFrame(active_cases)
df
[out]
Day(s) since outbreak Australia Albania Algeria
0 0 NaN NaN NaN
1 1 10.0 20.0 25.0
2 2 10.0 0.0 10.0
3 3 -10.0 15.0 -10.0
4 4 -20.0 0.0 20.0
5 5 -20.0 -20.0 -20.0
I need to find the average length of days for a local outbreak to peak in this COVID-19 dataframe.
My solution is to find the nth row with the first negative value in each column (e.g., nth row of first negative value in 'Australia': 3, nth row of first negative value in 'Albania': 5) and average it.
However, I have no idea how to do this in Panda/Python.
Are there any ways to perform this task with simple lines of Python/Panda code?
you can set_index the column Day(s) since outbreak, then use iloc to select all rows except the first one, then check where the values are less than (lt) 0. Use idxmax to get the first row where the value is less than 0 and take the mean. With your input, it gives:
print (df.set_index('Day(s) since outbreak')\
.iloc[1:, :].lt(0).idxmax().astype(float).mean())
3.6666666666666665
IICU
using df.where mask negatives and replace positives with np.NaN and then calculate the mean
cols= ['Australia','Albania','Algeria']
df.set_index('Day(s) since outbreak', inplace=True)
m = df< 0
df2=df.where(m, np.NaN)
#df2 = df2.replace(0, np.NaN)
df2.mean()
Result

How to get the previous value of the same row (previous column) from the pandas dataframe?

I want to fetch the value from the previous column but the same row and I need to multiply that value with 5 and write it to the current place.
I have tried shift method of pandas but it's not working. after that, I have written the separate function to get the previous column name..but I think that's not the good approach.
'''
def get_previous_column_name(wkName):
v = int(wkName.strip('W'))
newv = str(v - 1)
if len(newv) == 1:
newv = '0' + newv
return 'W' + newv
'''
dataframe:
W01,W02,W03,W04,W05
7, 8
10,20
20, 40
expected result:
W01,W02,W03,W04,W05
7, 8, 40, 200, 1000
10, 20, 100, 500, 2500
20, 40, 200, 1000, 5000
Here is one way ffill +cumsum
df=df.ffill(1)*(5)**df.isnull().cumsum(1)
df
Out[230]:
W01 W02 W03 W04 W05
0 7.0 8.0 40.0 200.0 1000.0
1 10.0 20.0 100.0 500.0 2500.0
2 20.0 40.0 200.0 1000.0 5000.0
import pandas as pd
data = pd.read_csv('C:/d1', sep=',', header=None,names=['W1','W2'])
df=pd.DataFrame(data)
dfNew=pd.DataFrame(columns=['W1','W2','W3','W4','W5'])
(rows,columns)=df.shape
for index in range(rows):
tempRow=[df.iat[index,0],df.iat[index,1],df.iat[index,1]*5,df.iat[index,1]*25,df.iat[index,1]*125]
dfNew.loc[len(dfNew)]=tempRow
print()
print(dfNew)
If you indeed have only three columns to fill, just do the multiplication:
df['W03'] = df['W02'] * 5
df['W04'] = df['W03'] * 5
df['W05'] = df['W04'] * 5
df
# W01 W02 W03 W04 W05
#0 7 8 40 200 1000
#1 10 20 100 500 2500
#2 20 40 200 1000 5000

How to add conditions when calculating using Python?

I have a dataframe with two numeric columns. I want to add a third column to calculate the difference. But the condition is if the values in the first column are blank or Nan, the difference should be the value in the second column...
Can anyone help me with this problem?
Any suggestions and clues will be appreciated!
Thank you.
You should use vectorised operations where possible. Here you can use numpy.where:
df['Difference'] = np.where(df['July Sales'].isnull(), df['August Sales'],
df['August Sales'] - df['July Sales'])
However, consider this is precisely the same as considering NaN values in df['July Sales'] to be equal to zero. So you can use pd.Series.fillna:
df['Difference'] = df['August Sales'] - df['July Sales'].fillna(0)
This isn't really a situation with conditions, it is just a math operation.. Suppose you have the df:
consider your df using the .sub() method:
df['Diff'] = df['August Sales'].sub(df['July Sales'], fill_value=0)
returns output:
July Sales August Sales Diff
0 459.0 477 18.0
1 422.0 125 -297.0
2 348.0 483 135.0
3 397.0 271 -126.0
4 NaN 563 563.0
5 191.0 325 134.0
6 435.0 463 28.0
7 NaN 479 479.0
8 475.0 473 -2.0
9 284.0 496 212.0
Used a sample dataframe, but it shouldn't be hard to comprehend:
df = pd.DataFrame({'A': [1, 2, np.nan, 3], 'B': [10, 20, 30, 40]})
def diff(row):
return row['B'] if (pd.isnull(row['A'])) else (row['B'] - row['A'])
df['C'] = df.apply(diff, axis=1)
ORIGINAL DATAFRAME:
A B
0 1.0 10
1 2.0 20
2 NaN 30
3 3.0 40
AFTER apply:
A B C
0 1.0 10 9.0
1 2.0 20 18.0
2 NaN 30 30.0
3 3.0 40 37.0
try this:
def diff(row):
if not row['col1']:
return row['col2']
else:
return row['col1'] - row['col2']
df['col3']= df.apply(diff, axis=1)

Pandas DF Multiple Conditionals using np.where

I am trying to combine a few relatively simple conditions into an np.where clause, but am having trouble getting the syntax down for the logic.
My current dataframe looks like the df below, with four columns. I would like to add two columns, named the below, with the following conditions:
The desired output is below - the df df_so_v2
Days since activity
*Find most recent prior row with same ID, then subtract dates column
*If no most recent value, return NA
Chg. Avg. Value
Condition 1: If Count = 0, NA
Condition 2: If Count !=0, find most recent prior row with BOTH the same ID and Count!=0, then find the difference in Avg. Value column.
However, I am building off simple np.where queries like the below and do not know how to combine the multiple conditions needed in this case.
df['CASH'] = np.where(df['CASH'] != 0, df['CASH'] + commission , df['CASH'])
Thank you very much for your help on this.
df_dict={'DateOf': ['2017-08-07','2017-08-07','2017-08-07','2017-08-04','2017-08-04','2017-08-04'
, '2017-08-03','2017-08-03','2017-08-03','2017-08-02','2017-08-02','2017-08-02','2017-08-01','2017-08-01','2017-08-01'],
'ID': ['553','559','914','553','559','914','553','559','914','553','559','914','553','559','914'], 'Count': [0, 4, 5, 0, 11, 10, 3, 9, 0,1,0,2,4,4,0],
'Avg. Value': [0,3.5,2.2,0,4.2,3.3,5.3,5,0,3,0,2,4.4,6.4,0]}
df_so=pd.DataFrame(df_dict)
df_dict_v2={'DateOf': ['2017-08-07','2017-08-07','2017-08-07','2017-08-04','2017-08-04','2017-08-04'
, '2017-08-03','2017-08-03','2017-08-03','2017-08-02','2017-08-02','2017-08-02','2017-08-01','2017-08-01','2017-08-01'],
'ID': ['553','559','914','553','559','914','553','559','914','553','559','914','553','559','914'], 'Count': [0, 4, 5, 0, 11, 10, 3, 9, 0,1,0,2,4,4,0],
'Avg. Value': [0,3.5,2.2,0,4.2,3.3,5.3,5,0,3,0,2,4.4,6.4,0],
'Days_since_activity': [4,3,1,1,1,2,1,2,1,1,1,1,'NA','NA','NA'],
'Chg. Avg Value': ['NA',-0.7,-1.1,'NA',-0.8,1.3,2.3,-1.4,'NA',-1.4,'NA','NA','NA','NA','NA']
}
df_so_v2=pd.DataFrame(df_dict_v2)
Here is the answer to this part of the question. I need more clarification on the conditions of 2.
1) Days since activity *Find most recent prior row with same ID, then subtract dates column *If no most recent value, return NA
First you need to convert strings to datetime, then sort the dates in ascending order. Finally use .transform to find the difference.
df_dict={'DateOf': ['2017-08-07','2017-08-07','2017-08-07','2017-08-04','2017-08-04','2017-08-04'
, '2017-08-03','2017-08-03','2017-08-03','2017-08-02','2017-08-02','2017-08-02','2017-08-01','2017-08-01','2017-08-01'],
'ID': ['553','559','914','553','559','914','553','559','914','553','559','914','553','559','914'], 'Count': [0, 4, 5, 0, 11, 10, 3, 9, 0,1,0,2,4,4,0],
'Avg. Value': [0,3.5,2.2,0,4.2,3.3,5.3,5,0,3,0,2,4.4,6.4,0]}
df_so = pd.DataFrame(df_dict)
df_so['DateOf'] = pd.to_datetime(df_so['DateOf'])
df_so.sort_values('DateOf', inplace=True)
df_so['Days_since_activity'] = df_so.groupby(['ID'])['DateOf'].transform(pd.Series.diff)
df_so.sort_index()
Edited based on your comment:
Find the most recent previous day that does not have a count of Zero and calculate the difference.
df_dict={'DateOf': ['2017-08-07','2017-08-07','2017-08-07','2017-08-04','2017-08-04','2017-08-04'
, '2017-08-03','2017-08-03','2017-08-03','2017-08-02','2017-08-02','2017-08-02','2017-08-01','2017-08-01','2017-08-01'],
'ID': ['553','559','914','553','559','914','553','559','914','553','559','914','553','559','914'], 'Count': [0, 4, 5, 0, 11, 10, 3, 9, 0,1,0,2,4,4,0],
'Avg. Value': [0,3.5,2.2,0,4.2,3.3,5.3,5,0,3,0,2,4.4,6.4,0]}
df = pd.DataFrame(df_dict)
df['DateOf'] = pd.to_datetime(df['DateOf'], format='%Y-%m-%d')
df.sort_values(['ID','DateOf'], inplace=True)
df['Days_since_activity'] = df.groupby(['ID'])['DateOf'].diff()
mask = df.ID != df.ID.shift(1)
mask2 = df.groupby('ID').Count.shift(1) == 0
df['Days_since_activity'][mask] = np.nan
df['Days_since_activity'][mask2] = df.groupby(['ID'])['DateOf'].diff(2)
df['Chg. Avg Value'] = df.groupby(['ID'])['Avg. Value'].diff()
df['Chg. Avg Value'][mask2] = df.groupby(['ID'])['Avg. Value'].diff(2)
conditions = [((df['Count'] == 0)),]
choices = [np.nan,]
df['Chg. Avg Value'] = np.select(conditions, choices, default = df['Chg. Avg Value'])
# df = df.sort_index()
df
New unsorted Output for easy comparison:
DateOf ID Count Avg. Value Days_since_activity Chg. Avg Value
12 2017-08-01 553 4 4.4 NaT NaN
9 2017-08-02 553 1 3.0 1 days -1.4
6 2017-08-03 553 3 5.3 1 days 2.3
3 2017-08-04 553 0 0.0 1 days NaN
0 2017-08-07 553 0 0.0 4 days NaN
13 2017-08-01 559 4 6.4 NaT NaN
10 2017-08-02 559 0 0.0 1 days NaN
7 2017-08-03 559 9 5.0 2 days -1.4
4 2017-08-04 559 11 4.2 1 days -0.8
1 2017-08-07 559 4 3.5 3 days -0.7
14 2017-08-01 914 0 0.0 NaT NaN
11 2017-08-02 914 2 2.0 NaT NaN
8 2017-08-03 914 0 0.0 1 days NaN
5 2017-08-04 914 10 3.3 2 days 1.3
2 2017-08-07 914 5 2.2 3 days -1.1
index 11 should be NaT because the most current previous row has a count of zero and there is nothing else to compare it to

Categories