After searching several forums on similar questions, it appears that one way to iterate a conditional statement quickly is using Numpy's np.where() function on Pandas. I am having trouble with the following task:
I have a dataset that looks like several rows of:
PatientID Date1 Date2 ICD
1234 12/14/10 12/12/10 313.2, 414.2, 228.1
3213 8/2/10 9/5/12 232.1, 221.0
I am trying to create a conditional statement such that:
1. if strings '313.2' or '414.2' exist in df['ICD'] return 1
2. if strings '313.2' or '414.2' exist in df['ICD'] and Date1>Date2 return 2
3. Else return 0
Given that Date1 and Date2 are in date-time format and my data frame is coded as df, I have the following code:
df['NewColumn'] = np.where(df.ICD.str.contains('313.2|414.2').astype(int), 1, np.where(((df.ICD.str.contains('313.2|414.2').astype(int))&(df['Date1']>df['Date2'])), 2, 0)
However this code only returns a series with 1's and 0's and does not include a 2. How else can I complete this task?
You almost had it, you needed to pass a raw string (prepend with r) to contains so it treats it as a regex:
In [115]:
df['NewColumn'] = np.where(df.ICD.str.contains(r'313.2|414.2').astype(int), 1, np.where(((df.ICD.str.contains(r'313.2|414.2').astype(int))&(df['Date1']>df['Date2'])), 2, 0))
df
Out[115]:
PatientID Date1 Date2 ICD NewColumn
0 1234 2010-12-14 2010-12-12 313.2,414.2,228.1 1
1 3213 2010-08-02 2012-09-05 232.1,221.0 0
You get 1 returned because it short circuits on the first condition because that is met, if you want to get 2 returned then you need to rearrange the order of evaluation:
In [122]:
df['NewColumn'] = np.where( (df.ICD.str.contains(r'313.2|414.2').astype(int)) & ( df['Date1'] > df['Date2'] ), 2 ,
np.where( df.ICD.str.contains(r'313.2|414.2').astype(int), 1, 0 ) )
df
Out[122]:
PatientID Date1 Date2 ICD NewColumn
0 1234 2010-12-14 2010-12-12 313.2,414.2,228.1 2
1 3213 2010-08-02 2012-09-05 232.1,221.0 0
It is much easier to use the pandas functionality itself. Using numpy to do something that pandas already does is a good way to get unexpected behaviour.
Assuming you want to check for a cell value containing 313.2 only (so 2313.25 returns False).
df['ICD'].astype(str) == '313.2'
returns a Series Object of True or False next to each index entry.
so
boolean =(df['ICD'].astype(str) == '313.2')| (df['ICD'].astype(str) == '414.2')
if(boolean.any()):
#do something
return 1
boolean2 =((df['ICD'].astype(str) == '313.2')| (df['ICD'].astype(str) == '414.2'))&(df['Date1']>df['Date2'])
if(boolean2.any()):
return 2
etc
Pandas also has the function isin() which can simplify things further.
The docs are here: http://pandas.pydata.org/pandas-docs/stable/indexing.html
Also, you do not return two because of the order you evaluate the conditional statements.In any circumstance where condition 2 evaluates as true, condition 1 must evaluate to be true also. So as you test condition 1 too, it always returns 1 or passes.
In short, you need to test condition 2 first, as there is no circumstance where 1 can be false and 2 can be true.
Related
Let's say I have a Pandas dataframe df:
start_time Event
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
I want to set the value of the Event column to -1 when the corresponding start_time lies between two values, so I define this function:
def test(time):
if (time['start_time'] >= 5) and (time['start_time'] <= 8):
return -1
else:
return time
To apply this to the event column, I do the following:
df[['Event']] = df[['Event']].apply(test,axis=1)
which yields this error: KeyError: ('start_time', 'occurred at index 0')
Why is this happening? Should be a simple fix.
Simply do:
df['Event'] = df.apply(test, axis=1)['Event']
The function that you are passing to .apply() uses the start_time field of the input argument (in the conditional checks if (time['start_time'] >= 5) and (time['start_time'] <= 8)). So it should be applied to a DataFrame or Series that has a start_time column.
However, before you call apply you are first calling df[['Event']], which returns a Series. So df[['Event']].apply() will apply a function to the resulting Series. But when the function reaches the expression time['start_time'], it is looking for a column called start_time in the Series, can't find it (because only 'Event' column was kept), and raises a KeyError.
The solution is to pass a DataFrame or a Series that has a start_time column in it. In your case you want to apply the function to the entire DataFrame so replace df[['Event']] with the whole DataFrame df.
df = df.apply(test, axis=1)
and change your function to modify the Event column instead of returning a value. Replace return -1 with time['Event'] = -1 and eliminate the else return time part (i.e., don't change anything if the conditions aren't met).
I am trying to find the number of times a certain value appears in one column.
I have made the dataframe with data = pd.DataFrame.from_csv('data/DataSet2.csv')
and now I want to find the number of times something appears in a column. How is this done?
I thought it was the below, where I am looking in the education column and counting the number of time ? occurs.
The code below shows that I am trying to find the number of times 9th appears and the error is what I am getting when I run the code
Code
missing2 = df.education.value_counts()['9th']
print(missing2)
Error
KeyError: '9th'
You can create subset of data with your condition and then use shape or len:
print df
col1 education
0 a 9th
1 b 9th
2 c 8th
print df.education == '9th'
0 True
1 True
2 False
Name: education, dtype: bool
print df[df.education == '9th']
col1 education
0 a 9th
1 b 9th
print df[df.education == '9th'].shape[0]
2
print len(df[df['education'] == '9th'])
2
Performance is interesting, the fastest solution is compare numpy array and sum:
Code:
import perfplot, string
np.random.seed(123)
def shape(df):
return df[df.education == 'a'].shape[0]
def len_df(df):
return len(df[df['education'] == 'a'])
def query_count(df):
return df.query('education == "a"').education.count()
def sum_mask(df):
return (df.education == 'a').sum()
def sum_mask_numpy(df):
return (df.education.values == 'a').sum()
def make_df(n):
L = list(string.ascii_letters)
df = pd.DataFrame(np.random.choice(L, size=n), columns=['education'])
return df
perfplot.show(
setup=make_df,
kernels=[shape, len_df, query_count, sum_mask, sum_mask_numpy],
n_range=[2**k for k in range(2, 25)],
logx=True,
logy=True,
equality_check=False,
xlabel='len(df)')
Couple of ways using count or sum
In [338]: df
Out[338]:
col1 education
0 a 9th
1 b 9th
2 c 8th
In [335]: df.loc[df.education == '9th', 'education'].count()
Out[335]: 2
In [336]: (df.education == '9th').sum()
Out[336]: 2
In [337]: df.query('education == "9th"').education.count()
Out[337]: 2
An elegant way to count the occurrence of '?' or any symbol in any column, is to use built-in function isin of a dataframe object.
Suppose that we have loaded the 'Automobile' dataset into df object.
We do not know which columns contain missing value ('?' symbol), so let do:
df.isin(['?']).sum(axis=0)
DataFrame.isin(values) official document says:
it returns boolean DataFrame showing whether each element in the DataFrame
is contained in values
Note that isin accepts an iterable as input, thus we need to pass a list containing the target symbol to this function. df.isin(['?']) will return a boolean dataframe as follows.
symboling normalized-losses make fuel-type aspiration-ratio ...
0 False True False False False
1 False True False False False
2 False True False False False
3 False False False False False
4 False False False False False
5 False True False False False
...
To count the number of occurrence of the target symbol in each column, let's take sum over all the rows of the above dataframe by indicating axis=0.
The final (truncated) result shows what we expect:
symboling 0
normalized-losses 41
...
bore 4
stroke 4
compression-ratio 0
horsepower 2
peak-rpm 2
city-mpg 0
highway-mpg 0
price 4
Try this:
(df[education]=='9th').sum()
easy but not efficient:
list(df.education).count('9th')
Simple example to count occurrences (unique values) in a column in Pandas data frame:
import pandas as pd
# URL to .csv file
data_url = 'https://yoursite.com/Arrests.csv'
# Reading the data
df = pd.read_csv(data_url, index_col=0)
# pandas count distinct values in column
df['education'].value_counts()
Outputs:
Education 47516
9th 41164
8th 25510
7th 25198
6th 25047
...
3rd 2
2nd 2
1st 2
Name: name, Length: 190, dtype: int64
for finding a specific value of a column you can use the code below
irrespective of the preference you can use the any of the method you like
df.col_name.value_counts().Value_you_are_looking_for
take example of the titanic dataset
df.Sex.value_counts().male
this gives a count of all male on the ship
Although if you want to count a numerical data then you cannot use the above method because value_counts() is used only with series type of data hence fails
So for that you can use the second method example
the second method is
#this is an example method of counting on a data frame
df[(df['Survived']==1)&(df['Sex']=='male')].counts()
this is not that efficient as value_counts() but surely will help if you want to count values of a data frame
hope this helps
EDIT --
If you wanna look for something with a space in between
you may use
df.country.count('united states')
I believe this should solve the problem
I think this could be a more easy solution. Suppose you have the following data frame.
DATE LANG POSTS
2008-07-01 c# 3
2008-08-01 assembly 8
2008-08-01 javascript 2
2008-08-01 c 85
2008-08-01 python 11
2008-07-01 c# 3
2008-08-01 assembly 8
2008-08-01 javascript 62
2008-08-01 c 85
2008-08-01 python 14
you can find the occurrence of LANG item's sum like this
df.groupby('LANG').sum()
and you will have the sum of each individual language
how would I delete all rows from a dataframe that come after a certain fulfilled condition? As an example I have the following dataframe:
import pandas as pd
xEnd=1
yEnd=2
df = pd.DataFrame({'x':[1,1,1,2,2,2], 'y':[1,2,3,3,4,3], 'id':[0,1,2,3,4,5]})
How would i get a dataframe that deletes the last 4 rows and keeps the upper 2 as in row 2 the condition x=xEnd and y=yEnd is fulfilled.
EDITED: should have mentioned that the dataframe is not necessarily ascending. Could also be descending and i still would like to get the upper ones.
To slice your dataframe until the first time a condition across 2 series are satisfied, first calculate the required index and then slice via iloc.
You can calculate the index via set_index, isin and np.ndarray.argmax:
idx = df.set_index(['x', 'y']).isin((xEnd, yEnd)).values.argmax()
res = df.iloc[:idx+1]
print(res)
x y id
0 1 1 0
1 1 2 1
If you need better performance, see Efficiently return the index of the first value satisfying condition in array.
not 100% sure i understand correctly, but you can filter your dataframe like this:
df[(df.x <= xEnd) & (df.y <= yEnd)]
this yields the dataframe:
id x y
0 0 1 1
1 1 1 2
If x and y are not strictly increasing and you want whats above the line that satisfy condition:
df[df.index <= (df[(df.x == xEnd) & (df.y == yEnd)]).index.tolist()]
df = df.iloc[[0:yEnd-1],[:]]
Select just first two rows and keep all columns and put it in new dataframe.
Or you can use the same name of variable too.
Question:
I would like to gain a better understanding of the Pandas DataFrame.query method and what the following expression represents:
match = dfDays.query('index > #x.name & price >= #x.target')
What does #x.name represent?
I understand what the resulting output is for this code (a new column with pandas.tslib.Timestamp data) but don't have a clear understanding of the expression used to get this end result.
Data:
From here:
Vectorised way to query date and price data
np.random.seed(seed=1)
rng = pd.date_range('1/1/2000', '2000-07-31',freq='D')
weeks = np.random.uniform(low=1.03, high=3, size=(len(rng),))
ts2 = pd.Series(weeks
,index=rng)
dfDays = pd.DataFrame({'price':ts2})
dfWeeks = dfDays.resample('1W-Mon').first()
dfWeeks['target'] = (dfWeeks['price'] + .5).round(2)
def find_match(x):
match = dfDays.query('index > #x.name & price >= #x.target')
if not match.empty:
return match.index[0]
dfWeeks.assign(target_hit=dfWeeks.apply(find_match, 1))
#x.name - # helps .query() to understand that x is an external object (doesn't belong to the DataFrame for which the query() method was called). In this case x is a DataFrame. It could be a scalar value as well.
I hope this small demonstration will help you to understand it:
In [79]: d1
Out[79]:
a b c
0 1 2 3
1 4 5 6
2 7 8 9
In [80]: d2
Out[80]:
a x
0 1 10
1 7 11
In [81]: d1.query("a in #d2.a")
Out[81]:
a b c
0 1 2 3
2 7 8 9
In [82]: d1.query("c < #d2.a")
Out[82]:
a b c
1 4 5 6
Scalar x:
In [83]: x = 9
In [84]: d1.query("c == #x")
Out[84]:
a b c
2 7 8 9
Everything #MaxU said is perfect!
I wanted to add some context to the specific problem that this was applied to.
find_match
This is a helper function that is used in the dataframe dfWeeks.apply. Two things to note:
find_match takes a single argument x. This will be a single row of dfWeeks.
Each row is a pd.Series object and each row will be passed through this function. This is the nature of using apply.
When apply passes this row to the helper function, the row has a name attribute that is equal to the index value for that row in the dataframe. In this case, I know that the index value is a pd.Timestamp and I'll use it to do the comparing I need to do.
find_match references dfDays which is outside the scope of find_match itself.
I didn't have to use query... I like using query. It is my opinion that it makes some code prettier. The following function, as provided by the OP, could've been written differently
def find_match(x):
"""Original"""
match = dfDays.query('index > #x.name & price >= #x.target')
if not match.empty:
return match.index[0]
dfWeeks.assign(target_hit=dfWeeks.apply(find_match, 1))
find_match_alt
Or we could've done this, which may help to explain what the query string is doing above
def find_match_alt(x):
"""Alternative to OP's"""
date_is_afterwards = dfDays.index > x.name
price_target_is_met = dfDays.price >= x.target
both_are_true = price_target_is_met & date_is_afterwards
if (both_are_true).any():
return dfDays[both_are_true].index[0]
dfWeeks.assign(target_hit=dfWeeks.apply(find_match_alt, 1))
Comparing these two functions should give good perspective.
I am trying to read a csv file of horse track information.
I am attempting to code for the post positions (col 3) in race 1 the max value for the field qpts (col 210). I have spend days on researching this and can find no clear answer on web or youtube.
When I run the code below, I get "The truth value of a Series is ambiguous....."
import pandas as pd
import numpy as np
pd.set_option('display.max_columns',100)
df = pd.read_csv('track.csv', header=None, na_values=['.'])
index = list(range(0,200,1))
columns = list(range(0,1484,1))
if df.ix[2]== 1:
qpts = (df.max([210]))
print (qpts)
the problem is with
if df.ix[2] == 1. The expression df.ix[2] == 1 will return a pd.Series of truth values. By putting an if in front, you are attempting to evaluate a series of values as either True or False, which is what is throwing the error.
There are several ways to produce a series where the value is 210 and the indices are those where df.ix[2] == 1
This is one way
pd.Series(210, df.index[df.ix[2] == 1])
Here df.ix[2]== 1 is going to return a Series. You need to use a function such as .any() or .all() to combine the Series into a single value which you can do a truth statement upon. For example,
import pandas as pd
import numpy as np
pd.set_option('display.max_columns',100)
df = pd.read_csv('track.csv', header=None, na_values=['.'])
index = list(range(0,200,1))
columns = list(range(0,1484,1))
if (df.ix[2]== 1).any(axis=1):
qpts = (df.max([210]))
print (qpts)
In the case above we are checking to see if any of the Series elements are equal to 1. If so then the the if statement will be implemented. If we do not do this then we could have a situation as follows:
print(df)
Out[1]:
1 3
2 7
3 1
4 5
5 6
print(df.ix[2]== 1)
Out[2]:
1 False
2 False
3 True
4 False
5 False
Therefore the Series would be both simultaneously True and False.