I have dataset as below,
CURRENT YEAR dataset:
Year
ROA
Borrowings
2020
1.2
23681
Previous YEAR dataset:
Year
ROA
Borrowings
2019
2.3
24682
So 2 dataset with different year.I dont want to combine dataset.
I am checking for boolean logic as below,
for key6, data6 in bank.items():
cy = data6[data6['index']=='2020']
py = data6[data6['index']=='2019']
ROA_FS = cy['ROA'].apply(lambda x:1 if x>0 else 0)
CFO_FS = cy['CashfromOperatingActivity'].apply(lambda x:1 if x>0 else 0)
C_ROA_FS = (cy['ROA']>py['ROA']).apply(lambda x:1 if x==True else 0)
First two lines in for loop works perfectly as its output is integer. but 3rd line I am comparing two different df's, after that I converted to integer and float of the ROA columns as follows,
(int(cy['ROA'])>int(py['ROA'])).apply(lambda x: 1 if x=='True' else 0)
When I applied it says following error,
'bool' object has no attribute 'apply'
Please note that I am comparing different ROA of different years
Expected output,
change_in_ROA = a['ROA']>b['ROA'], if true print 1 else 0. So output should be 0/1.
Thanks
If same number of rows between both DataFrames use:
a['new'] = np.where(a['ROA'].astype(int) > b['ROA'].astype(int), 1, 0)
If not same rows is necessary first join DataFrames together, e.g. by column col (depends by data and what need) and then test:
df = a.merge(b, on='col')
df['new'] = np.where(df['ROA'].astype(int) > df['ROA'].astype(int), 1, 0)
Related
I've been trying to print out a Pandas dataframe to html and have specific entire rows highlighted if the value of one specific column's value for that row is over a threshold. I've looked through the Pandas Styler Slicing and tried to vary the highlight_max function for such a use, but seem to be failing miserably; if I try, say, to replace the is_max with a check for whether a given row's value is above said threshold (e.g., something like
is_x = df['column_name'] >= threshold
), it isn't apparent how to properly pass such a thing or what to return.
I've also tried to simply define it elsewhere using df.loc, but that hasn't worked too well either.
Another concern also came up: If I drop that column (currently the criterion) afterwards, will the styling still hold? I am wondering if a df.loc would prevent such a thing from being a problem.
This solution allows for you to pass a column label or a list of column labels to highlight the entire row if that value in the column(s) exceeds the threshold.
import pandas as pd
import numpy as np
np.random.seed(24)
df = pd.DataFrame({'A': np.linspace(1, 10, 10)})
df = pd.concat([df, pd.DataFrame(np.random.randn(10, 4), columns=list('BCDE'))],
axis=1)
df.iloc[0, 2] = np.nan
def highlight_greaterthan(s, threshold, column):
is_max = pd.Series(data=False, index=s.index)
is_max[column] = s.loc[column] >= threshold
return ['background-color: yellow' if is_max.any() else '' for v in is_max]
df.style.apply(highlight_greaterthan, threshold=1.0, column=['C', 'B'], axis=1)
Output:
Or for one column
df.style.apply(highlight_greaterthan, threshold=1.0, column='E', axis=1)
Here is a simpler approach:
Assume you have a 100 x 10 dataframe, df. Also assume you want to highlight all the rows corresponding to a column, say "duration", greater than 5.
You first need to define a function that highlights the cells. The real trick is that you need to return a row, not a single cell. For example:
def highlight(s):
if s.duration > 5:
return ['background-color: yellow'] * len(s)
else:
return ['background-color: white'] * len(s)
**Note that the return part should be a list of 10 (corresponding to the number of columns). This is the key part.
Now you can apply this to the dataframe style as:
df.style.apply(highlight, axis=1)
Assume you have the following dataframe and you want to highlight the rows where id is greater than 3 to red
id char date
0 0 s 2022-01-01
1 1 t 2022-02-01
2 2 y 2022-03-01
3 3 l 2022-04-01
4 4 e 2022-05-01
5 5 r 2022-06-01
You can try Styler.set_properties with pandas.IndexSlice
# Subset your original dataframe with condition
df_ = df[df['id'].gt(3)]
# Pass the subset dataframe index and column to pd.IndexSlice
slice_ = pd.IndexSlice[df_.index, df_.columns]
s = df.style.set_properties(**{'background-color': 'red'}, subset=slice_)
s.to_html('test.html')
You can also try Styler.apply with axis=None which passes the whole dataframe.
def styler(df):
color = 'background-color: {}'.format
mask = pd.concat([df['id'].gt(3)] * df.shape[1], axis=1)
style = np.where(mask, color('red'), color('green'))
return style
s = df.style.apply(styler, axis=None)
I am trying to divide a combined number by 2 if both of the inputs are more than 0
data = {'test':[1,1,0],'test2':[0, 1, 0,]}
df = pd.DataFrame(data)
df['combined'] = df['test'] +df['test2']
df
I am looking for a way (probably an if-statement to divide df['combined'] by 2 if both test and test2 have a value of 1.
I've tried this, however it gives an error
if ((df['test']> 1) and (df['test2']>1)):
df['combined'] / 2
else:
df['combined']
what is the best way to do this?
There are two same kind problems in you if statement. First you should know that result of (df['test']> 1) is a pandas Series object.
0 False
1 False
2 False
Name: test, dtype: bool
Both if and and operation couldn't handle pandas Series object, that's why you see the error.
At last, you can use np.where() to replace Series on condition:
mask = (df['test']> 1) & (df['test2']>1)
df['combined'] = np.where(mask, df['combined'], df['combined'] / 2)
I have this df:
id started completed
1 2022-02-20 15:00:10.157 2022-02-20 15:05:10.044
and I have this other one data:
timestamp x y
2022-02-20 14:59:47.329 16 0.0
2022-02-20 15:01:10.347 16 0.2
2022-02-20 15:06:35.362 16 0.3
what I wanna do is filter the rows in data where timestamp > started and timestamp < completed (which will leave me with the middle row only)
I tried to do it like this:
res = data[(data['timestamp'] > '2022-02-20 15:00:10.157')]
res = res[(res['timestamp'] > '2022-02-20 15:05:10.044')]
and it works.
But when I wanted to combine the two like this:
res = data[(data['timestamp'] > df['started']) and (data['timestamp'] < df['completed'])]
I get ValueError: Can only compare identically-labeled Series objects
Can anyone please explain why and where am I doing the mistake? Do I have to convert to string the df['started'] or something?
You have two issues here.
The first is the use of and. If you want to combine multiple masks (boolean array) with a "and" logic element-wise, you want to use & instead of and.
Then, the use of df['started'] and df['completed'] for comparing. If you use a debugger, you can see that
df['started'] is a dataframe with its own indexes, the same for data['timestamp']. The rule for comparing, two dataframes are described here. Essentially, you can compare only two dataframes with the same indexing. But here df has only one row, data multiple. Try convert your element from df as a non dataframe format. Using loc for instance.
For instance :
Using masks
n = 10
np.random.seed(0)
df = pd.DataFrame(
{
"x": np.random.choice(np.array([*ascii_lowercase]), size=n),
"y": np.random.normal(size=n),
}
)
df2 = pd.DataFrame(
{
"max_val" : [0],
"min_val" : [-0.5]
}
)
df[(df.y < df2.loc[0, 'max_val']) & (df.y > df2.loc[0, 'min_val'])]
Out[95]:
x y
2 v -0.055035
3 a -0.107310
5 d -0.097696
7 j -0.453056
8 t -0.470771
Using query
df2 = pd.DataFrame(
{
"max_val" : np.repeat(0, n),
"min_val" : np.repeat(-0.5, n)
}
)
df.query("y < #df2.max_val and y > #df2.min_val")
Out[124]:
x y
2 v -0.055035
3 a -0.107310
5 d -0.097696
7 j -0.453056
8 t -0.470771
To make the comparisons, Pandas need to have the same rows count in both the dataframes, that's because a comparison is made between the first row of the data['timestamp'] series and the first row of the df['started'] series, and so on.
The error is due to the second row of the data['timestamp'] series not having anything to compare with.
In order to make the code work, you can add for any row of data, a row in df to match against. In this way, Pandas will return a Boolean result for every row, and you can use the AND logical operator to get the results that are both True.
Pandas doesn't want Python's and operator, so you need to use the & operator, so your code will look like this:
data[(data['timestamp'] > df['started']) & (data['timestamp'] < df['completed'])]
I am writing a custom error message when 2 Pandas series are not equal and want to use '<' to point at the differences.
Here's the workflow for a failed equality:
Convert both lists to Python: pd.Series([list])
Side by side comparison in a dataframe: table = pd.concat([list1], [list2]), axis=1
Add column and index names: table.columns = ['...', '...'], table.index = ['...', '...']
Current output:
|Yours|Actual|
|1|1|
|2|2|
|4|3|
Desired output:
|Yours|Actual|-|
|1|1||
|2|2||
|4|3|<|
The naive solution is iterating through each list index and if it's not equal, appending '<' to another list then putting this list into pd.concat() but I am looking for a method using Pandas. For example,
error_series = '<' if (abs(yours - actual) >= 1).all(axis=None) else ''
Ideally it would append '<' to a list if the difference between the results is greater than the Margin of Error of 1, otherwise append nothing
Note: Removed tables due to StackOverflow being picky and not letting my post my question
You can create the DF and give index and column names in one line:
import pandas as pd
list1 = [1,2,4]
list2 = [1,2,10]
df = pd.DataFrame(zip(list1, list2), columns=['Yours', 'Actual'])
Create a boolean mask to find the rows that have a too large difference:
margin_of_error = 1
mask = df.diff(axis=1)['Actual'].abs()>margin_of_error
Add a column to the DF and set the values of the mask as you want:
df['too_different'] = df.diff(axis=1)['Actual'].abs()>margin_of_error
df['too_different'].replace(True, '<', inplace=True)
df['too_different'].replace(False, '', inplace=True)
output:
Yours Actual too_different
0 1 1
1 2 2
2 4 10 <
or you can do something like this:
df = df.assign(diffr=df.apply(lambda x: '<'
if (abs(x['yours'] - x['actual']) >= 1)
else '', axis=1))
print(df)
'''
yours actual diffr
0 1 1
1 2 2
2 4 3 <
After searching several forums on similar questions, it appears that one way to iterate a conditional statement quickly is using Numpy's np.where() function on Pandas. I am having trouble with the following task:
I have a dataset that looks like several rows of:
PatientID Date1 Date2 ICD
1234 12/14/10 12/12/10 313.2, 414.2, 228.1
3213 8/2/10 9/5/12 232.1, 221.0
I am trying to create a conditional statement such that:
1. if strings '313.2' or '414.2' exist in df['ICD'] return 1
2. if strings '313.2' or '414.2' exist in df['ICD'] and Date1>Date2 return 2
3. Else return 0
Given that Date1 and Date2 are in date-time format and my data frame is coded as df, I have the following code:
df['NewColumn'] = np.where(df.ICD.str.contains('313.2|414.2').astype(int), 1, np.where(((df.ICD.str.contains('313.2|414.2').astype(int))&(df['Date1']>df['Date2'])), 2, 0)
However this code only returns a series with 1's and 0's and does not include a 2. How else can I complete this task?
You almost had it, you needed to pass a raw string (prepend with r) to contains so it treats it as a regex:
In [115]:
df['NewColumn'] = np.where(df.ICD.str.contains(r'313.2|414.2').astype(int), 1, np.where(((df.ICD.str.contains(r'313.2|414.2').astype(int))&(df['Date1']>df['Date2'])), 2, 0))
df
Out[115]:
PatientID Date1 Date2 ICD NewColumn
0 1234 2010-12-14 2010-12-12 313.2,414.2,228.1 1
1 3213 2010-08-02 2012-09-05 232.1,221.0 0
You get 1 returned because it short circuits on the first condition because that is met, if you want to get 2 returned then you need to rearrange the order of evaluation:
In [122]:
df['NewColumn'] = np.where( (df.ICD.str.contains(r'313.2|414.2').astype(int)) & ( df['Date1'] > df['Date2'] ), 2 ,
np.where( df.ICD.str.contains(r'313.2|414.2').astype(int), 1, 0 ) )
df
Out[122]:
PatientID Date1 Date2 ICD NewColumn
0 1234 2010-12-14 2010-12-12 313.2,414.2,228.1 2
1 3213 2010-08-02 2012-09-05 232.1,221.0 0
It is much easier to use the pandas functionality itself. Using numpy to do something that pandas already does is a good way to get unexpected behaviour.
Assuming you want to check for a cell value containing 313.2 only (so 2313.25 returns False).
df['ICD'].astype(str) == '313.2'
returns a Series Object of True or False next to each index entry.
so
boolean =(df['ICD'].astype(str) == '313.2')| (df['ICD'].astype(str) == '414.2')
if(boolean.any()):
#do something
return 1
boolean2 =((df['ICD'].astype(str) == '313.2')| (df['ICD'].astype(str) == '414.2'))&(df['Date1']>df['Date2'])
if(boolean2.any()):
return 2
etc
Pandas also has the function isin() which can simplify things further.
The docs are here: http://pandas.pydata.org/pandas-docs/stable/indexing.html
Also, you do not return two because of the order you evaluate the conditional statements.In any circumstance where condition 2 evaluates as true, condition 1 must evaluate to be true also. So as you test condition 1 too, it always returns 1 or passes.
In short, you need to test condition 2 first, as there is no circumstance where 1 can be false and 2 can be true.