New column in Pandas dataframe based on boolean conditions

New column in Pandas dataframe based on boolean conditions - python

I'd like to create a new column to a Pandas dataframe populated with True or False based on the other values in each specific row. My approach to solve this task was to apply a function checking boolean conditions across each row in the dataframe and populate the new column with either True or False.
This is the dataframe:
l={'DayTime':['2018-03-01','2018-03-02','2018-03-03'],'Pressure':
[9,10.5,10.5], 'Feed':[9,10.5,11], 'Temp':[9,10.5,11]}
df1=pd.DataFrame(l)
This is the function I wrote:
def ops_on(row):
return row[('Feed' > 10)
& ('Pressure' > 10)
& ('Temp' > 10)
]
The function ops_on is used to create the new column ['ops_on']:
df1['ops_on'] = df1.apply(ops_on, axis='columns')
Unfortunately, I get this error message:
TypeError: ("'>' not supported between instances of 'str' and 'int'", 'occurred at index 0')
Thankful for help.

You should work column-wise (vectorised, efficient) rather than row-wise (inefficient, Python loop):
df1['ops_on'] = (df1['Feed'] > 10) & (df1['Pressure'] > 10) & (df1['Temp'] > 10)
The & ("and") operator is applied to Boolean series element-wise. An arbitrary number of such conditions can be chained.
Alternatively, for the special case where you are performing the same comparison multiple times:
df1['ops_on'] = df1[['Feed', 'Pressure', 'Temp']].gt(10).all(1)

In your current setup, just re-write your function like this:
def ops_on(row):
return (row['Feed'] > 10) & (row['Pressure'] > 10) & (row['Temp'] > 10)

Related

How can I sum multiple columns in pyspark and return the max?

Let's say I have a 20 columns like this:
df.columns = ['col1','col2','col3', ..., 'col20']
I am trying to sum all these columns and create a new column where the value of the new column will be 1, if the sum of all the above columns is >0 and 0 otherwise. I am currently doing it in two steps as shown here:
df = df.withColumn("temp_col", col("col1")+col("col2")+...+col("col20"))
df = df.withColumn("new_col_2", when(col("temp_col") > 0, 1).otherwise(0))
Is there any way to do this in a one step and also with a better/cleaner way so I don't need to type all these column names?
I was trying to use something like this, but I have got an error.
df.na.fill(0).withColumn("new_col" ,reduce(add, [col(col(f'{x}') for x in range(0,20))]))
An error was encountered:
name 'add' is not defined
Traceback (most recent call last):
NameError: name 'add' is not defined

You can do the following:
cols = ['col' + str(i) for i in range(1, 21)] # ['col1', 'col2',..., 'col20']
df['new_col'] = df[cols].sum(axis=1) > 0
If you want 1/0 instead of True/False, you can use .astype(int):
df['new_col'] = (df[cols].sum(axis=1) > 0).astype(int)

From a Spark point of view everything is ok using the two withColumns. If you are concerned about performance issues due to one extra column let Spark's Catalyst optimzier deal with this problem.
If you have already a Python array with the columns to be summed up you can create a SQL expression from this array in order to save some typing:
from pyspark.sql import functions as F
cols=df.columns #or cols=[f'col{c}' for c in range(1,21)]
sum_expr="+".join(cols)
df.withColumn("temp_col", F.expr(sum_expr)) \
.withColumn("new_col_2", F.when(F.col("temp_col") > 0, 1).otherwise(0)) \
.show()
A solution in the direction of your second attempt (but imho overengineered) to calculate the sum would be to use a combination of array and aggregate:
df.withColumn("temp_col", F.aggregate(F.array(cols), F.lit(0).cast("long"), lambda l,r: l+r))

cols=['col1','col2','col3', ..., 'col20']
Pandas
df.assign(x=np.where(df.loc[:, cols].sum(axis=1)>1,1,0))
Pyspark
new = (df.withColumn('x', array(*[x for x in cols]))#Create array of all columns required
.withColumn('x', when(expr("reduce(x, cast(0 as double), (c,i)-> c+i)")>0,1).otherwise(0))#combinewhen and reduce
).show()

How to loop through pandas dataframe, and conditionally assign values to a row of a variable?

I'm trying to loop through the 'vol' dataframe, and conditionally check if the sample_date is between certain dates. If it is, assign a value to another column.
Here's the following code I have:
vol = pd.DataFrame(data=pd.date_range(start='11/3/2015', end='1/29/2019'))
vol.columns = ['sample_date']
vol['hydraulic_vol'] = np.nan
for i in vol.iterrows():
if pd.Timestamp('2015-11-03') <= vol.loc[i,'sample_date'] <= pd.Timestamp('2018-06-07'):
vol.loc[i,'hydraulic_vol'] = 319779
Here's the error I received:
TypeError: 'Series' objects are mutable, thus they cannot be hashed

This is how you would do it properly:
cond = (pd.Timestamp('2015-11-03') <= vol.sample_date) &
(vol.sample_date <= pd.Timestamp('2018-06-07'))
vol.loc[cond, 'hydraulic_vol'] = 319779

Another way to do this would be to use the np.where method from the numpy module, in combination with the .between method.
This method works like this:
np.where(condition, value if true, value if false)
Code example
cond = vol.sample_date.between('2015-11-03', '2018-06-07')
vol['hydraulic_vol'] = np.where(cond, 319779, np.nan)
Or you can combine them in one single line of code:
vol['hydraulic_vol'] = np.where(vol.sample_date.between('2015-11-03', '2018-06-07'), 319779, np.nan)
Edit
I see that you're new here, so here's something I had to learn as well coming to python/pandas.
Looping over a dataframe should be your last resort, try to use vectorized solutions, in this case .loc or np.where, these will perform better in terms of speed compared to looping.

Issue with creating a column using np.where, ArrayType error

I have a dataframe in which I'm trying to create a binary 1/0 column when certain conditions are met. The code I'm using is as follows:
sd_threshold = 5
df1["signal"] = np.where(np.logical_and(df1["high"] >= df1["break"], df1["low"]
<= df1["break"], df1["sd_round"] > sd_threshold), 1, 0)
The code returns TypeError: return arrays must be of ArrayType when the last condition df1["sd_round"] > sd_threshold is included, otherwise it works fine. There isn't any issue with the data in the df1["sd_round"] column.
Any insight would be much appreciated, thank you!

check the documentation -- np.logical_and() compares the first two arguments you give it and writes the output to the third. you could use a nested call but i would just go with & (pandas boolean indexing):
df1["signal"] = np.where((df1["high"] >= df1["break"]) &
(df1["low"] <= df1["break"]) &
(df1["sd_round"] > sd_threshold),
1, 0)
EDIT: you could actually just skip numpy and cast your boolean Series to int to yield 1s and 0s:
mask = ((df1["high"] >= df1["break"]) &
(df1["low"] <= df1["break"]) &
(df1["sd_round"] > sd_threshold))
df1["signal"] = mask.astype(int)

Grabbing selection between specific dates in a DataFrame

so I have a large pandas DataFrame that contains about two months of information with a line of info per second. Way too much information to deal with at once, so I want to grab specific timeframes. The following code will grab everything before February 5th 2012:
sunflower[sunflower['time'] < '2012-02-05']
I want to do the equivalent of this:
sunflower['2012-02-01' < sunflower['time'] < '2012-02-05']
but that is not allowed. Now I could do this with these two lines:
step1 = sunflower[sunflower['time'] < '2012-02-05']
data = step1[step1['time'] > '2012-02-01']
but I have to do this with 20 different DataFrames and a multitude of times and being able to do this easily would be nice. I know pandas is capable of this because if my dates were the index rather than a column, it's easy to do, but they can't be the index because dates are repeated and therefore you receive this error:
Exception: Reindexing only valid with uniquely valued Index objects
So how would I go about doing this?

You could define a mask separately:
df = DataFrame('a': np.random.randn(100), 'b':np.random.randn(100)})
mask = (df.b > -.5) & (df.b < .5)
df_masked = df[mask]
Or in one line:
df_masked = df[(df.b > -.5) & (df.b < .5)]

You can use query for a more concise option:
df.query("'2012-02-01' < time < '2012-02-05'")

Pandas Panel fancy indexing: How to return (index of) all DataFrames in Panel based on Boolean of multiple columns in each df

I've got a Pandas Panel with many DataFrames with the same rows/column labels. I want to make a new panel with DataFrames that fulfill certain criteria based on a couple columns.
This is easy with dataframes and rows: Say I have a df, zHe_compare. I can get the suitable rows with:
zHe_compare[(zHe_compare['zHe_calc'] > 100) & (zHe_compare['zHe_med'] > 100) | ((zHe_obs_lo_2s <=zHe_compare['zHe_calc']) & (zHe_compare['zHe_calc'] <= zHe_obs_hi_2s))]
but how do I do (pseudocode, simplified boolean):
good_results_panel = results_panel[ all_dataframes[ sum ('zHe_calc' < 'zHe_obs') > min_num ] ]
I know the the inner boolean part, but how do I specify this for each dataframe in a panel? Because I need multiple columns from each df, I haven't met success using the panel.minor_xs slicing techniques.
thanks!

As mentioned in its documentation, Panel is currently a bit under-developed, so the sweet syntax you've come to rely on when working with DataFrame isn't there yet.
Meanwhile, I would suggest using the Panel.select method:
def is_good_result(item_label):
# whatever condition over the selected item
df = results_panel[item_label]
return df['col1'].sum() > 5
good_results = results.select(is_good_result)
The is_good_result function returns a boolean value. Note that its argument is not a DataFrame instance, because Panel.select applies its argument to the item label, rather than the DataFrame content of that item.
Of course, you can stuff that whole criterion function into a lambda in one statement, if you're into the whole brevity thing:
good_results = results.select(
lambda item_label: results[item_label]['col1'].sum() > 5
)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

New column in Pandas dataframe based on boolean conditions - python

In your current setup, just re-write your function like this: def ops_on(row): return (row['Feed'] > 10) & (row['Pressure'] > 10) & (row['Temp'] > 10)

Related

How can I sum multiple columns in pyspark and return the max?

How to loop through pandas dataframe, and conditionally assign values to a row of a variable?

Issue with creating a column using np.where, ArrayType error

Grabbing selection between specific dates in a DataFrame

Pandas Panel fancy indexing: How to return (index of) all DataFrames in Panel based on Boolean of multiple columns in each df

Categories

Resources