I'd like to create a new column to a Pandas dataframe populated with True or False based on the other values in each specific row. My approach to solve this task was to apply a function checking boolean conditions across each row in the dataframe and populate the new column with either True or False.
This is the dataframe:
l={'DayTime':['2018-03-01','2018-03-02','2018-03-03'],'Pressure':
[9,10.5,10.5], 'Feed':[9,10.5,11], 'Temp':[9,10.5,11]}
df1=pd.DataFrame(l)
This is the function I wrote:
def ops_on(row):
return row[('Feed' > 10)
& ('Pressure' > 10)
& ('Temp' > 10)
]
The function ops_on is used to create the new column ['ops_on']:
df1['ops_on'] = df1.apply(ops_on, axis='columns')
Unfortunately, I get this error message:
TypeError: ("'>' not supported between instances of 'str' and 'int'", 'occurred at index 0')
Thankful for help.
You should work column-wise (vectorised, efficient) rather than row-wise (inefficient, Python loop):
df1['ops_on'] = (df1['Feed'] > 10) & (df1['Pressure'] > 10) & (df1['Temp'] > 10)
The & ("and") operator is applied to Boolean series element-wise. An arbitrary number of such conditions can be chained.
Alternatively, for the special case where you are performing the same comparison multiple times:
df1['ops_on'] = df1[['Feed', 'Pressure', 'Temp']].gt(10).all(1)
In your current setup, just re-write your function like this:
def ops_on(row):
return (row['Feed'] > 10) & (row['Pressure'] > 10) & (row['Temp'] > 10)
Related
Let's say I have a 20 columns like this:
df.columns = ['col1','col2','col3', ..., 'col20']
I am trying to sum all these columns and create a new column where the value of the new column will be 1, if the sum of all the above columns is >0 and 0 otherwise. I am currently doing it in two steps as shown here:
df = df.withColumn("temp_col", col("col1")+col("col2")+...+col("col20"))
df = df.withColumn("new_col_2", when(col("temp_col") > 0, 1).otherwise(0))
Is there any way to do this in a one step and also with a better/cleaner way so I don't need to type all these column names?
I was trying to use something like this, but I have got an error.
df.na.fill(0).withColumn("new_col" ,reduce(add, [col(col(f'{x}') for x in range(0,20))]))
An error was encountered:
name 'add' is not defined
Traceback (most recent call last):
NameError: name 'add' is not defined
You can do the following:
cols = ['col' + str(i) for i in range(1, 21)] # ['col1', 'col2',..., 'col20']
df['new_col'] = df[cols].sum(axis=1) > 0
If you want 1/0 instead of True/False, you can use .astype(int):
df['new_col'] = (df[cols].sum(axis=1) > 0).astype(int)
From a Spark point of view everything is ok using the two withColumns. If you are concerned about performance issues due to one extra column let Spark's Catalyst optimzier deal with this problem.
If you have already a Python array with the columns to be summed up you can create a SQL expression from this array in order to save some typing:
from pyspark.sql import functions as F
cols=df.columns #or cols=[f'col{c}' for c in range(1,21)]
sum_expr="+".join(cols)
df.withColumn("temp_col", F.expr(sum_expr)) \
.withColumn("new_col_2", F.when(F.col("temp_col") > 0, 1).otherwise(0)) \
.show()
A solution in the direction of your second attempt (but imho overengineered) to calculate the sum would be to use a combination of array and aggregate:
df.withColumn("temp_col", F.aggregate(F.array(cols), F.lit(0).cast("long"), lambda l,r: l+r))
cols=['col1','col2','col3', ..., 'col20']
Pandas
df.assign(x=np.where(df.loc[:, cols].sum(axis=1)>1,1,0))
Pyspark
new = (df.withColumn('x', array(*[x for x in cols]))#Create array of all columns required
.withColumn('x', when(expr("reduce(x, cast(0 as double), (c,i)-> c+i)")>0,1).otherwise(0))#combinewhen and reduce
).show()
I'm trying to loop through the 'vol' dataframe, and conditionally check if the sample_date is between certain dates. If it is, assign a value to another column.
Here's the following code I have:
vol = pd.DataFrame(data=pd.date_range(start='11/3/2015', end='1/29/2019'))
vol.columns = ['sample_date']
vol['hydraulic_vol'] = np.nan
for i in vol.iterrows():
if pd.Timestamp('2015-11-03') <= vol.loc[i,'sample_date'] <= pd.Timestamp('2018-06-07'):
vol.loc[i,'hydraulic_vol'] = 319779
Here's the error I received:
TypeError: 'Series' objects are mutable, thus they cannot be hashed
This is how you would do it properly:
cond = (pd.Timestamp('2015-11-03') <= vol.sample_date) &
(vol.sample_date <= pd.Timestamp('2018-06-07'))
vol.loc[cond, 'hydraulic_vol'] = 319779
Another way to do this would be to use the np.where method from the numpy module, in combination with the .between method.
This method works like this:
np.where(condition, value if true, value if false)
Code example
cond = vol.sample_date.between('2015-11-03', '2018-06-07')
vol['hydraulic_vol'] = np.where(cond, 319779, np.nan)
Or you can combine them in one single line of code:
vol['hydraulic_vol'] = np.where(vol.sample_date.between('2015-11-03', '2018-06-07'), 319779, np.nan)
Edit
I see that you're new here, so here's something I had to learn as well coming to python/pandas.
Looping over a dataframe should be your last resort, try to use vectorized solutions, in this case .loc or np.where, these will perform better in terms of speed compared to looping.
I have a dataframe in which I'm trying to create a binary 1/0 column when certain conditions are met. The code I'm using is as follows:
sd_threshold = 5
df1["signal"] = np.where(np.logical_and(df1["high"] >= df1["break"], df1["low"]
<= df1["break"], df1["sd_round"] > sd_threshold), 1, 0)
The code returns TypeError: return arrays must be of ArrayType when the last condition df1["sd_round"] > sd_threshold is included, otherwise it works fine. There isn't any issue with the data in the df1["sd_round"] column.
Any insight would be much appreciated, thank you!
check the documentation -- np.logical_and() compares the first two arguments you give it and writes the output to the third. you could use a nested call but i would just go with & (pandas boolean indexing):
df1["signal"] = np.where((df1["high"] >= df1["break"]) &
(df1["low"] <= df1["break"]) &
(df1["sd_round"] > sd_threshold),
1, 0)
EDIT: you could actually just skip numpy and cast your boolean Series to int to yield 1s and 0s:
mask = ((df1["high"] >= df1["break"]) &
(df1["low"] <= df1["break"]) &
(df1["sd_round"] > sd_threshold))
df1["signal"] = mask.astype(int)
so I have a large pandas DataFrame that contains about two months of information with a line of info per second. Way too much information to deal with at once, so I want to grab specific timeframes. The following code will grab everything before February 5th 2012:
sunflower[sunflower['time'] < '2012-02-05']
I want to do the equivalent of this:
sunflower['2012-02-01' < sunflower['time'] < '2012-02-05']
but that is not allowed. Now I could do this with these two lines:
step1 = sunflower[sunflower['time'] < '2012-02-05']
data = step1[step1['time'] > '2012-02-01']
but I have to do this with 20 different DataFrames and a multitude of times and being able to do this easily would be nice. I know pandas is capable of this because if my dates were the index rather than a column, it's easy to do, but they can't be the index because dates are repeated and therefore you receive this error:
Exception: Reindexing only valid with uniquely valued Index objects
So how would I go about doing this?
You could define a mask separately:
df = DataFrame('a': np.random.randn(100), 'b':np.random.randn(100)})
mask = (df.b > -.5) & (df.b < .5)
df_masked = df[mask]
Or in one line:
df_masked = df[(df.b > -.5) & (df.b < .5)]
You can use query for a more concise option:
df.query("'2012-02-01' < time < '2012-02-05'")
I've got a Pandas Panel with many DataFrames with the same rows/column labels. I want to make a new panel with DataFrames that fulfill certain criteria based on a couple columns.
This is easy with dataframes and rows: Say I have a df, zHe_compare. I can get the suitable rows with:
zHe_compare[(zHe_compare['zHe_calc'] > 100) & (zHe_compare['zHe_med'] > 100) | ((zHe_obs_lo_2s <=zHe_compare['zHe_calc']) & (zHe_compare['zHe_calc'] <= zHe_obs_hi_2s))]
but how do I do (pseudocode, simplified boolean):
good_results_panel = results_panel[ all_dataframes[ sum ('zHe_calc' < 'zHe_obs') > min_num ] ]
I know the the inner boolean part, but how do I specify this for each dataframe in a panel? Because I need multiple columns from each df, I haven't met success using the panel.minor_xs slicing techniques.
thanks!
As mentioned in its documentation, Panel is currently a bit under-developed, so the sweet syntax you've come to rely on when working with DataFrame isn't there yet.
Meanwhile, I would suggest using the Panel.select method:
def is_good_result(item_label):
# whatever condition over the selected item
df = results_panel[item_label]
return df['col1'].sum() > 5
good_results = results.select(is_good_result)
The is_good_result function returns a boolean value. Note that its argument is not a DataFrame instance, because Panel.select applies its argument to the item label, rather than the DataFrame content of that item.
Of course, you can stuff that whole criterion function into a lambda in one statement, if you're into the whole brevity thing:
good_results = results.select(
lambda item_label: results[item_label]['col1'].sum() > 5
)