Python: inconsistent handling of IF statement in loop - python

I have a dataframe df containing conditions and values.
import pandas as pd
df=pd.DataFrame({'COND':['X','X','X','Y','Y','Y'], 'VALUE':[1,2,3,1,2,3]})
Therefore df looks like:
COND VALUE
X 1
X 2
X 3
Y 1
Y 2
Y 3
I'm using a loop to subset df according to COND, and write separate text files containing values for each condition
conditions = {'X','Y'}
for condition in conditions:
df2 = df[df['COND'].isin([condition])][['VALUE']]
df2.to_csv(condition + '_values.txt', header=False, index=False)
The end results is two text files: X_vals.txt and Y_vals.txt, both of which contain 1 2 3. Up until this point everything is working as expected.
I would like to further subset df for one condition only. For example, perhaps I want all values from condition Y, but ONLY values < 3 from condition X. In this scenario, X_vals.txt should contain 1 2 and Y_vals.txt should contain 1 2 3. I tried implementing this with an IF statement:
conditions = {'X','Y'}
for condition in conditions:
if condition == 'X':
df = df[df['VALUE'] < 3]
df2 = df[df['COND'].isin([condition])][['VALUE']]
df2.to_csv(condition + '_values.txt', header=False, index=False)
Here is where the inconsistency occurs. The above code works fine (i.e. X_vals.txt contains 1 2, and Y_vals.txt 1 2 3, as intended), but when I use if condition=='Y' instead of if condition=='X', it breaks, and both text files only contain 1 2.
In other words, if I specify the first element of conditions in the IF statement then it works as intended, however if I specify the second element then it breaks and applies the < 3 subset to values from both conditions.
What is going on here and how can I resolve it?
Thanks!

The problem you are encountering arises because you are overwriting df inside the loop.
conditions = {'X','Y'}
for condition in conditions:
if condition == 'X':
df = df[df['VALUE'] < 3] # <-- HERE'S YOUR ISSUE
df2 = df[df['COND'].isin([condition])][['VALUE']]
df2.to_csv(condition + '_values.txt', header=False, index=False)
What slightly surprised me is that when you are looping over the set conditions you get condition = 'Y' first, then condition = 'X'. But as a set is an unordered collection (i.e. it doesn't claim to have an inherent order of its elements), this ought not to be too disturbing: python is just reading out the elements in the most internally convenient way.
You could use conditions = ['X', 'Y'] to loop over a list (an ordered collection) instead. Then it will do X first, then Y. However, if you do that you will get the same bug but in reverse (i.e. it works for if condition == 'Y' but not if condition == 'X').
This is because after the loop runs once, df has been reassigned to the subset of the original df that only contains values less than three. That's why you get only the values 1 and 2 in both files if the if condition statement triggers on the first pass through the loop.
Now for the fix:
conditions = ['X', 'Y']
for condition in conditions:
csv_name = f"{condition}_values.txt"
if condition == 'X':
df_filter = f"VALUE < 3 & COND == '{condition}'"
else:
df_filter = f"COND == '{condition}'"
df.query(df_filter).VALUE.to_csv(csv_name, header=False, index=False)
Here I've introduced the DataFrame.query method, which is typically more concise than trying to create a Boolean series to use as a mask as you were doing.
The f-string syntax only works on python 3.6+, if you're on a lower version then modify as appropriate (e.g. df_filter = "COND == '{}'".format(condition))

We can write the condition to dict then use map filter the df before groupby
cond = {'X' : 2, 'Y' : 3}
subdf = df[df['VALUE']<df.COND.map(cond)]
for x, y in subdf.groupby('COND'):
y.to_csv(x + '_values.txt')

df=pd.DataFrame({'COND':['X','X','X','Y','Y','Y'], 'VALUE':[1,2,3,1,2,3]})
conditions = df.COND
for condition in conditions:
print(condition)
df2=df[df['COND'].isin([condition])][['VALUE']]
df2.to_csv(condition + '_values.txt',header=False, index=False)
for condition in conditions:
if condition=='X':
df=df[df['VALUE'] < 3]
df2=df[df['COND'].isin([condition])][['VALUE']]
df2.to_csv(condition + '_values.txt',header=False, index=False)
You didn't specify the variable "Condition", so it gave you an error.
try doing :
conditions = df.COND
before the for loop

Related

faster way to run a for loop for a very large dataframe list

I am using two for loops inside each other to calculate a value using combinations of elements in a dataframe list. the list consists of large number of dataframes and using two for loops takes considerable amount of time.
Is there a way i can do the operation faster?
the functions I refer with dummy names are the ones where I calculate the results.
My code looks like this:
conf_list = []
for tr in range(len(trajectories)):
df_1 = trajectories[tr]
if len(df_1) == 0:
continue
for tt in range(len(trajectories)):
df_2 = trajectories[tt]
if len(df_2) == 0:
continue
if df_1.equals(df_2) or df_1['time'].iloc[0] > df_2['time'].iloc[-1] or df_2['time'].iloc[0] > df_1['time'].iloc[-1]:
continue
df_temp = cartesian_product_basic(df_1,df_2)
flg, df_temp = another_function(df_temp)
if flg == 0:
continue
flg_h = some_other_function(df_temp)
if flg_h == 1:
conf_list.append(1)
My input list consist of around 5000 dataframes looking like (having several hundreds of rows)
id
x
y
z
time
1
5
7
2
5
and what i do is I get the cartesian product with combinations of two dataframes and for each couple I calculate another value 'c'. If this value c meets a condition then I add an element to my c_list so that I can get the final number of couples meeting the requirement.
For further info;
a_function(df_1, df_2) is a function getting the cartesian product of two dataframes.
another_function looks like this:
def another_function(df_temp):
df_temp['z_dif'] = nwh((df_temp['time_x'] == df_temp['time_y'])
, abs(df_temp['z_x']- df_temp['z_y']) , np.nan)
df_temp = df_temp.dropna()
df_temp['vert_conf'] = nwh((df_temp['z_dif'] >= 1000)
, np.nan , 1)
df_temp = df_temp.dropna()
if len(df_temp) == 0:
flg = 0
else:
flg = 1
return flg, df_temp
and some_other_function looks like this:
def some_other_function(df_temp):
df_temp['x_dif'] = df_temp['x_x']*df_temp['x_y']
df_temp['y_dif'] = df_temp['y_x']*df_temp['y_y']
df_temp['hor_dif'] = hypot(df_temp['x_dif'], df_temp['y_dif'])
df_temp['conf'] = np.where((df_temp['hor_dif']<=5)
, 1 , np.nan)
if df_temp['conf'].sum()>0:
flg_h = 1
return flg_h
The following are the way to make your code run faster:
Instead of for-loop use list comprehension.
use built-in functions like map, filter, sum ect, this would make your code faster.
Do not use '.' or dot operants, for example
Import datetime
A=datetime.datetime.now() #dont use this
From datetime.datetime import now as timenow
A=timenow()# use this
Use c/c++ based operation libraries like numpy.
Don't convert datatypes unnecessarily.
in infinite loops, use 1 instead of "True"
Use built-in Libraries.
if the data would not change, convert it to a tuple
Use String Concatenation
Use Multiple Assignments
Use Generators
When using if-else to check a Boolean value, avoid using assignment operator.
# Instead of Below approach
if a==1:
print('a is 1')
else:
print('a is 0')
# Try this approach
if a:
print('a is 1')
else:
print('a is 0')
# This would help as a portion of time is reduce which was used in check the 2 values.
Usefull references:
Speeding up Python Code: Fast Filtering and Slow Loops
Speed Up Python Code

Better way to do computation over pandas

Below is my pandas snippet. it works. Given a df, I wish to know if there exist any row that satisfy c1> 10 and C2 and C3 are True. Below code works. I wsh to know if there is any better way to do the same.
import pandas as pd
inp = [{'c1':10, 'c2':True, 'c3': False}, {'c1':9, 'c2':True, 'c3': True}, {'c1':11, 'c2':True, 'c3': True}]
df = pd.DataFrame(inp)
def check(df):
for index, row in df.iterrows():
if ((row['c1']>10) & (row['c2']==True)& (row['c3']==True)):
return True
else:
continue
t = check(df)
When using pandas you rarely need to iterate over rows and apply the operations per each row separately. In many cases if you apply the same operation to the whole dataframe or column you get the same or similar result and faster a more readable code. In your case:
(df['c1'] > 10) & df['c2'] & df['c3']
# will lead to a Series:
# 0 False
# 1 False
# 2 True
# dtype: bool
(note that I am calling the operation on the whole df rather than single row
which signifies for which rows the condition holds. If you need to know just if any row satisfies the condition, you can all any:
((df['c1'] > 10) & df['c2'] & df['c3']).any()
# True
So your whole check function would be:
def check(df):
return ((df['c1'] > 10) & df['c2'] & df['c3']).any()
It is not clear what you want to change or improve about your solution, but you can achieve the same without a separate function and loops as well -
df[(df['c1'] > 10) & (df['c2']) & (df['c3'])].index.size > 0
The condition in question is (df.c1 > 10) & df.c2 & df.c3
You can either check if there are any rows in the dataframe df that satisfies this condition.
>>> print(((df.c1 > 10) & df.c2 & df.c3).any())
True
or , you can check for the length of the dataframe returned from the original dataframe - for this condition (which will be df[(condition)]
>>> print(len(df[((df.c1>10) & df.c2 & df.c2)]) > 0)
True

How to create conditionnal columns in Pandas with any?

I'm working with Pandas. I need to create a new column in a dataframe according to conditions in other columns. I try to look for each value in a series if it contains a value (a condition to return text).This works when the values are exactly the same but not when the value is only a part of the value of the series.
Sample data :
df = pd.DataFrame([["ores"], ["ores + more texts"], ["anything else"]], columns=['Symptom'])
def conditions(df5):
if ("ores") in df5["Symptom"]: return "Things"
df["new_column"] = df.swifter.apply(conditions, axis=1)
It's doesn't work because any("something") is always True
So i tried :
df['new_column'] = np.where(df2["Symptom"].str.contains('ores'), 'yes', 'no') : return "Things"
It doesn't work because it's inside a loop.
I can't use np.select because it needed two separate lists and my code has to be easily editable (and it can't come from a dict).
It also doesn't work with find_all. And also not with :
df["new_column"] == "ores" is True: return "things"
I don't really understand why nothing work and what i have to do ?
Edit :
df5 = pd.DataFrame([["ores"], ["ores + more texts"], ["anything else"]], columns=['Symptom'])
def conditions(df5):
(df5["Symptom"].str.contains('ores'), 'Things')
df5["Deversement Service"] = np.where(conditions)
df5
For the moment i have a lenght of values problem
To add a new column with condition, use np.where:
df = pd.DataFrame([["ores"], ["ores + more texts"], ["anything else"]], columns=['Symptom'])
df['new'] = np.where(df["Symptom"].str.contains('ores'), 'Things', "")
print (df)
Symptom new
0 ores Things
1 ores + more texts Things
2 anything else
If you need a single boolean value, use pd.Series.any:
if df["Symptom"].str.contains('ores').any():
print ("Things")
# Things

Splitting a dataframe based on condition

I am trying to split my dataframe into two based of medical_plan_id. If it is empty, into df1. If not empty into df2.
df1 = df_with_medicalplanid[df_with_medicalplanid['medical_plan_id'] == ""]
df2 = df_with_medicalplanid[df_with_medicalplanid['medical_plan_id'] is not ""]
The code below works, but if there are no empty fields, my code raises TypeError("invalid type comparison").
df1 = df_with_medicalplanid[df_with_medicalplanid['medical_plan_id'] == ""]
How to handle such situation?
My df_with_medicalplanid looks like below:
wellthie_issuer_identifier ... medical_plan_id
0 UHC99806 ... None
1 UHC99806 ... None
Use ==, not is, to test equality
Likewise, use != instead of is not for inequality.
is has a special meaning in Python. It returns True if two variables point to the same object, while == checks if the objects referred to by the variables are equal. See also Is there a difference between == and is in Python?.
Don't repeat mask calculations
The Boolean masks you are creating are the most expensive part of your logic. It's also logic you want to avoid repeating manually as your first and second masks are inverses of each other. You can therefore use the bitwise inverse ~ ("tilde"), also accessible via operator.invert, to negate an existing mask.
Empty strings are different to null values
Equality versus empty strings can be tested via == '', but equality versus null values requires a specialized method: pd.Series.isnull. This is because null values are represented in NumPy arrays, which are used by Pandas, by np.nan, and np.nan != np.nan by design.
If you want to replace empty strings with null values, you can do so:
df['medical_plan_id'] = df['medical_plan_id'].replace('', np.nan)
Conceptually, it makes sense for missing values to be null (np.nan) rather than empty strings. But the opposite of the above process, i.e. converting null values to empty strings, is also possible:
df['medical_plan_id'] = df['medical_plan_id'].fillna('')
If the difference matters, you need to know your data and apply the appropriate logic.
Semi-final solution
Assuming you do indeed have null values, calculate a single Boolean mask and its inverse:
mask = df['medical_plan_id'].isnull()
df1 = df[mask]
df2 = df[~mask]
Final solution: avoid extra variables
Creating additional variables is something, as a programmer, you should look to avoid. In this case, there's no need to create two new variables, you can use GroupBy with dict to give a dictionary of dataframes with False (== 0) and True (== 1) keys corresponding to your masks:
dfs = dict(tuple(df.groupby(df['medical_plan_id'].isnull())))
Then dfs[0] represents df2 and dfs[1] represents df1 (see also this related answer). A variant of the above, you can forego dictionary construction and use Pandas GroupBy methods:
dfs = df.groupby(df['medical_plan_id'].isnull())
dfs.get_group(0) # equivalent to dfs[0] from dict solution
dfs.get_group(1) # equivalent to dfs[1] from dict solution
Example
Putting all the above in action:
df = pd.DataFrame({'medical_plan_id': [np.nan, '', 2134, 4325, 6543, '', np.nan],
'values': [1, 2, 3, 4, 5, 6, 7]})
df['medical_plan_id'] = df['medical_plan_id'].replace('', np.nan)
dfs = dict(tuple(df.groupby(df['medical_plan_id'].isnull())))
print(dfs[0], dfs[1], sep='\n'*2)
medical_plan_id values
2 2134.0 3
3 4325.0 4
4 6543.0 5
medical_plan_id values
0 NaN 1
1 NaN 2
5 NaN 6
6 NaN 7
Another variant is to unpack df.groupby, which returns an iterator with tuples (first item being the element of groupby and the second being the dataframe).
Like this for instance:
cond = df_with_medicalplanid['medical_plan_id'] == ''
(_, df1) , (_, df2) = df_with_medicalplanid.groupby(cond)
_ is in Python used to mark variables that are not interested to keep. I have separated the code to two lines for readability.
Full example
import pandas as pd
df_with_medicalplanid = pd.DataFrame({
'medical_plan_id': ['214212','','12251','12421',''],
'value': 1
})
cond = df_with_medicalplanid['medical_plan_id'] == ''
(_, df1) , (_, df2) = df_with_medicalplanid.groupby(cond)
print(df1)
Returns:
medical_plan_id value
0 214212 1
2 12251 1
3 12421 1
cond = df_with_medicalplanid['medical_plan_id'] == ''
(_, df1) , (_, df2) = df_with_medicalplanid.groupby(cond)
# Anton missed cond in right side bracket
print(df1)

Python - Population of PANDAS dataframe column based on conditions met in other dataframes' columns

I have 3 dataframes (df1, df2, df3) which are identically structured (# and labels of rows/columns), but populated with different values.
I want to populate df3 based on values in the associated column/rows in df1 and df2. I'm doing this with a FOR loop and a custom function:
for x in range(len(df3.columns)):
df3.iloc[:, x] = customFunction(x)
I want to populate df3 using this custom IF/ELSE function:
def customFunction(y):
if df1.iloc[:,y] <> 1 and df2.iloc[:,y] = 0:
return "NEW"
elif df2.iloc[:,y] = 2:
return "OLD"
else:
return "NEITHER"
I understand why I get an error message when i run this, but i can't figure out how to apply this function to a series. I could do it row by row with more complex code but i'm hoping there's a more efficient solution? I fear my approach is flawed.
v1 = df1.values
v2 = df2.values
df3.loc[:] = np.where(
(v1 != 1) & (v2 == 0), 'NEW',
np.where(v2 == 2, 'OLD', 'NEITHER'))
Yeah, try to avoid loops in pandas, its inefficient and built to be used with the underlying numpy vectorization.
You want to use the apply function.
Something like:
df3['new_col'] = df3.apply(lambda x: customFunction(x))
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html

Categories