Better way to do computation over pandas - python

Below is my pandas snippet. it works. Given a df, I wish to know if there exist any row that satisfy c1> 10 and C2 and C3 are True. Below code works. I wsh to know if there is any better way to do the same.
import pandas as pd
inp = [{'c1':10, 'c2':True, 'c3': False}, {'c1':9, 'c2':True, 'c3': True}, {'c1':11, 'c2':True, 'c3': True}]
df = pd.DataFrame(inp)
def check(df):
for index, row in df.iterrows():
if ((row['c1']>10) & (row['c2']==True)& (row['c3']==True)):
return True
else:
continue
t = check(df)

When using pandas you rarely need to iterate over rows and apply the operations per each row separately. In many cases if you apply the same operation to the whole dataframe or column you get the same or similar result and faster a more readable code. In your case:
(df['c1'] > 10) & df['c2'] & df['c3']
# will lead to a Series:
# 0 False
# 1 False
# 2 True
# dtype: bool
(note that I am calling the operation on the whole df rather than single row
which signifies for which rows the condition holds. If you need to know just if any row satisfies the condition, you can all any:
((df['c1'] > 10) & df['c2'] & df['c3']).any()
# True
So your whole check function would be:
def check(df):
return ((df['c1'] > 10) & df['c2'] & df['c3']).any()

It is not clear what you want to change or improve about your solution, but you can achieve the same without a separate function and loops as well -
df[(df['c1'] > 10) & (df['c2']) & (df['c3'])].index.size > 0

The condition in question is (df.c1 > 10) & df.c2 & df.c3
You can either check if there are any rows in the dataframe df that satisfies this condition.
>>> print(((df.c1 > 10) & df.c2 & df.c3).any())
True
or , you can check for the length of the dataframe returned from the original dataframe - for this condition (which will be df[(condition)]
>>> print(len(df[((df.c1>10) & df.c2 & df.c2)]) > 0)
True

Related

Comparing and dropping columns based on greater or smaller timestamp

I have this df:
id started completed
1 2022-02-20 15:00:10.157 2022-02-20 15:05:10.044
and I have this other one data:
timestamp x y
2022-02-20 14:59:47.329 16 0.0
2022-02-20 15:01:10.347 16 0.2
2022-02-20 15:06:35.362 16 0.3
what I wanna do is filter the rows in data where timestamp > started and timestamp < completed (which will leave me with the middle row only)
I tried to do it like this:
res = data[(data['timestamp'] > '2022-02-20 15:00:10.157')]
res = res[(res['timestamp'] > '2022-02-20 15:05:10.044')]
and it works.
But when I wanted to combine the two like this:
res = data[(data['timestamp'] > df['started']) and (data['timestamp'] < df['completed'])]
I get ValueError: Can only compare identically-labeled Series objects
Can anyone please explain why and where am I doing the mistake? Do I have to convert to string the df['started'] or something?
You have two issues here.
The first is the use of and. If you want to combine multiple masks (boolean array) with a "and" logic element-wise, you want to use & instead of and.
Then, the use of df['started'] and df['completed'] for comparing. If you use a debugger, you can see that
df['started'] is a dataframe with its own indexes, the same for data['timestamp']. The rule for comparing, two dataframes are described here. Essentially, you can compare only two dataframes with the same indexing. But here df has only one row, data multiple. Try convert your element from df as a non dataframe format. Using loc for instance.
For instance :
Using masks
n = 10
np.random.seed(0)
df = pd.DataFrame(
{
"x": np.random.choice(np.array([*ascii_lowercase]), size=n),
"y": np.random.normal(size=n),
}
)
df2 = pd.DataFrame(
{
"max_val" : [0],
"min_val" : [-0.5]
}
)
df[(df.y < df2.loc[0, 'max_val']) & (df.y > df2.loc[0, 'min_val'])]
Out[95]:
x y
2 v -0.055035
3 a -0.107310
5 d -0.097696
7 j -0.453056
8 t -0.470771
Using query
df2 = pd.DataFrame(
{
"max_val" : np.repeat(0, n),
"min_val" : np.repeat(-0.5, n)
}
)
df.query("y < #df2.max_val and y > #df2.min_val")
Out[124]:
x y
2 v -0.055035
3 a -0.107310
5 d -0.097696
7 j -0.453056
8 t -0.470771
To make the comparisons, Pandas need to have the same rows count in both the dataframes, that's because a comparison is made between the first row of the data['timestamp'] series and the first row of the df['started'] series, and so on.
The error is due to the second row of the data['timestamp'] series not having anything to compare with.
In order to make the code work, you can add for any row of data, a row in df to match against. In this way, Pandas will return a Boolean result for every row, and you can use the AND logical operator to get the results that are both True.
Pandas doesn't want Python's and operator, so you need to use the & operator, so your code will look like this:
data[(data['timestamp'] > df['started']) & (data['timestamp'] < df['completed'])]

Python: inconsistent handling of IF statement in loop

I have a dataframe df containing conditions and values.
import pandas as pd
df=pd.DataFrame({'COND':['X','X','X','Y','Y','Y'], 'VALUE':[1,2,3,1,2,3]})
Therefore df looks like:
COND VALUE
X 1
X 2
X 3
Y 1
Y 2
Y 3
I'm using a loop to subset df according to COND, and write separate text files containing values for each condition
conditions = {'X','Y'}
for condition in conditions:
df2 = df[df['COND'].isin([condition])][['VALUE']]
df2.to_csv(condition + '_values.txt', header=False, index=False)
The end results is two text files: X_vals.txt and Y_vals.txt, both of which contain 1 2 3. Up until this point everything is working as expected.
I would like to further subset df for one condition only. For example, perhaps I want all values from condition Y, but ONLY values < 3 from condition X. In this scenario, X_vals.txt should contain 1 2 and Y_vals.txt should contain 1 2 3. I tried implementing this with an IF statement:
conditions = {'X','Y'}
for condition in conditions:
if condition == 'X':
df = df[df['VALUE'] < 3]
df2 = df[df['COND'].isin([condition])][['VALUE']]
df2.to_csv(condition + '_values.txt', header=False, index=False)
Here is where the inconsistency occurs. The above code works fine (i.e. X_vals.txt contains 1 2, and Y_vals.txt 1 2 3, as intended), but when I use if condition=='Y' instead of if condition=='X', it breaks, and both text files only contain 1 2.
In other words, if I specify the first element of conditions in the IF statement then it works as intended, however if I specify the second element then it breaks and applies the < 3 subset to values from both conditions.
What is going on here and how can I resolve it?
Thanks!
The problem you are encountering arises because you are overwriting df inside the loop.
conditions = {'X','Y'}
for condition in conditions:
if condition == 'X':
df = df[df['VALUE'] < 3] # <-- HERE'S YOUR ISSUE
df2 = df[df['COND'].isin([condition])][['VALUE']]
df2.to_csv(condition + '_values.txt', header=False, index=False)
What slightly surprised me is that when you are looping over the set conditions you get condition = 'Y' first, then condition = 'X'. But as a set is an unordered collection (i.e. it doesn't claim to have an inherent order of its elements), this ought not to be too disturbing: python is just reading out the elements in the most internally convenient way.
You could use conditions = ['X', 'Y'] to loop over a list (an ordered collection) instead. Then it will do X first, then Y. However, if you do that you will get the same bug but in reverse (i.e. it works for if condition == 'Y' but not if condition == 'X').
This is because after the loop runs once, df has been reassigned to the subset of the original df that only contains values less than three. That's why you get only the values 1 and 2 in both files if the if condition statement triggers on the first pass through the loop.
Now for the fix:
conditions = ['X', 'Y']
for condition in conditions:
csv_name = f"{condition}_values.txt"
if condition == 'X':
df_filter = f"VALUE < 3 & COND == '{condition}'"
else:
df_filter = f"COND == '{condition}'"
df.query(df_filter).VALUE.to_csv(csv_name, header=False, index=False)
Here I've introduced the DataFrame.query method, which is typically more concise than trying to create a Boolean series to use as a mask as you were doing.
The f-string syntax only works on python 3.6+, if you're on a lower version then modify as appropriate (e.g. df_filter = "COND == '{}'".format(condition))
We can write the condition to dict then use map filter the df before groupby
cond = {'X' : 2, 'Y' : 3}
subdf = df[df['VALUE']<df.COND.map(cond)]
for x, y in subdf.groupby('COND'):
y.to_csv(x + '_values.txt')
df=pd.DataFrame({'COND':['X','X','X','Y','Y','Y'], 'VALUE':[1,2,3,1,2,3]})
conditions = df.COND
for condition in conditions:
print(condition)
df2=df[df['COND'].isin([condition])][['VALUE']]
df2.to_csv(condition + '_values.txt',header=False, index=False)
for condition in conditions:
if condition=='X':
df=df[df['VALUE'] < 3]
df2=df[df['COND'].isin([condition])][['VALUE']]
df2.to_csv(condition + '_values.txt',header=False, index=False)
You didn't specify the variable "Condition", so it gave you an error.
try doing :
conditions = df.COND
before the for loop

Iterate pandas row by row and modify specific "cells" in a python way

I'm new to python and I have a pandas dataframe that I want to iterate row by row (like for example a 2d array in other languages).
The goal is something like this as a logic: (if df was a like 2d array)
for row in df:
if df[row,2] == '' AND df[row,1] !='':
df[row-1,1] = df[row,1]
df[row,1] = ''
The point is: I want to move the contents of the the current row to the previous one in the column 1, if the current row,column 2 is empty and the current row,column 1 is not.
How would I do that in a python way? (without for example iterating with for loops). I saw something about vectorization but I don't really get how it works.
Or is it easier to convert the df into a list of lists, or an array? The files are big so I would like to use a fast way and I read from excel file, so I just used the read_excel of pandas to import it into a df.
Try this (assuming by column 1 you meant the column at index 0, and by column 2, the one at index 1):
import pandas as pd
import numpy as np
col1, col2 = df.columns[0], df.columns[1]
mask = (df.loc[:, col1] != '') & (df.loc[:, col2] == '')
mask.iloc[0] = False # don't wrap around first row (even if the condition applies)
df.loc[mask.shift(-1, fill_value=False), col1] = df.loc[mask, col1].values
The key point here is using Series.shift to shift the boolean mask backwards by one. This only uses pandas/numpy vectorized functions, so it will be much better than iterating with a plain Python for loop.
Step-by-step
[Get the labels of your columns: col1, col2 = df.columns[0], df.columns[1]]
Create a boolean mask which is True for the rows which satisfy your condition, i.e. nonempty first column and empty second column:
mask = (df.loc[:, col1] != '') & (df.loc[:, col2] == '')
mask.iloc[0] = False
Here we manually set the first element of the mask to False, since even if the first row satisfies the condition, we can't do anything with it (there is no previous row to copy the value of the first column to). (This isn't a problem for Series.shift, which doesn't wrap around, but it is when we're using this mask, in step 3, to select the values that we're going to assign, with df.loc[mask, col1].values: if mask.iloc[0] were True, we would have one more value than targets.)
Shift the mask backwards by one to obtain a mask of the rows to be modified (i.e. the rows that come immediately before a row that satisfies the condition):
mask.shift(-1, fill_value=False)
Since we're shifting the mask backwards by one, the last element won't be defined, so we set it to False by using fill_value=False—we don't want to modify the last row.
Within column 1, assign the values of the rows satisfying the condition to their respective previous rows, using the two masks that we computed:
df.loc[mask.shift(-1, fill_value=False), col1] = df.loc[mask, col1].values
Here we must use .values on the right-hand-side to get the raw numpy array of values, since if we leave it as a Series, pandas will try to align the indices of the lhs and rhs (and since we shifted the rows by one, the indices won't match, so the end result will contain NaNs); instead, we simply want to assign the first element of the rhs to the first slot of the lhs, the second element to the second slot, etc.
This is more or less the same approach as the one outlined by Chaos in the comments.
Example
>>> sample = pd.DataFrame([("spam", ""), ("foo", "bar"), ("baz", ""), ("", "eggs")])
>>> df = sample.copy()
>>> df
0 1
0 spam
1 foo bar
2 baz
3 eggs
>>> col1, col2 = df.columns[0], df.columns[1]
>>> mask = (df.loc[:, col1] != '') & (df.loc[:, col2] == '')
>>> mask.iloc[0] = False
>>> df.loc[mask.shift(-1, fill_value=False), col1] = df.loc[mask, col1].values
>>> df
0 1
0 spam
1 baz bar
2 baz
3 eggs
Addendum
If you actually do want to make the value of the first row wrap around to the last row (if the condition applies to the first row)—i.e. you want to move the values around circularly—, you can use np.roll instead of Series.shift:
mask = (df.loc[:, col1] != '') & (df.loc[:, col2] == '')
df.loc[np.roll(mask, -1), col1] = np.roll(df.loc[mask, col1].values, -1)
Then, continuing the previous example:
>>> df = sample.copy()
>>> mask = (df.loc[:, col1] != '') & (df.loc[:, col2] == '')
>>> df.loc[np.roll(mask, -1), col1] = np.roll(df.loc[mask, col1].values, -1)
>>> df
0 1
0 spam
1 baz bar
2 baz
3 spam eggs
In case you will not find a more pythonic way, here is the correct code to do the work:
for i in range(1, len(df)):
if df.iloc[i, 2]='' and df.iloc[i, 1]!='':
df.iloc[i-1, 1]=df.iloc[i,1]
df.iloc[i, 1]=''

Multiple check for empty dataframe

I have a situation where I need to move the dataframe forward in code only if it is not empty. Illustrated below:
----- Filter 1 -------
Check if df.empty then return emptydf
else
----- Filter 2 ------
Check if df.empty then return emptydf
else
----- Filter 3 ------
return df
The code for the above is written as below(Just a part of code):
def filter_df(df):
df = df[df.somecolumn > 2].copy()
if df.empty:
return df
df = df[df.someother == 2].copy()
if df.empty:
return df
df = df[df.all <= 10].copy()
return df
If I have many such filters which expect dataframe not to be empty, I need to check empty after each filter. Is there any better way of checking dataframe empty rather than checking at each level.
Repeatedly subsetting your dataframe is expensive. Repeatedly copying your dataframe may also be expensive. It's also expensive to pre-calculate a large number of Boolean masks. The tricky part is finding a way to apply the masks lazily in a for loop.
While the below functional solution may seem ugly, it does address the above concerns. The idea is to combine a Boolean mask iteratively with an aggregate mask. Check in your loop whether your mask has all False values, not whether a dataframe is empty. Apply the aggregate mask once at the end of your logic:
from operator import methodcaller
def filter_df(df):
masks = [('somecolumn', 'gt', 2),
('someother', 'eq', 2),
('all', 'le', 10)]
agg_mask = np.ones(len(df.index)).astype(bool) # "all True" mask
for col, op, val in masks:
mask = methodcaller(op, val)(df[col])
agg_mask = agg_mask & mask
if not agg_mask.any():
return df[agg_mask]
return df[agg_mask]
Note for this solution series comparison operators such as >, ==, <= have functional equivalents pd.Series.gt, pd.Series.eq, pd.Series.le.
you can use function and call that after very filter
def check_empty(df):
if df.empty:
return df
df = df[df.somecolumn > 2].copy()
check_empty(df)
df = df[df.someother == 2].copy()
check_empty(df)
df = df[df.all <= 10].copy()
return df

Python - Population of PANDAS dataframe column based on conditions met in other dataframes' columns

I have 3 dataframes (df1, df2, df3) which are identically structured (# and labels of rows/columns), but populated with different values.
I want to populate df3 based on values in the associated column/rows in df1 and df2. I'm doing this with a FOR loop and a custom function:
for x in range(len(df3.columns)):
df3.iloc[:, x] = customFunction(x)
I want to populate df3 using this custom IF/ELSE function:
def customFunction(y):
if df1.iloc[:,y] <> 1 and df2.iloc[:,y] = 0:
return "NEW"
elif df2.iloc[:,y] = 2:
return "OLD"
else:
return "NEITHER"
I understand why I get an error message when i run this, but i can't figure out how to apply this function to a series. I could do it row by row with more complex code but i'm hoping there's a more efficient solution? I fear my approach is flawed.
v1 = df1.values
v2 = df2.values
df3.loc[:] = np.where(
(v1 != 1) & (v2 == 0), 'NEW',
np.where(v2 == 2, 'OLD', 'NEITHER'))
Yeah, try to avoid loops in pandas, its inefficient and built to be used with the underlying numpy vectorization.
You want to use the apply function.
Something like:
df3['new_col'] = df3.apply(lambda x: customFunction(x))
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html

Categories