Pandas Error Combining Conditions in Where Clause - python

I have a column with binary flag values and I'm trying to clean it up if there are mistakes. A mistake would be if a particular group has both 0's and 1's. My rule is that this column can only contain either 0's or 1's within the group. I'm trying to come up with an np.where() clause such that I'm testing for groups with a column that has a single repeated value and also that the first value of that column in the group isn't 1. If the first value of the group isn't 1, and there's a mix of values, flip them all to 0 in that group.
Here's what I'm trying:
df['Flag'] = np.where((df.groupby('CombBitSeq')['Flag'].transform('std') != 0) & (df.groupby('CombBitSeq')['Flag'].nth(0) != 1), 0, df['Flag'])
The error I'm getting is this, and I'm not sure how the lengths of the two combined conditions are off by 1:
ValueError: operands could not be broadcast together with shapes (336661,) () (336660,)

If you want to get the first item per group and translate that throughout the entire dataframe, use groupby + transform + head, instead of nth:
df.groupby('CombBitSeq')['Flag'].transform('head', 1)
Your condition now becomes:
g = df.groupby('CombBitSeq')['Flag'] # let's compute this only once
df['Flag'] = np.where(
g.transform('std').ne(0) & g.transform('head', 1).ne(1), 0, df['Flag']
)

Related

Xarray: grouping by contiguous identical values

In Pandas, it is simple to slice a series(/array) such as [1,1,1,1,2,2,1,1,1,1] to return groups of [1,1,1,1], [2,2,],[1,1,1,1]. To do this, I use the syntax:
datagroups= df[key].groupby(df[key][df[key][variable] == some condition].index.to_series().diff().ne(1).cumsum())
...where I would obtain individual groups by df[key][variable] == some condition. Groups that have the same value of some condition that aren't contiguous are their own groups. If the condition was x < 2, I would end up with [1,1,1,1],[1,1,1,1] from the above example.
I am attempting to do the same thing in xarray package, because I am working with multidimensional data, but the above syntax obviously doesn't work.
What I have been successful doing so far:
a) apply some condition to separate the values I want by NaNs:
datagroups_notsplit = df[key].where(df[key][variable] == some condition)
So now I have groups as in the example above [1,1,1,1,Nan,Nan,1,1,1,1] (if some condition was x <2). The question is, how do I cut these groups so that it becomes [1,1,1,1],[1,1,1,1]?
b) Alternatively, group by some condition...
datagroups_agglomerated = df[key].groupby_bins('variable', bins = [cleverly designed for some condition])
But then, following the example above, I end up with groups [1,1,1,1,1,1,1], [2,2]. Is there a way to then groupby the groups on noncontiguous index values?
Without knowing more about what your 'some condition' can be, or the domain of your data (small integers only?), I'd just workaround the missing pandas functionality, something like:
import pandas as pd
import xarray as xr
dat = xr.DataArray([1,1,1,1,2,2,1,1,1,1], dims='x')
# Use `diff()` to get groups of contiguous values
(dat.diff('x') != 0)]
# ...prepend a leading 0 (pedantic syntax for xarray)
xr.concat([xr.DataArray(0), (dat.diff('x') != 0)], 'x')
# ...take cumsum() to get group indices
xr.concat([xr.DataArray(0), (dat.diff('x') != 0)], 'x').cumsum()
# array([0, 0, 0, 0, 1, 1, 2, 2, 2, 2])
dat.groupby(xr.concat([xr.DataArray(0), (dat.diff('x') != 0)], 'x').cumsum() )
# DataArrayGroupBy, grouped over 'group'
# 3 groups with labels 0, 1, 2.
The xarray How do I page could use some recipes like this ("Group contiguous values"), suggest you contact them and have them added.
My use case was a bit more complicated than the minimal example I posted due to the use of timeseries indices and the desire to subselect certain conditions; however, I was able to adapt the answer of smci, above in the following way:
(1) create indexnumber variable:
df = Dataset( data_vars={
'some_data' : (('date'), some_data),
'more_data' : (('date'), more_data),
'indexnumber' : (('date'), arange(0,len(date_arr))
},
coords={
'date' : date_arr
}
)
(2) get the indices for the groupby groups:
ind_slice = df.where(df['more_data'] == some_condition)['indexnumber'].dropna(dim='date').diff(dim='date') !=1).cumsum().indexes
(3) get the cumsum field:
sumcum = df.where(df['more_data'] == some_condition)['indexnumber'].dropna(dim='date').diff(dim='date') !=1).cumsum()
(4) reconstitute a new df:
df2 = df.loc[ind_slice]
(5) add the cumsum field:
df2['sumcum'] = sumcum
(6) groupby:
groups = df2.groupby(df['sumcum'])
hope this helps anyone else out there looking to do this.

Filtering for rows in a Pandas dataframe containing at least one zero

I am trying to delete all rows in a Pandas data frame that don't have a zero in either of two columns. My data frame is indexed from 0 to 620. This is my code:
for index in range(0, 621):
if((zeroes[index,1] != 0) and (zeroes[index,3] != 0)):
del(zeroes[index,])
I keep getting a key error.
KeyError: (0, 1)
My instructor suggested I change the range to test to see if I have bad lines in my data frame. I did. I checked the tail of my dataframe and then changed the range to (616, 621). Then I got the key error: (616, 1).
Does anyone know what is wrong with my code or why I am getting a key error?
This code also produces a key error of (0,1):
index = 0
while (index < 621):
if((zeroes[index,1] != 0) and (zeroes[index,3] != 0)):
del(zeroes[index,])
index = index + 1
Don't use a manual for loop here. Your error probably occurs because df.__getitem__((x, y)), which is effectively what df[x, y] calls, has no significance.
Instead, use vectorised operations and Boolean indexing. For example, to remove rows where either column 1 or 3 do not equal 0:
df = df[df.iloc[:, [1, 3]].eq(0).any(1)]
This works because eq(0) creates a dataframe of Boolean values indicating equality to zero and any(1) filters for rows with any True values.
The full form is df.iloc[:, [1, 3]].eq(0).any(axis=1), or df.iloc[:, [1, 3]].eq(0).any(axis='columns') for even more clarity. See the docs for pd.DataFrame.any for more details.

Getting rid of outliers rows in multiple columns pandas dataframe

I have a pandas data frame with many columns (>100). I standarized all the columns value so every column is centered at 0 (they have mean 0 and std 1). I want to get rid of all the rows that are below -2 and above 2 taking into account all the columns. With this I mean, lets say in the first column the rows 2,3,4 are outliers and in the second column the rows 3,4,5,6 are outliers. Then I would like to get rid of the rows [2,3,4,5,6].
What I am trying to do is to use a for loop to pass for every column and collect the row index that are outliers and store them in a list. At the end I have a list containing lists with the row index of every column. I get the unique values to obtain the row index I should get rid of. My problem is I don´t know how to slice the data frame so it doesn´t contain these rows. I was thinking in using an %in% operator, but it doesn´t admit the format # list in a list#. I show my code below.
### Getting rid of the outliers
'''
We are going to get rid of the outliers who are outside the range of -2 to 2.
'''
aux_features = features_scaled.values
n_cols = aux_features.shape[1]
n_rows = aux_features.shape[0]
outliers_index = []
for i in range(n_cols):
variable = aux_features[:,i] # We take one column at a time
condition = (variable < -2) | (variable > 2) # We stablish the condition for the outliers
index = np.where(condition)
outliers_index.append(index)
outliers = [j for i in outliers_index for j in i]
outliers_2 = np.array([j for i in outliers for j in i])
unique_index = list(np.unique(outliers_2)) # This is the final list with all the index that contain outliers.
total_index = list(range(n_rows))
aux = (total_index in unique_index)
outliers_2 contain a list with all the row indexes (this includes repetition), then in unique_index I get only the unique values so I end with all the row index that have outliers. I am stuck in this part. If anyone knows how to complete it or have better a idea of how get rid of these outliers (I guess my method would be very time consuming for really large datasets)
df = pd.DataFrame(np.random.standard_normal(size=(1000, 5))) # example data
cleaned = df[~(np.abs(df) > 2).any(1)]
Explanation:
Filter dataframe for values above and below 2. Returns dataframe containing boolean expressions:
np.abs(df) > 2
Check if row contains outliers. Evaluates to True for each row where an outlier exists:
(np.abs(df) > 2).any(1)
Finally select all rows without outlier using the ~ operator:
df[~(np.abs(df) > 2).any(1)]

Why isn't here order invariance between the two sets of operations?

I'm handling a CSV file/pandas dataframe, where the first column contains the date.
I want to do some conversion here to datetime, some filtering, sorting and reindexing.
What I experience is that if I change the order of the sets of operations, I get different results (the result of the first configuration is bigger, than the other one). Probably the first one is the "good" one.
Can anyone tell me, which sub-operations cause the difference between the results?
Which of those is the "bad" and which is the "good" solution?
Is it possible secure order independence where the user can call those two methods in any order and still got the good results? (Is it possible to get the good results by implementing interchangeable sets of operations?)
jdf1 = x.copy(deep=True)
jdf2 = x.copy(deep=True)
interval = [DATE_START, DATE_END]
dateColName = "Date"
Configuration 1:
# Operation set 1: dropping duplicates, sorting and reindexing the table
jdf1.drop_duplicates(subset=dateColName, inplace=True)
jdf1.sort_values(dateColName, inplace=True)
jdf1.reset_index(drop=True, inplace=True)
# Operatrion set 2: converting column type and filtering the rows in case of CSV's contents are covering a wider interval
jdf1[dateColName] = pd.to_datetime(jdf1[jdf1.columns[0]], format="%Y-%m-%d")
maskL = jdf1[dateColName] < interval[0]
maskR = jdf1[dateColName] > interval[1]
mask = maskL | maskR
jdf1.drop(jdf1[mask].index, inplace=True)
vs.
Configuration 2:
# Operatrion set 2: converting column type and filtering the rows in case of CSV's contents are covering a wider interval
jdf2[dateColName] = pd.to_datetime(jdf2[jdf2.columns[0]], format="%Y-%m-%d")
maskL = jdf2[dateColName] < interval[0]
maskR = jdf2[dateColName] > interval[1]
mask = maskL | maskR
jdf2.drop(jdf2[mask].index, inplace=True)
# Operation set 1: dropping duplicates, sorting and reindexing the table
jdf2.drop_duplicates(subset=dateColName, inplace=True)
jdf2.sort_values(dateColName, inplace=True)
jdf2.reset_index(drop=True, inplace=True)
Results:
val1 = set(jdf1["Date"].values)
val2 = set(jdf2["Date"].values)
# bigger:
val1 - val2
# empty:
val2 - val1
Thank you for your help!
In first look it is same, but NOT.
Because there are 2 different ways for filtering with affect each others:
drop_duplicates() -> remove M rows, together ALL rows - M
boolean indexing with mask -> remove N rows, together ALL - M - N
--
boolean indexing with mask -> remove K rows, together ALL rows - K
drop_duplicates() -> remove L rows, together ALL - K - L
K != M
L != N
And if swap this operations, result should be different, because both remove rows. And it is important order of calling them, because some rows remove only drop_duplicates, somerows only boolean indexing.
In my opinion both methods are right, it depends what need.

complex dataframe filtering request on the last occurence of a value in Panda/Python [EDIT]

I have a hard time to do a complex dataframe filtering.
Here the problem:
For each column 'id' of same value, the column 'job' can take the values 'fireman','nan','policeman'.
I would like to filter my dataframe so that for each id of same value,
I keep only the rows starting where the value 'fireman' for job is occuring the last consecutive time. I first have to group by 'job' values to filter on:
df.groupby("job").filter(lambda x: f(x))
I don't know which function f is appropriate.
Any ideas ?
To try:
df = pd.DataFrame([[79,1,], [79,2,'fireman'],[79,3,'fireman'],[79,4,],[79,5,],[79,6,'fireman'],[79,7,'fireman'],[79,8,'policeman']], columns=['id','day','job'])
output = pd.DataFrame([[79,6,'fireman'],[79,7,'fireman'],[79,8,'policeman']], columns=['id','day','job'])
Here is a version without the need of extra variables:
df.groupby('imo').apply(lambda grp: grp[grp.index >=
((grp.polygon.shift() != grp.polygon) &
(grp.polygon.shift(-1) == grp.polygon) &
(grp.polygon == 'FE')
).cumsum().idxmax()]
).reset_index(level=0, drop=True)

Categories