Dynamically make comparisons between columns in pandas columns - python

I have a large database in which I need to drop entries that don't satisfy a boolean criteria, but the criteria may involve several dozen columns.
I have the following which works with copying and pasting the names
df = df[~( (df['FirstCol'] > df['SecondCol']) |
(df['ThirdCol'] > df['FifthCol']) |
...
(df['FiftiethCol'] > df['TweniethCol']) |
(df['ThisCouldBeHundredsCol'] > df['LastOne'])
)]
However, I want to be able to do this in shorter amounts of code. If I have the column names that need to be compared in a list, like so
list_of_comparison_cols = ['FirstCol', 'SecondCol', 'ThirdCol', 'FifthCol', ..., 'FiftiethCol', 'TweniethCol', 'ThisCouldBeHundredsCol', 'LastOne']
How would I go about doing this in as little code and more dynamically as possible?
Many thanks.

You can do it by selecting every two elements of your list with [::2] to get ['FirstCol', 'ThirdCol',...] and [1::2] to get ['SecondCol', 'FifthCol', .... Use it to select the columns and compare to_numpy arrays between both side of the inequality. Then use any over axis=1 that correspond to the | used in your condition.
#example
list_of_comparison_cols = ['FirstCol', 'SecondCol', 'ThirdCol', 'FifthCol',
'FiftiethCol', 'TweniethCol', 'ThisCouldBeHundredsCol',
'LastOne']
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0,50,8*10).reshape(10,8),
columns=list_of_comparison_cols)
# create the mask
mask = (df[list_of_comparison_cols[::2]].to_numpy()
>df[list_of_comparison_cols[1::2]].to_numpy()
).any(1)
print (df[~mask])
FirstCol SecondCol ThirdCol FifthCol FiftiethCol TweniethCol \
0 44 47 0 3 3 39
ThisCouldBeHundredsCol LastOne
0 9 19

Related

Creating a Numpy Array from two Series

I had a DataFrame called "segments" that looks like the below:
ORIGIN_AIRPORT_ID DEST_AIRPORT_ID FL_COUNT ORIGIN_INDEX DEST_INDEX OUTDEGREE
WEIGHT
0 10135 10397 77 119 373 3 0.333333
1 10135 11433 85 119 1375 3 0.333333
Using this, I created two Boolean Series objects: One in which I'm storing all the IDs for which the WEIGHT column is not 0 and one in which they are:
Zeroes = (segments['WEIGHT'] == 0).groupby(segments['ORIGIN_INDEX']).all()
Non_zeroes = (segments['WEIGHT'] != 0).groupby(segments['ORIGIN_INDEX']).all()
I want to do two things (because I'm not sure which this task needs):
Create a NumPy vector where all "True" values in the Non_zeroes Series are set to the result of 1/4191 (~0.024~) and all "True" values in the Zeroes Series are set to 0 (or the same logic using True and False of one Series) keeping the IDs (e.g. ORIGIN_INDEX 119 0.024%, etc.)
And I'd also like to create a NumPy vector that is JUST a list of the percentages and zeroes WITHOUT the IDs
EDIT to add extra detail requested!
I tried using a condition as a variable, then using .loc to apply it:
cond_array = copied.WEIGHT is not 0
df.loc[cond_array, ID] = 1/4191
I tried using from_coo(), toarray(), and DataFrame to convert:
pd.Series.sparse.from_coo(P, dense_index=True)
P.toarray()
pd.DataFrame(P)
Finally, I tried applying logic to the DF instead of the COO Matrix. I THINK this gets close, but it is still failing. I believe it fails because it is not including the 0s (copied is just a DF that's a copy of segments):
copied['WEIGHT'] = copied.loc[copied['WEIGHT'] != 0, 'WEIGHT'] = float((1/len(copied))) #0.00023860653
The last code passes the first two tests (testing if it's an array and that it sums to 1.0), but fails the last
assert np.isclose(x0.max(), 1.0/n_actual, atol=10*n*np.finfo(float).eps), "x0` values seem off..."
EDIT 2:
Had the wrong count. It was supposed to be 1/300, not 1/4191. All fixed now, thanks all who took a look :)

How to remove a list of columns from pydatatable dataframe?

I have a datatable Frame created as:
comidas_gen_dt = dt.Frame({
'country':list('ABCDE'),
'id':[1,2,3,4,5],
'egg':[10,20,30,5,40],
'veg':[30,40,10,3,5],
'fork':[5,10,2,1,9],
'beef':[90,50,20,None,4]})
I have created a custom function to select a list of required columns from a frame DT as,
def pydt_select_cols(DT, *rmcols):
return DT[:, *dt_cols]
So, here is the recommend syntax to remove columns from DT:
DT[:, f[:].remove([f.a, f.b, f.c])
following the above syntax of DT, I've create another custom function to keep a side a list of columns as
def pydt_remove_cols(DT, *rmcols):
dt_cols = [*rmcols]
return DT[:, f[:].remove(dt_cols)]
I'm executing the function as
pydt_remove_cols(comidas_gen_dt, 'id', 'country', 'egg')
and it's throwing the error
TypeError: Computed columns cannot be used in .remove()
Could you please help me how to go ahead with it?
Removing columns (or rows) from a Frame is easy: take any syntax that you would normally use to select those columns, and then append the python del keyword.
Thus, if you want to delete columns 'id', 'country', and 'egg', run
>>> del comidas_gen_dt[:, ['id','country','egg']]
>>> comidas_gen_dt
| veg fork beef
-- + --- ---- ----
0 | 30 5 90
1 | 40 10 50
2 | 10 2 20
3 | 3 1 NA
4 | 5 9 4
[5 rows x 3 columns]
If you want to keep the original frame unmodified, and then select a new frame with some of the columns removed, then the easiest way would be to first copy the frame, and then use the del operation:
>>> DT = comidas_gen_dt.copy()
>>> del DT[:, columns_to_remove]
(note that .copy() makes a shallow copy, i.e. its cost is typically negligible).
You can also use the f[:].remove() approach. It's a bit strange that it didn't work the way you've written it, but going from a list of strings to a list of f-symbols is quite straightforward:
def pydt_remove_cols(DT, *rmcols):
return DT[:, f[:].remove([f[col] for col in rmcols])]
Here I use the fact that f.A is the same as f["A"], where the inner string "A" might as well be replaced with any variable.

Pandas Create DataFrame with two lists behaving differently

I am trying to create a pandas data frame using two lists and the output is erroneous for a given length of the lists.(this is not due to varying lengths)
Here I have two cases, one that works as expected and one that doesn't(commented out):
import string
d = dict.fromkeys(string.ascii_lowercase, 0).keys()
groups = sorted(d)[:3]
numList = range(0,4)
# groups = sorted(d)[:20]
# numList = range(0,25)
df = DataFrame({'Number':sorted(numList)*len(groups), 'Group':sorted(groups)*len(numList)})
df.sort_values(['Group', 'Number'])
Expected Output: every item in groups, to correspond to all items in numList
Group Number
a 0
a 1
a 2
a 3
b 0
b 1
b 2
b 3
c 0
c 1
c 2
c 3
Actual Results: Works for case in which lists are sized 3 and 4 but not 20 , and 25 (I have commented out that case in the above code)
Why is that? and how to fix that?
If I understand this correctly, you want to make dataframe which will have all pairs of groups and numbers. That operation is called cartesian product.
If the difference in lengths betweens those two arrays is exactly 1, it works with your approach, but this is more by pure accident. For general case, you want to do this.
df1 = DataFrame({'Number': sorted(numList)})
df2 = DataFrame({'Group': sorted(groups)})
df = df1.assign(key=1).merge(df2.assign(key=1), on='key').drop('key', 1)
And just note about dataframes sorting: You need to remember that in pandas, most of DataFrame operations return new DataFrame by default, don't modify the old one, unless you pass the inplace=True parameter.
So you should do
df = df.sort_values(['Group', 'Number'])
or
df.sort_values(['Group', 'Number'], inplace=True)
and it should work now.

Pandas Multiindex df - slicing multiple sub-ranges of an index

I have a dataframe that looks likes this:
Sweep Index
Sweep0001 0 -70.434570
1 -67.626953
2 -68.725586
3 -70.556641
4 -71.899414
5 -69.946289
6 -63.964844
7 -73.974609
...
Sweep0039 79985 -63.964844
79986 -66.406250
79987 -67.993164
79988 -68.237305
79989 -66.894531
79990 -71.411133
I want to slice out different ranges of Sweeps.
So for example, I want Sweep0001 : Sweep0003, Sweep0009 : Sweep0015, etc.
I know I can do this in separate lines with ix, i.e.:
df.ix['Sweep0001':'Sweep0003']
df.ix['Sweep0009':'Sweep0015']
And then put those back together into one dataframe (I'm doing this so I can average sweeps together, but I need to select some of them and remove others).
Is there a way to do that selection in one line though? I.e. without having to slice each piece separately, followed by recombining all of it into one dataframe.
Use Pandas IndexSlice
import pandas as pd
idx = pd.IndexSlice
df.loc[idx[["Sweep0001", "Sweep0002", ..., "Sweep0003", "Sweep0009", ..., "Sweep0015"]]
You can retrieve the labels you want this way:
list1 = df.index.get_level_values(0).unique()
list2 = [x for x in list1]
list3 = list2[1:4] #For your Sweep0001:Sweep0003
list3.extend(list2[9:16]) #For you Sweep0009:Sweep0015
df.loc[idx[list3]] #Note that you need one set of "[]"
#less around "list3" as this list comes
#by default with its own set of "[]".
In case you want to also slice by columns you can use:
df.loc[idx[list3],:] #Same as above to include all columns.
df.loc[idx[list3],:"column label"] #Returns data up to that "column label".
More information on slicing is on the Pandas website (http://pandas.pydata.org/pandas-docs/stable/advanced.html#using-slicers) or in this similar Stackoverflow Q/A: Python Pandas slice multiindex by second level index (or any other level)

Adding calculated column(s) to a dataframe in pandas

I have an OHLC price data set, that I have parsed from CSV into a Pandas dataframe and resampled to 15 min bars:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 500047 entries, 1998-05-04 04:45:00 to 2012-08-07 00:15:00
Freq: 15T
Data columns:
Close 363152 non-null values
High 363152 non-null values
Low 363152 non-null values
Open 363152 non-null values
dtypes: float64(4)
I would like to add various calculated columns, starting with simple ones such as period Range (H-L) and then booleans to indicate the occurrence of price patterns that I will define - e.g. a hammer candle pattern, for which a sample definition:
def closed_in_top_half_of_range(h,l,c):
return c > l + (h-l)/2
def lower_wick(o,l,c):
return min(o,c)-l
def real_body(o,c):
return abs(c-o)
def lower_wick_at_least_twice_real_body(o,l,c):
return lower_wick(o,l,c) >= 2 * real_body(o,c)
def is_hammer(row):
return lower_wick_at_least_twice_real_body(row["Open"],row["Low"],row["Close"]) \
and closed_in_top_half_of_range(row["High"],row["Low"],row["Close"])
Basic problem: how do I map the function to the column, specifically where I would like to reference more than one other column or the whole row or whatever?
This post deals with adding two calculated columns off of a single source column, which is close, but not quite it.
And slightly more advanced: for price patterns that are determined with reference to more than a single bar (T), how can I reference different rows (e.g. T-1, T-2 etc.) from within the function definition?
The exact code will vary for each of the columns you want to do, but it's likely you'll want to use the map and apply functions. In some cases you can just compute using the existing columns directly, since the columns are Pandas Series objects, which also work as Numpy arrays, which automatically work element-wise for usual mathematical operations.
>>> d
A B C
0 11 13 5
1 6 7 4
2 8 3 6
3 4 8 7
4 0 1 7
>>> (d.A + d.B) / d.C
0 4.800000
1 3.250000
2 1.833333
3 1.714286
4 0.142857
>>> d.A > d.C
0 True
1 True
2 True
3 False
4 False
If you need to use operations like max and min within a row, you can use apply with axis=1 to apply any function you like to each row. Here's an example that computes min(A, B)-C, which seems to be like your "lower wick":
>>> d.apply(lambda row: min([row['A'], row['B']])-row['C'], axis=1)
0 6
1 2
2 -3
3 -3
4 -7
Hopefully that gives you some idea of how to proceed.
Edit: to compare rows against neighboring rows, the simplest approach is to slice the columns you want to compare, leaving off the beginning/end, and then compare the resulting slices. For instance, this will tell you for which rows the element in column A is less than the next row's element in column C:
d['A'][:-1] < d['C'][1:]
and this does it the other way, telling you which rows have A less than the preceding row's C:
d['A'][1:] < d['C'][:-1]
Doing ['A"][:-1] slices off the last element of column A, and doing ['C'][1:] slices off the first element of column C, so when you line these two up and compare them, you're comparing each element in A with the C from the following row.
You could have is_hammer in terms of row["Open"] etc. as follows
def is_hammer(rOpen,rLow,rClose,rHigh):
return lower_wick_at_least_twice_real_body(rOpen,rLow,rClose) \
and closed_in_top_half_of_range(rHigh,rLow,rClose)
Then you can use map:
df["isHammer"] = map(is_hammer, df["Open"], df["Low"], df["Close"], df["High"])
For the second part of your question, you can also use shift, for example:
df['t-1'] = df['t'].shift(1)
t-1 would then contain the values from t one row above.
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.shift.html
The first four functions you list will work on vectors as well, with the exception that lower_wick needs to be adapted. Something like this,
def lower_wick_vec(o, l, c):
min_oc = numpy.where(o > c, c, o)
return min_oc - l
where o, l and c are vectors.
You could do it this way instead which just takes the df as input and avoid using numpy, although it will be much slower:
def lower_wick_df(df):
min_oc = df[['Open', 'Close']].min(axis=1)
return min_oc - l
The other three will work on columns or vectors just as they are. Then you can finish off with
def is_hammer(df):
lw = lower_wick_at_least_twice_real_body(df["Open"], df["Low"], df["Close"])
cl = closed_in_top_half_of_range(df["High"], df["Low"], df["Close"])
return cl & lw
Bit operators can perform set logic on boolean vectors, & for and, | for or etc. This is enough to completely vectorize the sample calculations you gave and should be relatively fast. You could probably speed up even more by temporarily working with the numpy arrays underlying the data while performing these calculations.
For the second part, I would recommend introducing a column indicating the pattern for each row and writing a family of functions which deal with each pattern. Then groupby the pattern and apply the appropriate function to each group.

Categories