I have a datatable Frame created as:
comidas_gen_dt = dt.Frame({
'country':list('ABCDE'),
'id':[1,2,3,4,5],
'egg':[10,20,30,5,40],
'veg':[30,40,10,3,5],
'fork':[5,10,2,1,9],
'beef':[90,50,20,None,4]})
I have created a custom function to select a list of required columns from a frame DT as,
def pydt_select_cols(DT, *rmcols):
return DT[:, *dt_cols]
So, here is the recommend syntax to remove columns from DT:
DT[:, f[:].remove([f.a, f.b, f.c])
following the above syntax of DT, I've create another custom function to keep a side a list of columns as
def pydt_remove_cols(DT, *rmcols):
dt_cols = [*rmcols]
return DT[:, f[:].remove(dt_cols)]
I'm executing the function as
pydt_remove_cols(comidas_gen_dt, 'id', 'country', 'egg')
and it's throwing the error
TypeError: Computed columns cannot be used in .remove()
Could you please help me how to go ahead with it?
Removing columns (or rows) from a Frame is easy: take any syntax that you would normally use to select those columns, and then append the python del keyword.
Thus, if you want to delete columns 'id', 'country', and 'egg', run
>>> del comidas_gen_dt[:, ['id','country','egg']]
>>> comidas_gen_dt
| veg fork beef
-- + --- ---- ----
0 | 30 5 90
1 | 40 10 50
2 | 10 2 20
3 | 3 1 NA
4 | 5 9 4
[5 rows x 3 columns]
If you want to keep the original frame unmodified, and then select a new frame with some of the columns removed, then the easiest way would be to first copy the frame, and then use the del operation:
>>> DT = comidas_gen_dt.copy()
>>> del DT[:, columns_to_remove]
(note that .copy() makes a shallow copy, i.e. its cost is typically negligible).
You can also use the f[:].remove() approach. It's a bit strange that it didn't work the way you've written it, but going from a list of strings to a list of f-symbols is quite straightforward:
def pydt_remove_cols(DT, *rmcols):
return DT[:, f[:].remove([f[col] for col in rmcols])]
Here I use the fact that f.A is the same as f["A"], where the inner string "A" might as well be replaced with any variable.
Related
I have a large database in which I need to drop entries that don't satisfy a boolean criteria, but the criteria may involve several dozen columns.
I have the following which works with copying and pasting the names
df = df[~( (df['FirstCol'] > df['SecondCol']) |
(df['ThirdCol'] > df['FifthCol']) |
...
(df['FiftiethCol'] > df['TweniethCol']) |
(df['ThisCouldBeHundredsCol'] > df['LastOne'])
)]
However, I want to be able to do this in shorter amounts of code. If I have the column names that need to be compared in a list, like so
list_of_comparison_cols = ['FirstCol', 'SecondCol', 'ThirdCol', 'FifthCol', ..., 'FiftiethCol', 'TweniethCol', 'ThisCouldBeHundredsCol', 'LastOne']
How would I go about doing this in as little code and more dynamically as possible?
Many thanks.
You can do it by selecting every two elements of your list with [::2] to get ['FirstCol', 'ThirdCol',...] and [1::2] to get ['SecondCol', 'FifthCol', .... Use it to select the columns and compare to_numpy arrays between both side of the inequality. Then use any over axis=1 that correspond to the | used in your condition.
#example
list_of_comparison_cols = ['FirstCol', 'SecondCol', 'ThirdCol', 'FifthCol',
'FiftiethCol', 'TweniethCol', 'ThisCouldBeHundredsCol',
'LastOne']
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0,50,8*10).reshape(10,8),
columns=list_of_comparison_cols)
# create the mask
mask = (df[list_of_comparison_cols[::2]].to_numpy()
>df[list_of_comparison_cols[1::2]].to_numpy()
).any(1)
print (df[~mask])
FirstCol SecondCol ThirdCol FifthCol FiftiethCol TweniethCol \
0 44 47 0 3 3 39
ThisCouldBeHundredsCol LastOne
0 9 19
How can I find the row for which the value of a specific column is maximal?
df.max() will give me the maximal value for each column, I don't know how to get the corresponding row.
Use the pandas idxmax function. It's straightforward:
>>> import pandas
>>> import numpy as np
>>> df = pandas.DataFrame(np.random.randn(5,3),columns=['A','B','C'])
>>> df
A B C
0 1.232853 -1.979459 -0.573626
1 0.140767 0.394940 1.068890
2 0.742023 1.343977 -0.579745
3 2.125299 -0.649328 -0.211692
4 -0.187253 1.908618 -1.862934
>>> df['A'].idxmax()
3
>>> df['B'].idxmax()
4
>>> df['C'].idxmax()
1
Alternatively you could also use numpy.argmax, such as numpy.argmax(df['A']) -- it provides the same thing, and appears at least as fast as idxmax in cursory observations.
idxmax() returns indices labels, not integers.
Example': if you have string values as your index labels, like rows 'a' through 'e', you might want to know that the max occurs in row 4 (not row 'd').
if you want the integer position of that label within the Index you have to get it manually (which can be tricky now that duplicate row labels are allowed).
HISTORICAL NOTES:
idxmax() used to be called argmax() prior to 0.11
argmax was deprecated prior to 1.0.0 and removed entirely in 1.0.0
back as of Pandas 0.16, argmax used to exist and perform the same function (though appeared to run more slowly than idxmax).
argmax function returned the integer position within the index of the row location of the maximum element.
pandas moved to using row labels instead of integer indices. Positional integer indices used to be very common, more common than labels, especially in applications where duplicate row labels are common.
For example, consider this toy DataFrame with a duplicate row label:
In [19]: dfrm
Out[19]:
A B C
a 0.143693 0.653810 0.586007
b 0.623582 0.312903 0.919076
c 0.165438 0.889809 0.000967
d 0.308245 0.787776 0.571195
e 0.870068 0.935626 0.606911
f 0.037602 0.855193 0.728495
g 0.605366 0.338105 0.696460
h 0.000000 0.090814 0.963927
i 0.688343 0.188468 0.352213
i 0.879000 0.105039 0.900260
In [20]: dfrm['A'].idxmax()
Out[20]: 'i'
In [21]: dfrm.iloc[dfrm['A'].idxmax()] # .ix instead of .iloc in older versions of pandas
Out[21]:
A B C
i 0.688343 0.188468 0.352213
i 0.879000 0.105039 0.900260
So here a naive use of idxmax is not sufficient, whereas the old form of argmax would correctly provide the positional location of the max row (in this case, position 9).
This is exactly one of those nasty kinds of bug-prone behaviors in dynamically typed languages that makes this sort of thing so unfortunate, and worth beating a dead horse over. If you are writing systems code and your system suddenly gets used on some data sets that are not cleaned properly before being joined, it's very easy to end up with duplicate row labels, especially string labels like a CUSIP or SEDOL identifier for financial assets. You can't easily use the type system to help you out, and you may not be able to enforce uniqueness on the index without running into unexpectedly missing data.
So you're left with hoping that your unit tests covered everything (they didn't, or more likely no one wrote any tests) -- otherwise (most likely) you're just left waiting to see if you happen to smack into this error at runtime, in which case you probably have to go drop many hours worth of work from the database you were outputting results to, bang your head against the wall in IPython trying to manually reproduce the problem, finally figuring out that it's because idxmax can only report the label of the max row, and then being disappointed that no standard function automatically gets the positions of the max row for you, writing a buggy implementation yourself, editing the code, and praying you don't run into the problem again.
You might also try idxmax:
In [5]: df = pandas.DataFrame(np.random.randn(10,3),columns=['A','B','C'])
In [6]: df
Out[6]:
A B C
0 2.001289 0.482561 1.579985
1 -0.991646 -0.387835 1.320236
2 0.143826 -1.096889 1.486508
3 -0.193056 -0.499020 1.536540
4 -2.083647 -3.074591 0.175772
5 -0.186138 -1.949731 0.287432
6 -0.480790 -1.771560 -0.930234
7 0.227383 -0.278253 2.102004
8 -0.002592 1.434192 -1.624915
9 0.404911 -2.167599 -0.452900
In [7]: df.idxmax()
Out[7]:
A 0
B 8
C 7
e.g.
In [8]: df.loc[df['A'].idxmax()]
Out[8]:
A 2.001289
B 0.482561
C 1.579985
Both above answers would only return one index if there are multiple rows that take the maximum value. If you want all the rows, there does not seem to have a function.
But it is not hard to do. Below is an example for Series; the same can be done for DataFrame:
In [1]: from pandas import Series, DataFrame
In [2]: s=Series([2,4,4,3],index=['a','b','c','d'])
In [3]: s.idxmax()
Out[3]: 'b'
In [4]: s[s==s.max()]
Out[4]:
b 4
c 4
dtype: int64
df.iloc[df['columnX'].argmax()]
argmax() would provide the index corresponding to the max value for the columnX. iloc can be used to get the row of the DataFrame df for this index.
A more compact and readable solution using query() is like this:
import pandas as pd
df = pandas.DataFrame(np.random.randn(5,3),columns=['A','B','C'])
print(df)
# find row with maximum A
df.query('A == A.max()')
It also returns a DataFrame instead of Series, which would be handy for some use cases.
Very simple: we have df as below and we want to print a row with max value in C:
A B C
x 1 4
y 2 10
z 5 9
In:
df.loc[df['C'] == df['C'].max()] # condition check
Out:
A B C
y 2 10
If you want the entire row instead of just the id, you can use df.nlargest and pass in how many 'top' rows you want and you can also pass in for which column/columns you want it for.
df.nlargest(2,['A'])
will give you the rows corresponding to the top 2 values of A.
use df.nsmallest for min values.
The direct ".argmax()" solution does not work for me.
The previous example provided by #ely
>>> import pandas
>>> import numpy as np
>>> df = pandas.DataFrame(np.random.randn(5,3),columns=['A','B','C'])
>>> df
A B C
0 1.232853 -1.979459 -0.573626
1 0.140767 0.394940 1.068890
2 0.742023 1.343977 -0.579745
3 2.125299 -0.649328 -0.211692
4 -0.187253 1.908618 -1.862934
>>> df['A'].argmax()
3
>>> df['B'].argmax()
4
>>> df['C'].argmax()
1
returns the following message :
FutureWarning: 'argmax' is deprecated, use 'idxmax' instead. The behavior of 'argmax'
will be corrected to return the positional maximum in the future.
Use 'series.values.argmax' to get the position of the maximum now.
So that my solution is :
df['A'].values.argmax()
mx.iloc[0].idxmax()
This one line of code will give you how to find the maximum value from a row in dataframe, here mx is the dataframe and iloc[0] indicates the 0th index.
Considering this dataframe
[In]: df = pd.DataFrame(np.random.randn(4,3),columns=['A','B','C'])
[Out]:
A B C
0 -0.253233 0.226313 1.223688
1 0.472606 1.017674 1.520032
2 1.454875 1.066637 0.381890
3 -0.054181 0.234305 -0.557915
Assuming one want to know the rows where column "C" is max, the following will do the work
[In]: df[df['C']==df['C'].max()])
[Out]:
A B C
1 0.472606 1.017674 1.520032
The idmax of the DataFrame returns the label index of the row with the maximum value and the behavior of argmax depends on version of pandas (right now it returns a warning). If you want to use the positional index, you can do the following:
max_row = df['A'].values.argmax()
or
import numpy as np
max_row = np.argmax(df['A'].values)
Note that if you use np.argmax(df['A']) behaves the same as df['A'].argmax().
Use:
data.iloc[data['A'].idxmax()]
data['A'].idxmax() -finds max value location in terms of row
data.iloc() - returns the row
If there are ties in the maximum values, then idxmax returns the index of only the first max value. For example, in the following DataFrame:
A B C
0 1 0 1
1 0 0 1
2 0 0 0
3 0 1 1
4 1 0 0
idxmax returns
A 0
B 3
C 0
dtype: int64
Now, if we want all indices corresponding to max values, then we could use max + eq to create a boolean DataFrame, then use it on df.index to filter out indexes:
out = df.eq(df.max()).apply(lambda x: df.index[x].tolist())
Output:
A [0, 4]
B [3]
C [0, 1, 3]
dtype: object
what worked for me is:
df[df['colX'] == df['colX'].max()
You then get the row in your df with the maximum value of colX.
Then if you just want the index you can add .index at the end of the query.
I have a Pandas data frame which is MultiIndexed. The second level contains a year ([2014,2015]) and the third contains the month number ([1, 2, .., 12]). I would like to merge these two into a single level like - [1/2014, 2/2014 ..., 6/2015]. How could this be done?
I'm new to Pandas. Searched a lot but could not find any similar question/solution.
Edit: I found a way to avoid having to do this altogether with the answer to this question. I should have been creating my data frame that way. This seems to be the way to go for indexing by DateTime.
Consider the pd.MultiIndex and pd.DataFrame, mux and df
mux = pd.MultiIndex.from_product([list('ab'), [2014, 2015], range(1, 3)])
df = pd.DataFrame(dict(A=1), mux)
print(df)
A
a 2014 1 1
2 1
2015 1 1
2 1
b 2014 1 1
2 1
2015 1 1
2 1
We want to reassign to the index a list if lists that represent the index we want.
I want the 1st level the same
df.index.get_level_values(0)
I want the new 2nd level to be a string concatenation of the current 2nd and 3rd levels but reverse the order
df.index.map('{0[2]}/{0[1]}'.format)
df.index = [df.index.get_level_values(0), df.index.map('{0[2]}/{0[1]}'.format)]
print(df)
A
a 1/2014 1
2/2014 1
1/2015 1
2/2015 1
b 1/2014 1
2/2014 1
1/2015 1
2/2015 1
You can use a list comprehension to restructure your index. For example, if you have a 3 levels index and you want to combine the second and the third levels:
lst = [(i, f'{k}/{j}') for i, j, k in df.index]
df.index = pd.MultiIndex.from_tuples(lst)
This is just an explanation to the answer of piRSquared.
df.index.map('{0[2]}/{0[1]}'.format)
the map() method has one argument, which is a callback that is executed on each element of the index. In this example, the method happens to be the python built-in str.format function.
The format function is pretty mighty and has a lot of functionality (see the docs). One of those functions is to refer to positional arguments by specifying their position. This means that
"Hello {1}, I am {0}, how are you?".format("Bob", "Alice")
--> Hello Alice, I am Bob, how are you?
That's where the zero in piRSquared's answer comes from.
Normally, it is not required if only one argument is replaced in the string:
"Hello {}".format("Bob")
--> Hello Bob
However, in this case, two additional features are required:
using an element multiple times in the same string, and
selecting a sub-element from the argument.
Since the map method will pass a single index entry as argument to the format function, "{0[2]}" refers to the third element of that index.
Now the index in the original questions has three levels, so each argument passed to the format function is a tuple containing the three elements corresponding to the row's index.
A more verbose, but equivalent solution would be:
df.index.map(lambda idx: str(idx[2]) + '/' + str(idx[1]))
or
df.index.map(lambda idx: f'{idx[2]}/{idx[1]}')
To pass multiple variables to a normal python function you can just write something like:
def a_function(date,string,float):
do something....
convert string to int,
date = date + (float * int) days
return date
When using Pandas DataFrames I know you can create a new column based on the contents of one like so:
df['new_col']) = df['column_A'].map(a_function)
# This might return the year from a date column
# return date.year
What I'm wondering is in the same way you can pass multiple pieces of data to a single function (as seen in the first example above), can you use multiple columns in the creation of a new pandas DataFrame column?
For example combining three separate parts of a date Y - M - D into one field.
df['whole_date']) = df['Year','Month','Day'].map(a_function)
I get a key error with the following test.
def combine(one,two,three):
return one + two + three
df = pd.DataFrame({'a': [1,2,3], 'b': [2,3,4],'c': [4,5,6]})
df['d'] = df['a','b','b'].map(combine)
Is there a way of creating a new column in a pandas DataFrame using .map or something else which takes as input three columns and returns a single column?
-> Example input: 1, 2, 3
-> Example output: 1*2*3
Likewise is there also a way of having a function take in one argument, a date and return three new pandas DataFrame columns; one for the year, month and day?
Is there a way of creating a new column in a pandas dataframe using .MAP or something else which takes as input three columns and returns a single column. For example input would be 1, 2, 3 and output would be 1*2*3
To do that, you can use apply with axis=1. However, instead of being called with three separate arguments (one for each column) your specified function will then be called with a single argument for each row, and that argument will be a Series containing the data for that row. You can either account for this in your function:
def combine(row):
return row['a'] + row['b'] + row['c']
>>> df.apply(combine, axis=1)
0 7
1 10
2 13
Or you can pass a lambda which unpacks the Series into separate arguments:
def combine(one,two,three):
return one + two + three
>>> df.apply(lambda x: combine(*x), axis=1)
0 7
1 10
2 13
If you want to pass only specific rows, you need to select them by indexing on the DataFrame with a list:
>>> df[['a', 'b', 'c']].apply(lambda x: combine(*x), axis=1)
0 7
1 10
2 13
Note the double brackets. (This doesn't really have anything to do with apply; indexing with a list is the normal way to access multiple columns from a DataFrame.)
However, it's important to note that in many cases you don't need to use apply, because you can just use vectorized operations on the columns themselves. The combine function above can simply be called with the DataFrame columns themselves as the arguments:
>>> combine(df.a, df.b, df.c)
0 7
1 10
2 13
This is typically much more efficient when the "combining" operation is vectorizable.
Likewise is there also a way of having a function take in one argument, a date and return three new pandas dataframe columns; one for the year, month and day?
As above, there are two basic ways to do this: a general but non-vectorized way using apply, and a faster vectorized way. Suppose you have a DataFrame like this:
>>> df = pandas.DataFrame({'date': pandas.date_range('2015/05/01', '2015/05/03')})
>>> df
date
0 2015-05-01
1 2015-05-02
2 2015-05-03
You can define a function that returns a Series for each value, and then apply it to the column:
def dateComponents(date):
return pandas.Series([date.year, date.month, date.day], index=["Year", "Month", "Day"])
>>> df.date.apply(dateComponents)
11: Year Month Day
0 2015 5 1
1 2015 5 2
2 2015 5 3
In this situation, this is the only option, since there is no vectorized way to access the individual date components. However, in some cases you can use vectorized operations:
>>> df = pandas.DataFrame({'a': ["Hello", "There", "Pal"]})
>>> df
a
0 Hello
1 There
2 Pal
>>> pandas.DataFrame({'FirstChar': df.a.str[0], 'Length': df.a.str.len()})
FirstChar Length
0 H 5
1 T 5
2 P 3
Here again the operation is vectorized by operating directly on the values instead of applying a function elementwise. In this case, we have two vectorized operations (getting first character and getting the string length), and then we wrap the results in another call to DataFrame to create separate columns for each of the two kinds of results.
I normally use apply for this kind of thing; it's basically the DataFrame version of map (the axis parameter lets you decide whether to apply your function to rows or columns):
df.apply(lambda row: row.a*row.b*row.c, axis =1)
or
df.apply(np.prod, axis=1)
0 8
1 30
2 72
I have an OHLC price data set, that I have parsed from CSV into a Pandas dataframe and resampled to 15 min bars:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 500047 entries, 1998-05-04 04:45:00 to 2012-08-07 00:15:00
Freq: 15T
Data columns:
Close 363152 non-null values
High 363152 non-null values
Low 363152 non-null values
Open 363152 non-null values
dtypes: float64(4)
I would like to add various calculated columns, starting with simple ones such as period Range (H-L) and then booleans to indicate the occurrence of price patterns that I will define - e.g. a hammer candle pattern, for which a sample definition:
def closed_in_top_half_of_range(h,l,c):
return c > l + (h-l)/2
def lower_wick(o,l,c):
return min(o,c)-l
def real_body(o,c):
return abs(c-o)
def lower_wick_at_least_twice_real_body(o,l,c):
return lower_wick(o,l,c) >= 2 * real_body(o,c)
def is_hammer(row):
return lower_wick_at_least_twice_real_body(row["Open"],row["Low"],row["Close"]) \
and closed_in_top_half_of_range(row["High"],row["Low"],row["Close"])
Basic problem: how do I map the function to the column, specifically where I would like to reference more than one other column or the whole row or whatever?
This post deals with adding two calculated columns off of a single source column, which is close, but not quite it.
And slightly more advanced: for price patterns that are determined with reference to more than a single bar (T), how can I reference different rows (e.g. T-1, T-2 etc.) from within the function definition?
The exact code will vary for each of the columns you want to do, but it's likely you'll want to use the map and apply functions. In some cases you can just compute using the existing columns directly, since the columns are Pandas Series objects, which also work as Numpy arrays, which automatically work element-wise for usual mathematical operations.
>>> d
A B C
0 11 13 5
1 6 7 4
2 8 3 6
3 4 8 7
4 0 1 7
>>> (d.A + d.B) / d.C
0 4.800000
1 3.250000
2 1.833333
3 1.714286
4 0.142857
>>> d.A > d.C
0 True
1 True
2 True
3 False
4 False
If you need to use operations like max and min within a row, you can use apply with axis=1 to apply any function you like to each row. Here's an example that computes min(A, B)-C, which seems to be like your "lower wick":
>>> d.apply(lambda row: min([row['A'], row['B']])-row['C'], axis=1)
0 6
1 2
2 -3
3 -3
4 -7
Hopefully that gives you some idea of how to proceed.
Edit: to compare rows against neighboring rows, the simplest approach is to slice the columns you want to compare, leaving off the beginning/end, and then compare the resulting slices. For instance, this will tell you for which rows the element in column A is less than the next row's element in column C:
d['A'][:-1] < d['C'][1:]
and this does it the other way, telling you which rows have A less than the preceding row's C:
d['A'][1:] < d['C'][:-1]
Doing ['A"][:-1] slices off the last element of column A, and doing ['C'][1:] slices off the first element of column C, so when you line these two up and compare them, you're comparing each element in A with the C from the following row.
You could have is_hammer in terms of row["Open"] etc. as follows
def is_hammer(rOpen,rLow,rClose,rHigh):
return lower_wick_at_least_twice_real_body(rOpen,rLow,rClose) \
and closed_in_top_half_of_range(rHigh,rLow,rClose)
Then you can use map:
df["isHammer"] = map(is_hammer, df["Open"], df["Low"], df["Close"], df["High"])
For the second part of your question, you can also use shift, for example:
df['t-1'] = df['t'].shift(1)
t-1 would then contain the values from t one row above.
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.shift.html
The first four functions you list will work on vectors as well, with the exception that lower_wick needs to be adapted. Something like this,
def lower_wick_vec(o, l, c):
min_oc = numpy.where(o > c, c, o)
return min_oc - l
where o, l and c are vectors.
You could do it this way instead which just takes the df as input and avoid using numpy, although it will be much slower:
def lower_wick_df(df):
min_oc = df[['Open', 'Close']].min(axis=1)
return min_oc - l
The other three will work on columns or vectors just as they are. Then you can finish off with
def is_hammer(df):
lw = lower_wick_at_least_twice_real_body(df["Open"], df["Low"], df["Close"])
cl = closed_in_top_half_of_range(df["High"], df["Low"], df["Close"])
return cl & lw
Bit operators can perform set logic on boolean vectors, & for and, | for or etc. This is enough to completely vectorize the sample calculations you gave and should be relatively fast. You could probably speed up even more by temporarily working with the numpy arrays underlying the data while performing these calculations.
For the second part, I would recommend introducing a column indicating the pattern for each row and writing a family of functions which deal with each pattern. Then groupby the pattern and apply the appropriate function to each group.