I want to create pandas dataframes and be able to manipulate them in optimised code (numba).
Most optimised code will take series or dataframe inputs and store results in preallocated outputs.
#njit
def calc(u, v, w):
for i in range (w.shape[0]):
w[i] = some_f(u[i], v[i])
where some_f is a placeholder for operations that can be complex, with tests and loops, hence the use of numba. Most importantly, I want to avoid any useless copies of data in the process.
For vectorized functions such as the above, I want to use the same code for series and dataframes.
So for series u, v, w, I'll use:
calc(u.values, v.values, w.values)
For dataframes, I thought of reusing the same function with
calc(u.values.reshape(-1), v.values.reshape(-1), w.values.reshape(-1))
This only works if
The array ordering of the dataframe (C or Fortran) is consistent across the three dataframes
The reshape method is passed an argument order='C' or 'F' matching the original dataframe ordering, otherwise a copy is made.
Pandas does not seem to have a consistent policy for dataframe ordering.
For instance, a constructor
df=pandas.Dataframe(index=..., columns=..., data=0.0)
will return a C ordered array. While
df.copy()
will be Fortran ordered.
I wanted to know if some people here had encountered similar issues, and found a consistent way to ensure that dataframes are always in the same order (C or Fortran) without cluttering too much ordinary pandas code.
Related
I'm confused about the rules Pandas uses when deciding that a selection from a dataframe is a copy of the original dataframe, or a view on the original.
If I have, for example,
df = pd.DataFrame(np.random.randn(8,8), columns=list('ABCDEFGH'), index=range(1,9))
I understand that a query returns a copy so that something like
foo = df.query('2 < index <= 5')
foo.loc[:,'E'] = 40
will have no effect on the original dataframe, df. I also understand that scalar or named slices return a view, so that assignments to these, such as
df.iloc[3] = 70
or
df.ix[1,'B':'E'] = 222
will change df. But I'm lost when it comes to more complicated cases. For example,
df[df.C <= df.B] = 7654321
changes df, but
df[df.C <= df.B].ix[:,'B':'E']
does not.
Is there a simple rule that Pandas is using that I'm just missing? What's going on in these specific cases; and in particular, how do I change all values (or a subset of values) in a dataframe that satisfy a particular query (as I'm attempting to do in the last example above)?
Note: This is not the same as this question; and I have read the documentation, but am not enlightened by it. I've also read through the "Related" questions on this topic, but I'm still missing the simple rule Pandas is using, and how I'd apply it to — for example — modify the values (or a subset of values) in a dataframe that satisfy a particular query.
Here's the rules, subsequent override:
All operations generate a copy
If inplace=True is provided, it will modify in-place; only some operations support this
An indexer that sets, e.g. .loc/.iloc/.iat/.at will set inplace.
An indexer that gets on a single-dtyped object is almost always a view (depending on the memory layout it may not be that's why this is not reliable). This is mainly for efficiency. (the example from above is for .query; this will always return a copy as its evaluated by numexpr)
An indexer that gets on a multiple-dtyped object is always a copy.
Your example of chained indexing
df[df.C <= df.B].loc[:,'B':'E']
is not guaranteed to work (and thus you shoulld never do this).
Instead do:
df.loc[df.C <= df.B, 'B':'E']
as this is faster and will always work
The chained indexing is 2 separate python operations and thus cannot be reliably intercepted by pandas (you will oftentimes get a SettingWithCopyWarning, but that is not 100% detectable either). The dev docs, which you pointed, offer a much more full explanation.
Here is something funny:
u = df
v = df.loc[:, :]
w = df.iloc[:,:]
z = df.iloc[0:, ]
The first three seem to be all references of df, but the last one is not!
I need to perform some simple calculations on a large number of combinations of rows or columns for a pandas dataframe. I need to figure out how to do so most efficiently because the number of combinations might go up above a billion.
The basic approach is easy--just performing means, comparison operators, and sums on subselections of a dataframe. But the only way I've figured out involves doing a loop over the combinations, which isn't very pythonic and isn't super efficient. Since efficiency will matter as the number of samples goes up I'm hoping there might be some smarter way to do this.
Right now I am building the list of combinations and then selecting those rows and doing the calculations using built-in pandas tools (see pseudo-code below). One possibility is to parallelize this, which should be pretty easy. However, I wonder if I'm missing a deeper way to do this more efficiently.
A few thoughts, ordered from big to small:
Is there some smart pandas/python or even some smart linear algebra way to do this? I haven't figured such out, but want to check.
Is the best approach to stick with pandas? Or convert to a numpy array and just do everything using numeric indices there, and then convert back to easier-to-understand data-frames?
Is the built-in mean() the best approach, or should I use some kind of apply()?
Is it faster to select rows or columns in any way? The matrix is symmetric so it's easy to grab either.
I'm currently actually selecting 18 rows because each of the 6 rows actually has three entries with slightly different parameters--I could combine those into individual rows beforehand if it's faster to select 6 rows than 18 for some reason.
Here's a rough-sketch of what I'm doing:
from itertools import combinations
df = from_excel() #Test case is 30 rows & cols
df = df.set_index('Col1') # Column and row 1 are names, rest are the actual matrix values
allSets = combinations(df.columns, 6)
temp = []
for s in allSets:
avg1 = df.loc[list(s)].mean().mean()
cnt1 = df.loc[list(s)].gt(0).sum().sum()
temp.append([s,avg1,cnt1])
dfOut = pd.DataFrame(temp,columns=['Set','Average','Count'])
A few general considerations that should help:
Not that I know of, though the best place to ask is on Mathematics or Math Professionals. And it is worth a try. There may be a better way to frame the question if you are doing something very specific with the results - looking for minimum/maximum, etc.
In general, you are right, that pandas, as a layer on top of NumPy is probably not the speeding things up. However, most of the heavy-lifting is done at the level of NumPy, and until you are sure pandas is to blame, use it.
mean is better than your own function applied across rows or columns because it uses C implementation of mean in NumPy under the hood which is always going to be faster than Python.
Given that pandas is organizing data in column fashion (i.e. each column is a contiguous NumPy array), it is better to go row-wise.
It would be great to see an example of data here.
Now, some comments on the code:
use iloc and numeric indices instead of loc - it is way faster
it is unnecessary to turn tuples into list here: df.loc[list(s)].gt(0).sum().sum()
just use: df.loc[s].gt(0).sum().sum()
you should rather use a generator instead of the for loop where you append elements to a temporary list (this is awfully slow and unnecessary, because you are creating pandas dataframe either way). Also, use tuples instead of lists wherever possible for maximum speed:
def gen_fun():
allSets = combinations(df.columns, 6)
for s in allSets:
avg1 = df.loc[list(s)].mean().mean()
cnt1 = df.loc[list(s)].gt(0).sum().sum()
yield (s, avg1, cnt1)
dfOut = pd.DataFrame(gen_fun(), columns=['Set', 'Average', 'Count'])
Another thing is, that you can preprocess the dataframe to use only values that are positive to avoid gt(0) operation in each loop.
In this way you are sparing both memory and CPU time.
I am running into a weird inconsistency. So I had to learn the difference between immutable and mutable data types. For my purpose, I need to convert my pandas DataFrame into Numpy apply operations and convert it back, as I do not wish to alter my input.
so I am converting like follows:
mix=pd.DataFrame(array,columns=columns)
def mix_to_pmix(mix,p_tank):
previous=0
columns,mix_in=np.array(mix) #<---
mix_in*=p_tank
previous=0
for count,val in enumerate(mix_in):
mix_in[count]=val+previous
previous+=val
return pd.DataFrame(mix_in,columns=columns)
This works perfectly fine, but the function:
columns,mix_in=np.array(mix)
seems to not be consistent as in the case:
def to_molfrac(mix):
columns,mix_in=np.array(mix)
shape=mix_in.shape
for i in range(shape[0]):
mix_in[i,:]*=1/max(mix_in[i,:])
for k in range(shape[1]-1,0,-1):
mix_in[:,k]+=-mix_in[:,k-1]
mix_in=mix_in/mix_in.sum(axis=1)[:,np.newaxis]
return pd.DataFrame(mix_in,columns=columns)
I receive the error:
ValueError: too many values to unpack (expected 2)
The input of the latter function is the output of the previous function. So it should be the same case.
It's impossible to understand the input of to_molfrac and mix_to_pmix without an example.
But the pandas objects has a .value attribute which allows you to access the underlying numpy array. So, its probably better to use mix_in = mix.values instead.
columns, values = df.columns, df.values
Is there a nicer way of summing over all the DataArrays in an xarray Dataset than
sum(d for d in ds.data_vars.values())
This works, but seems a bit clunky. Is there an equivalent to summing over pandas DataFrame columns?
Note the ds.sum() method applies to each of the DataArrays - but I want to combine the DataArrays.
I assume you want to sum each data variable as well, e.g., sum(d.sum() for d in ds.data_vars.values()). In a future version of xarray (not yet in v0.10) this will be more succinct: you will be able to write sum(d.sum() for d in ds.values()).
Another option is to convert the Dataset into a single DataArray and sum it at once, e.g., ds.to_array().sum(). This will be less efficient if you have data variables with different dimensions.
Question
Let's assume the following DataFrame is given
ID IBM MSFT APPL ORCL FB TWTR
date
1986-08-31 -1.332298 0.396217 0.574269 -0.679972 -0.470584 0.234379
1986-09-30 -0.222567 0.281202 -0.505856 -1.392477 0.941539 0.974867
1986-10-31 -1.139867 -0.458111 -0.999498 1.920840 0.478174 -0.315904
1986-11-30 -0.189720 -0.542432 -0.471642 1.506206 -1.506439 0.301714
1986-12-31 1.061092 -0.922713 -0.275050 0.776958 1.371245 -2.540688
and I want to do some operations on it. This could be some complicated mathematical method. The columns are structurally the same.
Q1: What is the best method wrt. performance and/or implementation design?
Q2: Should I program a method that is disassembling the DataFrame into numerical parts ( numpy arrays ) and indices? Thereby the necessary calculations would be undertaken by a submodule on the numpy array. The main method would be only responsible for recollecting the data retrieved from the submodule and rejoining it with the corresponding indices ( see example code below ).
def submodule(np_array):
# some fancy calculations here
return modified_array
def main(df):
cols = df.columns
indices = df.index
values = df.values()
modified_values = submodule(values)
new_df = pd.DataFrame(modified_values, columns=cols, index=indices)
return new_df
Q3: Or should I do the calculations with DataFrames directly?
Q4: Or should I work with objects instead?
Q5: What is better with respect to performance, design, or code structure?
Addendum
Some more practical example would be if I want to do a portfolio optimization.
Q6: Should I pass the whole DataFrame into the optimization or use only the numerical matrix. Strictly speaking I don't think that the information of the DataFrame should be passed to a numerical method. But I am not sure whether my thinking is outdated.
Another example would be calculating the Delta for a number of options ( an operation on every single series instead of a matrix operation )
P.S.:
I know that I wouldn't need to use a separate function for disassembling. But it highlights my intentions.