Pandas: Series of arrays to series of transposed arrays - python

Ok, this is an easy one, I hope.
Using Pandas, I have a Series of 100 equal length Numpy arrays each with 30000 elements. I'd like to quickly transpose them into a series of 30000 arrays with 100 elements.
I of course can do it with list comprehensions or pulling the arrays but is there an efficient Pandas way to do it? Thanks!
UPDATE:
As per the request by #Alexander to make this a better example, here is some toy data.
import pandas
s1 = pandas.Series([np.array(range(10)) for i in range(10)])
And what I want returned in this example is:
s2 = pandas.Series([np.ones(10)*i for i in range(10)])
That is, an element-wise transpose of a Series of arrays into a new Series of arrays. Thanks!

Ok, this works actually. Any one have a more efficient solution?
pandas.Series(np.asarray(s1.tolist()).T.tolist())

Related

Efficiently perform cheap calculations on many (1e6-1e10) combinations of rows in a pandas dataframe in python

I need to perform some simple calculations on a large number of combinations of rows or columns for a pandas dataframe. I need to figure out how to do so most efficiently because the number of combinations might go up above a billion.
The basic approach is easy--just performing means, comparison operators, and sums on subselections of a dataframe. But the only way I've figured out involves doing a loop over the combinations, which isn't very pythonic and isn't super efficient. Since efficiency will matter as the number of samples goes up I'm hoping there might be some smarter way to do this.
Right now I am building the list of combinations and then selecting those rows and doing the calculations using built-in pandas tools (see pseudo-code below). One possibility is to parallelize this, which should be pretty easy. However, I wonder if I'm missing a deeper way to do this more efficiently.
A few thoughts, ordered from big to small:
Is there some smart pandas/python or even some smart linear algebra way to do this? I haven't figured such out, but want to check.
Is the best approach to stick with pandas? Or convert to a numpy array and just do everything using numeric indices there, and then convert back to easier-to-understand data-frames?
Is the built-in mean() the best approach, or should I use some kind of apply()?
Is it faster to select rows or columns in any way? The matrix is symmetric so it's easy to grab either.
I'm currently actually selecting 18 rows because each of the 6 rows actually has three entries with slightly different parameters--I could combine those into individual rows beforehand if it's faster to select 6 rows than 18 for some reason.
Here's a rough-sketch of what I'm doing:
from itertools import combinations
df = from_excel() #Test case is 30 rows & cols
df = df.set_index('Col1') # Column and row 1 are names, rest are the actual matrix values
allSets = combinations(df.columns, 6)
temp = []
for s in allSets:
avg1 = df.loc[list(s)].mean().mean()
cnt1 = df.loc[list(s)].gt(0).sum().sum()
temp.append([s,avg1,cnt1])
dfOut = pd.DataFrame(temp,columns=['Set','Average','Count'])
A few general considerations that should help:
Not that I know of, though the best place to ask is on Mathematics or Math Professionals. And it is worth a try. There may be a better way to frame the question if you are doing something very specific with the results - looking for minimum/maximum, etc.
In general, you are right, that pandas, as a layer on top of NumPy is probably not the speeding things up. However, most of the heavy-lifting is done at the level of NumPy, and until you are sure pandas is to blame, use it.
mean is better than your own function applied across rows or columns because it uses C implementation of mean in NumPy under the hood which is always going to be faster than Python.
Given that pandas is organizing data in column fashion (i.e. each column is a contiguous NumPy array), it is better to go row-wise.
It would be great to see an example of data here.
Now, some comments on the code:
use iloc and numeric indices instead of loc - it is way faster
it is unnecessary to turn tuples into list here: df.loc[list(s)].gt(0).sum().sum()
just use: df.loc[s].gt(0).sum().sum()
you should rather use a generator instead of the for loop where you append elements to a temporary list (this is awfully slow and unnecessary, because you are creating pandas dataframe either way). Also, use tuples instead of lists wherever possible for maximum speed:
def gen_fun():
allSets = combinations(df.columns, 6)
for s in allSets:
avg1 = df.loc[list(s)].mean().mean()
cnt1 = df.loc[list(s)].gt(0).sum().sum()
yield (s, avg1, cnt1)
dfOut = pd.DataFrame(gen_fun(), columns=['Set', 'Average', 'Count'])
Another thing is, that you can preprocess the dataframe to use only values that are positive to avoid gt(0) operation in each loop.
In this way you are sparing both memory and CPU time.

using unique function in pandas for a 2D array

In this question's answer I got the idea of using pandas unique function instead of numpy unique. When looking into the documentation here I discovered that this can only be done for 1D arrays or tuples. As my data has the format:
example = [[25.1, 0.03], [25.1, 0.03], [24.1, 15]]
it would be possible to covert it to tuples and after using the unique function again back to an array. Does someone know a 'better' way to do this? This question might be related, but is dealing with cells. I don't want to use numpy as I have to keep the order in the array the same.
You can convert to tuple and the convert to unique list:
list(dict.fromkeys(map(tuple, example)))
Output:
[(25.1, 0.03), (24.1, 15)]
If you'd like to use Pandas:
To find the unique pairs in example, use DataFrame instead of Series and then drop_duplicates:
pd.DataFrame(example).drop_duplicates()
0 1
0 25.1 0.03
2 24.1 15.00
(And .values will give you back a 2-D array.)

List to 2d array in pandas per line NEED MORE EFFICIENT WAY

I have a pandas dataframe for lists. And each one of the lists can use np.asarray(list) to convert the list to a numpy array. The shape of the array should be (263,300) ,so i do this
a=dataframe.to_numpy()
# a.shape is (100000,)
output_array=np.array([])
for list in a:
output_array=np.append(output_array,np.asarray(list))
Since there are 100000 rows in my pandas, so i expect to get
output_array.shape is (100000,263,300)
It works, but it takes long time.
I want to know which part of my code cost the most and how to solve it.
Is there a more efficient method to reach this? Thanks!

Get rows from a DataFrame by using a list of slices

I have a several million row data frame, and a list of interesting sections I need to select out of it. I'm looking for a highly efficient (read as: fastest possible) way of doing this.
I know I can do this:
slices = [slice(0,10), slice(20,50), slice(1000,5000)]
for slice in slices:
df.loc[slice, 'somecolumn'] = True
... but that just seems like an inefficient way of getting the job done. It's really slow.
This seems faster than the for loop above, but I'm not sure if this is the best possible approach:
from itertools import chain
ranges = chain.from_iterable(slices)
df.loc[ranges, 'somecolumns'] = True
This also doesn't work, even though it seems that maybe it should:
df.loc[slices, 'somecolumns'] = True
TypeError: unhashable type: 'slice'
My primary concern in this is performance. I need the best I can get out of this due to the size of the data frames I am dealing with.
pandas
You can try a couple of tricks:
Use np.r_ to concatenate slice objects into a single NumPy array. Indexing with NumPy arrays is usually efficient as these are used internally in the Pandas framework.
Use positional integer indexing via pd.DataFrame.iloc instead of primarily label-based loc. The former is more restrictive and more closely aligned with NumPy indexing.
Here's a demo:
# some example dataframe
df = pd.DataFrame(dict(zip('ABCD', np.arange(100).reshape((4, 25)))))
# concatenate multiple slices
slices = np.r_[slice(0, 3), slice(6, 10), slice(15, 20)]
# use integer indexing
df.iloc[slices, df.columns.get_loc('C')] = 0
numpy
If your series is held in a contiguous memory block, which is usually the case with numeric (or Boolean) arrays, you can try updating the underlying NumPy array in-place. First define slices via np.r_ as above, then use:
df['C'].values[slices] = 0
This by-passes the Pandas interface and any associated checks which occur via the regular indexing methods.
IIUC, you are looking to slice on axis = 0 (row index). Instead of slices, I'm using numpy's arange method, and using df.ix:
slices = np.append(np.arange(0,10), np.arange(20,50), np.arange(1000,5000)) ##add other row slices here
df.ix[slices, 'some_col']
You can try building a full indexer for the rows first, then do your assignment:
row_indexer = pd.concat((df.index[sub_slice] for sub_slice in slices), axis=0)
df[row_indexer, column] = True

Combining DataArrays in an xarray Dataset

Is there a nicer way of summing over all the DataArrays in an xarray Dataset than
sum(d for d in ds.data_vars.values())
This works, but seems a bit clunky. Is there an equivalent to summing over pandas DataFrame columns?
Note the ds.sum() method applies to each of the DataArrays - but I want to combine the DataArrays.
I assume you want to sum each data variable as well, e.g., sum(d.sum() for d in ds.data_vars.values()). In a future version of xarray (not yet in v0.10) this will be more succinct: you will be able to write sum(d.sum() for d in ds.values()).
Another option is to convert the Dataset into a single DataArray and sum it at once, e.g., ds.to_array().sum(). This will be less efficient if you have data variables with different dimensions.

Categories