Is there a Pandas equivalent to each_slice to operate on dataframes - python

I am wondering if there is a Python or Pandas function that approximates the Ruby #each_slice method. In this example, the Ruby #each_slice method will take the array or hash and break it into groups of 100.
var.each_slice(100) do |batch|
# do some work on each batch
I am trying to do this same operation on a Pandas dataframe. Is there a Pythonic way to accomplish the same thing?
I have checked out this answer: Python equivalent of Ruby's each_slice(count)
However, it is old and is not Pandas specific. I am checking it out but am wondering if there is a more direct method.

There isn't a built in method as such but you can use numpy's array_slice, you can pass the dataframe to this and the number of slices.
In order to get ~100 size slices you'll have to calculate this which is simply the number of rows/100:
import numpy as np
# df.shape returns the dimensions in a tuple, the first dimension is the number of rows
np.array_slice(df, df.shape[0]/100)
This returns a list of dataframes sliced as evenly as possible

Related

What is the fastest way to compare entries from two different pandas DataFrames?

I have two lists in form of pandas DataFrames which both contain a column of names. Now I want to compare these names and return a list of names which appear in both DataFrames. The problem is that my solution is way too slow since both list have several thousand entries.
Now I want to know if there is anything else I can do to accelerate the solution of my problem.
I already sorted my pandas dataframe by alphabet using "df.sort_values" in Order to create an alphabetical index so that a name in the first list which starts with the letter "X" will only be compared to entries with the same first letter in the second list.
I suspect that the main reason my program is running so slow is my way of accessing the fields which I am comparing.
I use a specific comparison function to compare the names and access the dataframe elements through the df.at[i, 'column_title'] method.
Edit: Note that this specific comparison function is more complex than a simple "==" since I am doing a kind of fuzzy string comparison to make sure names with slightly different spelling still get marked as a match. I use the whoswho library which returns me a match rate between 0 and 100. A simplified example focussing on my slow solution for the pandas dataframe comparison looks as follows:
for i in range(len(list1)):
for j in range(len(list2)):
# who.ratio returns a match rate between two strings
ratio = who.ratio(list1.at[i, 'name'], list2.at[j, 'name'])
if ratio > 75:
save(i,j) # stores values i and j in a result list
I also thought about switching from pandas to numpy but I read that this might slow it down even further since pandas is faster for big data amounts.
Can anybody tell me if there is there a faster way of accessing specific elements in a pandas array? Or is there a faster way in general to run a custom comparison function through two pd dataframes?
Edit2: spelling, addtitional information.

numpy/pandas NaN difference confusion

I happened onto this when trying to find the means/sums of non-nan elements in rows of a pandas dataframe. It seems that
df.apply(np.mean, axis=1)
works fine.
However, applying np.mean to a numpy array containing nans returns a nan.
Is this all speced out somewhere? I would not want to get burned down the road...
numpy's mean function first checks whether its input has a mean method, as #EdChum explains in this answer.
When you use df.apply, the input passed to the function is a pandas.Series. Since pandas.Series has a mean method, numpy uses that instead of using its own function. And by default, pandas.Series.mean ignores NaN.
You can access the underlying numpy array by the values attribute and pass that to the function:
df.apply(lambda x: np.mean(x.values), axis=1)
this will use numpy's version.
Divakar has correctly suggested using np.nanmean
If I may answer the question still standing, the semantics differ because Numpy supports masked arrays, while Pandas does not.

python array initialisation (preallocation) with nans

I want to initialise an array that will hold some data. I have created a random matrix (using np.empty) and then multiplied it by np.nan. Is there anything wrong with that? Or is there a better practice that I should stick to?
To further explain my situation: I have data I need to store in an array. Say I have 8 rows of data. The number of elements in each row is not equal, so my matrix row length needs to be as long as the longest row. In other rows, some elements will not be filled. I don't want to use zeros since some of my data might actually be zeros.
I realise I can use some value I know my data will never, but nans is definitely clearer. Just wondering if that can cause any issues later with processing. I realise I need to use nanmax instead of max and so on.
I have created a random matrix (using np.empty) and then multiplied it by np.nan. Is there anything wrong with that? Or is there a better practice that I should stick to?
You can use np.full, for example:
np.full((100, 100), np.nan)
However depending on your needs you could have a look at numpy.ma for masked arrays or scipy.sparse for sparse matrices. It may or may not be suitable, though. Either way you may need to use different functions from the corresponding module instead of the normal numpy ufuncs.
A way I like to do it which probably isn't the best but it's easy to remember is adding a 'nans' method to the numpy object this way:
import numpy as np
def nans(n):
return np.array([np.nan for i in range(n)])
setattr(np,'nans',nans)
and now you can simply use np.nans as if it was the np.zeros:
np.nans(10)

Python - How to construct a numpy array out of a list of objects efficiently

I am building a python application where I retrieve a list of objects and I want to plot them (for ploting I use matplotlib). Each object in the list contains two properties.
For example let's say I have the list rawdata and the objects stored in it have the properties timestamp and power
rawdata[0].timestamp == 1
rawdata[1].timestamp == 2
rawdata[2].timestamp == 3
etc
rawdata[0].power == 1232.547
rawdata[1].power == 2525.423
rawdata[2].power == 1125.253
etc
I want to be able to plot those two dimensions, that the two properties represent, and I want to do it a time and space efficient way. That means that I want to avoid iterating over the list and sequentially constructing something like a numpy array out it.
Is there a way that to apply an on-the-fly transformation of the list? Or somehow plot it as it is? Since all the information is already included in the list I believe there should be a way.
The closest answer I found was this, but it includes sequential iteration over the list.
update
As pointed out by Antonio Ragagnin I can use the map builtin function to construct a numpy array efficiently. But that also means that I will have to create a second data structure. Can I use map to transform the list on the fly to a two dimensional numpy array?
From the matplotlib tutorial (emphasis mine):
If matplotlib were limited to working with lists, it would be fairly useless for numeric processing. Generally, you will use numpy arrays. In fact, all sequences are converted to numpy arrays internally.
So you lose nothing by converting it to a numpy array, if you don't do it matplotlib will.

Find max non-infinity element in pytables CArray

This must be easy, but I'm very new to pytables. My application has dataset sizes so large they cannot be held in memory, thus I use PyTable CArrays. However, I need to find the maximum element in an array that is not infinity. Naively in numpy I'd do this:
max_element = numpy.max(array[array != numpy.inf])
Obviously that won't work in PyTables without introducing a whole array into memory. I could loop through the CArray in windows that fit in memory, but it'd be surprising to me if there weren't a max/min reduction operation. Is there an elegant mechanism to get the conditional maximum element of that array?
If your CArray is one dimensional, it is probably easier to stick it in a single-column Table. Then you have access to the where() method and can easily evaluate expressions like the following.
from itertools import imap
max(imap(lamdba r: r['col'], tab.where('col != np.inf')))
This works because where() never reads in all the data at once and returns an iterator, which is handed off to map, which is handed off to max. Note that in Python 3, you don't need to import imap() and imap() becomes just the builtin map().
Not using a table means that you need to use the Expr class and do more of the wiring yourself.

Categories