Find max value of a dataframe column containing numpy arrays - python

I was trying to find the maximum value of a column in a dataframe that contains numpy arrays.
df = pd.DataFrame({'id': [1, 2, 33, 4],
'a': [1, 22, 23, 44],
'b': [1, 42, 23, 42]})
df['new'] = df.apply(lambda r: tuple(r), axis=1).apply(np.array)
This how the dataframe can look like:
id a b new
0 1 1 1 [1, 1, 1]
1 2 22 42 [2, 22, 42]
2 33 23 23 [33, 23, 23]
3 4 44 42 [4, 44, 42]
Now I want to find the maximum (single) value of column new. In this case it is 44. What about a quick and easy way?

Because your new column is actually constructed from the columns id, a, b. Before you create the new column you can do:
single_max = np.max(df.values)
OR if you insist on your dataframe to contain the new column and then get max you can do:
single_max = np.max(df.drop('new',axis=1).values)

You can apply a lambda to the values that calls the array's max method. This would result in a Series that also has a max method.
df['new'].apply(lambda arr: arr.max()).max()
Just guessing, but this should be faster than .apply(max) because you use the optimized array method instead of converting the numpy ints to python ints one by one.

A possible solution:
df.new.explode().max()
Or a faster alternative:
np.max(np.vstack(df.new.values))
Returns 44.

Assuming you only want to consider the columns "new":
import numpy as np
out = np.max(tuple(df['new'])) # or np.max(df['new'].tolist())
Output: 44

df1.new.map(pd.eval).explode().max()
Output: 44

1- Combination of max and explode()
df['new'].explode().max()
# 44
2- List comprehension
max([max(e) for e in df['new']])
# 44

Related

pandas create new columns from tuple values in one column

I have a dataframe that looks like
RMSE SELECTED DATA information
0 100 [12, 15, 19, 13] (arr1, str1, fl1)
1 200 [7, 12, 3] (arr2, str2, fl2)
2 300 [5, 9, 3, 3, 3, 3] (arr3, str3, fl3)
Here, I want to break up the information column into three distinct columns: the first column containing the arrays , the second column containing the string and the last column containing the float Thus the new dataframe would look like
RMSE SELECTED DATA ARRAYS STRING FLOAT
0 100 [12, 15, 19, 13] arr1 str1 fl1
1 200 [7, 12, 3] arr2 str2 fl2
2 300 [5, 9, 3, 3, 3, 3] arr3 str3 fl3
I thought one way would be to isolate the information column and then slice it using .apply like so:
df['arrays'] = df['information'].apply(lambda row : row[0])
and do this for each entry. But I was curious if there is a better way to do this as if there are many more entries it may become tedious or slow with a for loop
Let us recreate the dataframe
tojoin = pd.DataFrame(df.pop('information').to_numpy().tolist(),
index = df.index,
columns = ['ARRAYS', 'STRING', 'FLOAT'])
df = df.join(tojoin)

What does numpy.ix_() function do and what is the output used for?

Below shows the output from numpy.ix_() function. What is the use of the output? It's structure is quite unique.
import numpy as np
>>> gfg = np.ix_([1, 2, 3, 4, 5, 6], [11, 12, 13, 14, 15, 16], [21, 22, 23, 24, 25, 26], [31, 32, 33, 34, 35, 36] )
>>> gfg
(array([[[[1]]],
[[[2]]],
[[[3]]],
[[[4]]],
[[[5]]],
[[[6]]]]),
array([[[[11]],
[[12]],
[[13]],
[[14]],
[[15]],
[[16]]]]),
array([[[[21],
[22],
[23],
[24],
[25],
[26]]]]),
array([[[[31, 32, 33, 34, 35, 36]]]]))
According to numpy doc:
Construct an open mesh from multiple sequences.
This function takes N 1-D sequences and returns N outputs with N dimensions each, such that the shape is 1 in all but one dimension and the dimension with the non-unit shape value cycles through all N dimensions.
Using ix_ one can quickly construct index arrays that will index the cross product. a[np.ix_([1,3],[2,5])] returns the array [[a[1,2] a[1,5]], [a[3,2] a[3,5]]].
numpy.ix_()'s main use is to create an open mesh so that we can use it to select specific indices from an array (specific sub-array). An easy example to understand it is:
Say you have a 2D array of shape (5,5), and you would like to select a sub-array that is constructed by selecting the rows 1 and 3 and columns 0 and 3. You can use np.ix_ to create a (index) mesh so as to be able to select the sub-array as follows in the example below:
a = np.arange(5*5).reshape(5,5)
[[ 0 1 2 3 4]
[ 5 6 7 8 9]
[10 11 12 13 14]
[15 16 17 18 19]
[20 21 22 23 24]]
sub_indices = np.ix_([1,3],[0,3])
(array([[1],
[3]]), array([[0, 3]]))
a[sub_indices]
[[ 5 8]
[15 18]]
which is basically the selected sub-array from a that is in rows array([[1],[3]]) and columns array([[0, 3]]):
col 0 col 3
| |
v v
[[ 0 1 2 3 4]
[ 5 6 7 8 9] <- row 1
[10 11 12 13 14]
[15 16 17 18 19] <- row 3
[20 21 22 23 24]]
Please note in the output of the np.ix_, the N-arrays returned for the N 1-D input indices you feed to np.ix_ are returned in a way that first one is for rows, second one is for columns, third one is for depth and so on. That is why in the above example, array([[1],[3]]) is for rows and array([[0, 3]]) is for columns. Same goes for the example OP provided in the question. The reason behind it is the way numpy uses advanced indexing for multi-dimensional arrays.
It's basically used to create N array mask or arrays of indexes each one referring to a different dimension.
For example if I've a 3d np.ndarray and I want to get only some entries of it I can use numpy.ix to create 3 arrays that will have shapes like (N,1,1) (1,N,1) and (1,1,N) containing the corresponding index for each one of the 3 axis.
Take a look at the examples at numpy documentation page. They're self explanatory.
This function isn't commonly used.
I think it's used in some algebra operations like cross product and it's generalisations.

Manipulate lists in a pandas data frame column (e.g. divide by another column)

I have a pandas data frame with one column containing lists. I wish to divide each list element in each row by a scalar value in another column. In the following example, I wish to divide each element in a by b:
a b
0 [11, 22, 33] 11
1 [12, 24, 36] 2
2 [33, 66, 99] 3
Thus yielding the following result:
a b c
0 [11, 22, 33] 11 [1.0, 2.0, 3.0]
1 [12, 24, 36] 2 [6.0, 12.0, 18.0]
2 [33, 66, 99] 3 [11.0, 22.0, 33.0]
I can achieve this by the following code:
import pandas as pd
df = pd.DataFrame({"a":[[11,22,33],[12,24,36],[33,66,99]], "b" : [11,2,3]})
result = {"c":[]}
for _, row in df.iterrows():
result["c"].append([x / row["b"] for x in row["a"]])
df_c = pd.DataFrame(result)
df = pd.concat([df,df_c], axis="columns")
But explicit iteration over rows and collecting the result in a dictionary, converting it to a dataframe and then concatenation to the original data frame seems very inefficient and inelegant.
Does anyone have a better solution?
Thanks in advance and cheeers!
PS: In case you are wondering why I would store lists in a column: These are the resulting amplitudes of a Fourier-Transformation.
Why I don't use one column for each frequency?
Creating a new column for each frequency is horribly slow
With different sampling rates and FFT-window sizes in my project, there are multiple sets of frequencies.
zip the two columns, divide each entry in col a with its corresponding entry in col b, through a combination of product and starmap, and convert the iterator back into a list.
from itertools import product,starmap
from operator import floordiv
df['c'] = [list(starmap(floordiv,(product(num,[denom]))))
for num, denom in zip(df.a,df.b)]
a b c
0 [11, 22, 33] 11 [1, 2, 3]
1 [12, 24, 36] 2 [6, 12, 18]
2 [33, 66, 99] 3 [11, 22, 33]
Alternatively, u could just use numpy array within the iteration:
df['c'] = [list(np.array(num)/denom) for num, denom in zip(df.a,df.b)]
Thanks to #jezrael for the suggestion - All of this might be unnecessary as scipy has something for FFT - have a look at the link and see if it helps out.
I would convert the lists to numpy arrays:
df['c'] = df['a'].apply(np.array) / df['b']
You will get np.arrays in column c. If you really need lists, you will have to convert them back
df['c'] = df['c'].apply(list)

numpy.argmax() not working on my sorted pandas.Series

Can someone please explain to my why the argmax() function does not work after using sort_values() on my pandas series?
Below is the example of my code. The indices in the output is based on the original DataFrame, and not on the sorted Series.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'a': [4, 5, 3, 1, 2],
'b': [20, 10, 40, 50, 30],
'c': [25, 20, 5, 15, 10]
})
def sec_largest(x):
xsorted = x.sort_values(ascending=False)
return xsorted.idxmax()
df.apply(sec_largest)
Then the output is
a 1
b 3
c 0
dtype: int64
And when I checked the Series using xsorted.iloc[0] function, it gives me the maximum values in the series.
Can someone explain to me how this works? Thank you very much.
The problem is that you are using the sort on the pandas Series, with which the indices also get passed along while sorting, and idxmax returns the original index with the highest value, not the index of the sorted series..
def sec_largest(x):
xsorted = x.sort_values(ascending=False)
return xsorted.values.argmax()
By using the values of xsorted we use the numpy dataframe, and not the underlying pandas datastructure and everything works as expected.
If you print xsorted in the function you can see that the indices also get sorted along:
1 5
0 4
2 3
4 2
3 1

Apply elementwise, concatenate resulting rows into a DataFrame

I have a list of values (could easily become a Series or DataFrame), on which I want to apply a function element-wise.
x = [1, 5, 14, 27]
The function itself returns a single row of a DataFrame (returning the original value x and two result value columns), and I want to end up with a single DataFrame.
x Val1 Val2
0 1 4 23
1 5 56 27
2 14 10 9
3 27 8 33
The simple method is a for loop over the list and row bind the results with df.append(), but I'm sure there is a way to do this with the .apply()family of functions. I just can't figure out exactly which to use. I'm pretty familiar with doing this type of thing in R, and am familiar with Python, just need to get my head around the pandas syntax.
EDIT: More concrete example for clarity
Example function:
def returnsquares(x):
return pd.DataFrame({"input": [x], "sq": x**2, "cube": x**3})
Input of the function is a scalar, output is a DataFrame with a single row (Not a series).
Code that works:
result = pd.DataFrame({}, columns=["input", "sq", "cube"])
for entry in x:
result = result.append(returnsquares(entry))
(The values of the output are obviously not the same as above, but are the same shape). Is there a better method for doing this?
consider the following function that returns the same stuff you show in your example
def special_function(x):
idx = ['x', 'Val1', 'Val2']
d = {
1: pd.Series([x, 4, 23], idx),
5: pd.Series([x, 56, 27], idx),
14: pd.Series([x, 10, 9], idx),
27: pd.Series([x, 8, 33], idx),
}
return d[x]
then you want to combine into a single dataframe using pd.DataFrame.from_records
pd.DataFrame.from_records([special_function(i).squeeze() for i in x])
Or use pd.concat
pd.concat([special_function(i) for i in x])
Or make x a series and use apply
x = pd.Series([1, 5, 14, 27])
x.apply(lambda y: special_function(y).iloc[0], 1)
Be aware of timings
timing
Don't fear the list comprehension

Categories