I have a list of values (could easily become a Series or DataFrame), on which I want to apply a function element-wise.
x = [1, 5, 14, 27]
The function itself returns a single row of a DataFrame (returning the original value x and two result value columns), and I want to end up with a single DataFrame.
x Val1 Val2
0 1 4 23
1 5 56 27
2 14 10 9
3 27 8 33
The simple method is a for loop over the list and row bind the results with df.append(), but I'm sure there is a way to do this with the .apply()family of functions. I just can't figure out exactly which to use. I'm pretty familiar with doing this type of thing in R, and am familiar with Python, just need to get my head around the pandas syntax.
EDIT: More concrete example for clarity
Example function:
def returnsquares(x):
return pd.DataFrame({"input": [x], "sq": x**2, "cube": x**3})
Input of the function is a scalar, output is a DataFrame with a single row (Not a series).
Code that works:
result = pd.DataFrame({}, columns=["input", "sq", "cube"])
for entry in x:
result = result.append(returnsquares(entry))
(The values of the output are obviously not the same as above, but are the same shape). Is there a better method for doing this?
consider the following function that returns the same stuff you show in your example
def special_function(x):
idx = ['x', 'Val1', 'Val2']
d = {
1: pd.Series([x, 4, 23], idx),
5: pd.Series([x, 56, 27], idx),
14: pd.Series([x, 10, 9], idx),
27: pd.Series([x, 8, 33], idx),
}
return d[x]
then you want to combine into a single dataframe using pd.DataFrame.from_records
pd.DataFrame.from_records([special_function(i).squeeze() for i in x])
Or use pd.concat
pd.concat([special_function(i) for i in x])
Or make x a series and use apply
x = pd.Series([1, 5, 14, 27])
x.apply(lambda y: special_function(y).iloc[0], 1)
Be aware of timings
timing
Don't fear the list comprehension
Related
I was trying to find the maximum value of a column in a dataframe that contains numpy arrays.
df = pd.DataFrame({'id': [1, 2, 33, 4],
'a': [1, 22, 23, 44],
'b': [1, 42, 23, 42]})
df['new'] = df.apply(lambda r: tuple(r), axis=1).apply(np.array)
This how the dataframe can look like:
id a b new
0 1 1 1 [1, 1, 1]
1 2 22 42 [2, 22, 42]
2 33 23 23 [33, 23, 23]
3 4 44 42 [4, 44, 42]
Now I want to find the maximum (single) value of column new. In this case it is 44. What about a quick and easy way?
Because your new column is actually constructed from the columns id, a, b. Before you create the new column you can do:
single_max = np.max(df.values)
OR if you insist on your dataframe to contain the new column and then get max you can do:
single_max = np.max(df.drop('new',axis=1).values)
You can apply a lambda to the values that calls the array's max method. This would result in a Series that also has a max method.
df['new'].apply(lambda arr: arr.max()).max()
Just guessing, but this should be faster than .apply(max) because you use the optimized array method instead of converting the numpy ints to python ints one by one.
A possible solution:
df.new.explode().max()
Or a faster alternative:
np.max(np.vstack(df.new.values))
Returns 44.
Assuming you only want to consider the columns "new":
import numpy as np
out = np.max(tuple(df['new'])) # or np.max(df['new'].tolist())
Output: 44
df1.new.map(pd.eval).explode().max()
Output: 44
1- Combination of max and explode()
df['new'].explode().max()
# 44
2- List comprehension
max([max(e) for e in df['new']])
# 44
I have a data frame with intervals like:
begin end
2 4
6 8
9 11
I want to compare a value to each of such pair of interval. If that value is in the range of any interval, it will be 'yes', else it will be 'no'.
For example: x = 3 => yes (because 2<x<4), x=5 => no
I currently do this with a nested loop through each value of x and each the interval. But I have multiple values of x and multiple intervals, so the nested loop is really slow. Is there any way I can do this efficiently without a loop? Thank you!
You can use broadcasting to speed up the comparisons:
def check_intervals(x):
if any((intervals.begin < x) & (x < intervals.end)):
return 'yes'
else:
return 'no'
>>> intervals = pd.DataFrame({'begin': [2, 6, 9], 'end': [4, 8, 11]})
>>> intervals
begin end
0 2 4
1 6 8
2 9 11
>>> check_intervals(3)
'yes'
>>> values = [3, 5, 10, 11]
>>> [check_intervals(x) for x in values]
['yes', 'no', 'yes', 'no']
That should be fast enough for a few thousand comparisons per second.
Not sure this improves efficiency at all, but you could create a column of whether your value falls in the interval like so,
df = pd.DataFrame({'a':[2, 6, 9], 'b':[4, 8, 11]})
df['a'] = df['a'].astype(str)
df['b'] = df['b'].astype(str)
df['t'] = df['a'].str.cat(df['b'],sep=",")
def in_between(x, val):
a, b = x.split(',')
if val > int(a) and val < int(b):
return 'yes'
else:
return 'no'
val = 3
df['bw'] = df['t'].map(lambda x: in_between(x, val))
df
Here is another take on it just for the sake of benchmarking, that relies only on numpy which probably makes it the fastest option at the expense of taking up more memory. On my machine, this takes around 2 seconds to evaluate 10000 values and 10000 intervals.
import numpy as np
import pandas as pd
N = 10
values = np.random.randn(N) * N # Your x values
df = pd.DataFrame({"begin": [2, 6, 9],
"end": [4, 8, 11]})
data = df.to_numpy()
begin = data[:, 0] < values[np.newaxis].T
end = values[np.newaxis].T < data[:, 1]
answer = np.stack((begin, end), axis=-1)
answer = answer.all(axis=-1).any(axis=-1)
Where answer is of the same shape as values. Afterwards, you can just do a simple replacement of True and False to yes and no, respectively.
I have a pandas data frame with one column containing lists. I wish to divide each list element in each row by a scalar value in another column. In the following example, I wish to divide each element in a by b:
a b
0 [11, 22, 33] 11
1 [12, 24, 36] 2
2 [33, 66, 99] 3
Thus yielding the following result:
a b c
0 [11, 22, 33] 11 [1.0, 2.0, 3.0]
1 [12, 24, 36] 2 [6.0, 12.0, 18.0]
2 [33, 66, 99] 3 [11.0, 22.0, 33.0]
I can achieve this by the following code:
import pandas as pd
df = pd.DataFrame({"a":[[11,22,33],[12,24,36],[33,66,99]], "b" : [11,2,3]})
result = {"c":[]}
for _, row in df.iterrows():
result["c"].append([x / row["b"] for x in row["a"]])
df_c = pd.DataFrame(result)
df = pd.concat([df,df_c], axis="columns")
But explicit iteration over rows and collecting the result in a dictionary, converting it to a dataframe and then concatenation to the original data frame seems very inefficient and inelegant.
Does anyone have a better solution?
Thanks in advance and cheeers!
PS: In case you are wondering why I would store lists in a column: These are the resulting amplitudes of a Fourier-Transformation.
Why I don't use one column for each frequency?
Creating a new column for each frequency is horribly slow
With different sampling rates and FFT-window sizes in my project, there are multiple sets of frequencies.
zip the two columns, divide each entry in col a with its corresponding entry in col b, through a combination of product and starmap, and convert the iterator back into a list.
from itertools import product,starmap
from operator import floordiv
df['c'] = [list(starmap(floordiv,(product(num,[denom]))))
for num, denom in zip(df.a,df.b)]
a b c
0 [11, 22, 33] 11 [1, 2, 3]
1 [12, 24, 36] 2 [6, 12, 18]
2 [33, 66, 99] 3 [11, 22, 33]
Alternatively, u could just use numpy array within the iteration:
df['c'] = [list(np.array(num)/denom) for num, denom in zip(df.a,df.b)]
Thanks to #jezrael for the suggestion - All of this might be unnecessary as scipy has something for FFT - have a look at the link and see if it helps out.
I would convert the lists to numpy arrays:
df['c'] = df['a'].apply(np.array) / df['b']
You will get np.arrays in column c. If you really need lists, you will have to convert them back
df['c'] = df['c'].apply(list)
I have two dataframes: main_df (cols=['Technology', 'Condition1', Condition2']) and database_df (cols=['Technology', 'Values1', 'Values2']).
I have grouped the database_df depending on the Technology column:
grouped = database_df.groupby(['Technology'])
Now, what I would like to do is to get the pd.series main_df['Technology'] in main_df, for every row retrieve the relevant group, filter according to some conditions depending on some other column values of main_df and return the first row's ['Character'] column (of the database_df) that fulfills the conditions.
I.e. I would like to do something like:
grouped = database_df.groupby(['Technology'])
main_df['New column'] = (
grouped.get_group(main_df['Technology']).loc[
(grouped.get_group(main_df['Technology']))['Values1'] > main_df['Condition1'])
& (grouped.get_group(main_df['Technology']))['Values2'] > main_df['Condition2'])]['Character'][0])
However, I cannot pass a pd.Series as an argument to the get_group method. I realise I could probably pass main_df['Technology'] as a str for every entry applying a lambda function, but I would like to perform this operation in a vectorial way... Is there any way?
MINIMAL VIABLE EXAMPLE:
main_df = pd.DataFrame({'Technology': ['A','A','B'],
'Condition1': [20, 10, 10],
'Condition2': [100, 200, 100]})
database_df = pd.DataFrame({'Technology':['A', 'A', 'A', 'B', 'B', 'B'],
'Values1':[10, 20, 30, 10, 20, 30],
'Values2':[100, 200, 300, 100, 200, 300]
'Character':[1, 2, 3, 1, 2, 3]})
I would like the outcome of the above mentioned operation with these dfs to be:
main_df['New column'] = [3, 3, 2]
If want compare between 2 DataFrames use outer join with convert index to column, then filter by conditions and last filter first matched values:
df = main_df.reset_index().merge(database_df, on='Technology', how='outer')
m = (df['Values1'] > df['Condition1']) & (df['Values2'] > df['Condition2'])
main_df['New column'] = df[m].groupby('index')['Character'].first()
print (main_df)
Technology Condition1 Condition2 New column
0 A 20 100 3
1 A 10 200 3
2 B 10 100 2
Can someone please explain to my why the argmax() function does not work after using sort_values() on my pandas series?
Below is the example of my code. The indices in the output is based on the original DataFrame, and not on the sorted Series.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'a': [4, 5, 3, 1, 2],
'b': [20, 10, 40, 50, 30],
'c': [25, 20, 5, 15, 10]
})
def sec_largest(x):
xsorted = x.sort_values(ascending=False)
return xsorted.idxmax()
df.apply(sec_largest)
Then the output is
a 1
b 3
c 0
dtype: int64
And when I checked the Series using xsorted.iloc[0] function, it gives me the maximum values in the series.
Can someone explain to me how this works? Thank you very much.
The problem is that you are using the sort on the pandas Series, with which the indices also get passed along while sorting, and idxmax returns the original index with the highest value, not the index of the sorted series..
def sec_largest(x):
xsorted = x.sort_values(ascending=False)
return xsorted.values.argmax()
By using the values of xsorted we use the numpy dataframe, and not the underlying pandas datastructure and everything works as expected.
If you print xsorted in the function you can see that the indices also get sorted along:
1 5
0 4
2 3
4 2
3 1