Min of Str Column in Pandas - python

I have a dataframe where one column contains a list of values, e.g.
dict = {'a' : [0, 1, 2], 'b' : [4, 5, 6]}
df = pd.DataFrame(dict)
df.loc[:, 'c'] = -1
df['c'] = df.apply(lambda x: [x.a, x.b], axis=1)
So I get:
a b c
0 0 4 [0, 4]
1 1 5 [1, 5]
2 2 6 [2, 6]
I now would like to save the minimum value of each entry of column c in a new column d, which should give me the following data frame:
a b c d
0 0 4 [0, 4] 0
1 1 5 [1, 5] 1
2 2 6 [2, 6] 2
Somehow though I always fail to do it with min() or similar. Right now I am using df.apply(lambda x: min(x['c'], axis=1). But that is too slow in my case. Do you know of a faster way of doing it?
Thanks!

You can get help from numpy:
import numpy as np
df['d'] = np.array(df['c'].tolist()).min(axis=1)
As stated in the comments, if you don't need the column c then:
df['d'] = df[['a','b']].min(axis=1)

Remember that series (like df['c']) are iterable. You can then create a new list and set it as a key, just like you would a dictionary. The list will automatically be cast to a pd.Series object. No need to use fancy pandas functions unless you are dealing with really (really) big data.
df['d'] = [min(c) for c in df['c']]
Edit: update to comments below
df['d'] = [min(c, key=lambda v: v - df.a) for c in df['c']]
This doesn't work because v is a value (in the first iteration it is passed 0, then 4, for example). df.a is a series. v - df.a is a new series with the elements [v - df.a[0], v - df.a[1], ...]. Then min tries to compare these series keys, which doesn't make any sense, because it will be testing if True, False, ...] or something like that which pandas throws an error for because it doens't really make sense. What you need is
df['d'] = [min(c, key=lambda v: v - df['a'][i]) for i, c in enumerate(df['c'])]
# I prefer to use df['a'] rather than df.a
so you take each value of df['a'] in turn from v, not the entire series df['a'].
However, taking a constant when calculating the minimum will do absolutely nothing, but I'm guessing this is simplified from what you are actually doing. The two samples above will do exactly the same thing.

This is a functional solution.
df['d'] = list(map(min, df['c']))
It works because:
df['c'] is a pd.Series, which is an iterable object.
map is a lazy operator which applies a function to each element of an iterable.
Since map is lazy, we must apply list in order to assign to a series.

Related

Is there a shorter syntax using multiple index method returns bolean array?(Pandas dataframe) [duplicate]

Let’s say I have the following Pandas dataframe:
df = DataFrame({'A' : [5,6,3,4], 'B' : [1,2,3, 5]})
df
A B
0 5 1
1 6 2
2 3 3
3 4 5
I can subset based on a specific value:
x = df[df['A'] == 3]
x
A B
2 3 3
But how can I subset based on a list of values? - something like this:
list_of_values = [3,6]
y = df[df['A'] in list_of_values]
To get:
A B
1 6 2
2 3 3
You can use the isin method:
In [1]: df = pd.DataFrame({'A': [5,6,3,4], 'B': [1,2,3,5]})
In [2]: df
Out[2]:
A B
0 5 1
1 6 2
2 3 3
3 4 5
In [3]: df[df['A'].isin([3, 6])]
Out[3]:
A B
1 6 2
2 3 3
And to get the opposite use ~:
In [4]: df[~df['A'].isin([3, 6])]
Out[4]:
A B
0 5 1
3 4 5
You can use the method query:
df.query('A in [6, 3]')
# df.query('A == [6, 3]')
or
lst = [6, 3]
df.query('A in #lst')
# df.query('A == #lst')
Another method;
df.loc[df.apply(lambda x: x.A in [3,6], axis=1)]
Unlike the isin method, this is particularly useful in determining if the list contains a function of the column A. For example, f(A) = 2*A - 5 as the function;
df.loc[df.apply(lambda x: 2*x.A-5 in [3,6], axis=1)]
It should be noted that this approach is slower than the isin method.
You can store your values in a list as:
lis = [3,6]
then
df1 = df[df['A'].isin(lis)]
list_of_values doesn't have to be a list; it can be set, tuple, dictionary, numpy array, pandas Series, generator, range etc. and isin() and query() will still work.
Some common problems with selecting rows
1. list_of_values is a range
If you need to filter within a range, you can use between() method or query().
list_of_values = [3, 4, 5, 6] # a range of values
df[df['A'].between(3, 6)] # or
df.query('3<=A<=6')
2. Return df in the order of list_of_values
In the OP, the values in list_of_values don't appear in that order in df. If you want df to return in the order they appear in list_of_values, i.e. "sort" by list_of_values, use loc.
list_of_values = [3, 6]
df.set_index('A').loc[list_of_values].reset_index()
If you want to retain the old index, you can use the following.
list_of_values = [3, 6, 3]
df.reset_index().set_index('A').loc[list_of_values].reset_index().set_index('index').rename_axis(None)
3. Don't use apply
In general, isin() and query() are the best methods for this task; there's no need for apply(). For example, for function f(A) = 2*A - 5 on column A, both isin() and query() work much more efficiently:
df[(2*df['A']-5).isin(list_of_values)] # or
df[df['A'].mul(2).sub(5).isin(list_of_values)] # or
df.query("A.mul(2).sub(5) in #list_of_values")
4. Select rows not in list_of_values
To select rows not in list_of_values, negate isin()/in:
df[~df['A'].isin(list_of_values)]
df.query("A not in #list_of_values") # df.query("A != #list_of_values")
5. Select rows where multiple columns are in list_of_values
If you want to filter using both (or multiple) columns, there's any() and all() to reduce columns (axis=1) depending on the need.
Select rows where at least one of A or B is in list_of_values:
df[df[['A','B']].isin(list_of_values).any(1)]
df.query("A in #list_of_values or B in #list_of_values")
Select rows where both of A and B are in list_of_values:
df[df[['A','B']].isin(list_of_values).all(1)]
df.query("A in #list_of_values and B in #list_of_values")
Bonus:
You can also call isin() inside query():
df.query("A.isin(#list_of_values).values")
Its trickier with f-Strings
list_of_values = [3,6]
df.query(f'A in {list_of_values}')
The above answers are correct, but if you still are not able to filter out rows as expected, make sure both DataFrames' columns have the same dtype.
source = source.astype({1: 'int64'})
to_rem = to_rem.astype({'some col': 'int64'})
works = source[~source[1].isin(to_rem['some col'])]
Took me long enough.
A non pandas solution that compares in terms of speed may be:
filtered_column = set(df.A) - set(list_list_of_values)

Can a pd.Series be assigned to a column in an out-of-order pd.DataFrame without mapping to index (i.e. without reordering the values)?

I discovered some unexpected behavior when creating or assigning a new column in Pandas. When I filter or sort the pd.DataFrame (thus mixing up the indexes) and then create a new column from a pd.Series, Pandas reorders the series to map to the DataFrame index. For example:
df = pd.DataFrame({'a': ['alpha', 'beta', 'gamma']},
index=[2, 0, 1])
df['b'] = pd.Series(['alpha', 'beta', 'gamma'])
index
a
b
2
alpha
gamma
0
beta
alpha
1
gamma
beta
I think this is happening because the pd.Series has an index [0, 1, 2] which is getting mapped to the pd.DataFrame index. But I wanted to create the new column with values in the correct "order" ignoring index:
index
a
b
2
alpha
alpha
0
beta
beta
1
gamma
gamma
Here's a convoluted example showing how unexpected this behavior is:
df = pd.DataFrame({'num': [1, 2, 3]}, index=[2, 0, 1]) \
.assign(num_times_two=lambda x: pd.Series(list(x['num']*2)))
index
num
num_times_two
2
1
6
0
2
2
1
3
4
If I use any function that strips the index off the original pd.Series and then returns a new pd.Series, the values get out of order.
Is this a bug in Pandas or intentional behavior? Is there any way to force Pandas to ignore the index when I create a new column from a pd.Series?
If you don't want the conversions of dtypes between pandas and numpy (for example, with datetimes), you can set the index of the Series same as the index of the DataFrame before assigning to a column:
either with .set_axis()
The original Series will have its index preserved - by default this operation is not in place:
ser = pd.Series(['alpha', 'beta', 'gamma'])
df['b'] = ser.set_axis(df.index)
or you can change the index of the original Series:
ser.index = df.index # ser.set_axis(df.index, inplace=True) # alternative
df['b'] = ser
OR:
Use a numpy array instead of a Series. It doesn't have indices, so there is nothing to be aligned by.
Any Series can be converted to a numpy array with .to_numpy():
df['b'] = ser.to_numpy()
Any other array-like also can be used, for example, a list.
I don't know if it is on purpose, but the new column assignment is based on index, do you need to maintain the old indexes?
If the answer is no you can simply reset the index before adding a new column
df.reset_index(drop=True)
In your example, I don't see any reason to make it a new Series? (Even if something strips the index, like converting to a list)
df = pd.DataFrame({'num': [1, 2, 3]}, index=[2, 0, 1]) \
.assign(num_times_two=lambda x: list(x['num']*2))
print(df)
Output:
num num_times_two
2 1 2
0 2 4
1 3 6

How to sum columns containing arrays

I have a problem to summarize the columns of a dataframe containing arrays in each cell.
I tried to summarize the columns using df.sum(), expecting to get the total column array, for example [4,1,1,4,1] for the column 'common'.
But I got only an empty Series.
df_sum = df.sum()
print(df_sum)
Series([], dtype: float64)
How can I get the summarized column in this case?
Well, working with object dtypes in pandas DataFrames are usually not a good idea, especially filling cells with python lists, because you lose performance.
Nevertheless, you may accomplish this by using itertools.chain.from_iterable
df.apply(lambda s: list(it.chain.from_iterable(s.dropna())))
You may also use sum, but I'd say it's slower
df.apply(lambda s: s.dropna().sum())
I can see why you'd think df.sum would work here, even setting skipna=True explicitly, but the vectorized df.sum shows a weird behavior in this situation. But then again, these are the downsides of using a DataFrame with lists in it
IIUC, you can probably just use list comprehension to handle your task:
df = pd.DataFrame({'d1':[np.nan, [1,2], [4]], 'd2':[[3], np.nan, np.nan]})
>>> df
d1 d2
0 NaN [3]
1 [1, 2] NaN
2 [4] NaN
df_sum = [i for a in df['d1'] if type(a) is list for i in a]
>>> df_sum
[1, 2, 4]
If you need to do sum on the whole DataFrame (or multiple columns), then use numpy.ravel() to flatten the dataframe before using the list comprehension.
df_sum = [i for a in np.ravel(df.values) if type(a) is list for i in a]
>>> df_sum
[3, 1, 2, 4]

Loop through subsets of rows in a DataFrame

I try to loop trough rows of a DataFrame with a function calculation most frequent element in a series. The function works perfectly when i manually supply a series into it:
# Create DataFrame
df = pd.DataFrame({'a' : [1, 2, 1, 2, 1, 2, 1, 1],
'b' : [1, 1, 2, 1, 1, 1, 2, 2],
'c' : [1, 2, 2, 1, 2, 2, 2, 1]})
# Create function calculating most frequent element
from collections import Counter
def freq_value(series):
return Counter(series).most_common()[0][0]
# Test function on one row
freq_value(df.iloc[1])
# Another test
freq_value((df.iloc[1, 0], df.iloc[1, 1], df.iloc[1, 2]))
With both tests I get the desired result. However, when i try to apply this function in a loop through DataFrame rows and save the result into new column, i get an error "'Series' object is not callable", 'occurred at index 0'. The line producing the error is as follows:
# Loop trough rows of a dataframe and write the result into new column
df['result'] = df.apply(lambda row: freq_value((row('a'), row('b'), row('c'))), axis = 1)
How exactly row() in apply() function works? Shouldn't it supply to my freq_value() function values from columns 'a', 'b', 'c'?
#jpp's answer addresses how to apply your custom function, but you can also get the desired result using df.mode, with axis=1. This will avoid the use of apply, and will still give you a column of the most common value for each row.
df['result'] = df.mode(1)
>>> df
a b c result
0 1 1 1 1
1 2 1 2 2
2 1 2 2 2
3 2 1 1 1
4 1 1 2 1
5 2 1 2 2
6 1 2 2 2
7 1 2 1 1
row is not a function within your lambda, so parentheses are not appropriate, Instead, you should use the __getitem__ method or loc accessor to access values. The syntactic sugar for the former is []:
df['result'] = df.apply(lambda row: freq_value((row['a'], row['b'], row['c'])), axis=1)
Using the loc alternative:
def freq_value_calc(row):
return freq_value((row.loc['a'], row.loc['b'], row.loc['c']))
To understand exactly why this is the case, it helps to rewrite your lambda as a named function:
def freq_value_calc(row):
print(type(row)) # useful for debugging
return freq_value((row['a'], row['b'], row['c']))
df['result'] = df.apply(freq_value_calc, axis=1)
Running this, you'll find that row is of type <class 'pandas.core.series.Series'>, i.e. a series indexed by column labels if you use axis=1. To access the value in a series for a given label, you can either use __getitem__ / [] syntax or loc.
df['CommonValue'] = df.apply(lambda x: x.mode()[0], axis = 1)

Passing row and column name to get value [duplicate]

I have constructed a condition that extracts exactly one row from my data frame:
d2 = df[(df['l_ext']==l_ext) & (df['item']==item) & (df['wn']==wn) & (df['wd']==1)]
Now I would like to take a value from a particular column:
val = d2['col_name']
But as a result, I get a data frame that contains one row and one column (i.e., one cell). It is not what I need. I need one value (one float number). How can I do it in pandas?
If you have a DataFrame with only one row, then access the first (only) row as a Series using iloc, and then the value using the column name:
In [3]: sub_df
Out[3]:
A B
2 -0.133653 -0.030854
In [4]: sub_df.iloc[0]
Out[4]:
A -0.133653
B -0.030854
Name: 2, dtype: float64
In [5]: sub_df.iloc[0]['A']
Out[5]: -0.13365288513107493
These are fast access methods for scalars:
In [15]: df = pandas.DataFrame(numpy.random.randn(5, 3), columns=list('ABC'))
In [16]: df
Out[16]:
A B C
0 -0.074172 -0.090626 0.038272
1 -0.128545 0.762088 -0.714816
2 0.201498 -0.734963 0.558397
3 1.563307 -1.186415 0.848246
4 0.205171 0.962514 0.037709
In [17]: df.iat[0, 0]
Out[17]: -0.074171888537611502
In [18]: df.at[0, 'A']
Out[18]: -0.074171888537611502
You can turn your 1x1 dataframe into a NumPy array, then access the first and only value of that array:
val = d2['col_name'].values[0]
Most answers are using iloc which is good for selection by position.
If you need selection-by-label, loc would be more convenient.
For getting a value explicitly (equiv to deprecated
df.get_value('a','A'))
# This is also equivalent to df1.at['a','A']
In [55]: df1.loc['a', 'A']
Out[55]: 0.13200317033032932
It doesn't need to be complicated:
val = df.loc[df.wd==1, 'col_name'].values[0]
I needed the value of one cell, selected by column and index names.
This solution worked for me:
original_conversion_frequency.loc[1,:].values[0]
It looks like changes after pandas 10.1 or 13.1.
I upgraded from 10.1 to 13.1. Before, iloc is not available.
Now with 13.1, iloc[0]['label'] gets a single value array rather than a scalar.
Like this:
lastprice = stock.iloc[-1]['Close']
Output:
date
2014-02-26 118.2
name:Close, dtype: float64
The quickest and easiest options I have found are the following. 501 represents the row index.
df.at[501, 'column_name']
df.get_value(501, 'column_name')
In later versions, you can fix it by simply doing:
val = float(d2['col_name'].iloc[0])
df_gdp.columns
Index([u'Country', u'Country Code', u'Indicator Name', u'Indicator Code',
u'1960', u'1961', u'1962', u'1963', u'1964', u'1965', u'1966', u'1967',
u'1968', u'1969', u'1970', u'1971', u'1972', u'1973', u'1974', u'1975',
u'1976', u'1977', u'1978', u'1979', u'1980', u'1981', u'1982', u'1983',
u'1984', u'1985', u'1986', u'1987', u'1988', u'1989', u'1990', u'1991',
u'1992', u'1993', u'1994', u'1995', u'1996', u'1997', u'1998', u'1999',
u'2000', u'2001', u'2002', u'2003', u'2004', u'2005', u'2006', u'2007',
u'2008', u'2009', u'2010', u'2011', u'2012', u'2013', u'2014', u'2015',
u'2016'],
dtype='object')
df_gdp[df_gdp["Country Code"] == "USA"]["1996"].values[0]
8100000000000.0
I am not sure if this is a good practice, but I noticed I can also get just the value by casting the series as float.
E.g.,
rate
3 0.042679
Name: Unemployment_rate, dtype: float64
float(rate)
0.0426789
I've run across this when using dataframes with MultiIndexes and found squeeze useful.
From the documentation:
Squeeze 1 dimensional axis objects into scalars.
Series or DataFrames with a single element are squeezed to a scalar.
DataFrames with a single column or a single row are squeezed to a
Series. Otherwise the object is unchanged.
# Example for a dataframe with MultiIndex
> import pandas as pd
> df = pd.DataFrame(
[
[1, 2, 3],
[4, 5, 6],
[7, 8, 9]
],
index=pd.MultiIndex.from_tuples( [('i', 1), ('ii', 2), ('iii', 3)] ),
columns=pd.MultiIndex.from_tuples( [('A', 'a'), ('B', 'b'), ('C', 'c')] )
)
> df
A B C
a b c
i 1 1 2 3
ii 2 4 5 6
iii 3 7 8 9
> df.loc['ii', 'B']
b
2 5
> df.loc['ii', 'B'].squeeze()
5
Note that while df.at[] also works (if you aren't needing to use conditionals) you then still AFAIK need to specify all levels of the MultiIndex.
Example:
> df.at[('ii', 2), ('B', 'b')]
5
I have a dataframe with a six-level index and two-level columns, so only having to specify the outer level is quite helpful.
For pandas 0.10, where iloc is unavailable, filter a DF and get the first row data for the column VALUE:
df_filt = df[df['C1'] == C1val & df['C2'] == C2val]
result = df_filt.get_value(df_filt.index[0],'VALUE')
If there is more than one row filtered, obtain the first row value. There will be an exception if the filter results in an empty data frame.
Converting it to integer worked for me:
int(sub_df.iloc[0])
Using .item() returns a scalar (not a Series), and it only works if there is a single element selected. It's much safer than .values[0] which will return the first element regardless of how many are selected.
>>> df = pd.DataFrame({'a': [1,2,2], 'b': [4,5,6]})
>>> df[df['a'] == 1]['a'] # Returns a Series
0 1
Name: a, dtype: int64
>>> df[df['a'] == 1]['a'].item()
1
>>> df2 = df[df['a'] == 2]
>>> df2['b']
1 5
2 6
Name: b, dtype: int64
>>> df2['b'].values[0]
5
>>> df2['b'].item()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3/dist-packages/pandas/core/base.py", line 331, in item
raise ValueError("can only convert an array of size 1 to a Python scalar")
ValueError: can only convert an array of size 1 to a Python scalar
To get the full row's value as JSON (instead of a Serie):
row = df.iloc[0]
Use the to_json method like below:
row.to_json()

Categories