How to sum columns containing arrays - python

I have a problem to summarize the columns of a dataframe containing arrays in each cell.
I tried to summarize the columns using df.sum(), expecting to get the total column array, for example [4,1,1,4,1] for the column 'common'.
But I got only an empty Series.
df_sum = df.sum()
print(df_sum)
Series([], dtype: float64)
How can I get the summarized column in this case?

Well, working with object dtypes in pandas DataFrames are usually not a good idea, especially filling cells with python lists, because you lose performance.
Nevertheless, you may accomplish this by using itertools.chain.from_iterable
df.apply(lambda s: list(it.chain.from_iterable(s.dropna())))
You may also use sum, but I'd say it's slower
df.apply(lambda s: s.dropna().sum())
I can see why you'd think df.sum would work here, even setting skipna=True explicitly, but the vectorized df.sum shows a weird behavior in this situation. But then again, these are the downsides of using a DataFrame with lists in it

IIUC, you can probably just use list comprehension to handle your task:
df = pd.DataFrame({'d1':[np.nan, [1,2], [4]], 'd2':[[3], np.nan, np.nan]})
>>> df
d1 d2
0 NaN [3]
1 [1, 2] NaN
2 [4] NaN
df_sum = [i for a in df['d1'] if type(a) is list for i in a]
>>> df_sum
[1, 2, 4]
If you need to do sum on the whole DataFrame (or multiple columns), then use numpy.ravel() to flatten the dataframe before using the list comprehension.
df_sum = [i for a in np.ravel(df.values) if type(a) is list for i in a]
>>> df_sum
[3, 1, 2, 4]

Related

Can a pd.Series be assigned to a column in an out-of-order pd.DataFrame without mapping to index (i.e. without reordering the values)?

I discovered some unexpected behavior when creating or assigning a new column in Pandas. When I filter or sort the pd.DataFrame (thus mixing up the indexes) and then create a new column from a pd.Series, Pandas reorders the series to map to the DataFrame index. For example:
df = pd.DataFrame({'a': ['alpha', 'beta', 'gamma']},
index=[2, 0, 1])
df['b'] = pd.Series(['alpha', 'beta', 'gamma'])
index
a
b
2
alpha
gamma
0
beta
alpha
1
gamma
beta
I think this is happening because the pd.Series has an index [0, 1, 2] which is getting mapped to the pd.DataFrame index. But I wanted to create the new column with values in the correct "order" ignoring index:
index
a
b
2
alpha
alpha
0
beta
beta
1
gamma
gamma
Here's a convoluted example showing how unexpected this behavior is:
df = pd.DataFrame({'num': [1, 2, 3]}, index=[2, 0, 1]) \
.assign(num_times_two=lambda x: pd.Series(list(x['num']*2)))
index
num
num_times_two
2
1
6
0
2
2
1
3
4
If I use any function that strips the index off the original pd.Series and then returns a new pd.Series, the values get out of order.
Is this a bug in Pandas or intentional behavior? Is there any way to force Pandas to ignore the index when I create a new column from a pd.Series?
If you don't want the conversions of dtypes between pandas and numpy (for example, with datetimes), you can set the index of the Series same as the index of the DataFrame before assigning to a column:
either with .set_axis()
The original Series will have its index preserved - by default this operation is not in place:
ser = pd.Series(['alpha', 'beta', 'gamma'])
df['b'] = ser.set_axis(df.index)
or you can change the index of the original Series:
ser.index = df.index # ser.set_axis(df.index, inplace=True) # alternative
df['b'] = ser
OR:
Use a numpy array instead of a Series. It doesn't have indices, so there is nothing to be aligned by.
Any Series can be converted to a numpy array with .to_numpy():
df['b'] = ser.to_numpy()
Any other array-like also can be used, for example, a list.
I don't know if it is on purpose, but the new column assignment is based on index, do you need to maintain the old indexes?
If the answer is no you can simply reset the index before adding a new column
df.reset_index(drop=True)
In your example, I don't see any reason to make it a new Series? (Even if something strips the index, like converting to a list)
df = pd.DataFrame({'num': [1, 2, 3]}, index=[2, 0, 1]) \
.assign(num_times_two=lambda x: list(x['num']*2))
print(df)
Output:
num num_times_two
2 1 2
0 2 4
1 3 6

Why get different results when comparing two dataframes?

I am comparing two df, it gives me False when using .equals(), but if I append two df together and use drop_duplicate() it gives me nothing. Can someone explain this?
TL;DR
These are completely different operations and I'd have never expected them to produce the same results.
pandas.DataFrame.equals
Will return a boolean value depending on whether Pandas determines that the dataframes being compared are the "same". That means that the index of one is the "same" as the index of the other, the columns of one is the "same" as the columns of the the other, and the data of one is the "same" as the data of the other.
See docs
It is NOT the same as pandas.DataFrame.eq which will return a dataframe of boolean values.
Setup
Consider these three dataframes
df0 = pd.DataFrame([[0, 1], [2, 3]], [0, 1], ['A', 'B'])
df1 = pd.DataFrame([[1, 0], [3, 2]], [0, 1], ['B', 'A'])
df2 = pd.DataFrame([[0, 1], [2, 3]], ['foo', 'bar'], ['A', 'B'])
df0 df1 df2
A B B A A B
0 0 1 0 1 0 foo 0 1
1 2 3 1 3 2 bar 2 3
If we checked if df1 was equals to df0, we get
df0.equals(df1)
False
Even though all elements are the same
df0.eq(df1).all().all()
True
And that is because the columns are not aligned. If I sort the columns then ...
df0.equals(df1.sort_index(axis=1))
True
pandas.DataFrame.drop_duplicates
Compares the values in rows and doesn't care about the index.
So, both of these produce the same looking results
df0.append(df2).drop_duplicates()
and
df0.append(df1, sort=True).drop_duplicates()
A B
0 0 1
1 2 3
When I append (or pandas.concat), Pandas will align the columns and add the appended dataframe as new rows. Then drop_duplicates does it's thing. But it was the inherent aligning of the columns that does the what I did above with sort_index and axis=1.
maybe the lines in both dataframes are not ordered the same way? dataframes will be equal when the lines corresponding to the same index are the same

How to locate a row containing a particular list in a DataFrame

Let's have this DataFrame
d = {'col1': [[0,1], [0,2], [1,2], [2,3]], 'col2': ["a", "b", "c", "d"]}
df = pandas.DataFrame(data=d)
col1 col2
0 [0, 1] a
1 [0, 2] b
2 [1, 2] c
3 [2, 3] d
Now I need to find a particular list in col1 and return the value from col2 of that line
For example I want to lookup [0,2] and get "b" in return
I have read this thread about how to do it: extract column value based on another column pandas dataframe
But when I try to apply the answers there, I don't get the result I need
df.loc[df['col1'] == [0,2], 'col2']
ValueError: Arrays were different lengths: 4 vs 2
df.query('col1==[0,2]')
SystemError: <built-in method view of numpy.ndarray object at 0x000000000D67FA80> returned a result with an error set
One possible solution is compare tuples or sets:
mask = df['col1'].apply(tuple) == tuple([0,2])
mask = df['col1'].apply(set) == set([0,2])
Or compare by arrays if same length of each value of Series and also same length of comparing list or array:
mask = (np.array(df['col1'].values.tolist())== [0,2]).all(axis=1)
s = df.loc[mask, 'col2']
print (s)
1 b
Name: col2, dtype: object
Not sure if you can do logical indexing in pandas DataFrames with non-numeric or string values. Heres a simple one-line workaround that compares strings instead of lists.
df.loc[df['col1'].apply(str) == str([0,1])]['col2'][0]
Essentially what you're doing is all the lists in column 1 to strings, and then comparing them to the string: str([0,1]).
Note the [0] at the end of my second line of the solution. This is because more than one of the rows might contain the list [0,1]; I select the first value that shows up.

Min of Str Column in Pandas

I have a dataframe where one column contains a list of values, e.g.
dict = {'a' : [0, 1, 2], 'b' : [4, 5, 6]}
df = pd.DataFrame(dict)
df.loc[:, 'c'] = -1
df['c'] = df.apply(lambda x: [x.a, x.b], axis=1)
So I get:
a b c
0 0 4 [0, 4]
1 1 5 [1, 5]
2 2 6 [2, 6]
I now would like to save the minimum value of each entry of column c in a new column d, which should give me the following data frame:
a b c d
0 0 4 [0, 4] 0
1 1 5 [1, 5] 1
2 2 6 [2, 6] 2
Somehow though I always fail to do it with min() or similar. Right now I am using df.apply(lambda x: min(x['c'], axis=1). But that is too slow in my case. Do you know of a faster way of doing it?
Thanks!
You can get help from numpy:
import numpy as np
df['d'] = np.array(df['c'].tolist()).min(axis=1)
As stated in the comments, if you don't need the column c then:
df['d'] = df[['a','b']].min(axis=1)
Remember that series (like df['c']) are iterable. You can then create a new list and set it as a key, just like you would a dictionary. The list will automatically be cast to a pd.Series object. No need to use fancy pandas functions unless you are dealing with really (really) big data.
df['d'] = [min(c) for c in df['c']]
Edit: update to comments below
df['d'] = [min(c, key=lambda v: v - df.a) for c in df['c']]
This doesn't work because v is a value (in the first iteration it is passed 0, then 4, for example). df.a is a series. v - df.a is a new series with the elements [v - df.a[0], v - df.a[1], ...]. Then min tries to compare these series keys, which doesn't make any sense, because it will be testing if True, False, ...] or something like that which pandas throws an error for because it doens't really make sense. What you need is
df['d'] = [min(c, key=lambda v: v - df['a'][i]) for i, c in enumerate(df['c'])]
# I prefer to use df['a'] rather than df.a
so you take each value of df['a'] in turn from v, not the entire series df['a'].
However, taking a constant when calculating the minimum will do absolutely nothing, but I'm guessing this is simplified from what you are actually doing. The two samples above will do exactly the same thing.
This is a functional solution.
df['d'] = list(map(min, df['c']))
It works because:
df['c'] is a pd.Series, which is an iterable object.
map is a lazy operator which applies a function to each element of an iterable.
Since map is lazy, we must apply list in order to assign to a series.

Make Tuples from Specific Pandas Columns

I have a pandas dataframe, e.g.
one two three four five
0 1 2 3 4 5
1 1 1 1 1 1
What I would like is to be able to convert only a select number of columns to a list, such that we obtain:
[[1,2],[1,1]]
This is the rows 0,1, where we are selecting columns one and two.
Similarly if we selected columns one, two, four:
[[1,2,4],[1,1,1]]
Ideally I would like to avoid iteration of rows as it is slow!
You can select just those columns with:
In [11]: df[['one', 'two']]
Out[11]:
one two
0 1 2
1 1 1
and get the list of lists from the underlying numpy array using tolist:
In [12]: df[['one', 'two']].values.tolist()
Out[12]: [[1, 2], [1, 1]]
In [13]: df[['one', 'two', 'four']].values.tolist()
Out[13]: [[1, 2, 4], [1, 1, 1]]
Note: this should never really be necessary unless this is your end game... it's going to be much more efficient to do the work inside pandas or numpy.
So I worked out how to do it.
Firstly we select the columns we would like the values from:
y = x[['one','two']]
This gives us a subset df.
Now we can choose the values:
> y.values
array([[1, 2],
[1, 1]])

Categories