Pandas: how to compare several cells with a list/tuple - python

I need to compare some columns in a dataframe as a whole, for example:
df = pd.DataFrame({'A':[1,1,3],'B':[4,5,6]})
#Select condition: If df['A'] == 1 and df['B'] == 4, then pick up this row.
For this simple example, I can use below method:
df.loc[(df['A']==1)&(df['B']==4),'A':'B']
However, in reality my dataframe has tens of columns which should be compared as whole. Above solution will be very very messy if I choose to list all of them. So I think if regard them as a whole to compare with a list may solve the problem:
#something just like this:
df.loc[df.loc[:,'A':'B']==[1,4],'A':'B')]
Not worked. So I came up with the idea that first combine all desired columns into a new column as a list value, then compare this new column with the list. The latter has been solved in Pandas: compare list objects in Series
Although generally I've solved my case, I still want to know if there is an easier way to solve this problem? Thanks.

Or use [[]] for getting multiple columns:
df[(df[['A','B']].values==[1,4]).all(1)]
Demo:
>>> df = pd.DataFrame({'A':[1,1,3],'B':[4,5,6]})
>>> df[(df[['A','B']].values==[1,4]).all(1)]
A B
0 1 4
>>>

You can use a Boolean mask via a NumPy array representation of your data:
df = pd.DataFrame({'A':[1,1,3],'B':[4,5,6]})
res = df[(df.loc[:, 'A':'B'].values == [1, 4]).all(1)]
print(res)
A B
0 1 4
In this situation, never combine your columns into a single series of lists. This is inefficient as you will lose all vectorisation benefits, and any processing thereafter will involve Python-level loops.

Related

How can I assign a lists elements to corresponding rows of a dataframe in pandas?

I have numbers in a List that should get assigned to certain rows of a dataframe consecutively.
List=[2,5,7,12….]
In my dataframe that looks similar to the below table, I need to do the following:
A frame_index that starts with 1 gets the next element of List as “sequence_number”
Frame_Index==1 then assign first element of List as Sequence_number.
Frame_index == 1 again, so assign second element of List as Sequence_number.
So my goal is to achieve a new dataframe like this:
I don't know which functions to use. If this weren't python language, I would use a for loop and check where frame_index==1, but my dataset is large and I need a pythonic way to achieve the described solution. I appreciate any help.
EDIT: I tried the following to fill with my List values to use fillna with ffill afterwards:
concatenated_df['Sequence_number']=[List[i] for i in
concatenated_df.index if (concatenated_df['Frame_Index'] == 1).any()]
But of course I'm getting "list index out of range" error.
I think you could do that in two steps.
Add column and fill with your list where frame_index == 1.
Use df.fillna() with method="ffill" kwarg.
import pandas as pd
df = pd.DataFrame({"frame_index": [1,2,3,4,1,2]})
sequence = [2,5]
df.loc[df["frame_index"] == 1, "sequence_number"] = sequence
df.ffill(inplace=True) # alias for df.fillna(method="ffill")
This puts the sequence_number as float64, which might be acceptable in your use case, if you want it to be int64, then you can just force it when creating the column (line 4) or cast it later.

Avoid duplicate count columns with pandas groupby

pandas.Dataframe.groupby(['date','some_category']).agg([np.sum, np.size]) produces a count that is repeated for each sum column. Is it possible to output just a single count column when passing a list of aggregate functions?
a = df_all.groupby(['date','some_category']).sum()
b = df_all.groupby(['date','some_category']).size()
pd.concat([a,b], axis=1)
produces basically what I want but seems awkward.
df.pivot_table(index=['date', 'some_category'],aggfunc=['sum', 'size']) is what I was looking for. This produces a single size column (though I am not sure why it is labeled '0'), rather than repeated (identical) size for each summed column. Thank all, I learned some useful things along the way.

Remove element from every list in a column in pandas DataFrame

I have a pretty simple question, but I'm having trouble achieving what I want.
I have a DataFrame that looks like this:
base
[a,b,c]
[c,d,e]
[a,b,h]
I want to remove the second element of every list, so I would get this:
base
[a,c]
[c,e]
[a,h]
I suppose there's an easy way to do this, but it's not that usual to work with lists in DataFrames, so I'm not finding anything.
Thanks in advance.
Edit: The DataFrame is just one column, which is comprised of lists, all of the same length. I need to remove one element, so the length of the list is the same as the number of columns of the DataFrame it will become.
IIUC
df.base=pd.DataFrame(df.base.values.tolist()).drop(1,1).values.tolist()
df
Out[635]:
base
0 [a, c]
1 [c, e]
2 [a, h]
Don't use list in series
Pandas series are not designed to hold lists. You lose all functionality and performance with 2 layers of pointers: one with your object dtype array, another corresponding to each list within your series.
Since each list has the same number of elements, separate into columns instead:
df = pd.DataFrame({'base': [list('abc'), list('cde'), list('abh')]})
res = pd.DataFrame(df['base'].values.tolist()).iloc[:, [0, 2]]
print(res)
0 2
0 a c
1 c e
2 a h
You could work on the underlying np.array:
df['base'] = np.stack(df.base.values)[:,[0,2]].tolist()
>>> df
base
0 [a, c]
1 [c, e]
2 [a, h]
You can use df['base'].apply(lambda x: x.pop(1)). Note that pop acts in place, so you don't need to assign the result to base (in fact, if you do so, you'll get the removed element instead of the remaining list).
However, as #jpp says, you should consider using some other data structure, such as a dataframe with multi-index or a three-dimensional numpy array.
And considering your edit, it's probably easier to convert the data to a dataframe with multiple columns, and then delete the extra column, rather than trying to manipulate a column of lists and then turn it into your final dataframe. It may seem simpler to have "only one column", but you're just putting the extra complexity into a separate layer, rather than getting rid of it. Pandas was built around two-dimensional data being represented as columns and rows, not a single column of lists, so you're going out of your way to not use the tools that pandas was built to provide.
Presumably, you had something like this:
data=[['a','b','c'],
['c','d','e'],
['a','b','h']]
And you did something like this:
df = pd.DataFrame({'base':data})
You should instead do
df = pd.DataFrame(data)
df = df[[0,2]]

pandas iterrows changes ints into floats

I'm trying to iterate over the rows of a DataFrame that contains some int64s and some floats. iterrows() seems to be turning my ints into floats, which breaks everything I want to do downstream:
>>> import pandas as pd
>>> df = pd.DataFrame([[10000000000000001, 1.5], [10000000000000002, 2.5]], columns=['id', 'prc'])
>>> [id for id in df.id]
[10000000000000001, 10000000000000002]
>>> [r['id'] for (idx,r) in df.iterrows()]
[10000000000000000.0, 10000000000000002.0]
Iterating directly over df.id is fine. But through iterrows(), I get different values. Is there a way to iterate over the rows in such a way that I can still index by column name and get all the correct values?
Here's the relevant part of the docs:
Because iterrows returns a Series for each row, it does not preserve dtypes across the rows (dtypes are preserved across columns for DataFrames) [...] To preserve dtypes while iterating over the rows, it is better to use itertuples() which returns namedtuples of the values and which is generally faster as iterrows.
Example for your data:
>>> df = pd.DataFrame([[10000000000000001, 1.5], [10000000000000002, 2.5]], columns=['id', 'prc'])
>>> [t[1] for t in df.itertuples()]
[10000000000000001, 10000000000000002]
If possible you're better off avoiding iteration. Check if you can vectorize your work first.
If vectorization is impossible, you probably want DataFrame.itertuples. That will return an iterable of (named)tuples where the first element is the index label.
In [2]: list(df.itertuples())
Out[2]:
[Pandas(Index=0, id=10000000000000001, prc=1.5),
Pandas(Index=1, id=10000000000000002, prc=2.5)]
iterrows returns a Series for each row. Since series are backed by numpy arrays, whose elements must all share a single type, your ints were cast as floats.

Adding individual items and sequences of items to dataframes and series

Say I have a dataframe df
import pandas as pd
df = pd.DataFrame()
and I have the following tuple and value:
column_and_row = ('bar', 'foo')
value = 56
How can I most easily add this tuple to my dataframe so that:
df['bar']['foo']
returns 56?
What if I have a list of such tuples and list of values? e.g.
columns_and_rows = [A, B, C, ...]
values = [5, 10, 15]
where A, B and C are tuples of columns and rows (similar to column_and_row).
Along the same lines, how would this be done with a Series?, e.g.:
import pandas as pd
srs = pd.Series()
and I want to add one item to it with index 'foo' and value 2 so that:
srs['foo']
returns 2?
Note:
I know that none of these are efficient ways of creating dataframes or series, but I need a solution that allows me to grow my structures organically in this way when I have no other choice.
For a series, you can do it with append, but you have to create a series from your value first:
>>> print x
A 1
B 2
C 3
>>> print x.append( pandas.Series([8, 9], index=["foo", "bar"]))
A 1
B 2
C 3
foo 8
bar 9
For a DataFrame, you can also use append or concat, but it doesn't make sense to do this for a single cell only. DataFrames are tabular, so you can only add a whole row or a whole column. The documentation has plenty of examples and there are other questions about this.
Edit: Apparently you actually can set a single value with df.set_value('newRow', 'newCol', newVal). However, if that row/column doesn't already exist, this will actually create an entire new row and/or column, with the rest of the values in the created row/column filled with NaN. Note that in this case a new object will be returned, so you'd have to do df = df.set_value('newRow', 'newCol', newVal) to modify the original.
However, now matter how you do it, this is going to be inefficient. Pandas data structures are based on Numpy and are fundamentally reliant on knowing the size of the array ahead of time. You can add rows and columns, but every time you do so, and entirely new data structure is created, so if you do this a lot, it will be slower than using ordinary Python lists/dicts.

Categories