pandas selection by callables - python

EDIT: Turns out my initial question was simply a versionitis issue. However, in the course of answering my initial question a few other questions were addressed, so I've reworded the questions and listed them below:
I'm familiarizing myself with some pandas capabilities, namely selection by callables. The docs advise use of lambda functions, e.g. to extract all samples in dataframe df1 with value > 0 for feature 'A':
df1.loc[lambda df: df.A > 0, :]
Is there a more compact, pythonic way to do this?
Let's say df1 is now a dataframe with feature A, but the values are mixed doubles and triples (2- and 3-tuples). How can I extract the samples which contain only doubles? I tried doing this as df1.loc[len(df1.A)>2,:], but it's clear that pandas doesn't broadcast the values the way I expect.

You have to restart IDE.
Your another question:
Use apply with len:
import pandas as pd
data = {'A': [(1,2), (1,2), (1,2), (1,2), (1,2,4), (1,2,3)],
'B': [13, 98, 23, 45, 64, 10]}
df = pd.DataFrame(data)
print (df)
A B
0 (1, 2) 13
1 (1, 2) 98
2 (1, 2) 23
3 (1, 2) 45
4 (1, 2, 4) 64
5 (1, 2, 3) 10
print (df[df.A.apply(len) >2])
A B
4 (1, 2, 4) 64
5 (1, 2, 3) 10

You can do what you want to do without the lambda function, as follows:
df1.loc[df1.A>0,:]
Perhaps the docs are outdated.

Related

Pythonic way of making combinations in groups

Given I have the following list:
group
code
A
1
A
2
A
3
B
4
B
5
B
6
B
7
How do I create the following list in a pythonic way?:
group
code
code
A
1
2
A
1
3
A
2
3
B
4
5
B
4
6
B
4
7
B
5
6
B
5
7
B
6
7
I saw from another ticket that suggests using itertools import combinations. But how to get by the grouping restriction: I don't want all matches, just ones within groups.
You need to use itertool to get combinations all possible code combinations for each group
from itertools import combinations
A = [1,2,3]
B = [4,5,6,7]
comb_A = combinations(A, 2)
comb_B = combinations(B, 2)
#to see the results iterate through all the combinations
# For A (the same applies for B)
for comb in comb_A:
print(comb)
>>> (1,2)
>>> (1,3)
>>> (2,3)
Note: It will be more helpful if you could provide the "list" so we can give a more specific answer
Since you didn't post a MWE of what you tried, I'll show you steps that you can implement yourself.
Build a dictionary of data groups, say group_dict.
Create an empty list for result.
Iterate through the items in group_dict, where each item is a group name and a list of codes for that group.
For each group, use the combinations function to generate all possible combinations of 2 codes.
For each combination, append a tuple of the group name, the first code, and the second code to result.
This should give you a result like this:
[('A', 1, 2), ('A', 1, 3), ('A', 2, 3), ('B', 4, 5), ('B', 4, 6), ('B', 4, 7), ('B', 5, 6), ('B', 5, 7), ('B', 6, 7)]

Selecting a subset based on multiple slices in pandas/NumPy?

I want to select a subset of some pandas DataFrame columns based on several slices.
In [1]: df = pd.DataFrame(data={'A': np.random.rand(100), 'B': np.random.rand(100), 'C': np.random.rand(100)})
df.head()
Out[1]: A B C
0 0.745487 0.146733 0.594006
1 0.212324 0.692727 0.244113
2 0.954276 0.318949 0.199224
3 0.606276 0.155027 0.247255
4 0.155672 0.464012 0.229516
Something like:
In [2]: df.loc[[slice(1, 4), slice(42, 44)], ['B', 'C']]
Expected output:
Out[2]: B C
1 0.692727 0.244113
2 0.318949 0.199224
3 0.155027 0.247255
42 0.335285 0.000997
43 0.019172 0.237810
I've seen that NumPy's r_ object can help when wanting to use multiple slices, e.g:
In [3]: arr = np.array([1, 2, 3, 4, 5, 5, 5, 5])
arr[np.r_[1:3, 4:6]]
Out[3]: array([2, 3, 5, 5])
But I can't get this to work with some predefined collection (list) of slices. Ideally I would like to be able to specify a collection of ranges/slices and subset based on this. I doesn't seem like r_ accepts iterables? I've seen that one could for example create an array with hstack, and then use it as an index, like:
In [4]: idx = np.hstack((np.arange(1, 4), np.arange(42, 44)))
df.loc[idx, ['B', 'C']]
Out[4]: B C
1 0.692727 0.244113
2 0.318949 0.199224
3 0.155027 0.247255
42 0.335285 0.000997
43 0.019172 0.237810
Which gets me what I need, but is there any other faster/cleaner/preferred/whatever way of doing this?
A bit late, but it might also help others:
pd.concat([df.loc[sl, ['B', 'C']] for sl in [slice(1, 4), slice(42, 44)]])
This also works when your are dealing with other slices, e.g. time windows.
You can do:
df.loc[[x for x in range(1, 4)] + [x for x in range(42, 44)], ['B', 'C']]
Which took about 1/4 of the time with your np.hstack option.

Pandas - Filter dataframe using list of tuples

I need to create a function that filters a dataframe using a list of tuples - taking as arguments a dataframe and a tuple list, as follows:
tuplelist=[('A', 5, 10), ('B', 0, 4),('C', 10, 11)]
What is the proper way to do this?
I have tried the following:
def multcolfilter(data_frame, tuplelist):
def apply_single_cond(df_0,cond):
df_1=df_0[(df_0[cond[0]]>cond[1]) & (df_0[cond[0]]<cond[2])]
return df_1
for x in range(len(tuplelist)-1):
df=apply_single_cond(apply_single_cond(data_frame,tuplelist[x-1]),tuplelist[x])
return df
Example dataframe and tuplelist:
df = pd.DataFrame({'A':range(1,10), 'B':range(1,10), 'C':range(1,10)})
tuplelist=[('A', 2, 10), ('B', 0, 4),('C', 3, 5)]
Instead of working with tuples, create a dictionary form them:
filters = {x[0]:x[1:] for x in tuplelist}
print(filters)
{'A': (5, 10), 'B': (0, 4), 'C': (10, 11)}
You can use pd.cut to bin the values of the dataframe's columns:
rows = np.asarray([~pd.cut(df[i], filters[i], retbins=False, include_lowest=True).isnull()
for i in filters.keys()]).all(axis = 0)
Use rows as a boolean indexer of df:
df[rows]
A B C
2 3 3 3
3 4 4 4

Combine multiple columns into 1 column [python,pandas]

I have a pandas data frame with 2 columns:
{'A':[1, 2, 3],'B':[4, 5, 6]}
I want to create a new column where:
{'C':[1 4,2 5,3 6]}
Setup
df = pd.DataFrame({'A':[1, 2, 3],'B':[4, 5, 6]})
Solution
Keep in mind, per your expected output, [1 4,2 5,3 6] isn't a thing. I'm interpreting you to mean either [(1, 4), (2, 5), (3, 6)] or ["1 4", "2 5", "3 6"]
First assumption
df.apply(lambda x: tuple(x.values), axis=1)
0 (1, 4)
1 (2, 5)
2 (3, 6)
dtype: object
Second assumption
df.apply(lambda x: ' '.join(x.astype(str)), axis=1)
0 1 4
1 2 5
2 3 6
dtype: object
If you don't mind zip object, then you can usedf['C'] = zip(df.A,df.B).
If you like tuple then you can cast zip object with list(). Please refer to this post. It's pretty handy to use zip in this kind of scenarios.

select rows in pandas DataFrame using comparisons against two columns

I have a pandas dataframe:
df = pd.DataFrame({'one' : [1, 2, 3, 4] ,'two' : [5, 6, 7, 8]})
one two
0 1 5
1 2 6
2 3 7
3 4 8
Column "one" and column "two" together comprise (x,y) coordinates
Lets say I have a list of coordinates: c = [(1,5), (2,6), (20,5)]
Is there an elegant way of obtaining the rows in df
with matching coordinates? In this case, given c, the matching rows would be 0 and 1
Related question: Using pandas to select rows using two different columns from dataframe?
And: Selecting rows from pandas DataFrame using two columns
This approaching using pd.merge should perform better than the iterative solutions.
import pandas as pd
df = pd.DataFrame({"one" : [1, 2, 3, 4] ,"two" : [5, 6, 7, 8]})
c = [(1, 5), (2, 6), (20, 5)]
df2 = pd.DataFrame(c, columns=["one", "two"])
pd.merge(df, df2, on=["one", "two"], how="inner")
one two
0 1 5
1 2 6
You can use
>>> set.union(*(set(df.index[(df.one == i) & (df.two == j)]) for i, j in c))
{0, 1}

Categories