EDIT: Turns out my initial question was simply a versionitis issue. However, in the course of answering my initial question a few other questions were addressed, so I've reworded the questions and listed them below:
I'm familiarizing myself with some pandas capabilities, namely selection by callables. The docs advise use of lambda functions, e.g. to extract all samples in dataframe df1 with value > 0 for feature 'A':
df1.loc[lambda df: df.A > 0, :]
Is there a more compact, pythonic way to do this?
Let's say df1 is now a dataframe with feature A, but the values are mixed doubles and triples (2- and 3-tuples). How can I extract the samples which contain only doubles? I tried doing this as df1.loc[len(df1.A)>2,:], but it's clear that pandas doesn't broadcast the values the way I expect.
You have to restart IDE.
Your another question:
Use apply with len:
import pandas as pd
data = {'A': [(1,2), (1,2), (1,2), (1,2), (1,2,4), (1,2,3)],
'B': [13, 98, 23, 45, 64, 10]}
df = pd.DataFrame(data)
print (df)
A B
0 (1, 2) 13
1 (1, 2) 98
2 (1, 2) 23
3 (1, 2) 45
4 (1, 2, 4) 64
5 (1, 2, 3) 10
print (df[df.A.apply(len) >2])
A B
4 (1, 2, 4) 64
5 (1, 2, 3) 10
You can do what you want to do without the lambda function, as follows:
df1.loc[df1.A>0,:]
Perhaps the docs are outdated.
Related
Given I have the following list:
group
code
A
1
A
2
A
3
B
4
B
5
B
6
B
7
How do I create the following list in a pythonic way?:
group
code
code
A
1
2
A
1
3
A
2
3
B
4
5
B
4
6
B
4
7
B
5
6
B
5
7
B
6
7
I saw from another ticket that suggests using itertools import combinations. But how to get by the grouping restriction: I don't want all matches, just ones within groups.
You need to use itertool to get combinations all possible code combinations for each group
from itertools import combinations
A = [1,2,3]
B = [4,5,6,7]
comb_A = combinations(A, 2)
comb_B = combinations(B, 2)
#to see the results iterate through all the combinations
# For A (the same applies for B)
for comb in comb_A:
print(comb)
>>> (1,2)
>>> (1,3)
>>> (2,3)
Note: It will be more helpful if you could provide the "list" so we can give a more specific answer
Since you didn't post a MWE of what you tried, I'll show you steps that you can implement yourself.
Build a dictionary of data groups, say group_dict.
Create an empty list for result.
Iterate through the items in group_dict, where each item is a group name and a list of codes for that group.
For each group, use the combinations function to generate all possible combinations of 2 codes.
For each combination, append a tuple of the group name, the first code, and the second code to result.
This should give you a result like this:
[('A', 1, 2), ('A', 1, 3), ('A', 2, 3), ('B', 4, 5), ('B', 4, 6), ('B', 4, 7), ('B', 5, 6), ('B', 5, 7), ('B', 6, 7)]
I want to select a subset of some pandas DataFrame columns based on several slices.
In [1]: df = pd.DataFrame(data={'A': np.random.rand(100), 'B': np.random.rand(100), 'C': np.random.rand(100)})
df.head()
Out[1]: A B C
0 0.745487 0.146733 0.594006
1 0.212324 0.692727 0.244113
2 0.954276 0.318949 0.199224
3 0.606276 0.155027 0.247255
4 0.155672 0.464012 0.229516
Something like:
In [2]: df.loc[[slice(1, 4), slice(42, 44)], ['B', 'C']]
Expected output:
Out[2]: B C
1 0.692727 0.244113
2 0.318949 0.199224
3 0.155027 0.247255
42 0.335285 0.000997
43 0.019172 0.237810
I've seen that NumPy's r_ object can help when wanting to use multiple slices, e.g:
In [3]: arr = np.array([1, 2, 3, 4, 5, 5, 5, 5])
arr[np.r_[1:3, 4:6]]
Out[3]: array([2, 3, 5, 5])
But I can't get this to work with some predefined collection (list) of slices. Ideally I would like to be able to specify a collection of ranges/slices and subset based on this. I doesn't seem like r_ accepts iterables? I've seen that one could for example create an array with hstack, and then use it as an index, like:
In [4]: idx = np.hstack((np.arange(1, 4), np.arange(42, 44)))
df.loc[idx, ['B', 'C']]
Out[4]: B C
1 0.692727 0.244113
2 0.318949 0.199224
3 0.155027 0.247255
42 0.335285 0.000997
43 0.019172 0.237810
Which gets me what I need, but is there any other faster/cleaner/preferred/whatever way of doing this?
A bit late, but it might also help others:
pd.concat([df.loc[sl, ['B', 'C']] for sl in [slice(1, 4), slice(42, 44)]])
This also works when your are dealing with other slices, e.g. time windows.
You can do:
df.loc[[x for x in range(1, 4)] + [x for x in range(42, 44)], ['B', 'C']]
Which took about 1/4 of the time with your np.hstack option.
I need to create a function that filters a dataframe using a list of tuples - taking as arguments a dataframe and a tuple list, as follows:
tuplelist=[('A', 5, 10), ('B', 0, 4),('C', 10, 11)]
What is the proper way to do this?
I have tried the following:
def multcolfilter(data_frame, tuplelist):
def apply_single_cond(df_0,cond):
df_1=df_0[(df_0[cond[0]]>cond[1]) & (df_0[cond[0]]<cond[2])]
return df_1
for x in range(len(tuplelist)-1):
df=apply_single_cond(apply_single_cond(data_frame,tuplelist[x-1]),tuplelist[x])
return df
Example dataframe and tuplelist:
df = pd.DataFrame({'A':range(1,10), 'B':range(1,10), 'C':range(1,10)})
tuplelist=[('A', 2, 10), ('B', 0, 4),('C', 3, 5)]
Instead of working with tuples, create a dictionary form them:
filters = {x[0]:x[1:] for x in tuplelist}
print(filters)
{'A': (5, 10), 'B': (0, 4), 'C': (10, 11)}
You can use pd.cut to bin the values of the dataframe's columns:
rows = np.asarray([~pd.cut(df[i], filters[i], retbins=False, include_lowest=True).isnull()
for i in filters.keys()]).all(axis = 0)
Use rows as a boolean indexer of df:
df[rows]
A B C
2 3 3 3
3 4 4 4
I have a pandas data frame with 2 columns:
{'A':[1, 2, 3],'B':[4, 5, 6]}
I want to create a new column where:
{'C':[1 4,2 5,3 6]}
Setup
df = pd.DataFrame({'A':[1, 2, 3],'B':[4, 5, 6]})
Solution
Keep in mind, per your expected output, [1 4,2 5,3 6] isn't a thing. I'm interpreting you to mean either [(1, 4), (2, 5), (3, 6)] or ["1 4", "2 5", "3 6"]
First assumption
df.apply(lambda x: tuple(x.values), axis=1)
0 (1, 4)
1 (2, 5)
2 (3, 6)
dtype: object
Second assumption
df.apply(lambda x: ' '.join(x.astype(str)), axis=1)
0 1 4
1 2 5
2 3 6
dtype: object
If you don't mind zip object, then you can usedf['C'] = zip(df.A,df.B).
If you like tuple then you can cast zip object with list(). Please refer to this post. It's pretty handy to use zip in this kind of scenarios.
I have a pandas dataframe:
df = pd.DataFrame({'one' : [1, 2, 3, 4] ,'two' : [5, 6, 7, 8]})
one two
0 1 5
1 2 6
2 3 7
3 4 8
Column "one" and column "two" together comprise (x,y) coordinates
Lets say I have a list of coordinates: c = [(1,5), (2,6), (20,5)]
Is there an elegant way of obtaining the rows in df
with matching coordinates? In this case, given c, the matching rows would be 0 and 1
Related question: Using pandas to select rows using two different columns from dataframe?
And: Selecting rows from pandas DataFrame using two columns
This approaching using pd.merge should perform better than the iterative solutions.
import pandas as pd
df = pd.DataFrame({"one" : [1, 2, 3, 4] ,"two" : [5, 6, 7, 8]})
c = [(1, 5), (2, 6), (20, 5)]
df2 = pd.DataFrame(c, columns=["one", "two"])
pd.merge(df, df2, on=["one", "two"], how="inner")
one two
0 1 5
1 2 6
You can use
>>> set.union(*(set(df.index[(df.one == i) & (df.two == j)]) for i, j in c))
{0, 1}