Choosing particular values from a Pandas Series - python

I have the Pandas Series s, part of which can be seen below. I basically want to insert the indices of those values of s which are not 0 into a list l, but don't know how to do this.
2003-05-13 1
2003-11-2 0
2004-05-1 3

In [7] is what you're looking for below:
In [5]: s = pd.Series(np.random.choice([0,1,2], 10))
In [6]: print s
0 0
1 1
2 0
3 1
4 0
5 2
6 1
7 1
8 2
9 2
dtype: int64
In [7]: print list(s.index[s != 0])
[1, 3, 5, 6, 7, 8, 9]

Related

Pad selection range in Pandas Dataframe?

If I slice a dataframe with something like
>>> df = pd.DataFrame(data=[[x] for x in [1,2,3,5,1,3,2,1,1,4,5,6]], columns=['A'])
>>> df.loc[df['A'] == 1]
# or
>>> df[df['A'] == 1]
A
0 1
4 1
7 1
8 1
how could I pad my selections by a buffer of 1 and get the each of the indices 0, 1, 3, 4, 5, 6, 7, 8, 9? I want to select all rows for which the value in column 'A' is 1, but also a row before or after any such row.
edit I'm hoping to figure out a solution that works for arbitrary pad sizes, rather than just for a pad size of 1.
edit 2 here's another example illustrating what I'm going for
df = pd.DataFrame(data=[[x] for x in [1,2,3,5,3,2,1,1,4,5,6,0,0,3,1,2,4,5]], columns=['A'])
and we're looking for pad == 2. In this case I'd be trying to fetch rows 0, 1, 2, 4, 5, 6, 7, 8, 9, 12, 13, 14, 15, 16.
you can use shift with bitwise or |
c = df['A'] == 1
df[c|c.shift()|c.shift(-1)]
A
0 1
1 2
3 5
4 1
5 3
6 2
7 1
8 1
9 4
For arbitrary pad sizes, you may try where, interpolate, and notna to create the mask
n = 2
c = df.where(df['A'] == 1)
m = c.interpolate(limit=n, limit_direction='both').notna()
df[m]
Out[61]:
A
0 1
1 2
2 3
4 3
5 2
6 1
7 1
8 4
9 5
12 0
13 3
14 1
15 2
16 4
Here is an approach that allows for multiple pad levels. Use ffill and bfill on the boolean mask (df['A'] == 1), after converting the False values to np.nan:
import numpy as np
pad = 2
df[(df['A'] == 1).replace(False, np.nan).ffill(limit=pad).bfill(limit=pad).replace(np.nan,False).astype(bool)]
Here it is in action:
def padsearch(df, column, value, pad):
return df[(df[column] == value).replace(False, np.nan).ffill(limit=pad).bfill(limit=pad).replace(np.nan,False).astype(bool)]
# your first example
df = pd.DataFrame(data=[[x] for x in [1,2,3,5,1,3,2,1,1,4,5,6]], columns=['A'])
print(padsearch(df=df, column='A', value=1, pad=1))
# your other example
df = pd.DataFrame(data=[[x] for x in [1,2,3,5,3,2,1,1,4,5,6,0,0,3,1,2,4,5]], columns=['A'])
print(padsearch(df=df, column='A', value=1, pad=2))
Result:
A
0 1
1 2
3 5
4 1
5 3
6 2
7 1
8 1
9 4
A
0 1
1 2
2 3
4 3
5 2
6 1
7 1
8 4
9 5
12 0
13 3
14 1
15 2
16 4
Granted the command is far less nice, and its a little clunky to be converting the False to and from null. But it's still using all Pandas builtins, so it is fairly quick still.
I found another solution but not nearly as slick as some of the ones already posted.
# setup
df = ...
pad = 2
# determine set of indicies
indices = set(
[
x for x in filter(
lambda x: x>=0,
[
x+y
for x in df[df['A'] == 1].index
for y in range(-pad, pad+1)
]
)
]
)
# fetch rows
df.iloc[[*indices]]

Vectorizing for-loop

I have a very large dataframe (~10^8 rows) where I need to change some values. The algorithm I use is complex so I tried to break down the issue into a simple example below. I mostly programmed in C++, so I keep thinking in for-loops. I know I should vectorize but I am new to python and very new to pandas and cannot come up with a better solution. Any solutions which increase performance are welcome.
#!/usr/bin/python3
import numpy as np
import pandas as pd
data = {'eventID': [1, 1, 1, 2, 2, 3, 4, 5, 6, 6, 6, 6, 7, 8],
'types': [0, -1, -1, -1, 1, 0, 0, 0, -1, -1, -1, 1, -1, -1]
}
mydf = pd.DataFrame(data, columns=['eventID', 'types'])
print(mydf)
MyIntegerCodes = np.array([0, 1])
eventIDs = np.unique(mydf.eventID.values) # can be up to 10^8 values
for val in eventIDs:
currentTypes = mydf[mydf.eventID == val].types.values
if (0 in currentTypes) & ~(1 in currentTypes):
mydf.loc[mydf.eventID == val, 'types'] = 0
if ~(0 in currentTypes) & (1 in currentTypes):
mydf.loc[mydf.eventID == val, 'types'] = 1
print(mydf)
Any ideas?
EDIT: I was ask to explain what I do with my for-loops.
For every eventID I want to know if all corresponding types contain a 1 or a 0 or both. If they contain a 1, all values which are equal to -1 should be changed to 1. If the values are 0, all values equal to -1 should be changed to 0. My problem is to do this efficiently for each eventID independently. There can be one or multiple entries per eventID.
Input of example:
eventID types
0 1 0
1 1 -1
2 1 -1
3 2 -1
4 2 1
5 3 0
6 4 0
7 5 0
8 6 -1
9 6 -1
10 6 -1
11 6 1
12 7 -1
13 8 -1
Output of example:
eventID types
0 1 0
1 1 0
2 1 0
3 2 1
4 2 1
5 3 0
6 4 0
7 5 0
8 6 1
9 6 1
10 6 1
11 6 1
12 7 -1
13 8 -1
First we create boolean masks m1 and m2 using Series.eq then use DataFrame.groupby on this mask and transform using any, then using np.select chose the elements from 1, 0 depending upon the conditions m1 or m2:
m1 = mydf['types'].eq(1).groupby(mydf['eventID']).transform('any')
m2 = mydf['types'].eq(0).groupby(mydf['eventID']).transform('any')
mydf['types'] = np.select([m1 , m2], [1, 0], mydf['types'])
Result:
# print(mydf)
eventID types
0 1 0
1 1 0
2 1 0
3 2 1
4 2 1
5 3 0
6 4 0
7 5 0
8 6 1
9 6 1
10 6 1
11 6 1
12 7 -1
13 8 -1

Alter columns for each row in dataframe basing on other column

I have a dataframe that has got many columns.
I want to apply a function on each row that alters all the columns based on different column.
def mark(row):
columns = get_columns_to_alter(row['Text'])
for c in columns:
row[c] = True
And I was trying to use apply function
df.apply(mark, axis=1)
But it does not alter these columns. What am I doing wrong? The function I gave is a psuedocode but it gets names of columns to change basing on "Text" column.
OK,
This is a bit confusing, to be honest.
Several issues I see:
First, DataFrame.apply a function to each column should look more like:
df.apply(lambda x: mark(x), axis=1)
so that you actually loop through each row.
Second, DataFrame.apply creates a copy Series for each row (in your case); thus, the changes are not applied to df but to the new row value. If you want to change df, you need to (a) make sure that mark returns something and (b) to assign it to something else:
def mark(row):
columns = get_columns_to_alter(row['Text'])
if len(columns) > 0:
row[columns] = True
return row
new_df = df.apply(lambda x: mark(x), axis=1)
Something like this should do what you expect.
Here is one solution via numpy and itertools.chain. As far as possible, it's a good idea to remove loops.
from itertools import chain
import numpy as np
df = pd.DataFrame(np.random.randint(0, 9, (10, 10)))
df['cols'] = [np.random.randint(0, 9, 3) for _ in df]
def calc_cols(s):
arr = s.values.tolist()
# apply function on arr here, e.g.
# arr = list(map(f, arr))
idx = np.repeat(list(range(len(arr))), list(map(len, arr)))
return idx, list(chain(*arr))
A, cols = df.values, calc_cols(df['cols'])
A[cols[0], cols[1]] = -1
df_res = pd.DataFrame(A, columns=df.columns)
# 0 1 2 3 4 5 6 7 8 9 cols
# 0 2 4 -1 4 -1 2 6 6 8 1 [4, 4, 2]
# 1 4 -1 -1 3 4 4 -1 5 6 7 [2, 1, 6]
# 2 -1 1 7 1 2 -1 2 2 -1 0 [8, 0, 5]
# 3 2 4 -1 6 -1 8 6 -1 0 3 [7, 2, 4]
# 4 -1 5 5 2 8 2 -1 8 -1 6 [8, 6, 0]
# 5 5 6 0 3 5 -1 -1 5 3 7 [6, 5, 6]
# 6 -1 0 7 1 4 -1 -1 6 1 8 [5, 6, 0]
# 7 2 6 4 6 -1 6 -1 5 7 6 [6, 4, 6]
# 8 -1 8 1 -1 0 7 8 -1 2 3 [3, 0, 7]
# 9 2 4 6 6 -1 -1 0 2 -1 0 [4, 8, 5]

Count Total number of sequences that meet condition, without for-loop

I have the following Dataframe as input:
l = [2,2,2,5,5,5,3,3,2,2,4,4,6,5,5,3,5]
df = pd.DataFrame(l)
print(df)
0
0 2
1 2
2 2
3 5
4 5
5 5
6 3
7 3
8 2
9 2
10 4
11 4
12 6
13 5
14 5
15 3
16 5
As an output I would like to have a final count of the total sequences that meet a certain condition. For example, in this case, I want the number of sequences that the values are greater than 3.
So, the output is 3.
1st Sequence = [555]
2nd Sequence = [44655]
3rd Sequence = [5]
Is there a way to calculate this without a for-loop in pandas ?
I have already implemented a solution using for-loop, and I wonder if there is better approach using pandas in O(N) time.
Thanks very much!
Related to this question: How to count the number of time intervals that meet a boolean condition within a pandas dataframe?
You can use:
m = df[0] > 3
df[1] = (~m).cumsum()
df = df[m]
print (df)
0 1
3 5 3
4 5 3
5 5 3
10 4 7
11 4 7
12 6 7
13 5 7
14 5 7
16 5 8
#create tuples
df = df.groupby(1)[0].apply(tuple).value_counts()
print (df)
(5, 5, 5) 1
(4, 4, 6, 5, 5) 1
(5,) 1
Name: 0, dtype: int64
#alternativly create strings
df = df.astype(str).groupby(1)[0].apply(''.join).value_counts()
print (df)
5 1
44655 1
555 1
Name: 0, dtype: int64
If need output as list:
print (df.astype(str).groupby(1)[0].apply(''.join).tolist())
['555', '44655', '5']
Detail:
print (df.astype(str).groupby(1)[0].apply(''.join))
3 555
7 44655
8 5
Name: 0, dtype: object
If you don't need pandas this will suit your needs:
l = [2,2,2,5,5,5,3,3,2,2,4,4,6,5,5,3,5]
def consecutive(array, value):
result = []
sub = []
for item in array:
if item > value:
sub.append(item)
else:
if sub:
result.append(sub)
sub = []
if sub:
result.append(sub)
return result
print(consecutive(l,3))
#[[5, 5, 5], [4, 4, 6, 5, 5], [5]]

Pandas indexing by both boolean `loc` and subsequent `iloc`

I want to index a Pandas dataframe using a boolean mask, then set a value in a subset of the filtered dataframe based on an integer index, and have this value reflected in the dataframe. That is, I would be happy if this worked on a view of the dataframe.
Example:
In [293]:
df = pd.DataFrame({'a': [0, 1, 2, 3, 4, 5, 6, 7],
'b': [5, 5, 2, 2, 5, 5, 2, 2],
'c': [0, 0, 0, 0, 0, 0, 0, 0]})
mask = (df['a'] < 7) & (df['b'] == 2)
df.loc[mask, 'c']
Out[293]:
2 0
3 0
6 0
Name: c, dtype: int64
Now I would like to set the values of the first two elements returned in the filtered dataframe. Chaining an iloc onto the loc call above works to index:
In [294]:
df.loc[mask, 'c'].iloc[0: 2]
Out[294]:
2 0
3 0
Name: c, dtype: int64
But not to assign:
In [295]:
df.loc[mask, 'c'].iloc[0: 2] = 1
print(df)
a b c
0 0 5 0
1 1 5 0
2 2 2 0
3 3 2 0
4 4 5 0
5 5 5 0
6 6 2 0
7 7 2 0
Making the assign value the same length as the slice (i.e. = [1, 1]) also doesn't work. Is there a way to assign these values?
This does work but is a little ugly, basically we use the index generated from the mask and make an additional call to loc:
In [57]:
df.loc[df.loc[mask,'c'].iloc[0:2].index, 'c'] = 1
df
Out[57]:
a b c
0 0 5 0
1 1 5 0
2 2 2 1
3 3 2 1
4 4 5 0
5 5 5 0
6 6 2 0
7 7 2 0
So breaking the above down:
In [60]:
# take the index from the mask and iloc
df.loc[mask, 'c'].iloc[0: 2]
Out[60]:
2 0
3 0
Name: c, dtype: int64
In [61]:
# call loc using this index, we can now use this to select column 'c' and set the value
df.loc[df.loc[mask,'c'].iloc[0:2].index]
Out[61]:
a b c
2 2 2 0
3 3 2 0
How about.
ix = df.index[mask][:2]
df.loc[ix, 'c'] = 1
Same idea as EdChum but more elegant as suggested in the comment.
EDIT: Have to be a little bit careful with this one as it may give unwanted results with a non-unique index, since there could be multiple rows indexed by either of the label in ix above. If the index is non-unique and you only want the first 2 (or n) rows that satisfy the boolean key, it would be safer to use .iloc with integer indexing with something like
ix = np.where(mask)[0][:2]
df.iloc[ix, 'c'] = 1
I don't know if this is any more elegant, but it's a little different:
mask = mask & (mask.cumsum() < 3)
df.loc[mask, 'c'] = 1
a b c
0 0 5 0
1 1 5 0
2 2 2 1
3 3 2 1
4 4 5 0
5 5 5 0
6 6 2 0
7 7 2 0

Categories