Count Total number of sequences that meet condition, without for-loop - python

I have the following Dataframe as input:
l = [2,2,2,5,5,5,3,3,2,2,4,4,6,5,5,3,5]
df = pd.DataFrame(l)
print(df)
0
0 2
1 2
2 2
3 5
4 5
5 5
6 3
7 3
8 2
9 2
10 4
11 4
12 6
13 5
14 5
15 3
16 5
As an output I would like to have a final count of the total sequences that meet a certain condition. For example, in this case, I want the number of sequences that the values are greater than 3.
So, the output is 3.
1st Sequence = [555]
2nd Sequence = [44655]
3rd Sequence = [5]
Is there a way to calculate this without a for-loop in pandas ?
I have already implemented a solution using for-loop, and I wonder if there is better approach using pandas in O(N) time.
Thanks very much!
Related to this question: How to count the number of time intervals that meet a boolean condition within a pandas dataframe?

You can use:
m = df[0] > 3
df[1] = (~m).cumsum()
df = df[m]
print (df)
0 1
3 5 3
4 5 3
5 5 3
10 4 7
11 4 7
12 6 7
13 5 7
14 5 7
16 5 8
#create tuples
df = df.groupby(1)[0].apply(tuple).value_counts()
print (df)
(5, 5, 5) 1
(4, 4, 6, 5, 5) 1
(5,) 1
Name: 0, dtype: int64
#alternativly create strings
df = df.astype(str).groupby(1)[0].apply(''.join).value_counts()
print (df)
5 1
44655 1
555 1
Name: 0, dtype: int64
If need output as list:
print (df.astype(str).groupby(1)[0].apply(''.join).tolist())
['555', '44655', '5']
Detail:
print (df.astype(str).groupby(1)[0].apply(''.join))
3 555
7 44655
8 5
Name: 0, dtype: object

If you don't need pandas this will suit your needs:
l = [2,2,2,5,5,5,3,3,2,2,4,4,6,5,5,3,5]
def consecutive(array, value):
result = []
sub = []
for item in array:
if item > value:
sub.append(item)
else:
if sub:
result.append(sub)
sub = []
if sub:
result.append(sub)
return result
print(consecutive(l,3))
#[[5, 5, 5], [4, 4, 6, 5, 5], [5]]

Related

Pad selection range in Pandas Dataframe?

If I slice a dataframe with something like
>>> df = pd.DataFrame(data=[[x] for x in [1,2,3,5,1,3,2,1,1,4,5,6]], columns=['A'])
>>> df.loc[df['A'] == 1]
# or
>>> df[df['A'] == 1]
A
0 1
4 1
7 1
8 1
how could I pad my selections by a buffer of 1 and get the each of the indices 0, 1, 3, 4, 5, 6, 7, 8, 9? I want to select all rows for which the value in column 'A' is 1, but also a row before or after any such row.
edit I'm hoping to figure out a solution that works for arbitrary pad sizes, rather than just for a pad size of 1.
edit 2 here's another example illustrating what I'm going for
df = pd.DataFrame(data=[[x] for x in [1,2,3,5,3,2,1,1,4,5,6,0,0,3,1,2,4,5]], columns=['A'])
and we're looking for pad == 2. In this case I'd be trying to fetch rows 0, 1, 2, 4, 5, 6, 7, 8, 9, 12, 13, 14, 15, 16.
you can use shift with bitwise or |
c = df['A'] == 1
df[c|c.shift()|c.shift(-1)]
A
0 1
1 2
3 5
4 1
5 3
6 2
7 1
8 1
9 4
For arbitrary pad sizes, you may try where, interpolate, and notna to create the mask
n = 2
c = df.where(df['A'] == 1)
m = c.interpolate(limit=n, limit_direction='both').notna()
df[m]
Out[61]:
A
0 1
1 2
2 3
4 3
5 2
6 1
7 1
8 4
9 5
12 0
13 3
14 1
15 2
16 4
Here is an approach that allows for multiple pad levels. Use ffill and bfill on the boolean mask (df['A'] == 1), after converting the False values to np.nan:
import numpy as np
pad = 2
df[(df['A'] == 1).replace(False, np.nan).ffill(limit=pad).bfill(limit=pad).replace(np.nan,False).astype(bool)]
Here it is in action:
def padsearch(df, column, value, pad):
return df[(df[column] == value).replace(False, np.nan).ffill(limit=pad).bfill(limit=pad).replace(np.nan,False).astype(bool)]
# your first example
df = pd.DataFrame(data=[[x] for x in [1,2,3,5,1,3,2,1,1,4,5,6]], columns=['A'])
print(padsearch(df=df, column='A', value=1, pad=1))
# your other example
df = pd.DataFrame(data=[[x] for x in [1,2,3,5,3,2,1,1,4,5,6,0,0,3,1,2,4,5]], columns=['A'])
print(padsearch(df=df, column='A', value=1, pad=2))
Result:
A
0 1
1 2
3 5
4 1
5 3
6 2
7 1
8 1
9 4
A
0 1
1 2
2 3
4 3
5 2
6 1
7 1
8 4
9 5
12 0
13 3
14 1
15 2
16 4
Granted the command is far less nice, and its a little clunky to be converting the False to and from null. But it's still using all Pandas builtins, so it is fairly quick still.
I found another solution but not nearly as slick as some of the ones already posted.
# setup
df = ...
pad = 2
# determine set of indicies
indices = set(
[
x for x in filter(
lambda x: x>=0,
[
x+y
for x in df[df['A'] == 1].index
for y in range(-pad, pad+1)
]
)
]
)
# fetch rows
df.iloc[[*indices]]

How to split a dataframe into some dataframes according to list of row numbers?

Given a dataframe, I want to obtain a list of distinct dataframes which together concatenate into the original.
The separation is by indices of rows like so
import pandas as pd
import numpy as np
data = {"a": np.arange(10)}
df = pd.DataFrame(data)
print(df)
a
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
separate_by = [1, 5, 6, ]
should give a list of
df1 =
a
0 0
df2 =
a
1 1
2 2
3 3
4 4
df3 =
a
5 5
df4 =
a
6 6
7 7
8 8
9 9
How can this be done in pandas?
Try:
groups = (pd.Series(1, index=separate_by)
.reindex(df.index,fill_value=0)
.cumsum()
)
out = {k:v for k,v in df.groupby(groups)}
then for example, out[2]:
a
5 5
Similar logic:
groups = np.zeros(len(df))
groups[separate_by] = 1
groups = np.cumsum(groups)
out = {k:v for k,v in df.groupby(groups)}
separate_by = [1, 5, 6, ]
separate_by.append(len(df))
separate_by.append(0, 0)
dfs = [df.loc[separate_by[i]: separate_by[i+1]] for i in range(len(separate_by)-1)]
Let us try
d = dict(tuple(df.groupby(df.index.isin(separate_by).cumsum())))
d[0]
Out[364]:
a
0 0
d[2]
Out[365]:
a
5 5

Slicing multiple ranges of columns in Pandas, by list of names

I am trying to select multiple columns in a Pandas dataframe in two different approaches:
1)via the columns number, for examples, columns 1-3 and columns 6 onwards.
and
2)via a list of column names, for instance:
years = list(range(2000,2017))
months = list(range(1,13))
years_month = list(["A", "B", "B"])
for y in years:
for m in months:
y_m = str(y) + "-" + str(m)
years_month.append(y_m)
Then, years_month would produce the following:
['A',
'B',
'C',
'2000-1',
'2000-2',
'2000-3',
'2000-4',
'2000-5',
'2000-6',
'2000-7',
'2000-8',
'2000-9',
'2000-10',
'2000-11',
'2000-12',
'2001-1',
'2001-2',
'2001-3',
'2001-4',
'2001-5',
'2001-6',
'2001-7',
'2001-8',
'2001-9',
'2001-10',
'2001-11',
'2001-12']
That said, what is the best(or correct) way to load only the columns in which the names are in the list years_month in the two approaches?
I think you need numpy.r_ for concanecate positions of columns, then use iloc for selecting:
print (df.iloc[:, np.r_[1:3, 6:len(df.columns)]])
and for second approach subset by list:
print (df[years_month])
Sample:
df = pd.DataFrame({'2000-1':[1,3,5],
'2000-2':[5,3,6],
'2000-3':[7,8,9],
'2000-4':[1,3,5],
'2000-5':[5,3,6],
'2000-6':[7,8,9],
'2000-7':[1,3,5],
'2000-8':[5,3,6],
'2000-9':[7,4,3],
'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9]})
print (df)
2000-1 2000-2 2000-3 2000-4 2000-5 2000-6 2000-7 2000-8 2000-9 A \
0 1 5 7 1 5 7 1 5 7 1
1 3 3 8 3 3 8 3 3 4 2
2 5 6 9 5 6 9 5 6 3 3
B C
0 4 7
1 5 8
2 6 9
print (df.iloc[:, np.r_[1:3, 6:len(df.columns)]])
2000-2 2000-3 2000-7 2000-8 2000-9 A B C
0 5 7 1 5 7 1 4 7
1 3 8 3 3 4 2 5 8
2 6 9 5 6 3 3 6 9
You can also sum of ranges (cast to list in python 3 is necessary):
rng = list(range(1,3)) + list(range(6, len(df.columns)))
print (rng)
[1, 2, 6, 7, 8, 9, 10, 11]
print (df.iloc[:, rng])
2000-2 2000-3 2000-7 2000-8 2000-9 A B C
0 5 7 1 5 7 1 4 7
1 3 8 3 3 4 2 5 8
2 6 9 5 6 3 3 6 9
I’m not sure what exactly you are asking but in general DataFrame.loc allows you to select by label, DataFrame.iloc by index.
For example selecting columns # 0, 1 and 4:
dataframe.iloc[:, [0, 1, 4]]
and selecting columns labelled 'A', 'B' and 'C':
dataframe.loc[:, ['A', 'B', 'C']]

How to sort pandas dataframe from list category?

so I have this data set below that I want to sort base on mylist from column 'name' as well as acsending by 'A' and descending by 'B'
import pandas as pd
import numpy as np
df1 = pd.DataFrame.from_items([('A', [1, 2, 3]), ('B', [4, 5, 6]) , ('name', ['x','x','x'])])
df2 = pd.DataFrame.from_items([('B', [5, 6, 7]), ('A', [8, 9, 10]) , ('name', ['y','y','y'])])
df3 = pd.DataFrame.from_items([('C', [5, 6, 7]), ('D', [8, 9, 10]), ('A',[1,2,3]), ('B',[4,5,7] ), ('name', ['z','z','z'])])
df_list = [df1,df2,df3[['A','B','name']]]
df = pd.concat(df_list, ignore_index=True)
so my list is
mylist = ['z','x','y']
I want the dataset to start with sort by my list , then sort asc column A then desc column B
is there a way to do this in python ?
======== Edit ==========
I want my final result to be something like
OK, a way to sort by a custom order is to create a dict that defines how 'name' column should be order, call map to add a new column that defines this new order, then call sort and pass in the new column and the others, plus the param ascending where you selectively decide whether each column is sorted ascending or not, and then finally drop that column:
In [20]:
name_sort = {'z':0,'x':1,'y':2}
df['name_sort'] = df.name.map(name_sort)
df
Out[20]:
A B name name_sort
0 1 4 x 1
1 2 5 x 1
2 3 6 x 1
3 8 5 y 2
4 9 6 y 2
5 10 7 y 2
6 1 4 z 0
7 2 5 z 0
8 3 7 z 0
In [23]:
df = df.sort(['name_sort','A','B'], ascending=[1,1,0])
df
Out[23]:
A B name name_sort
6 1 4 z 0
7 2 5 z 0
8 3 7 z 0
0 1 4 x 1
1 2 5 x 1
2 3 6 x 1
3 8 5 y 2
4 9 6 y 2
5 10 7 y 2
In [25]:
df = df.drop('name_sort', axis=1)
df
Out[25]:
A B name
6 1 4 z
7 2 5 z
8 3 7 z
0 1 4 x
1 2 5 x
2 3 6 x
3 8 5 y
4 9 6 y
5 10 7 y
Hi We can do the above issue by using the following:
t = pd.CategoricalDtype(categories=['z','x','y'], ordered=True)
df['sort'] = pd.Series(df.name, dtype=t)
df.sort_values(by=['sort','A','B'], inplace=True)

Python Pandas add column with relative order numbers

How do I add a order number column to an existing DataFrame?
This is my DataFrame:
import pandas as pd
import math
frame = pd.DataFrame([[1, 4, 2], [8, 9, 2], [10, 2, 1]], columns=['a', 'b', 'c'])
def add_stats(row):
row['sum'] = sum([row['a'], row['b'], row['c']])
row['sum_sq'] = sum(math.pow(v, 2) for v in [row['a'], row['b'], row['c']])
row['max'] = max(row['a'], row['b'], row['c'])
return row
frame = frame.apply(add_stats, axis=1)
print(frame.head())
The resulting data is:
a b c sum sum_sq max
0 1 4 2 7 21 4
1 8 9 2 19 149 9
2 10 2 1 13 105 10
First, I would like to add 3 extra columns with order numbers, sorting on sum, sum_sq and max, respectively. Next, these 3 columns should be combined into one column - the mean of the order numbers - but I do know how to do that part (with apply and axis=1).
I think you're looking for rank where you mention sorting. Given your example, add:
frame['sum_order'] = frame['sum'].rank()
frame['sum_sq_order'] = frame['sum_sq'].rank()
frame['max_order'] = frame['max'].rank()
frame['mean_order'] = frame[['sum_order', 'sum_sq_order', 'max_order']].mean(axis=1)
To get:
a b c sum sum_sq max sum_order sum_sq_order max_order mean_order
0 1 4 2 7 21 4 1 1 1 1.000000
1 8 9 2 19 149 9 3 3 2 2.666667
2 10 2 1 13 105 10 2 2 3 2.333333
The rank method has some options as well, to specify the behavior in case of identical or NA-values for example.

Categories