Count rows that have same value in all columns - python

If I have a dataframe like this:
A B C D E F
------------------
1 2 3 4 5 6
1 1 1 1 1 1
0 0 0 0 0 0
1 1 1 1 1 1
How can I get the number of rows that have value 1 in every column?
In this case, 2 rows have 1 in every field.
I know one way, for example if we only take columns A and B:
count = df2.query("A == 1 & B == 1").shape[0]
But I have to put the name of every column, is there a more fancy approach?
Thanks in advance

Try:
(df == 1).all(axis=1).sum()
Output:
2

For the large data frame and multiple row , you may always want to try with any , since when it detect the first item , it will yield the result
sum(~df.ne(1).any(1))
2

Related

Sort column names using wildcard using pandas

I have a big dataframe with more than 100 columns. I am sharing a miniature version of my real dataframe below
ID rev_Q1 rev_Q5 rev_Q4 rev_Q3 rev_Q2 tx_Q3 tx_Q5 tx_Q2 tx_Q1 tx_Q4
1 1 1 1 1 1 1 1 1 1 1
2 1 1 1 1 1 1 1 1 1 1
I would like to do the below
a) sort the column names based on Quarters (ex:Q1,Q2,Q3,Q4,Q5..Q100..Q1000) for each column pattern
b) By column pattern, I mean the keyword that is before underscore which is rev and tx.
So, I tried the below but it doesn't work and it also shifts the ID column to the back
df = df.reindex(sorted(df.columns), axis=1)
I expect my output to be like as below. In real time, there are more than 100 columns with more than 30 patterns like rev, tx etc. I want my ID column to be in the first position as shown below.
ID rev_Q1 rev_Q2 rev_Q3 rev_Q4 rev_Q5 tx_Q1 tx_Q2 tx_Q3 tx_Q4 tx_Q5
1 1 1 1 1 1 1 1 1 1 1
2 1 1 1 1 1 1 1 1 1 1
For the provided example, df.sort_index(axis=1) should work fine.
If you have Q values higher that 9, use natural sorting with natsort:
from natsort import natsort_key
out = df.sort_index(axis=1, key=natsort_key)
Or using manual sorting with np.lexsort:
idx = df.columns.str.split('_Q', expand=True, n=1)
order = np.lexsort([idx.get_level_values(1).astype(float), idx.get_level_values(0)])
out = df.iloc[:, order]
Something like:
new_order = list(df.columns)
new_order = ['ID'] + sorted(new_order.remove("ID"))
df = df[new_order]
we manually put "ID" in front and then sort what is remaining
The idea is to create a dataframe from the column names. Create two columns: one for Variable and another one for Quarter number. Finally sort this dataframe by values then extract index.
idx = (df.columns.str.extract(r'(?P<V>[^_]+)_Q(?P<Q>\d+)')
.fillna(0).astype({'Q': int})
.sort_values(by=['V', 'Q']).index)
df = df.iloc[:, idx]
Output:
>>> df
ID rev_Q1 rev_Q2 rev_Q3 rev_Q4 rev_Q5 tx_Q1 tx_Q2 tx_Q3 tx_Q4 tx_Q5
0 1 1 1 1 1 1 1 1 1 1 1
1 2 1 1 1 1 1 1 1 1 1 1
>>> (df.columns.str.extract(r'(?P<V>[^_]+)_Q(?P<Q>\d+)')
.fillna(0).astype({'Q': int})
.sort_values(by=['V', 'Q']))
V Q
0 0 0
1 rev 1
5 rev 2
4 rev 3
3 rev 4
2 rev 5
9 tx 1
8 tx 2
6 tx 3
10 tx 4
7 tx 5

Counting number of consecutive more than 2 occurences

I am beginner, and I really need help on the following:
I need to do similar to the following but on a two dimensional dataframe Identifying consecutive occurrences of a value
I need to use this answer but for two dimensional dataframe. I need to count at least 2 consecuetive ones along the columns dimension. Here is a sample dataframe:
my_df=
0 1 2
0 1 0 1
1 0 1 0
2 1 1 1
3 0 0 1
4 0 1 0
5 1 1 0
6 1 1 1
7 1 0 1
The output I am looking for is:
0 1 2
0 3 5 4
Instead of the column 'consecutive', I need a new output called "out_1_df" for line
df.Count.groupby((df.Count != df.Count.shift()).cumsum()).transform('size') * df.Count
So that later I can do
threshold = 2;
out_2_df= (out_1_df > threshold).astype(int)
I tried the following:
out_1_df= my_df.groupby(( my_df != my_df.shift(axis=0)).cumsum(axis=0))
out_2_df =`(out_1_df > threshold).astype(int)`
How can I modify this?
Try:
import pandas as pd
df=pd.DataFrame({0:[1,0,1,0,0,1,1,1], 1:[0,1,1,0,1,1,1,0], 2: [1,0,1,1,0,0,1,1]})
out_2_df=((df.diff(axis=0).eq(0)|df.diff(periods=-1,axis=0).eq(0))&df.eq(1)).sum(axis=0)
>>> out_2_df
[3 5 4]

Python Selecting and Adding row values of columns in dataframe to create an aggregated dataframe

I need to process my dataframe in Python such that I add the numeric values of numeric columns that lie between 2 rows of the dataframe.
The dataframe can be created using
df = pd.DataFrame(np.array([['a',0,1,0,0,0,0,'i'],
['b',1,0,0,0,0,0,'j'],
['c',0,0,1,0,0,0,'k'],
['None',0,0,0,1,0,0,'l'],
['e',0,0,0,0,1,0,'m'],
['f',0,1,0,0,0,0,'n'],
['None',0,0,0,1,0,0,'o'],
['h',0,0,0,0,1,0,'p']]),
columns=[0,1,2,3,4,5,6,7],
index=[0,1,2,3,4,5,6,7])
I need to add all rows that occur before the 'None' entries and move the aggregated row to a new dataframe that should look like:
Your data frame dtype is mess up , cause you are using the array to assign the value , since one array only accpet one type , so it push all int to become string , we need convert it firstly
df=df.apply(pd.to_numeric,errors ='ignore')# convert
df['newkey']=df[0].eq('None').cumsum()# using cumsum create the key
df.loc[df[0].ne('None'),:].groupby('newkey').agg(lambda x : x.sum() if x.dtype=='int64' else x.head(1))# then we agg
Out[742]:
0 1 2 3 4 5 6 7
newkey
0 a 1 1 1 0 0 0 i
1 e 0 1 0 0 1 0 m
2 h 0 0 0 0 1 0 p
You can also specify the agg funcs
s = lambda s: sum(int(k) for k in s)
d = {i: s for i in range(8)}
d.update({0: 'first', 7: 'first'})
df.groupby((df[0] == 'None').cumsum().shift().fillna(0)).agg(d)
0 1 2 3 4 5 6 7
0
0.0 a 1 1 1 1 0 0 i
1.0 e 0 1 0 1 1 0 m
2.0 h 0 0 0 0 1 0 p

pandas: Grouping or filtering based on values in list, instead of dataframe

I want to get a row count of the frequency of each value, even if that value doesn't exist in the dataframe.
d = {'light' : pd.Series(['b','b','c','a','a','a','a'], index=[1,2,3,4,5,6,9]),'injury' : pd.Series([1,5,5,5,2,2,4], index=[1,2,3,4,5,6,9])}
testdf = pd.DataFrame(d)
injury light
1 1 b
2 5 b
3 5 c
4 5 a
5 2 a
6 2 a
9 4 a
I want to get a count of the number of occurrences of each unique value of 'injury' for each unique value in 'light'.
Normally I would just use groupby(), or (in this case, since I want it to be in a specific format), pivot_table:
testdf.reset_index().pivot_table(index='light',columns='injury',fill_value=0,aggfunc='count')
index
injury 1 2 4 5
light
a 0 2 1 1
b 1 0 0 1
c 0 0 0 1
But in this case I actually want to compare the records in the dataframe to an external list of values-- in this case, ['a','b','c','d']. So if 'd' doesn't exist in this dataframe, then I want it to return a count of zero:
index
injury 1 2 4 5
light
a 0 2 1 1
b 1 0 0 1
c 0 0 0 1
d 0 0 0 0
The closest I've come is filtering the dataframe based on each value, and then getting the size of that dataframe:
for v in sorted(['a','b','c','d']):
idx2 = (df['light'].isin([v]))
df2 = df[idx2]
print(df2.shape[0])
4
2
1
0
But that only returns counts from the 'light' column-- instead of a cross-tabulation of both columns.
Is there a way to make a pivot table, or a groupby() object, that groups things based on values in a list, rather than in a column in a dataframe? Or is there a better way to do this?
Try this:
df = pd.crosstab(df.light, df.injury,margins=True)
df
injury 1 2 4 5 All
light
a 0 2 1 1 4
b 1 0 0 1 2
c 0 0 0 1 1
All 1 2 1 3 7
df["All"]
light
a 4
b 2
c 1
All 7

Pandas: conditional rolling count

I have a Series that looks the following:
col
0 B
1 B
2 A
3 A
4 A
5 B
It's a time series, therefore the index is ordered by time.
For each row, I'd like to count how many times the value has appeared consecutively, i.e.:
Output:
col count
0 B 1
1 B 2
2 A 1 # Value does not match previous row => reset counter to 1
3 A 2
4 A 3
5 B 1 # Value does not match previous row => reset counter to 1
I found 2 related questions, but I can't figure out how to "write" that information as a new column in the DataFrame, for each row (as above). Using rolling_apply does not work well.
Related:
Counting consecutive events on pandas dataframe by their index
Finding consecutive segments in a pandas data frame
I think there is a nice way to combine the solution of #chrisb and #CodeShaman (As it was pointed out CodeShamans solution counts total and not consecutive values).
df['count'] = df.groupby((df['col'] != df['col'].shift(1)).cumsum()).cumcount()+1
col count
0 B 1
1 B 2
2 A 1
3 A 2
4 A 3
5 B 1
One-liner:
df['count'] = df.groupby('col').cumcount()
or
df['count'] = df.groupby('col').cumcount() + 1
if you want the counts to begin at 1.
Based on the second answer you linked, assuming s is your series.
df = pd.DataFrame(s)
df['block'] = (df['col'] != df['col'].shift(1)).astype(int).cumsum()
df['count'] = df.groupby('block').transform(lambda x: range(1, len(x) + 1))
In [88]: df
Out[88]:
col block count
0 B 1 1
1 B 1 2
2 A 2 1
3 A 2 2
4 A 2 3
5 B 3 1
I like the answer by #chrisb but wanted to share my own solution, since some people might find it more readable and easier to use with similar problems....
1) Create a function that uses static variables
def rolling_count(val):
if val == rolling_count.previous:
rolling_count.count +=1
else:
rolling_count.previous = val
rolling_count.count = 1
return rolling_count.count
rolling_count.count = 0 #static variable
rolling_count.previous = None #static variable
2) apply it to your Series after converting to dataframe
df = pd.DataFrame(s)
df['count'] = df['col'].apply(rolling_count) #new column in dataframe
output of df
col count
0 B 1
1 B 2
2 A 1
3 A 2
4 A 3
5 B 1
If you wish to do the same thing but filter on two columns, you can use this.
def count_consecutive_items_n_cols(df, col_name_list, output_col):
cum_sum_list = [
(df[col_name] != df[col_name].shift(1)).cumsum().tolist() for col_name in col_name_list
]
df[output_col] = df.groupby(
["_".join(map(str, x)) for x in zip(*cum_sum_list)]
).cumcount() + 1
return df
col_a col_b count
0 1 B 1
1 1 B 2
2 1 A 1
3 2 A 1
4 2 A 2
5 2 B 1

Categories