Count number of occurrences in Pandas with a specified list

Count number of occurrences in Pandas with a specified list - python

I have a list of possible integer numbers:
item_list = [0,1,2,3]
and some of the numbers do not necessarily will appear in my dataframe. For example with:
df = pd.DataFrame({'a': [0, 2, 0, 1, 0, 1, 0]})
executing
df['a'].value_counts()
will yield
0 5
1 2
2 1
Name: a, dtype: int64
but I am interested in all occurrences of all my 'item_list = [0,1,2,3]', so basically, I would like to see something like:
0 5
1 2
2 1
3 0
Name: a, dtype: int64
where the first column is 'item_list'
How to get this result?

You can also use reindex:
df['a'].value_counts().reindex(item_list).fillna(0)

You can convert values to Categorical:
item_list = [0,1,2,3]
df.a = df.a.astype('category', categories=item_list)
print (df['a'].value_counts())
0 5
1 2
2 1
3 0
Name: a, dtype: int64
With reindex and parameter fill_value:
print (df['a'].value_counts().reindex(item_list, fill_value=0))
0 5
1 2
2 1
3 0
Name: a, dtype: int64

Related

Pandas iterate over DataFrame row pairs

How can I iterate over pairs of rows of a Pandas DataFrame?
For example:
content = [(1,2,[1,3]),(3,4,[2,4]),(5,6,[6,9]),(7,8,[9,10])]
df = pd.DataFrame( content, columns=["a","b","interval"])
print df
output:
a b interval
0 1 2 [1, 3]
1 3 4 [2, 4]
2 5 6 [6, 9]
3 7 8 [9, 10]
Now I would like to do something like
for (indx1,row1), (indx2,row2) in df.?
print "row1:\n", row1
print "row2:\n", row2
print "\n"
which should output
row1:
a 1
b 2
interval [1,3]
Name: 0, dtype: int64
row2:
a 3
b 4
interval [2,4]
Name: 1, dtype: int64
row1:
a 3
b 4
interval [2,4]
Name: 1, dtype: int64
row2:
a 5
b 6
interval [6,9]
Name: 2, dtype: int64
row1:
a 5
b 6
interval [6,9]
Name: 2, dtype: int64
row2:
a 7
b 8
interval [9,10]
Name: 3, dtype: int64
Is there a builtin way to achieve this?
I looked at df.groupby(df.index // 2) and df.itertuples but none of these methods seems to do what I want.
Edit:
The overall goal is to get a list of bools indicating whether the intervals in column "interval" overlap. In the above example the list would be
overlaps = [True, False, False]
So one bool for each pair.

shift the dataframe & concat it back to the original using axis=1 so that each interval & the next interval are in the same row
df_merged = pd.concat([df, df.shift(-1).add_prefix('next_')], axis=1)
df_merged
#Out:
a b interval next_a next_b next_interval
0 1 2 [1, 3] 3.0 4.0 [2, 4]
1 3 4 [2, 4] 5.0 6.0 [6, 9]
2 5 6 [6, 9] 7.0 8.0 [9, 10]
3 7 8 [9, 10] NaN NaN NaN
define an intersects function that works with your lists representation & apply on the merged data frame ignoring the last row where the shifted_interval is null
def intersects(left, right):
return left[1] > right[0]
df_merged[:-1].apply(lambda x: intersects(x.interval, x.next_interval), axis=1)
#Out:
0 True
1 False
2 False
dtype: bool

If you want to keep the loop for, using zip and iterrows could be a way
for (indx1,row1),(indx2,row2) in zip(df[:-1].iterrows(),df[1:].iterrows()):
print "row1:\n", row1
print "row2:\n", row2
print "\n"
To access the next row at the same time, start the second iterrow one row after with df[1:].iterrows(). and you get the output the way you want.
row1:
a 1
b 2
Name: 0, dtype: int64
row2:
a 3
b 4
Name: 1, dtype: int64
row1:
a 3
b 4
Name: 1, dtype: int64
row2:
a 5
b 6
Name: 2, dtype: int64
row1:
a 5
b 6
Name: 2, dtype: int64
row2:
a 7
b 8
Name: 3, dtype: int64
But as said #RafaelC, doing for loop might not be the best method for your general problem.

To get the output you've shown use:
for row in df.index[:-1]:
print 'row 1:'
print df.iloc[row].squeeze()
print 'row 2:'
print df.iloc[row+1].squeeze()
print

You could try the iloc indexing.
Exmaple:
for i in range(df.shape[0] - 1):
idx1,idx2=i,i+1
row1,row2=df.iloc[idx1],df.iloc[idx2]
print(row1)
print(row2)
print()

Pandas Series with different lengths

Using pandas concat function it is possible to create a series like this:
In[230]pd.concat({'One':pd.Series(range(3)), 'Two':pd.Series(range(4))})
Out[230]:
One 0 0
1 1
2 2
Two 0 0
1 1
2 2
3 3
dtype: int64
Is it possible to do the same without using concat method?
My best approach was:
a = pd.Series(range(3),range(3))
b = pd.Series(range(4),range(4))
pd.Series([a,b],index=['One','Two'])
But it is not the same, it outputs:
One 0 0
1 1
2 2
dtype: int64
Two 0 0
1 1
2 2
3 3
dtype: int64
dtype: object

This should give you an idea of just how useful concat is.
a.index = pd.MultiIndex.from_tuples([('One', v) for v in a.index])
b.index = pd.MultiIndex.from_tuples([('Two', v) for v in b.index])
a.append(b)
One 0 0
1 1
2 2
Two 0 0
1 1
2 2
3 3
dtype: int64
The same thing is achieved by pd.concat([a, b]).

This is the work for the argument keys in case you want to get the same output using concat i.e :
pd.concat([a,b],keys=['One','Two'])
One 0 0
1 1
2 2
Two 0 0
1 1
2 2
3 3
dtype: int64

This works fine:
data = list(range(3)) + list(range(4))
index = MultiIndex(levels=[['One', 'Two'], [0, 1, 2, 3]],
labels=[[0, 0, 0, 1, 1, 1, 1], [0, 1, 2, 0, 1, 2, 3]])
pd.Series(data,index=index)

Finding elements in a pandas dataframe

I have a pandas dataframe which looks like the following:
0 1
0 2
2 3
1 4
What I want to do is the following: if I get 2 as input my code is supposed to search for 2 in the dataframe and when it finds it returns the value of the other column. In the above example my code would return 0 and 3. I know that I can simply look at each row and check if any of the elements is equal to 2 but I was wondering if there is one-liner for such a problem.
UPDATE: None of the columns are index columns.
Thanks

>>> df = pd.DataFrame({'A': [0, 0, 2, 1], 'B': [1,2,3,4]})
>>> df
A B
0 0 1
1 0 2
2 2 3
3 1 4
The following pandas syntax is equivalent to the SQL SELECT B FROM df WHERE A = 2
>>> df[df['A'] == 2]['B']
2 3
Name: B, dtype: int64
There's also pandas.DataFrame.query:
>>> df.query('A == 2')['B']
2 3
Name: B, dtype: int64

You may need this:
n_input = 2
df[(df == n_input).any(1)].stack()[lambda x: x != n_input].unique()
# array([0, 3])

df = pd.DataFrame({'A': [0, 0, 2, 1], 'B': [1,2,3,4]})
t = [df.loc[lambda df: df['A'] == 3]]
t

Counting the number of missing/NaN in each row

I've got a dataset with a big number of rows. Some of the values are NaN, like this:
In [91]: df
Out[91]:
1 3 1 1 1
1 3 1 1 1
2 3 1 1 1
1 1 NaN NaN NaN
1 3 1 1 1
1 1 1 1 1
And I want to count the number of NaN values in each row, it would be like this:
In [91]: list = <somecode with df>
In [92]: list
Out[91]:
[0,
0,
0,
3,
0,
0]
What is the best and fastest way to do it?

You could first find if element is NaN or not by isnull() and then take row-wise sum(axis=1)
In [195]: df.isnull().sum(axis=1)
Out[195]:
0 0
1 0
2 0
3 3
4 0
5 0
dtype: int64
And, if you want the output as list, you can
In [196]: df.isnull().sum(axis=1).tolist()
Out[196]: [0, 0, 0, 3, 0, 0]
Or use count like
In [130]: df.shape[1] - df.count(axis=1)
Out[130]:
0 0
1 0
2 0
3 3
4 0
5 0
dtype: int64

To count NaNs in specific rows, use
cols = ['col1', 'col2']
df['number_of_NaNs'] = df[cols].isna().sum(1)
or index the columns by position, e.g. count NaNs in the first 4 columns:
df['number_of_NaNs'] = df.iloc[:, :4].isna().sum(1)

Pandas indexing by both boolean `loc` and subsequent `iloc`

I want to index a Pandas dataframe using a boolean mask, then set a value in a subset of the filtered dataframe based on an integer index, and have this value reflected in the dataframe. That is, I would be happy if this worked on a view of the dataframe.
Example:
In [293]:
df = pd.DataFrame({'a': [0, 1, 2, 3, 4, 5, 6, 7],
'b': [5, 5, 2, 2, 5, 5, 2, 2],
'c': [0, 0, 0, 0, 0, 0, 0, 0]})
mask = (df['a'] < 7) & (df['b'] == 2)
df.loc[mask, 'c']
Out[293]:
2 0
3 0
6 0
Name: c, dtype: int64
Now I would like to set the values of the first two elements returned in the filtered dataframe. Chaining an iloc onto the loc call above works to index:
In [294]:
df.loc[mask, 'c'].iloc[0: 2]
Out[294]:
2 0
3 0
Name: c, dtype: int64
But not to assign:
In [295]:
df.loc[mask, 'c'].iloc[0: 2] = 1
print(df)
a b c
0 0 5 0
1 1 5 0
2 2 2 0
3 3 2 0
4 4 5 0
5 5 5 0
6 6 2 0
7 7 2 0
Making the assign value the same length as the slice (i.e. = [1, 1]) also doesn't work. Is there a way to assign these values?

This does work but is a little ugly, basically we use the index generated from the mask and make an additional call to loc:
In [57]:
df.loc[df.loc[mask,'c'].iloc[0:2].index, 'c'] = 1
df
Out[57]:
a b c
0 0 5 0
1 1 5 0
2 2 2 1
3 3 2 1
4 4 5 0
5 5 5 0
6 6 2 0
7 7 2 0
So breaking the above down:
In [60]:
# take the index from the mask and iloc
df.loc[mask, 'c'].iloc[0: 2]
Out[60]:
2 0
3 0
Name: c, dtype: int64
In [61]:
# call loc using this index, we can now use this to select column 'c' and set the value
df.loc[df.loc[mask,'c'].iloc[0:2].index]
Out[61]:
a b c
2 2 2 0
3 3 2 0

How about.
ix = df.index[mask][:2]
df.loc[ix, 'c'] = 1
Same idea as EdChum but more elegant as suggested in the comment.
EDIT: Have to be a little bit careful with this one as it may give unwanted results with a non-unique index, since there could be multiple rows indexed by either of the label in ix above. If the index is non-unique and you only want the first 2 (or n) rows that satisfy the boolean key, it would be safer to use .iloc with integer indexing with something like
ix = np.where(mask)[0][:2]
df.iloc[ix, 'c'] = 1

I don't know if this is any more elegant, but it's a little different:
mask = mask & (mask.cumsum() < 3)
df.loc[mask, 'c'] = 1
a b c
0 0 5 0
1 1 5 0
2 2 2 1
3 3 2 1
4 4 5 0
5 5 5 0
6 6 2 0
7 7 2 0

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Count number of occurrences in Pandas with a specified list - python

You can also use reindex: df['a'].value_counts().reindex(item_list).fillna(0)

Related

Pandas iterate over DataFrame row pairs

Pandas Series with different lengths

Finding elements in a pandas dataframe

Counting the number of missing/NaN in each row

Pandas indexing by both boolean `loc` and subsequent `iloc`

Categories

Resources