Count number of elements in each column less than x - python

I have a DataFrame which looks like below. I am trying to count the number of elements less than 2.0 in each column, then I will visualize the result in a bar plot. I did it using lists and loops, but I wonder if there is a "Pandas way" to do this quickly.
x = []
for i in range(6):
x.append(df[df.ix[:,i]<2.0].count()[i])
then I can get a bar plot using list x.
A B C D E F
0 2.142 1.929 1.674 1.547 3.395 2.382
1 2.077 1.871 1.614 1.491 3.110 2.288
2 2.098 1.889 1.610 1.487 3.020 2.262
3 1.990 1.760 1.479 1.366 2.496 2.128
4 1.935 1.765 1.656 1.530 2.786 2.433

In [96]:
df = pd.DataFrame({'a':randn(10), 'b':randn(10), 'c':randn(10)})
df
Out[96]:
a b c
0 -0.849903 0.944912 1.285790
1 -1.038706 1.445381 0.251002
2 0.683135 -0.539052 -0.622439
3 -1.224699 -0.358541 1.361618
4 -0.087021 0.041524 0.151286
5 -0.114031 -0.201018 -0.030050
6 0.001891 1.601687 -0.040442
7 0.024954 -1.839793 0.917328
8 -1.480281 0.079342 -0.405370
9 0.167295 -1.723555 -0.033937
[10 rows x 3 columns]
In [97]:
df[df > 1.0].count()
Out[97]:
a 0
b 2
c 2
dtype: int64
So in your case:
df[df < 2.0 ].count()
should work
EDIT
some timings
In [3]:
%timeit df[df < 1.0 ].count()
%timeit (df < 1.0).sum()
%timeit (df < 1.0).apply(np.count_nonzero)
1000 loops, best of 3: 1.47 ms per loop
1000 loops, best of 3: 560 us per loop
1000 loops, best of 3: 529 us per loop
So #DSM's suggestions are correct and much faster than my suggestion

Method-chaining is possible (comparison operators have their respective methods, e.g. < = lt(), <= = le()):
df.lt(2).sum()
If you have multiple conditions to consider, e.g. count the number of values between 2 and 10. Then you can use boolean operator on two boolean Serieses:
(df.gt(2) & df.lt(10)).sum()
or you can use pd.eval():
pd.eval("2 < df < 10").sum()
Count the number of values less than 2 or greater than 10:
(df.lt(2) | df.gt(10)).sum()
# or
pd.eval("df < 2 or df > 10").sum()

Related

Python Panda Percentages Calculations

Trying to Calculate Percentages on Python Pandas
1 2
0 A 0
1 A 1
2 A 2
3 B 0
4 B 0
5 B 1.5
6 B 0
Output of Percentage of A's and B's with a score higher than 0.
A = 66%
B = 25%
Create a boolean filter on the second column ((df['2'] > 0))
Group it by the first column
Aggregate with sum and size (sum will count the ones that satisfy the condition)
Divide sum by size to get the percentage:
res = (df['2'] > 0).groupby(df['1']).agg(['sum', 'size'])
res['sum'] / res['size']
Out:
1
A 0.666667
B 0.250000
dtype: float64
This can be done in a more compact way with a lambda expression:
df.groupby('1')['2'].agg(lambda x: (x > 0).sum() / x.size)
Out:
1
A 0.666667
B 0.250000
Name: 2, dtype: float64
but I suspect that the first one is more efficient.
In [3]: df['2'].gt(0).groupby(df['1']).mean()
Out[3]:
1
A 0.666667
B 0.250000
Name: 2, dtype: float64
Don't mind me...
I'm on a kick where I'm solving everything with np.bincount and pd.factorize
f, u = df['1'].factorize()
pd.Series(
np.bincount(f, df['2'].values > 0) / np.bincount(f),
u
)
A 0.666667
B 0.250000
dtype: float64
One-liner version for fun!
(lambda w, g, f, u: pd.Series(g(f, w) / g(f)))(
df['2'].values > 0, np.bincount, *pd.factorize(df['1'].values)
)
Naive Timing
%timeit df['2'].gt(0).groupby(df['1']).mean()
%timeit df.groupby('1')['2'].agg(lambda x: (x > 0).sum() / x.size)
%timeit (lambda w, g, f, u: pd.Series(g(f, w) / g(f)))(df['2'].values > 0, np.bincount, *pd.factorize(df['1'].values))
1000 loops, best of 3: 697 µs per loop
1000 loops, best of 3: 1 ms per loop
10000 loops, best of 3: 117 µs per loop

Removing usernames from a dataframe that do not appear a certain number of times?

I am trying to understand the provided below (which I found online, but do not fully understand). I want to essentially remove user names that do not appear in my dataframe at least 4 times (other than removing this names, I do not want to modify the dataframe in any other way). Does the following code solve this problem and if so, can you explain how the filter combined with the lambda achieves this? I have the following:
df.groupby('userName').filter(lambda x: len(x) > 4)
I am also open to alternative solutions/approaches that are easy to understand.
You can check filtration.
Faster solution in bigger DataFrame is with transform and boolean indexing:
df[df.groupby('userName')['userName'].transform('size') > 4]
Sample:
df = pd.DataFrame({'userName':['a'] * 5 + ['b'] * 3 + ['c'] * 6})
print (df.groupby('userName').filter(lambda x: len(x) > 4))
userName
0 a
1 a
2 a
3 a
4 a
8 c
9 c
10 c
11 c
12 c
13 c
print (df[df.groupby('userName')['userName'].transform('size') > 4])
userName
0 a
1 a
2 a
3 a
4 a
8 c
9 c
10 c
11 c
12 c
13 c
Timings:
np.random.seed(123)
N = 1000000
L = np.random.randint(1000,size=N).astype(str)
df = pd.DataFrame({'userName': np.random.choice(L, N)})
print (df)
In [128]: %timeit (df.groupby('userName').filter(lambda x: len(x) > 1000))
1 loop, best of 3: 468 ms per loop
In [129]: %timeit (df[df.groupby('userName')['userName'].transform(len) > 1000])
1 loop, best of 3: 661 ms per loop
In [130]: %timeit (df[df.groupby('userName')['userName'].transform('size') > 1000])
10 loops, best of 3: 96.9 ms per loop
Using numpy
def pir(df, k):
names = df.userName.values
f, u = pd.factorize(names)
c = np.bincount(f)
m = c[f] > k
return df[m]
pir(df, 4)
userName
0 a
1 a
2 a
3 a
4 a
8 c
9 c
10 c
11 c
12 c
13 c
__
Timing
#jezrael's large data
np.random.seed(123)
N = 1000000
L = np.random.randint(1000,size=N).astype(str)
df = pd.DataFrame({'userName': np.random.choice(L, N)})
pir(df, 1000).equals(
df[df.groupby('userName')['userName'].transform('size') > 1000]
)
True
%timeit df[df.groupby('userName')['userName'].transform('size') > 1000]
%timeit pir(df, 1000)
10 loops, best of 3: 78.4 ms per loop
10 loops, best of 3: 61.9 ms per loop

Sum of a column in Pandas DataFrame

I have a Pandas DataFrame.
LeafId pidx pidy count
1 x y 10
1 x y 20
1 x z 30
3 b q 10
1 x y 20
We can see that there are multiple rows of pidx = x and pidy = y
I want to sum the count column and get dataframe df2 as:
LeafId pidx pidy count
1 x y 50
1 x z 30
3 b q 10
I know one way of doing it:
df2 = df.groupby(['pidx','pidy']).agg({'LeafID':'first',count':'sum'}).reset_index()
But I want the most efficient way of doing it for a huge DataFrame (millions of records), which will take the least amount of time.
Is there any better way of doing this?
Also, instead of putting LeafID inside .agg(), can I do the following?
df2 = df.groupby(['LeafID','pidx','pidy']).agg({count':'sum'}).reset_index()
If need groupby by LeafId , pidx and pidy columns:
df1 = df.groupby(['LeafId','pidx','pidy'], as_index=False)['count'].sum()
print (df1)
LeafId pidx pidy count
0 1 x y 50
1 1 x z 30
2 3 b q 10
I try some timings:
np.random.seed(123)
N = 1000000
L1 = list('abcdefghijklmnopqrstu')
L2 = list('efghijklmnopqrstuvwxyz')
df = pd.DataFrame({'LeafId':np.random.randint(1000, size=N),
'pidx': np.random.choice(L1, N),
'pidy': np.random.choice(L2, N),
'count':np.random.randint(1000, size=N)})
#print (df)
print (df.groupby(['LeafId','pidx','pidy'], as_index=False)['count'].sum())
print (df.groupby(['LeafId','pidx','pidy']).agg({'count':'sum'}).reset_index())
In [261]: %timeit (df.groupby(['LeafId','pidx','pidy'], as_index=False)['count'].sum())
1 loop, best of 3: 544 ms per loop
In [262]: %timeit (df.groupby(['LeafId','pidx','pidy']).agg({'count':'sum'}).reset_index())
1 loop, best of 3: 466 ms per loop
Smaller groups 1000 to 10000:
np.random.seed(123)
N = 1000000
L1 = list('abcdefghijklmnopqrstu')
L2 = list('efghijklmnopqrstuvwxyz')
df = pd.DataFrame({'LeafId':np.random.randint(10000, size=N),
'pidx': np.random.choice(L1, N),
'pidy': np.random.choice(L2, N),
'count':np.random.randint(10000, size=N)})
print (df)
print (df.groupby(['LeafId','pidx','pidy'], as_index=False)['count'].sum())
print (df.groupby(['LeafId','pidx','pidy']).agg({'count':'sum'}).reset_index())
In [264]: %timeit (df.groupby(['LeafId','pidx','pidy'], as_index=False)['count'].sum())
1 loop, best of 3: 933 ms per loop
In [265]: %timeit (df.groupby(['LeafId','pidx','pidy']).agg({'count':'sum'}).reset_index())
1 loop, best of 3: 775 ms per loop

Python 3 occurrence matching

I started to write this on a whim for little more than curiosity. I've been looking at the code in a visualizer and it looks like it iterates like I'd expect but doesn't output what I think it should. Can someone show me what I'm missing? It's just a funny example of how sql join tables look after processing.
def query(a=[1,2,3,4], b=[3,1,1,2,3,4,5,6]):
"""
table A table B Expected Output Actual Output
idx value idx value indxA indxB indxA indxB
0 1 0 3 0 1 0 1
1 2 1 1 0 2 0 1
2 3 2 1 1 3 1 3
3 4 3 2 2 0 2 0
5 4 2 3 2 0
6 5 3 5 3 5
7 6
EXAMPLE
Table A index 0 occurs at Table B index 1 and 2
PROBLEM
Anywhere there are multiple matches only first occurrence prints
"""
for idx, itemA in enumerate(a):
if itemA in b:
for itemB in b:
if itemA == itemB:
print("{} {}".format(a.index(itemA), b.index(itemB)))
query()
list.index(x) returns the index in the list of the first item whose value is x.
So you should use enumerate(b).
def query(a=[1,2,3,4], b=[3,1,1,2,3,4,5,6]):
for index_a, value_a in enumerate(a):
for index_b, value_b in enumerate(b):
if value_a == value_b:
print("{} {}".format(index_a, index_b))
query()
For large lists iterating over each element in one list for each other element in the other list can be slow. This is a quadratic algorithm.
This is the solution with only lists. I took the printing out as it would use most of time and return the results in a list:
def query_list(a, b):
res = []
for index_a, value_a in enumerate(a):
for index_b, value_b in enumerate(b):
if value_a == value_b:
res.append((index_a, index_b))
return res
This is an alternative implementation using dictionaries:
def query_dict(a, b):
indices = {value: index for index, value in enumerate(a)}
res = []
for index_b, value_b in enumerate(b):
if value_b in indices:
res.append((indices[value_b], index_b))
return sorted(res)
Generating some example data:
import random
a = list(range(1, 1000))
b = [random.randint(1, 100) for x in range(10000)]
The list version:
%timeit query_list(a, b)
1 loops, best of 3: 1.09 s per loop
is much slower than the dict version:
%timeit query_dict(a, b)
100 loops, best of 3: 11.8 ms per loop
This about a factor of 10.
Using larger example data:
import random
a = list(range(1, 10000))
b = [random.randint(1, 100) for x in range(10000)]
The difference becomes even more pronounced:
%timeit query_list(a, b)
1 loops, best of 3: 11.4 s per loop
%timeit query_dict(a, b)
100 loops, best of 3: 13.7 ms per loop
Going up to a factor of close to 100.

how to compute a new column based on the values of other columns in pandas - python

Let's say my data frame contains these data:
>>> df = pd.DataFrame({'a':['l1','l2','l1','l2','l1','l2'],
'b':['1','2','2','1','2','2']})
>>> df
a b
0 l1 1
1 l2 2
2 l1 2
3 l2 1
4 l1 2
5 l2 2
l1 should correspond to 1 whereas l2 should correspond to 2.
I'd like to create a new column 'c' such that, for each row, c = 1 if a = l1 and b = 1 (or a = l2 and b = 2). If a = l1 and b = 2 (or a = l2 and b = 1) then c = 0.
The resulting data frame should look like this:
a b c
0 l1 1 1
1 l2 2 1
2 l1 2 0
3 l2 1 0
4 l1 2 0
5 l2 2 1
My data frame is very large so I'm really looking for the most efficient way to do this using pandas.
df = pd.DataFrame({'a': numpy.random.choice(['l1', 'l2'], 1000000),
'b': numpy.random.choice(['1', '2'], 1000000)})
A fast solution assuming only two distinct values:
%timeit df['c'] = ((df.a == 'l1') == (df.b == '1')).astype(int)
10 loops, best of 3: 178 ms per loop
#Viktor Kerkes:
%timeit df['c'] = (df.a.str[-1] == df.b).astype(int)
1 loops, best of 3: 412 ms per loop
#user1470788:
%timeit df['c'] = (((df['a'] == 'l1')&(df['b']=='1'))|((df['a'] == 'l2')&(df['b']=='2'))).astype(int)
1 loops, best of 3: 363 ms per loop
#herrfz
%timeit df['c'] = (df.a.apply(lambda x: x[1:])==df.b).astype(int)
1 loops, best of 3: 387 ms per loop
You can also use the string methods.
df['c'] = (df.a.str[-1] == df.b).astype(int)
df['c'] = (df.a.apply(lambda x: x[1:])==df.b).astype(int)
You can just use logical operators. I'm not sure why you're using strings of 1 and 2 rather than ints, but here's a solution. The astype at the end converts it from boolean to 0's and 1's.
df['c'] = (((df['a'] == 'l1')&(df['b']=='1'))|((df['a'] == 'l2')&(df['b']=='2'))).astype(int)

Categories