i have big data, i want to count, sum, average for each row only between specific range.
df = pd.DataFrame({'id0':[10.3,20,30,50,108,110],'id1':[100.5,0,300,570,400,140], 'id2':[-2.6,-3,5,12,44,53], 'id3':[-100.1,4,6,22,12,42]})
id0 id1 id2 id3
0 10.3 100.5 -2.6 -100.1
1 20.0 0.0 -3.0 4.0
2 30.0 300.0 5.0 6.0
3 50.0 570.0 12.0 22.0
4 108.0 400.0 44.0 12.0
5 110.0 140.0 53.0 42.0
for example i want to count the occurrence of value between 10-100 for each row, so it will get:
0 1
1 1
2 1
3 3
4 2
5 2
Name: count_10-100, dtype: int64
currently i get this done by iterate for each row, transverse and using groupby. But this take a time because i have ~500 column and 500000 row
You can apply the conditions with AND between them, and then sum along the row (axis 1):
((df >= 10) & (df <= 100)).sum(axis=1)
Output:
0 1
1 1
2 1
3 3
4 2
5 2
dtype: int64
For sum and mean, you can apply the conditions with where:
df.where((df >= 10) & (df <= 100)).sum(axis=1)
df.where((df >= 10) & (df <= 100)).mean(axis=1)
Credit for this goes to #anky, who posted it first as a comment :)
Below summarizes the different situations in which you'd want to count something in a DataFrame (or Series, for completeness), along with the recommended method(s).
DataFrame.count returns counts for each column as a Series since the non-null count varies by column.
DataFrameGroupBy.size returns a Series, since all columns in the same group share the same row-count.
DataFrameGroupBy.count returns a DataFrame, since the non-null count could differ across columns in the same group.
To get the group-wise non-null count for a specific column, use df.groupby(...)['x'].count() where "x" is the column to count.
#Code Examples
df = pd.DataFrame({
'A': list('aabbc'), 'B': ['x', 'x', np.nan, 'x', np.nan]})
s = df['B'].copy()
df
A B
0 a x
1 a x
2 b NaN
3 b x
4 c NaN
s
0 x
1 x
2 NaN
3 x
4 NaN
Name: B, dtype: object
Row Count of a DataFrame: len(df), df.shape[0], or len(df.index)
len(df)
# 5
df.shape[0]
# 5
len(df.index)
# 5
Of the three methods above, len(df.index) (as mentioned in other answers) is the fastest.
Note
All the methods above are constant time operations as they are simple attribute lookups.
df.shape (similar to ndarray.shape) is an attribute that returns a tuple of (# Rows, # Cols).
Column Count of a DataFrame: df.shape[1], len(df.columns)
df.shape[1]
# 2
len(df.columns)
# 2
Analogous to len(df.index), len(df.columns) is the faster of the two methods (but takes more characters to type).
Row Count of a Series:
len(s), s.size, len(s.index)
len(s)
# 5
s.size
# 5
len(s.index)
# 5
s.size and len(s.index) are about the same in terms of speed. But I recommend len(df).
size is an attribute, and it returns the number of elements (=count of rows for any Series). DataFrames also define a size attribute which returns the same result as
df.shape[0] * df.shape[1].
Non-Null Row Count: DataFrame.count and Series.count
The methods described here only count non-null values (meaning NaNs are ignored).
Calling DataFrame.count will return non-NaN counts for each column:
df.count()
A 5
B 3
dtype: int64
For Series, use Series.count to similar effect:
s.count()
# 3
Group-wise Row Count: GroupBy.size
For DataFrames, use DataFrameGroupBy.size to count the number of rows per group.
df.groupby('A').size()
A
a 2
b 2
c 1
dtype: int64
Similarly, for Series, you'll use SeriesGroupBy.size.
s.groupby(df.A).size()
A
a 2
b 2
c 1
Name: B, dtype: int64
In both cases, a Series is returned.
Group-wise Non-Null Row Count: GroupBy.count
Similar to above, but use GroupBy.count, not GroupBy.size. Note that size always returns a Series, while count returns a Series if called on a specific column, or else a DataFrame.
The following methods return the same thing:
df.groupby('A')['B'].size()
df.groupby('A').size()
A
a 2
b 2
c 1
Name: B, dtype: int64
df.groupby('A').count()
B
A
a 2
b 1
c 0
df.groupby('A')['B'].count()
A
a 2
b 1
c 0
Name: B, dtype: int64
There's a neat way to do that with aggregations and using pandas methods. It can be read as "aggregate by row (axis=1) where x is greater or equal to 10 and less or equal to 100".
df.agg(lambda x : (x.ge(10) & x.le(100)).sum(), axis=1)
Something like this will help you.
df["n_values_in_range"] = df.apply(
func=lambda row: count_values_in_range(row, range_min, range_max), axis=1)
Try this:
df.apply(lambda x: x.between(10, 100), axis=1).sum(axis=1)
Output:
0 1
1 1
2 1
3 3
4 2
5 2
Related
Actual dataframe consist of more than a million rows.
Say for example a dataframe is:
UniqueID Code Value OtherData
1 A 5 Z01
1 B 6 Z02
1 C 7 Z03
2 A 10 Z11
2 B 11 Z24
2 C 12 Z23
3 A 10 Z21
4 B 8 Z10
I want to obtain ratio of A/B for each UniqueID and put it in a new dataframe. For example, for UniqueID 1, its ratio of A/B = 5/6.
What is the most efficient way to do this in Python?
Want:
UniqueID RatioAB
1 5/6
2 10/11
3 Inf
4 0
Thank you.
One approach is using pivot_table, aggregating with the sum in the case there are multiple occurrences of the same letters (otherwise a simple pivot will do), and evaluating on columns A and B:
df.pivot_table(index='UniqueID', columns='Code', values='Value', aggfunc='sum').eval('A/B')
UniqueID
1 0.833333
2 0.909091
3 NaN
4 NaN
dtype: float64
If there is maximum one occurrence of each letter per group:
df.pivot(index='UniqueID', columns='Code', values='Value').eval('A/B')
UniqueID
1 0.833333
2 0.909091
3 NaN
4 NaN
dtype: float64
If you only care about A/B ratio:
df1 = df[df['Code'].isin(['A','B'])][['UniqueID', 'Code', 'Value']]
df1 = df1.pivot(index='UniqueID',
columns='Code',
values='Value')
df1['RatioAB'] = df1['A']/df1['B']
The most apparent way is via groupby.
df.groupby('UniqueID').apply(lambda g: g.query("Code == 'A'")['Value'].iloc[0] / g.query("Code == 'B'")['Value'].iloc[0])
I have a df with values:
A B C D
0 1 2 3 2
1 2 3 3 9
2 5 3 6 6
3 3 6 7
4 6 7
5 2
df.shape is 6x4, say
df.iloc[:,1] pulls out the B column, but len(df.iloc[:,1]) is also = 6
How do I "reshape" df.iloc[:,1]? Which function can I use so that the output is the length of the actual values in the column.
My expected output in this case is 3
You can use last_valid_index. Just note that since your series originally contained NaN values and these are considered float, even after filtering your series will be float. You may wish to convert to int as a separate step.
# first convert dataframe to numeric
df = df.apply(pd.to_numeric, errors='coerce')
# extract column
B = df.iloc[:, 1]
# filter to the last valid value
B_filtered = B[:B.last_valid_index()]
print(B_filtered)
0 2.0
1 3.0
2 3.0
3 6.0
Name: B, dtype: float64
You can use list comprehension like this.
len([x for x in df.iloc[:,1] if x != ''])
Using python and panda: For a given data set how does one find the total number of missing attributes? I have found the number for each column, but I need to sum the columns using python to find the total. Below is the code I have currently used.
def num_missing(x):
return sum(x.isnull())
print("Missing Values per Column:")
print(data_file1.apply(num_missing))
Consider df -
df
A B C
0 1.0 4 NaN
1 2.0 5 1.0
2 NaN 6 6.0
3 NaN 7 3.0
Column-wise NaN count -
df.isnull().sum(0)
A 2
B 0
C 1
dtype: int64
Row-wise NaN count -
df.isnull().sum(1)
0 1
1 0
2 1
3 1
dtype: int64
df-wide NaN count -
df.isnull().values.sum()
3
Option 1: call .sum() twice, where the second call finds the sum of the intermediate Series.
df = pd.DataFrame(np.ones((5,5)))
df.iloc[2:4, 1:3] = np.nan
df.isnull().sum().sum()
# 4
Option 2: use underlying NumPy array.
np.isnan(df.values).sum()
# 4
Option 2 should be significantly faster (8.5 us vs. 249 us on this sample data).
As noted by #root and here, np.isnan() works only on numeric data, not object dtypes. pandas.DataFrame.isnull() doesn't have this problem.
That is the difference between groupby("x").count and groupby("x").size in pandas ?
Does size just exclude nil ?
size includes NaN values, count does not:
In [46]:
df = pd.DataFrame({'a':[0,0,1,2,2,2], 'b':[1,2,3,4,np.NaN,4], 'c':np.random.randn(6)})
df
Out[46]:
a b c
0 0 1 1.067627
1 0 2 0.554691
2 1 3 0.458084
3 2 4 0.426635
4 2 NaN -2.238091
5 2 4 1.256943
In [48]:
print(df.groupby(['a'])['b'].count())
print(df.groupby(['a'])['b'].size())
a
0 2
1 1
2 2
Name: b, dtype: int64
a
0 2
1 1
2 3
dtype: int64
What is the difference between size and count in pandas?
The other answers have pointed out the difference, however, it is not completely accurate to say "size counts NaNs while count does not". While size does indeed count NaNs, this is actually a consequence of the fact that size returns the size (or the length) of the object it is called on. Naturally, this also includes rows/values which are NaN.
So, to summarize, size returns the size of the Series/DataFrame1,
df = pd.DataFrame({'A': ['x', 'y', np.nan, 'z']})
df
A
0 x
1 y
2 NaN
3 z
<!- _>
df.A.size
# 4
...while count counts the non-NaN values:
df.A.count()
# 3
Notice that size is an attribute (gives the same result as len(df) or len(df.A)). count is a function.
1. DataFrame.size is also an attribute and returns the number of elements in the DataFrame (rows x columns).
Behaviour with GroupBy - Output Structure
Besides the basic difference, there's also the difference in the structure of the generated output when calling GroupBy.size() vs GroupBy.count().
df = pd.DataFrame({
'A': list('aaabbccc'),
'B': ['x', 'x', np.nan, np.nan,
np.nan, np.nan, 'x', 'x']
})
df
A B
0 a x
1 a x
2 a NaN
3 b NaN
4 b NaN
5 c NaN
6 c x
7 c x
Consider,
df.groupby('A').size()
A
a 3
b 2
c 3
dtype: int64
Versus,
df.groupby('A').count()
B
A
a 2
b 0
c 2
GroupBy.count returns a DataFrame when you call count on all column, while GroupBy.size returns a Series.
The reason being that size is the same for all columns, so only a single result is returned. Meanwhile, the count is called for each column, as the results would depend on on how many NaNs each column has.
Behavior with pivot_table
Another example is how pivot_table treats this data. Suppose we would like to compute the cross tabulation of
df
A B
0 0 1
1 0 1
2 1 2
3 0 2
4 0 0
pd.crosstab(df.A, df.B) # Result we expect, but with `pivot_table`.
B 0 1 2
A
0 1 2 1
1 0 0 1
With pivot_table, you can issue size:
df.pivot_table(index='A', columns='B', aggfunc='size', fill_value=0)
B 0 1 2
A
0 1 2 1
1 0 0 1
But count does not work; an empty DataFrame is returned:
df.pivot_table(index='A', columns='B', aggfunc='count')
Empty DataFrame
Columns: []
Index: [0, 1]
I believe the reason for this is that 'count' must be done on the series that is passed to the values argument, and when nothing is passed, pandas decides to make no assumptions.
Just to add a little bit to #Edchum's answer, even if the data has no NA values, the result of count() is more verbose, using the example before:
grouped = df.groupby('a')
grouped.count()
Out[197]:
b c
a
0 2 2
1 1 1
2 2 3
grouped.size()
Out[198]:
a
0 2
1 1
2 3
dtype: int64
When we are dealing with normal dataframes then only difference will be an inclusion of NAN values, means count does not include NAN values while counting rows.
But if we are using these functions with the groupby then, to get the correct results by count() we have to associate any numeric field with the groupby to get the exact number of groups where for size() there is no need for this type of association.
In addition to all above answers, I would like to point out one more difference which I find significant.
You can correlate pandas' DataFrame size and count with Java's Vectors size and length. When we create a vector, some predefined memory is allocated to it. When we reach closer to the maximum number of elements it can hold, more memory is allocated to accommodate further additions. Similarly, in DataFrame as we add elements, the memory allocated to it increases.
The size attribute gives the number of memory cell allocated to DataFrame whereas count gives the number of elements that are actually present in DataFrame. For example,
You can see that even though there are 3 rows in DataFrame, its size is 6.
This answer covers size and count difference with respect to DataFrame and not pandas Series. I have not checked what happens with Series.
I have two dataframe, df1 (1 row, 10 columns) and df2 ( 7 rows, 10 columns). Now I want to add values from those two dataframe:
final = df1[0] + df2[0][0]
print final
output:
0 23.8458
Name: 0, dtype: float64
I believe 23.8458 is what I want, but I don't understand what "0" stand for, and what datatype of "final", but I just want to keep 23.8458 as a float number. How can I do that? Thanks,
What you are printing is the Series where it has a single value with index value 0 and column value 23.8458 if you wanted just the value then print final[0] would give you what you want
You can see the type if you do print type(final)
Example:
In [42]:
df1 = pd.DataFrame(np.random.randn(1,7))
print(df1)
df2 = pd.DataFrame(np.random.randn(7,10))
print(df2)
0 1 2 3 4 5 6
0 -1.575662 0.725688 -0.442072 0.622824 0.345227 -0.062732 0.197771
0 1 2 3 4 5 6 \
0 -0.658687 0.404817 -0.786715 -0.923070 0.479590 0.598325 -0.495963
1 1.482873 -0.240854 -0.987909 0.085952 -0.054789 -0.576887 -0.940805
2 1.173126 0.190489 -0.809694 -1.867470 0.500751 -0.663346 -1.777718
3 -0.111570 -0.606001 -0.755202 -0.201915 0.933065 -0.833538 2.526979
4 1.537778 -0.739090 -1.813050 0.601448 0.296994 -0.966876 0.459992
5 -0.936997 -0.494562 0.365359 -1.351915 -0.794753 1.552997 -0.684342
6 0.128406 0.016412 0.461390 -2.411903 3.154070 -0.584126 0.136874
7 8 9
0 -0.483917 -0.268557 1.386847
1 0.379854 0.205791 -0.527887
2 -0.307892 -0.915033 0.017231
3 -0.672195 0.110869 1.779655
4 0.241685 1.899335 -0.334976
5 -0.510972 -0.733894 0.615160
6 2.094503 -0.184050 1.328208
In [43]:
final = df1[0] + df2[0][0]
print(final)
0 -2.234349
Name: 0, dtype: float64
In [44]:
print(type(final))
<class 'pandas.core.series.Series'>
In [45]:
final[0]
Out[45]:
-2.2343491912631328