Multiindex pandas groupby + aggregate, keep full index

Multiindex pandas groupby + aggregate, keep full index - python

I have a two-level hierarchically-indexed sequence of integers.
>> s
id1 id2
1 a 100
b 10
c 9
2 a 2000
3 a 5
b 10
c 15
d 20
...
I want to group by id1, and select the maximum value, but have the full index in the result. I have tried the following:
>> s.groupby(level=0).aggregate(np.max)
id1
1 100
2 2000
3 20
But result is indexed by id1 only. I want my output to look like this:
id1 id2
1 a 100
2 a 2000
3 d 20
A related, but more complicated, question was asked here:
Multiindexed Pandas groupby, ignore a level?
As it states, the answer is kind of a hack.
Does anyone know a better solution? If not, what about the special case where every value of id2 is unique?

One way to select full rows after a groupby is to use groupby/transform to build a boolean mask and then use the mask to select the full rows from s:
In [110]: s[s.groupby(level=0).transform(lambda x: x == x.max()).astype(bool)]
Out[110]:
id1 id2
1 a 100
2 a 2000
3 d 20
Name: s, dtype: int64
Another way, which is faster in some cases -- such as when there are a lot of groups -- is to merge the max values m into a DataFrame along with the values in s, and then select rows based on equality between m and s:
def using_merge(s):
m = s.groupby(level=0).agg(np.max)
df = s.reset_index(['id2'])
df['m'] = m
result = df.loc[df['s']==df['m']]
del result['m']
result = result.set_index(['id2'], append=True)
return result['s']
Here is an example showing using_merge, while more complicated, may be faster than using_transform:
import numpy as np
import pandas as pd
def using_transform(s):
return s[s.groupby(level=0).transform(lambda x: x == x.max()).astype(bool)]
N = 10**5
id1 = np.random.randint(100, size=N)
id2 = np.random.choice(list('abcd'), size=N)
index = pd.MultiIndex.from_arrays([id1, id2])
ss = pd.Series(np.random.randint(100, size=N), index=index)
ss.index.names = ['id1', 'id2']
ss.name = 's'
Timing these two functions using IPython's %timeit function yields:
In [121]: %timeit using_merge(ss)
100 loops, best of 3: 12.8 ms per loop
In [122]: %timeit using_transform(ss)
10 loops, best of 3: 45 ms per loop

Related

I try to count the same numbers of a column in a DF with pandas [duplicate]

I have a dataset
category
cat a
cat b
cat a
I'd like to be able to return something like (showing unique values and frequency)
category freq
cat a 2
cat b 1

Use value_counts() as #DSM commented.
In [37]:
df = pd.DataFrame({'a':list('abssbab')})
df['a'].value_counts()
Out[37]:
b 3
a 2
s 2
dtype: int64
Also groupby and count. Many ways to skin a cat here.
In [38]:
df.groupby('a').count()
Out[38]:
a
a
a 2
b 3
s 2
[3 rows x 1 columns]
See the online docs.
If you wanted to add frequency back to the original dataframe use transform to return an aligned index:
In [41]:
df['freq'] = df.groupby('a')['a'].transform('count')
df
Out[41]:
a freq
0 a 2
1 b 3
2 s 2
3 s 2
4 b 3
5 a 2
6 b 3
[7 rows x 2 columns]

If you want to apply to all columns you can use:
df.apply(pd.value_counts)
This will apply a column based aggregation function (in this case value_counts) to each of the columns.

df.category.value_counts()
This short little line of code will give you the output you want.
If your column name has spaces you can use
df['category'].value_counts()

df.apply(pd.value_counts).fillna(0)
value_counts - Returns object containing counts of unique values
apply - count frequency in every column. If you set axis=1, you get frequency in every row
fillna(0) - make output more fancy. Changed NaN to 0

In 0.18.1 groupby together with count does not give the frequency of unique values:
>>> df
a
0 a
1 b
2 s
3 s
4 b
5 a
6 b
>>> df.groupby('a').count()
Empty DataFrame
Columns: []
Index: [a, b, s]
However, the unique values and their frequencies are easily determined using size:
>>> df.groupby('a').size()
a
a 2
b 3
s 2
With df.a.value_counts() sorted values (in descending order, i.e. largest value first) are returned by default.

Using list comprehension and value_counts for multiple columns in a df
[my_series[c].value_counts() for c in list(my_series.select_dtypes(include=['O']).columns)]
https://stackoverflow.com/a/28192263/786326

As everyone said, the faster solution is to do:
df.column_to_analyze.value_counts()
But if you want to use the output in your dataframe, with this schema:
df input:
category
cat a
cat b
cat a
df output:
category counts
cat a 2
cat b 1
cat a 2
you can do this:
df['counts'] = df.category.map(df.category.value_counts())
df

If your DataFrame has values with the same type, you can also set return_counts=True in numpy.unique().
index, counts = np.unique(df.values,return_counts=True)
np.bincount() could be faster if your values are integers.

You can also do this with pandas by broadcasting your columns as categories first, e.g. dtype="category" e.g.
cats = ['client', 'hotel', 'currency', 'ota', 'user_country']
df[cats] = df[cats].astype('category')
and then calling describe:
df[cats].describe()
This will give you a nice table of value counts and a bit more :):
client hotel currency ota user_country
count 852845 852845 852845 852845 852845
unique 2554 17477 132 14 219
top 2198 13202 USD Hades US
freq 102562 8847 516500 242734 340992

Without any libraries, you could do this instead:
def to_frequency_table(data):
frequencytable = {}
for key in data:
if key in frequencytable:
frequencytable[key] += 1
else:
frequencytable[key] = 1
return frequencytable
Example:
to_frequency_table([1,1,1,1,2,3,4,4])
>>> {1: 4, 2: 1, 3: 1, 4: 2}

I believe this should work fine for any DataFrame columns list.
def column_list(x):
column_list_df = []
for col_name in x.columns:
y = col_name, len(x[col_name].unique())
column_list_df.append(y)
return pd.DataFrame(column_list_df)
column_list_df.rename(columns={0: "Feature", 1: "Value_count"})
The function "column_list" checks the columns names and then checks the uniqueness of each column values.

#metatoaster has already pointed this out.
Go for Counter. It's blazing fast.
import pandas as pd
from collections import Counter
import timeit
import numpy as np
df = pd.DataFrame(np.random.randint(1, 10000, (100, 2)), columns=["NumA", "NumB"])
Timers
%timeit -n 10000 df['NumA'].value_counts()
# 10000 loops, best of 3: 715 µs per loop
%timeit -n 10000 df['NumA'].value_counts().to_dict()
# 10000 loops, best of 3: 796 µs per loop
%timeit -n 10000 Counter(df['NumA'])
# 10000 loops, best of 3: 74 µs per loop
%timeit -n 10000 df.groupby(['NumA']).count()
# 10000 loops, best of 3: 1.29 ms per loop
Cheers!

The following code creates frequency table for the various values in a column called "Total_score" in a dataframe called "smaller_dat1", and then returns the number of times the value "300" appears in the column.
valuec = smaller_dat1.Total_score.value_counts()
valuec.loc[300]

n_values = data.income.value_counts()
First unique value count
n_at_most_50k = n_values[0]
Second unique value count
n_greater_50k = n_values[1]
n_values
Output:
<=50K 34014
>50K 11208
Name: income, dtype: int64
Output:
n_greater_50k,n_at_most_50k:-
(11208, 34014)

your data:
|category|
cat a
cat b
cat a
solution:
df['freq'] = df.groupby('category')['category'].transform('count')
df = df.drop_duplicates()

How can I group by and count similar elements in a pandas dataframe? [duplicate]

I have a dataset
category
cat a
cat b
cat a
I'd like to be able to return something like (showing unique values and frequency)
category freq
cat a 2
cat b 1

Use value_counts() as #DSM commented.
In [37]:
df = pd.DataFrame({'a':list('abssbab')})
df['a'].value_counts()
Out[37]:
b 3
a 2
s 2
dtype: int64
Also groupby and count. Many ways to skin a cat here.
In [38]:
df.groupby('a').count()
Out[38]:
a
a
a 2
b 3
s 2
[3 rows x 1 columns]
See the online docs.
If you wanted to add frequency back to the original dataframe use transform to return an aligned index:
In [41]:
df['freq'] = df.groupby('a')['a'].transform('count')
df
Out[41]:
a freq
0 a 2
1 b 3
2 s 2
3 s 2
4 b 3
5 a 2
6 b 3
[7 rows x 2 columns]

If you want to apply to all columns you can use:
df.apply(pd.value_counts)
This will apply a column based aggregation function (in this case value_counts) to each of the columns.

df.category.value_counts()
This short little line of code will give you the output you want.
If your column name has spaces you can use
df['category'].value_counts()

df.apply(pd.value_counts).fillna(0)
value_counts - Returns object containing counts of unique values
apply - count frequency in every column. If you set axis=1, you get frequency in every row
fillna(0) - make output more fancy. Changed NaN to 0

In 0.18.1 groupby together with count does not give the frequency of unique values:
>>> df
a
0 a
1 b
2 s
3 s
4 b
5 a
6 b
>>> df.groupby('a').count()
Empty DataFrame
Columns: []
Index: [a, b, s]
However, the unique values and their frequencies are easily determined using size:
>>> df.groupby('a').size()
a
a 2
b 3
s 2
With df.a.value_counts() sorted values (in descending order, i.e. largest value first) are returned by default.

Using list comprehension and value_counts for multiple columns in a df
[my_series[c].value_counts() for c in list(my_series.select_dtypes(include=['O']).columns)]
https://stackoverflow.com/a/28192263/786326

As everyone said, the faster solution is to do:
df.column_to_analyze.value_counts()
But if you want to use the output in your dataframe, with this schema:
df input:
category
cat a
cat b
cat a
df output:
category counts
cat a 2
cat b 1
cat a 2
you can do this:
df['counts'] = df.category.map(df.category.value_counts())
df

If your DataFrame has values with the same type, you can also set return_counts=True in numpy.unique().
index, counts = np.unique(df.values,return_counts=True)
np.bincount() could be faster if your values are integers.

You can also do this with pandas by broadcasting your columns as categories first, e.g. dtype="category" e.g.
cats = ['client', 'hotel', 'currency', 'ota', 'user_country']
df[cats] = df[cats].astype('category')
and then calling describe:
df[cats].describe()
This will give you a nice table of value counts and a bit more :):
client hotel currency ota user_country
count 852845 852845 852845 852845 852845
unique 2554 17477 132 14 219
top 2198 13202 USD Hades US
freq 102562 8847 516500 242734 340992

Without any libraries, you could do this instead:
def to_frequency_table(data):
frequencytable = {}
for key in data:
if key in frequencytable:
frequencytable[key] += 1
else:
frequencytable[key] = 1
return frequencytable
Example:
to_frequency_table([1,1,1,1,2,3,4,4])
>>> {1: 4, 2: 1, 3: 1, 4: 2}

I believe this should work fine for any DataFrame columns list.
def column_list(x):
column_list_df = []
for col_name in x.columns:
y = col_name, len(x[col_name].unique())
column_list_df.append(y)
return pd.DataFrame(column_list_df)
column_list_df.rename(columns={0: "Feature", 1: "Value_count"})
The function "column_list" checks the columns names and then checks the uniqueness of each column values.

#metatoaster has already pointed this out.
Go for Counter. It's blazing fast.
import pandas as pd
from collections import Counter
import timeit
import numpy as np
df = pd.DataFrame(np.random.randint(1, 10000, (100, 2)), columns=["NumA", "NumB"])
Timers
%timeit -n 10000 df['NumA'].value_counts()
# 10000 loops, best of 3: 715 µs per loop
%timeit -n 10000 df['NumA'].value_counts().to_dict()
# 10000 loops, best of 3: 796 µs per loop
%timeit -n 10000 Counter(df['NumA'])
# 10000 loops, best of 3: 74 µs per loop
%timeit -n 10000 df.groupby(['NumA']).count()
# 10000 loops, best of 3: 1.29 ms per loop
Cheers!

The following code creates frequency table for the various values in a column called "Total_score" in a dataframe called "smaller_dat1", and then returns the number of times the value "300" appears in the column.
valuec = smaller_dat1.Total_score.value_counts()
valuec.loc[300]

n_values = data.income.value_counts()
First unique value count
n_at_most_50k = n_values[0]
Second unique value count
n_greater_50k = n_values[1]
n_values
Output:
<=50K 34014
>50K 11208
Name: income, dtype: int64
Output:
n_greater_50k,n_at_most_50k:-
(11208, 34014)

your data:
|category|
cat a
cat b
cat a
solution:
df['freq'] = df.groupby('category')['category'].transform('count')
df = df.drop_duplicates()

Count the frequency that a value occurs in a dataframe column

I have a dataset
category
cat a
cat b
cat a
I'd like to be able to return something like (showing unique values and frequency)
category freq
cat a 2
cat b 1

Use value_counts() as #DSM commented.
In [37]:
df = pd.DataFrame({'a':list('abssbab')})
df['a'].value_counts()
Out[37]:
b 3
a 2
s 2
dtype: int64
Also groupby and count. Many ways to skin a cat here.
In [38]:
df.groupby('a').count()
Out[38]:
a
a
a 2
b 3
s 2
[3 rows x 1 columns]
See the online docs.
If you wanted to add frequency back to the original dataframe use transform to return an aligned index:
In [41]:
df['freq'] = df.groupby('a')['a'].transform('count')
df
Out[41]:
a freq
0 a 2
1 b 3
2 s 2
3 s 2
4 b 3
5 a 2
6 b 3
[7 rows x 2 columns]

If you want to apply to all columns you can use:
df.apply(pd.value_counts)
This will apply a column based aggregation function (in this case value_counts) to each of the columns.

df.category.value_counts()
This short little line of code will give you the output you want.
If your column name has spaces you can use
df['category'].value_counts()

df.apply(pd.value_counts).fillna(0)
value_counts - Returns object containing counts of unique values
apply - count frequency in every column. If you set axis=1, you get frequency in every row
fillna(0) - make output more fancy. Changed NaN to 0

In 0.18.1 groupby together with count does not give the frequency of unique values:
>>> df
a
0 a
1 b
2 s
3 s
4 b
5 a
6 b
>>> df.groupby('a').count()
Empty DataFrame
Columns: []
Index: [a, b, s]
However, the unique values and their frequencies are easily determined using size:
>>> df.groupby('a').size()
a
a 2
b 3
s 2
With df.a.value_counts() sorted values (in descending order, i.e. largest value first) are returned by default.

Using list comprehension and value_counts for multiple columns in a df
[my_series[c].value_counts() for c in list(my_series.select_dtypes(include=['O']).columns)]
https://stackoverflow.com/a/28192263/786326

As everyone said, the faster solution is to do:
df.column_to_analyze.value_counts()
But if you want to use the output in your dataframe, with this schema:
df input:
category
cat a
cat b
cat a
df output:
category counts
cat a 2
cat b 1
cat a 2
you can do this:
df['counts'] = df.category.map(df.category.value_counts())
df

If your DataFrame has values with the same type, you can also set return_counts=True in numpy.unique().
index, counts = np.unique(df.values,return_counts=True)
np.bincount() could be faster if your values are integers.

You can also do this with pandas by broadcasting your columns as categories first, e.g. dtype="category" e.g.
cats = ['client', 'hotel', 'currency', 'ota', 'user_country']
df[cats] = df[cats].astype('category')
and then calling describe:
df[cats].describe()
This will give you a nice table of value counts and a bit more :):
client hotel currency ota user_country
count 852845 852845 852845 852845 852845
unique 2554 17477 132 14 219
top 2198 13202 USD Hades US
freq 102562 8847 516500 242734 340992

Without any libraries, you could do this instead:
def to_frequency_table(data):
frequencytable = {}
for key in data:
if key in frequencytable:
frequencytable[key] += 1
else:
frequencytable[key] = 1
return frequencytable
Example:
to_frequency_table([1,1,1,1,2,3,4,4])
>>> {1: 4, 2: 1, 3: 1, 4: 2}

I believe this should work fine for any DataFrame columns list.
def column_list(x):
column_list_df = []
for col_name in x.columns:
y = col_name, len(x[col_name].unique())
column_list_df.append(y)
return pd.DataFrame(column_list_df)
column_list_df.rename(columns={0: "Feature", 1: "Value_count"})
The function "column_list" checks the columns names and then checks the uniqueness of each column values.

#metatoaster has already pointed this out.
Go for Counter. It's blazing fast.
import pandas as pd
from collections import Counter
import timeit
import numpy as np
df = pd.DataFrame(np.random.randint(1, 10000, (100, 2)), columns=["NumA", "NumB"])
Timers
%timeit -n 10000 df['NumA'].value_counts()
# 10000 loops, best of 3: 715 µs per loop
%timeit -n 10000 df['NumA'].value_counts().to_dict()
# 10000 loops, best of 3: 796 µs per loop
%timeit -n 10000 Counter(df['NumA'])
# 10000 loops, best of 3: 74 µs per loop
%timeit -n 10000 df.groupby(['NumA']).count()
# 10000 loops, best of 3: 1.29 ms per loop
Cheers!

The following code creates frequency table for the various values in a column called "Total_score" in a dataframe called "smaller_dat1", and then returns the number of times the value "300" appears in the column.
valuec = smaller_dat1.Total_score.value_counts()
valuec.loc[300]

n_values = data.income.value_counts()
First unique value count
n_at_most_50k = n_values[0]
Second unique value count
n_greater_50k = n_values[1]
n_values
Output:
<=50K 34014
>50K 11208
Name: income, dtype: int64
Output:
n_greater_50k,n_at_most_50k:-
(11208, 34014)

your data:
|category|
cat a
cat b
cat a
solution:
df['freq'] = df.groupby('category')['category'].transform('count')
df = df.drop_duplicates()

Apply function to sets of columns in pandas, 'looping' over entire data frame column-wise

Here is a test example to show what I am trying to achieve. Here's a toy data frame:
df = pd.DataFrame(np.random.randn(10,7),index=range(1,11),columns=headers)
Which gives
Time A_x A_y A_z B_x B_y B_z
1 -0.075509 -0.123527 -0.547239 -0.453707 -0.969796 0.248761 1.369613
2 -0.206369 -0.112098 -1.122609 0.218538 -0.878985 0.566872 -1.048862
3 -0.194552 0.818276 -1.563931 0.097377 1.641384 -0.766217 -1.482096
4 0.502731 0.766515 -0.650482 -0.087203 -0.089075 0.443969 0.354747
5 1.411380 -2.419204 -0.882383 0.005204 -0.204358 -0.999242 -0.395236
6 1.036695 1.115630 0.081825 -1.038442 0.515798 -0.060016 2.669702
7 0.392943 0.226386 0.039879 0.732611 -0.073447 1.164285 1.034357
8 -1.253264 0.389148 0.158289 0.440282 -1.195860 0.872064 0.906377
9 -0.133580 -0.308314 -0.839347 -0.517989 0.652120 0.477232 -0.391767
10 0.623841 0.473552 0.059428 0.726088 -0.593291 -3.186297 -0.846863
What I want to do is simply to calculate the length of the vector for each header (A and B) in this case, for each index, and divide by the Time column. Hence, this function needs to be np.sqrt(A_x^2 + A_y^2 + A_z^2) and the same for B of course. I.e. I am looking to calculate the velocity for each row, but three columns contribute to one velocity result.
I have tried using df.groupby and df.filter to loop-over the columns but I cannot really get it to work, because I am not at all sure how I apply effectively the same function to chunks of the data-frame, all in one go (as apparently one is to avoid looping over rows). I have tried doing
df = df.apply(lambda x: np.sqrt(x.dot(x)), axis=1)
This works of course, but only if the input data frame has the right number of columns (3), if longer then the dot-product is calculated over the entire row, and not in chunks of three columns which is what I want (because this is turns corresponds to the tag coordinates, which are three dimensional).
So this is what I am eventually trying to get with the above example (the below arrays are just filled with random numbers, not the actual velocities which I am trying to calculate - just to show what sort of shape I trying to achieve):
Velocity_A Velocity_B
1 -0.975633 -2.669544
2 0.766405 -0.264904
3 0.425481 -0.429894
4 -0.437316 0.954006
5 1.073352 -1.475964
6 -0.647534 0.937035
7 0.082517 0.438112
8 -0.387111 -1.417930
9 -0.111011 1.068530
10 0.451979 -0.053333
My actual data is 50,000 x 36 (so there are 12 tags with x,y,z coordinates), and I want to calculate the velocity all in one go to avoid iterating (if at all possible). There is also a time column of the same length (50,000x1).
How do you do this?
Thanks, Astrid

A possible start.
Filtering out column names corresponding to a particular vector. For example
In [20]: filter(lambda x: x.startswith("A_"),df.columns)
Out[20]: ['A_x', 'A_y', 'A_z']
Sub selecting these columns from the DataFrame
In [22]: df[filter(lambda x: x.startswith("A_"),df.columns)]
Out[22]:
A_x A_y A_z
1 -0.123527 -0.547239 -0.453707
2 -0.112098 -1.122609 0.218538
3 0.818276 -1.563931 0.097377
4 0.766515 -0.650482 -0.087203
5 -2.419204 -0.882383 0.005204
6 1.115630 0.081825 -1.038442
7 0.226386 0.039879 0.732611
8 0.389148 0.158289 0.440282
9 -0.308314 -0.839347 -0.517989
10 0.473552 0.059428 0.726088
So, using this technique you can get chunks of 3 columns. For example.
column_initials = ["A","B"]
for column_initial in column_initials:
df["Velocity_"+column_initial]=df[filter(lambda x: x.startswith(column_initial+"_"),df.columns)].apply(lambda x: np.sqrt(x.dot(x)), axis=1)/df.Time
In [32]: df[['Velocity_A','Velocity_B']]
Out[32]:
Velocity_A Velocity_B
1 -9.555311 -22.467965
2 -5.568487 -7.177625
3 -9.086257 -12.030091
4 2.007230 1.144208
5 1.824531 0.775006
6 1.472305 2.623467
7 1.954044 3.967796
8 -0.485576 -1.384815
9 -7.736036 -6.722931
10 1.392823 5.369757
I do not get the same answer as yours. But, I borrowed your df.apply(lambda x: np.sqrt(x.dot(x)), axis=1) and assume it is correct.
Hope this helps.

I would do at least a loop over the tag identifier, but don't worry, that's a very fast loop that just determines the filter pattern to get the right columns:
df = pd.DataFrame(np.random.randn(10,7), index=range(1,11), columns='Time A_x A_y A_z B_x B_y B_z'.split())
col_ids = ['A', 'B'] # I guess you can create that one easily
results = pd.DataFrame(index=df.index) # the result container
for id in col_ids:
results['Velocity_'+id] = np.sqrt((df.filter(regex=id+'_')**2).sum(axis=1))/df.Time

Your calculation is more NumPy-ish than Panda-ish, by which I mean the calculation can be expressed somewhat succinctly if you regard your DataFrame as merely a big array, whereas the solution (at least the one I came up with) is more complicated when you try to wrangle the DataFrame with melt, groupby, etc.
The entire calculation can be expressed in essentially one line:
np.sqrt((arr**2).reshape(arr.shape[0],-1,3).sum(axis=-1))/times[:,None]
So here is the NumPy way:
import numpy as np
import pandas as pd
import io
content = '''
Time A_x A_y A_z B_x B_y B_z
-0.075509 -0.123527 -0.547239 -0.453707 -0.969796 0.248761 1.369613
-0.206369 -0.112098 -1.122609 0.218538 -0.878985 0.566872 -1.048862
-0.194552 0.818276 -1.563931 0.097377 1.641384 -0.766217 -1.482096
0.502731 0.766515 -0.650482 -0.087203 -0.089075 0.443969 0.354747
1.411380 -2.419204 -0.882383 0.005204 -0.204358 -0.999242 -0.395236
1.036695 1.115630 0.081825 -1.038442 0.515798 -0.060016 2.669702
0.392943 0.226386 0.039879 0.732611 -0.073447 1.164285 1.034357
-1.253264 0.389148 0.158289 0.440282 -1.195860 0.872064 0.906377
-0.133580 -0.308314 -0.839347 -0.517989 0.652120 0.477232 -0.391767
0.623841 0.473552 0.059428 0.726088 -0.593291 -3.186297 -0.846863'''
df = pd.read_table(io.BytesIO(content), sep='\s+', header=True)
arr = df.values
times = arr[:,0]
arr = arr[:,1:]
result = np.sqrt((arr**2).reshape(arr.shape[0],-1,3).sum(axis=-1))/times[:,None]
result = pd.DataFrame(result, columns=['Velocity_%s'%(x,) for x in list('AB')])
print(result)
which yields
Velocity_A Velocity_B
0 -9.555311 -22.467965
1 -5.568487 -7.177625
2 -9.086257 -12.030091
3 2.007230 1.144208
4 1.824531 0.775006
5 1.472305 2.623467
6 1.954044 3.967796
7 -0.485576 -1.384815
8 -7.736036 -6.722931
9 1.392823 5.369757
Since your actual DataFrame has shape (50000, 36), choosing a quick method may be important. Here is a benchmark:
import numpy as np
import pandas as pd
import string
N = 12
col_ids = string.letters[:N]
df = pd.DataFrame(
np.random.randn(50000, 3*N+1),
columns=['Time']+['{}_{}'.format(letter, coord) for letter in col_ids
for coord in list('xyz')])
def using_numpy(df):
arr = df.values
times = arr[:,0]
arr = arr[:,1:]
result = np.sqrt((arr**2).reshape(arr.shape[0],-1,3).sum(axis=-1))/times[:,None]
result = pd.DataFrame(result, columns=['Velocity_%s'%(x,) for x in col_ids])
return result
def using_loop(df):
results = pd.DataFrame(index=df.index) # the result container
for id in col_ids:
results['Velocity_'+id] = np.sqrt((df.filter(regex=id+'_')**2).sum(axis=1))/df.Time
return results
Using IPython:
In [43]: %timeit using_numpy(df)
10 loops, best of 3: 34.7 ms per loop
In [44]: %timeit using_loop(df)
10 loops, best of 3: 82 ms per loop

One liner...split over many lines for readability:
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame(
np.random.randn(10,7),
index=range(1,11),
columns='Time A_x A_y A_z B_x B_y B_z'.split()
)
result = df\
.loc[:, df.columns.values!='Time']\
.T\
.groupby(lambda x: x[0])\
.apply(lambda x: np.sqrt((x ** 2).sum()))\
.T\
.apply(lambda x: x / df['Time'])
print result
A B
1 1.404626 1.310639
2 -2.954644 -10.874091
3 3.479836 6.105961
4 3.885530 2.244544
5 0.995012 1.434228
6 11.278208 11.454466
7 -1.209242 -1.281165
8 -5.175911 -5.905070
9 11.889318 16.758958
10 -0.978014 -0.590767
Note: I am a bit frustrated that I needed to thrown in the two transposes. I just couldn't get groupby and apply to play nicely with axis=1. If someone could show me how to do that, I'd be very grateful. The trick here was knowing that when you call groupby(lambda x: f(x)) that x is the value of the index for each row. So groupby(lambda x: x[0]) groups by the first letter of the row index. After doing the transposition, this was A or B.
Ok, no more transposes:
result = df\
.loc[:, df.columns!='Time']\
.groupby(lambda x: x[0], axis=1)\
.apply(lambda x: np.sqrt((x**2).sum(1)))\
.apply(lambda x: x / df['Time'])
print result
A B
1 1.404626 1.310639
2 -2.954644 -10.874091
3 3.479836 6.105961
4 3.885530 2.244544
5 0.995012 1.434228
6 11.278208 11.454466
7 -1.209242 -1.281165
8 -5.175911 -5.905070
9 11.889318 16.758958
10 -0.978014 -0.590767

pandas python how to count the number of records or rows in a dataframe

Obviously new to Pandas. How can i simply count the number of records in a dataframe.
I would have thought some thing as simple as this would do it and i can't seem to even find the answer in searches...probably because it is too simple.
cnt = df.count
print cnt
the above code actually just prints the whole df

To get the number of rows in a dataframe use:
df.shape[0]
(and df.shape[1] to get the number of columns).
As an alternative you can use
len(df)
or
len(df.index)
(and len(df.columns) for the columns)
shape is more versatile and more convenient than len(), especially for interactive work (just needs to be added at the end), but len is a bit faster (see also this answer).
To avoid: count() because it returns the number of non-NA/null observations over requested axis
len(df.index) is faster
import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(24).reshape(8, 3),columns=['A', 'B', 'C'])
df['A'][5]=np.nan
df
# Out:
# A B C
# 0 0 1 2
# 1 3 4 5
# 2 6 7 8
# 3 9 10 11
# 4 12 13 14
# 5 NaN 16 17
# 6 18 19 20
# 7 21 22 23
%timeit df.shape[0]
# 100000 loops, best of 3: 4.22 µs per loop
%timeit len(df)
# 100000 loops, best of 3: 2.26 µs per loop
%timeit len(df.index)
# 1000000 loops, best of 3: 1.46 µs per loop
df.__len__ is just a call to len(df.index)
import inspect
print(inspect.getsource(pd.DataFrame.__len__))
# Out:
# def __len__(self):
# """Returns length of info axis, but here we use the index """
# return len(self.index)
Why you should not use count()
df.count()
# Out:
# A 7
# B 8
# C 8

Regards to your question... counting one Field? I decided to make it a question, but I hope it helps...
Say I have the following DataFrame
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.normal(0, 1, (5, 2)), columns=["A", "B"])
You could count a single column by
df.A.count()
#or
df['A'].count()
both evaluate to 5.
The cool thing (or one of many w.r.t. pandas) is that if you have NA values, count takes that into consideration.
So if I did
df['A'][1::2] = np.NAN
df.count()
The result would be
A 3
B 5

Simply, row_num = df.shape[0] # gives number of rows, here's the example:
import pandas as pd
import numpy as np
In [322]: df = pd.DataFrame(np.random.randn(5,2), columns=["col_1", "col_2"])
In [323]: df
Out[323]:
col_1 col_2
0 -0.894268 1.309041
1 -0.120667 -0.241292
2 0.076168 -1.071099
3 1.387217 0.622877
4 -0.488452 0.317882
In [324]: df.shape
Out[324]: (5, 2)
In [325]: df.shape[0] ## Gives no. of rows/records
Out[325]: 5
In [326]: df.shape[1] ## Gives no. of columns
Out[326]: 2

The Nan example above misses one piece, which makes it less generic. To do this more "generically" use df['column_name'].value_counts()
This will give you the counts of each value in that column.
d=['A','A','A','B','C','C'," " ," "," "," "," ","-1"] # for simplicity
df=pd.DataFrame(d)
df.columns=["col1"]
df["col1"].value_counts()
5
A 3
C 2
-1 1
B 1
dtype: int64
"""len(df) give you 12, so we know the rest must be Nan's of some form, while also having a peek into other invalid entries, especially when you might want to ignore them like -1, 0 , "", also"""

Simple method to get the records count:
df.count()[0]

I used pandas library for this. Here is the code
import pandas as pd
name_of_file = "test.xlsx"
data = pd.read_excel(name_of_file)
required_colum_name = "Post test Number"
print(len(data[required_colum_name]))
# this also works -> data["Post test Number"].count()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Multiindex pandas groupby + aggregate, keep full index - python

Related

I try to count the same numbers of a column in a DF with pandas [duplicate]

How can I group by and count similar elements in a pandas dataframe? [duplicate]

Count the frequency that a value occurs in a dataframe column

Apply function to sets of columns in pandas, 'looping' over entire data frame column-wise

pandas python how to count the number of records or rows in a dataframe

Categories

Resources