Count Number of dates in a dataframe [duplicate]

Count Number of dates in a dataframe [duplicate] - python

I have this simple dataframe df:
df = pd.DataFrame({'c':[1,1,1,2,2,2,2],'type':['m','n','o','m','m','n','n']})
my goal is to count values of type for each c, and then add a column with the size of c. So starting with:
In [27]: g = df.groupby('c')['type'].value_counts().reset_index(name='t')
In [28]: g
Out[28]:
c type t
0 1 m 1
1 1 n 1
2 1 o 1
3 2 m 2
4 2 n 2
the first problem is solved. Then I can also:
In [29]: a = df.groupby('c').size().reset_index(name='size')
In [30]: a
Out[30]:
c size
0 1 3
1 2 4
How can I add the size column directly to the first dataframe? So far I used map as:
In [31]: a.index = a['c']
In [32]: g['size'] = g['c'].map(a['size'])
In [33]: g
Out[33]:
c type t size
0 1 m 1 3
1 1 n 1 3
2 1 o 1 3
3 2 m 2 4
4 2 n 2 4
which works, but is there a more straightforward way to do this?

Use transform to add a column back to the orig df from a groupby aggregation, transform returns a Series with its index aligned to the orig df:
In [123]:
g = df.groupby('c')['type'].value_counts().reset_index(name='t')
g['size'] = df.groupby('c')['type'].transform('size')
g
Out[123]:
c type t size
0 1 m 1 3
1 1 n 1 3
2 1 o 1 3
3 2 m 2 4
4 2 n 2 4

Another solution with transform len:
df['size'] = df.groupby('c')['type'].transform(len)
print df
c type size
0 1 m 3
1 1 n 3
2 1 o 3
3 2 m 4
4 2 m 4
5 2 n 4
6 2 n 4
Another solution with Series.map and Series.value_counts:
df['size'] = df['c'].map(df['c'].value_counts())
print (df)
c type size
0 1 m 3
1 1 n 3
2 1 o 3
3 2 m 4
4 2 m 4
5 2 n 4
6 2 n 4

You can calculate the groupby object and use it multiple times:
g = df.groupby('c')['type']
df = g.value_counts().reset_index(name='counts')
df['size'] = g.transform('size')
or
g.value_counts().reset_index(name='counts').assign(size=g.transform('size'))
Output:
c type counts size
0 1 m 1 3
1 1 n 1 3
2 1 o 1 3
3 2 m 2 4
4 2 n 2 4

Related

Select n-th lowest value from DataFrame (Every Row!)

I am looking for a solution to pick values (row wise) from a Dataframe.
Here is what I already have:
np.random.seed(1)
df = pd.DataFrame(np.random.randint(1,10, (10, 10)))
df.columns = list('ABCDEFGHIJ')
N = 2
idx = np.argsort(df.values, 1)[:, 0:N]
df= pd.concat([pd.DataFrame(df.values.take(idx), index=df.index), pd.DataFrame(df.columns[idx], index=df.index)],keys=['Value', 'Columns']).sort_index(level=1)
Now I have the index/position for every value but if I try to get the values from the Dataframe it only takes the values from the first row.
What do I have to change in the code?
df looks like:
A B C D E F G H I J
0 6 9 6 1 1 2 8 7 3 5
1 6 3 5 3 5 8 8 2 8 1
2 7 8 7 2 1 2 9 9 4 9
....
My output should look like:
0 D E
0 1 1
1 J H
1 1 2

You can use np.take_along_axis to take values from dataframe. Use np.insert to sieve both values taken and corresponding column names.
# idx is the same as the one used in the question.
vals = np.take_along_axis(df.values, idx, axis=1)
cols = df.columns.values[idx]
indices = np.r_[: len(vals)] # same as np.arange(len(vals))
out = np.insert(vals.astype(str), indices , cols, axis=0)
index = np.repeat(indices, 2)
df = pd.DataFrame(out, index=index)
0 1
0 D E
0 1 1
1 J H
1 1 2
2 E D
2 1 2
3 E I
3 2 2
4 A D
4 1 1
5 I J
5 1 3
6 E I
6 1 2
7 B H
7 1 3
8 G I
8 1 1
9 E A
9 1 2

Count duplicates on two columns and add value as a new column [duplicate]

I have this simple dataframe df:
df = pd.DataFrame({'c':[1,1,1,2,2,2,2],'type':['m','n','o','m','m','n','n']})
my goal is to count values of type for each c, and then add a column with the size of c. So starting with:
In [27]: g = df.groupby('c')['type'].value_counts().reset_index(name='t')
In [28]: g
Out[28]:
c type t
0 1 m 1
1 1 n 1
2 1 o 1
3 2 m 2
4 2 n 2
the first problem is solved. Then I can also:
In [29]: a = df.groupby('c').size().reset_index(name='size')
In [30]: a
Out[30]:
c size
0 1 3
1 2 4
How can I add the size column directly to the first dataframe? So far I used map as:
In [31]: a.index = a['c']
In [32]: g['size'] = g['c'].map(a['size'])
In [33]: g
Out[33]:
c type t size
0 1 m 1 3
1 1 n 1 3
2 1 o 1 3
3 2 m 2 4
4 2 n 2 4
which works, but is there a more straightforward way to do this?

Use transform to add a column back to the orig df from a groupby aggregation, transform returns a Series with its index aligned to the orig df:
In [123]:
g = df.groupby('c')['type'].value_counts().reset_index(name='t')
g['size'] = df.groupby('c')['type'].transform('size')
g
Out[123]:
c type t size
0 1 m 1 3
1 1 n 1 3
2 1 o 1 3
3 2 m 2 4
4 2 n 2 4

Another solution with transform len:
df['size'] = df.groupby('c')['type'].transform(len)
print df
c type size
0 1 m 3
1 1 n 3
2 1 o 3
3 2 m 4
4 2 m 4
5 2 n 4
6 2 n 4
Another solution with Series.map and Series.value_counts:
df['size'] = df['c'].map(df['c'].value_counts())
print (df)
c type size
0 1 m 3
1 1 n 3
2 1 o 3
3 2 m 4
4 2 m 4
5 2 n 4
6 2 n 4

You can calculate the groupby object and use it multiple times:
g = df.groupby('c')['type']
df = g.value_counts().reset_index(name='counts')
df['size'] = g.transform('size')
or
g.value_counts().reset_index(name='counts').assign(size=g.transform('size'))
Output:
c type counts size
0 1 m 1 3
1 1 n 1 3
2 1 o 1 3
3 2 m 2 4
4 2 n 2 4

Most efficient way to groupby => aggregate for large dataframe in pandas

I have a pandas dataframe with roughly 150,000,000 rows in the following format:
df.head()
Out[1]:
ID TERM X
0 1 A 0
1 1 A 4
2 1 A 6
3 1 B 0
4 1 B 10
5 2 A 1
6 2 B 1
7 2 F 1
I want to aggregate it by ID & TERM, and count the number of rows. Currently I do the following:
df.groupby(['ID','TERM']).count()
Out[2]:
ID TERM X
0 1 A 3
1 1 B 2
2 2 A 1
3 2 B 1
4 2 F 1
But this takes roughly two minutes. The same operation using R data.tables takes less than 22 seconds. Is there a more efficient way to do this in python?
For comparison, R data.table:
system.time({ df[,.(.N), .(ID, TERM)] })
#user: 30.32 system: 2.45 elapsed: 22.88

A NumPy solution would be like so -
def groupby_count(df):
unq, t = np.unique(df.TERM, return_inverse=1)
ids = df.ID.values
sidx = np.lexsort([t,ids])
ts = t[sidx]
idss = ids[sidx]
m0 = (idss[1:] != idss[:-1]) | (ts[1:] != ts[:-1])
m = np.concatenate(([True], m0, [True]))
ids_out = idss[m[:-1]]
t_out = unq[ts[m[:-1]]]
x_out = np.diff(np.flatnonzero(m)+1)
out_ar = np.column_stack((ids_out, t_out, x_out))
return pd.DataFrame(out_ar, columns = [['ID','TERM','X']])
A bit simpler version -
def groupby_count_v2(df):
a = df.values
sidx = np.lexsort(a[:,:2].T)
b = a[sidx,:2]
m = np.concatenate(([True],(b[1:] != b[:-1]).any(1),[True]))
out_ar = np.column_stack((b[m[:-1],:2], np.diff(np.flatnonzero(m)+1)))
return pd.DataFrame(out_ar, columns = [['ID','TERM','X']])
Sample run -
In [332]: df
Out[332]:
ID TERM X
0 1 A 0
1 1 A 4
2 1 A 6
3 1 B 0
4 1 B 10
5 2 A 1
6 2 B 1
7 2 F 1
In [333]: groupby_count(df)
Out[333]:
ID TERM X
0 1 A 3
1 1 B 2
2 2 A 1
3 2 B 1
4 2 F 1
Let's randomly shuffle the rows and verify that it works with our solution -
In [339]: df1 = df.iloc[np.random.permutation(len(df))]
In [340]: df1
Out[340]:
ID TERM X
7 2 F 1
6 2 B 1
0 1 A 0
3 1 B 0
5 2 A 1
2 1 A 6
1 1 A 4
4 1 B 10
In [341]: groupby_count(df1)
Out[341]:
ID TERM X
0 1 A 3
1 1 B 2
2 2 A 1
3 2 B 1
4 2 F 1

How to create a dataframe that indicates the rows min and max of a given dataframe?

I need to produce a matrix as input for a conditional formatting for an automated chart creation process. My receiving colleague has to display the numbers and give the max and min of each row an associated color. For his process, a second matrix with entries that indicate the row mins and max would be ideal.
So what do I need to deliver?
Let's say I have the following dataframe:
Cat Product Brand1 Brand2 Brand3
A a 6 9 5
A b 11 7 7
A c 9 5 5
B d 7 3 10
B e 5 8 8
B f 10 6 6
C g 8 4 4
C h 6 2 9
C i 4 7 7
From that, I want to generate the following dataframe, indicating "1" as row max and "2" as row min:
Cat Product Brand1 Brand2 Brand3
A a 0 1 2
A b 1 2 2
A c 1 2 2
B d 0 2 1
B e 2 1 1
B f 1 2 2
C g 1 2 2
C h 0 2 1
C i 2 1 1
The indicators "1" and "2" could be something else, even letters or whatever. The zeros could also be na.
How can this be achieved?

You can use numpy.where for replace values by mask created by eq:
df = df.set_index(['Cat','Product'])
m1 = df.eq(df.max(axis=1), axis=0)
m2 = df.eq(df.min(axis=1), axis=0)
df = pd.DataFrame(np.where(m1, 1, np.where(m2, 2, 0)), index=df.index, columns=df.columns)
df = df.reset_index()
print (df)
Cat Product Brand1 Brand2 Brand3
0 A a 0 1 2
1 A b 1 2 2
2 A c 1 2 2
3 B d 0 2 1
4 B e 2 1 1
5 B f 1 2 2
6 C g 1 2 2
7 C h 0 2 1
8 C i 2 1 1
Another solution:
df = df.set_index(['Cat','Product'])
m1 = df.values == df.values.max(axis=1)[:, None]
m2 = df.values == df.values.min(axis=1)[:, None]
df = pd.DataFrame(np.where(m1, 1, np.where(m2, 2, 0)), index=df.index, columns=df.columns)
df = df.reset_index()
print (df)
Cat Product Brand1 Brand2 Brand3
0 A a 0 1 2
1 A b 1 2 2
2 A c 1 2 2
3 B d 0 2 1
4 B e 2 1 1
5 B f 1 2 2
6 C g 1 2 2
7 C h 0 2 1
8 C i 2 1 1

pandas add column to groupby dataframe

I have this simple dataframe df:
df = pd.DataFrame({'c':[1,1,1,2,2,2,2],'type':['m','n','o','m','m','n','n']})
my goal is to count values of type for each c, and then add a column with the size of c. So starting with:
In [27]: g = df.groupby('c')['type'].value_counts().reset_index(name='t')
In [28]: g
Out[28]:
c type t
0 1 m 1
1 1 n 1
2 1 o 1
3 2 m 2
4 2 n 2
the first problem is solved. Then I can also:
In [29]: a = df.groupby('c').size().reset_index(name='size')
In [30]: a
Out[30]:
c size
0 1 3
1 2 4
How can I add the size column directly to the first dataframe? So far I used map as:
In [31]: a.index = a['c']
In [32]: g['size'] = g['c'].map(a['size'])
In [33]: g
Out[33]:
c type t size
0 1 m 1 3
1 1 n 1 3
2 1 o 1 3
3 2 m 2 4
4 2 n 2 4
which works, but is there a more straightforward way to do this?

Use transform to add a column back to the orig df from a groupby aggregation, transform returns a Series with its index aligned to the orig df:
In [123]:
g = df.groupby('c')['type'].value_counts().reset_index(name='t')
g['size'] = df.groupby('c')['type'].transform('size')
g
Out[123]:
c type t size
0 1 m 1 3
1 1 n 1 3
2 1 o 1 3
3 2 m 2 4
4 2 n 2 4

Another solution with transform len:
df['size'] = df.groupby('c')['type'].transform(len)
print df
c type size
0 1 m 3
1 1 n 3
2 1 o 3
3 2 m 4
4 2 m 4
5 2 n 4
6 2 n 4
Another solution with Series.map and Series.value_counts:
df['size'] = df['c'].map(df['c'].value_counts())
print (df)
c type size
0 1 m 3
1 1 n 3
2 1 o 3
3 2 m 4
4 2 m 4
5 2 n 4
6 2 n 4

You can calculate the groupby object and use it multiple times:
g = df.groupby('c')['type']
df = g.value_counts().reset_index(name='counts')
df['size'] = g.transform('size')
or
g.value_counts().reset_index(name='counts').assign(size=g.transform('size'))
Output:
c type counts size
0 1 m 1 3
1 1 n 1 3
2 1 o 1 3
3 2 m 2 4
4 2 n 2 4

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Count Number of dates in a dataframe [duplicate] - python

Related

Select n-th lowest value from DataFrame (Every Row!)

Count duplicates on two columns and add value as a new column [duplicate]

Most efficient way to groupby => aggregate for large dataframe in pandas

How to create a dataframe that indicates the rows min and max of a given dataframe?

pandas add column to groupby dataframe

Categories

Resources