Pydatatable enumerate rows within each group - python

Given the following datatable
DT = dt.Frame({'A':['A','A','A','B','B','B'],
'B':['a','a','b','a','a','a'],
})
I'd like to create column 'C', which numbers the rows within each group in columns A and B like this:
A B C
0 A a 1
1 A a 2
2 A b 1
3 B a 1
4 B a 2
5 B a 3
According to this thread for pandas cumcount() or rank() would be options, but it does not seem to be defined for pydatatable:
DT = DT[:, f[:].extend({'C': cumcount()}),by(f.A,f.B)]
DT = DT[:, f[:].extend({'C': rank(f.B)}),by(f.A,f.B)]
a) How can I number the rows within groups?
b) Is there a comprehensive resource with all the currently available functions for pydatatable?

Update:
Datatable now has a cumcount function in dev :
DT[:, {'C':dt.cumcount() + 1}, by('A', 'B')]
| A B C
| str32 str32 int64
-- + ----- ----- -----
0 | A a 1
1 | A a 2
2 | A b 1
3 | B a 1
4 | B a 2
5 | B a 3
[6 rows x 3 columns]
Old answer:
This is a hack, in time there should be an in-built way to do cumulative count, or even take advantage of itertools or other performant tools within python while still being very fast :
Step 1 : Get count of columns A and B and export to list
result = DT[:, dt.count(), by("A","B")][:,'count'].to_list()
Step 2 : Use a combination of itertools chain and list comprehension to get the cumulative counts :
from itertools import chain
cumcount = chain.from_iterable([i+1 for i in range(n)] for n in result[0])
Step 3: Assign result back to DT
DT['C'] = dt.Frame(tuple(cumcount))
print(DT)
A B C
▪▪▪▪ ▪▪▪▪ ▪▪▪▪
0 A a 1
1 A a 2
2 A b 1
3 B a 1
4 B a 2
5 B a 3
6 rows × 3 columns

Related

create row number by group, using python datatable

If I have a python datatable like this:
from datatable import f, dt
data = dt.Frame(grp=["a","a","b","b","b","b","c"], value=[2,3,1,2,5,9,2])
how do I create an new column that has the row number, by group?. That is, what is the equivalent of R data.table's
data[, id:=1:.N, by=.(grp)]
This works, but seems completely ridiculous
data['id'] = np.concatenate(
[np.arange(x)
for x in data[:,dt.count(), dt.by(f.grp)]['count'].to_numpy()])
desired output:
| grp value id
| str32 int32 int64
-- + ----- ----- -----
0 | a 2 0
1 | a 3 1
2 | b 1 0
3 | b 2 1
4 | b 5 2
5 | b 9 3
6 | c 2 0
Update:
Datatable now has a cumcount function in dev :
data[:, [f.value, dt.cumcount()], 'grp']
| grp value C0
| str32 int32 int64
-- + ----- ----- -----
0 | a 2 0
1 | a 3 1
2 | b 1 0
3 | b 2 1
4 | b 5 2
5 | b 9 3
6 | c 2 0
[7 rows x 3 columns]
Old Answer:
datatable does not have a cumulative count function, in fact there is no cumulative function for any aggregation at the moment.
One way to possibly improve the speed is to use a faster iteration of numpy, where the for loop is done within C, and with more efficiency. The code is from here and modified for this purpose:
from datatable import dt, f, by
import numpy as np
In [244]: def create_ranges(indices):
...: cum_length = indices.cumsum()
...: ids = np.ones(cum_length[-1], dtype=int)
...: ids[0] = 0
...: ids[cum_length[:-1]] = -1 * indices[:-1] + 1
...: return ids.cumsum()
counts = data[:, dt.count(), by('grp', add_columns=False)].to_numpy().ravel()
data[:, f[:].extend({"counts" : create_ranges(counts)})]
| grp value counts
| str32 int32 int64
-- + ----- ----- ------
0 | a 2 0
1 | a 3 1
2 | b 1 0
3 | b 2 1
4 | b 5 2
5 | b 9 3
6 | c 2 0
[7 rows x 3 columns]
The create_ranges function is wonderful (the logic built on cumsum is nice) and really kicks in as the array size increases.
Of course this has its drawbacks; you are stepping out of datatable into numpy territory and then back into datatable; the other aspect is that I am banking on the fact that the groups are sorted lexically; this won't work if the data is unsorted (and has to be sorted on the grouping column).
Preliminary tests show a marked improvement in speed; again it is limited in scope and it would be much easier/better if this was baked into the datatable library.
If you are good with C++, you could consider contributing this function to the library; I and so many others would appreciate your effort.
You could have a look at pypolars and see if it helps with your use case. From the h2o benchmarks it looks like a very fast tool.
One approach is to convert to_pandas, groupby (on the pandas DataFrame) and use cumcount:
import datatable as dt
data = dt.Frame(grp=["a", "a", "b", "b", "b", "b", "c"], value=[2, 3, 1, 2, 5, 9, 2])
data["id"] = data.to_pandas().groupby("grp").cumcount()
print(data)
Output
| grp value id
| str32 int32 int64
-- + ----- ----- -----
0 | a 2 0
1 | a 3 1
2 | b 1 0
3 | b 2 1
4 | b 5 2
5 | b 9 3
6 | c 2 0
[7 rows x 3 columns]

Using head/tail function on pandas dataframe with different number of n for each group

I want to use head / tail function, but for each group i will take the different number of row according an input dictionary.
The function should have 2 input. First input is pandas dataframe
df = pd.DataFrame({"group":["A","A","A","B","B","B","B"],"value":[0,1,2,3,4,5,6,7]})
print(df)
group value
0 A 0
1 A 1
2 A 2
3 B 3
4 B 4
5 B 5
6 B 6
Second input is dict :
slice_per_group = {"A":1,"B":3}
Expected output :
df.groupby('group').head(slice_per_group) #Obviously this doesn't work
group value
0 A 0
3 B 3
4 B 4
5 B 5
Use head on each group separately:
df.groupby('group', group_keys=False).apply(lambda g: g.head(slice_per_group.get(g.name)))
group value
0 A 0
3 B 3
4 B 4
5 B 5

Pandas Counting Each Column with its Spesific Thresholds

If I have a following dataframe:
A B C D E
1 1 2 0 1 0
2 0 0 0 1 -1
3 1 1 3 -5 2
4 -3 4 2 6 0
5 2 4 1 9 -1
T 1 2 2 4 1
The last row is my threshold values for each column. I want to count each column values whether lower its threshold values or not in python pandas.
Desired Output is;
A B C D E
Count 2 2 3 3 4
But, I need to figure it out with a general solution, not for these specific columns. Because I have a large dataset. I cannot specify a column name for each of them in the code.
Could you please help me with this?
Select all rows without first by indexing and compare by DataFrame.lt by last row, then sum and convert Series to one row DataFrame by Series.to_frame with transpose by DataFrame.T:
df = df.iloc[:-1].lt(df.iloc[-1]).sum().to_frame('count').T
print (df)
A B C D E
count 2 2 3 3 4
Numpy alternative with DataFrame constructor:
arr = df.values
df = pd.DataFrame([np.sum(arr[:-1] < arr[-1], axis=0)], columns=df.columns, index=['count'])
print (df)
A B C D E
count 2 2 3 3 4

Pandas - Create a symmetric matrix that counts the number of records

I have a dataframe that looks like the one below
ID | Value
1 | A
1 | B
1 | C
2 | B
2 | C
I want to create a symmetric matrix of based on Value:
A B C
A 1 1 1
B 1 2 2
C 1 2 2
This basically indicates how many people have both values (v1,v2). I am currently using for loops to scan the dataframe for every combination but was wondering if there was an easier way to do it using pandas.
Use merge with cross join by ID column with crosstab and DataFrame.rename_axis for remove index and columns names:
df = pd.merge(df, df, on='ID')
df = pd.crosstab(df['Value_x'], df['Value_y']).rename_axis(None).rename_axis(None, axis=1)
print (df)
A B C
A 1 1 1
B 1 2 2
C 1 2 2

Determining when a column value changes in pandas dataframe

I am looking to write a quick script that will run through a csv file with two columns and provide me the rows in which the values in column B switch from one value to another:
eg:
dataframe:
# | A | B
--+-----+-----
1 | 2 | 3
2 | 3 | 3
3 | 4 | 4
4 | 5 | 4
5 | 5 | 4
would tell me that the change happened between row 2 and row 3. I know how to get these values using for loops but I was hoping there was a more pythonic way of approaching this problem.
You can create a new column for the difference
> df['C'] = df['B'].diff()
> print df
# A B C
0 1 2 3 NaN
1 2 3 3 0
2 3 4 4 1
3 4 5 4 0
4 5 5 4 0
> df_filtered = df[df['C'] != 0]
> print df_filtered
# A B C
2 3 4 4 1
This will your required rows
You can do the following which also works for non numerical values:
>>> import pandas as pd
>>> df = pd.DataFrame({"Status": ["A","A","B","B","C","C","C"]})
>>> df["isStatusChanged"] = df["Status"].shift(1, fill_value=df["Status"].head(1)) != df["Status"]
>>> df
Status isStatusChanged
0 A False
1 A False
2 B True
3 B False
4 C True
5 C False
6 C False
>>>
Note the fill_value could be different depending on your application.
you can use this it is much faster, hope it helps!!
my_column_changes = df["MyStringColumn"].shift() != df["MyStringColumn"]

Categories