If I have a python datatable like this:
from datatable import f, dt
data = dt.Frame(grp=["a","a","b","b","b","b","c"], value=[2,3,1,2,5,9,2])
how do I create an new column that has the row number, by group?. That is, what is the equivalent of R data.table's
data[, id:=1:.N, by=.(grp)]
This works, but seems completely ridiculous
data['id'] = np.concatenate(
[np.arange(x)
for x in data[:,dt.count(), dt.by(f.grp)]['count'].to_numpy()])
desired output:
| grp value id
| str32 int32 int64
-- + ----- ----- -----
0 | a 2 0
1 | a 3 1
2 | b 1 0
3 | b 2 1
4 | b 5 2
5 | b 9 3
6 | c 2 0
Update:
Datatable now has a cumcount function in dev :
data[:, [f.value, dt.cumcount()], 'grp']
| grp value C0
| str32 int32 int64
-- + ----- ----- -----
0 | a 2 0
1 | a 3 1
2 | b 1 0
3 | b 2 1
4 | b 5 2
5 | b 9 3
6 | c 2 0
[7 rows x 3 columns]
Old Answer:
datatable does not have a cumulative count function, in fact there is no cumulative function for any aggregation at the moment.
One way to possibly improve the speed is to use a faster iteration of numpy, where the for loop is done within C, and with more efficiency. The code is from here and modified for this purpose:
from datatable import dt, f, by
import numpy as np
In [244]: def create_ranges(indices):
...: cum_length = indices.cumsum()
...: ids = np.ones(cum_length[-1], dtype=int)
...: ids[0] = 0
...: ids[cum_length[:-1]] = -1 * indices[:-1] + 1
...: return ids.cumsum()
counts = data[:, dt.count(), by('grp', add_columns=False)].to_numpy().ravel()
data[:, f[:].extend({"counts" : create_ranges(counts)})]
| grp value counts
| str32 int32 int64
-- + ----- ----- ------
0 | a 2 0
1 | a 3 1
2 | b 1 0
3 | b 2 1
4 | b 5 2
5 | b 9 3
6 | c 2 0
[7 rows x 3 columns]
The create_ranges function is wonderful (the logic built on cumsum is nice) and really kicks in as the array size increases.
Of course this has its drawbacks; you are stepping out of datatable into numpy territory and then back into datatable; the other aspect is that I am banking on the fact that the groups are sorted lexically; this won't work if the data is unsorted (and has to be sorted on the grouping column).
Preliminary tests show a marked improvement in speed; again it is limited in scope and it would be much easier/better if this was baked into the datatable library.
If you are good with C++, you could consider contributing this function to the library; I and so many others would appreciate your effort.
You could have a look at pypolars and see if it helps with your use case. From the h2o benchmarks it looks like a very fast tool.
One approach is to convert to_pandas, groupby (on the pandas DataFrame) and use cumcount:
import datatable as dt
data = dt.Frame(grp=["a", "a", "b", "b", "b", "b", "c"], value=[2, 3, 1, 2, 5, 9, 2])
data["id"] = data.to_pandas().groupby("grp").cumcount()
print(data)
Output
| grp value id
| str32 int32 int64
-- + ----- ----- -----
0 | a 2 0
1 | a 3 1
2 | b 1 0
3 | b 2 1
4 | b 5 2
5 | b 9 3
6 | c 2 0
[7 rows x 3 columns]
Related
What is the most efficient way to switch the locations of two columns in python datatable? I wrote the below function that does what I want, but this may not be the best way, especially if my actual table is big. Is it possible to do this in place? Am I missing something obvious?
from datatable import Frame
dat = Frame(a=[1,2,3],b=[4,5,6],c=[7,8,9])
def switch_cols(data,col1,col2):
data_n = list(data.names)
data_n[data.colindex(col1)], data_n[data.colindex(col2)] = data_n[data.colindex(col2)], data_n[data.colindex(col1)]
return data[:, data_n]
dat = switch_cols(dat, "c","a")
| c b a
| int32 int32 int32
-- + ----- ----- -----
0 | 7 4 1
1 | 8 5 2
2 | 9 6 3
[3 rows x 3 columns]
For comparison in R, we can do this
dat = data.table(a=c(1,2,3), b=c(4,5,6), c=c(7,8,9))
switch_cols <- function(data,col1,col2) {
indexes = which(names(dat) %in% c(col1,col2))
datn = names(dat)
datn[indexes] <- datn[c(indexes[2], indexes[1])]
return(datn)
}
Then, we can change the order of two columns in-place like this
setcolorder(dat, switch_cols(dat,"a","c"))
Please note that assigning the values to each column is not what I'm after here. Consider this example, in R. I construct a large data.table like this:
dat = data.table(
x = rnorm(10000000),
y = sample(letters, 10000000, replace = T)
)
I make two copies of this data.table d and e
e = copy(dat)
d = copy(dat)
I then compare these two in-place operations
setcolorder (simply reindexing where in the data.table two columns are)
:= re-assignment of the two columns
microbenchmark::microbenchmark(
list=alist("setcolorder" = setcolorder(d, c("y", "x")),
"`:=`" = e[,`:=`(x=y, y=x)]),
times=1)
Unit: microseconds
expr min lq mean median uq max neval
setcolorder 81.5 81.5 81.5 81.5 81.5 81.5 1
`:=` 53691.1 53691.1 53691.1 53691.1 53691.1 53691.1 1
As expected, setcolorder is the right way to switch column locations in R data.table. I'm looking for a similar approach in python.
I find a method after checking its document
from datatable import Frame,f,update
dat = Frame(a=[1,2,3],b=[4,5,6],c=[7,8,9])
dat[:,update(a = f.c, c = f.a)]
In R, you can do it similarily
dat[,`:=`(a = c, c = a)]
After some consideration and timings, I'm finding that the best approach is this:
from datatable import Frame
dat = Frame(a=[1,2,3],b=[4,5,6],c=[7,8,9])
| a b c
| int32 int32 int32
-- + ----- ----- -----
0 | 1 4 7
1 | 2 5 8
2 | 3 6 9
[3 rows x 3 columns]
def switch_cols(data,col1,col2):
return data[:, [col1 if c==col2 else col2 if c==col1 else c for c in data.names]]
switch_cols(dat, "a","c")
| c b a
| int32 int32 int32
-- + ----- ----- -----
0 | 7 4 1
1 | 8 5 2
2 | 9 6 3
[3 rows x 3 columns]
Given the following datatable
DT = dt.Frame({'A':['A','A','A','B','B','B'],
'B':['a','a','b','a','a','a'],
})
I'd like to create column 'C', which numbers the rows within each group in columns A and B like this:
A B C
0 A a 1
1 A a 2
2 A b 1
3 B a 1
4 B a 2
5 B a 3
According to this thread for pandas cumcount() or rank() would be options, but it does not seem to be defined for pydatatable:
DT = DT[:, f[:].extend({'C': cumcount()}),by(f.A,f.B)]
DT = DT[:, f[:].extend({'C': rank(f.B)}),by(f.A,f.B)]
a) How can I number the rows within groups?
b) Is there a comprehensive resource with all the currently available functions for pydatatable?
Update:
Datatable now has a cumcount function in dev :
DT[:, {'C':dt.cumcount() + 1}, by('A', 'B')]
| A B C
| str32 str32 int64
-- + ----- ----- -----
0 | A a 1
1 | A a 2
2 | A b 1
3 | B a 1
4 | B a 2
5 | B a 3
[6 rows x 3 columns]
Old answer:
This is a hack, in time there should be an in-built way to do cumulative count, or even take advantage of itertools or other performant tools within python while still being very fast :
Step 1 : Get count of columns A and B and export to list
result = DT[:, dt.count(), by("A","B")][:,'count'].to_list()
Step 2 : Use a combination of itertools chain and list comprehension to get the cumulative counts :
from itertools import chain
cumcount = chain.from_iterable([i+1 for i in range(n)] for n in result[0])
Step 3: Assign result back to DT
DT['C'] = dt.Frame(tuple(cumcount))
print(DT)
A B C
▪▪▪▪ ▪▪▪▪ ▪▪▪▪
0 A a 1
1 A a 2
2 A b 1
3 B a 1
4 B a 2
5 B a 3
6 rows × 3 columns
I have the following grid with bins defined by x and y, and each grid square given a unique id
mapping = pd.DataFrame({'id': [1,2,3,4,5,6,7,8,9], 'x': [1,1,1,2,2,2,3,3,3], 'y': [1,2,3,1,2,3,1,2,3]})
id x y
0 1 1 1
1 2 1 2
2 3 1 3
3 4 2 1
4 5 2 2
5 6 2 3
6 7 3 1
7 8 3 2
8 9 3 3
I also have a new dataframe of observations for which I would like to know what their associated id should be (e.g. which grid square they fall into)
coordinates = pd.DataFrame({'x': [1.4, 2.7], 'y': [1.9, 1.1]})
x y
0 1.4 1.9
1 2.7 1.1
My solution is the following function:
import bisect
def get_id(coords, mapping):
x_val = mapping.x[bisect.bisect_right(mapping.x, coords[0]) - 1]
y_val = mapping.y[bisect.bisect_right(mapping.y, coords[1]) - 1]
id = mapping[(mapping.x == x_val) & (mapping.y == y_val)].iloc[0, 0]
return id
coordinates.apply(get_id, mapping = mapping, axis = 1)
Out[21]:
0 1
1 4
dtype: int64
This works but becomes slow as the coordinates dataframe grows long. I'm sure there is a fast way to do this for a coordinates dataframe with 10^6 + observations. Is there a faster way to do this?
Edit:
To answer #abdurrehman245 question from the comments below.
My current method is to simply round down any data point, this allows me to map it to an ID by using the mapping dataframe which contains the min entries (the bins) for any given ID. So x=1.4 y=1.9 round to x=1 y=1 which is mapped to id=1 according to the mapping.
Maybe this cartesian visualisation makes this a little bit more clear:
Y
4 -------------------------
| 3 | 6 | 9 |
| | | |
3 -------------------------
| 2 | 5 | 8 |
| | | |
2 -------------------------
| 1 | 4 | 7 |
| | | |
1 ------------------------- X
1 2 3 4
I would also add that I could not use the floor function as the bins are not necessarily nice integers as in this example.
I have a dataframe where I have transformed all NaN to 0 for a specific reason. In doing another calculation on the df, my group by is picking up a 0 and making it a value to perform the counts on. Any idea how to get python and pandas to exclude the 0 value? In this case the 0 represents a single row in the data. Is there a way to exclude all 0's from the groupby?
My groupby looks like this
+----------------+----------------+-------------+
| Team | Method | Count |
+----------------+----------------+-------------+
| Team 1 | Automated | 1 |
| Team 1 | Manual | 14 |
| Team 2 | Automated | 5 |
| Team 2 | Hybrid | 1 |
| Team 2 | Manual | 25 |
| Team 4 | 0 | 1 |
| Team 4 | Automated | 1 |
| Team 4 | Hybrid | 13 |
+----------------+----------------+-------------+
My code looks like this (after importing excel file)
df = df1.filnna(0)
a = df[['Team', 'Method']]
b = a.groupby(['Team', 'Method']).agg({'Method' : 'count'}
I'd filter the df prior to grouping:
In [8]:
a = df.loc[df['Method'] !=0, ['Team', 'Method']]
b = a.groupby(['Team', 'Method']).agg({'Method' : 'count'})
b
Out[8]:
Method
Team Method
1 Automated 1
Manual 1
2 Automated 1
Hybrid 1
Manual 1
4 Automated 1
Hybrid 1
Here we only select rows where method is not equal to 0
compare against without filtering:
In [9]:
a = df[['Team', 'Method']]
b = a.groupby(['Team', 'Method']).agg({'Method' : 'count'})
b
Out[9]:
Method
Team Method
1 Automated 1
Manual 1
2 Automated 1
Hybrid 1
Manual 1
4 0 1
Automated 1
Hybrid 1
You need the filter.
The filter method returns a subset of the original object. Suppose
we want to take only elements that belong to groups with a group sum
greater than 2.
Example:
In [94]: sf = pd.Series([1, 1, 2, 3, 3, 3])
In [95]: sf.groupby(sf).filter(lambda x: x.sum() > 2) Out[95]: 3 3
4 3 5 3 dtype: int64
Source.
I am looking to write a quick script that will run through a csv file with two columns and provide me the rows in which the values in column B switch from one value to another:
eg:
dataframe:
# | A | B
--+-----+-----
1 | 2 | 3
2 | 3 | 3
3 | 4 | 4
4 | 5 | 4
5 | 5 | 4
would tell me that the change happened between row 2 and row 3. I know how to get these values using for loops but I was hoping there was a more pythonic way of approaching this problem.
You can create a new column for the difference
> df['C'] = df['B'].diff()
> print df
# A B C
0 1 2 3 NaN
1 2 3 3 0
2 3 4 4 1
3 4 5 4 0
4 5 5 4 0
> df_filtered = df[df['C'] != 0]
> print df_filtered
# A B C
2 3 4 4 1
This will your required rows
You can do the following which also works for non numerical values:
>>> import pandas as pd
>>> df = pd.DataFrame({"Status": ["A","A","B","B","C","C","C"]})
>>> df["isStatusChanged"] = df["Status"].shift(1, fill_value=df["Status"].head(1)) != df["Status"]
>>> df
Status isStatusChanged
0 A False
1 A False
2 B True
3 B False
4 C True
5 C False
6 C False
>>>
Note the fill_value could be different depending on your application.
you can use this it is much faster, hope it helps!!
my_column_changes = df["MyStringColumn"].shift() != df["MyStringColumn"]