What is the most efficient way to switch the locations of two columns in python datatable? I wrote the below function that does what I want, but this may not be the best way, especially if my actual table is big. Is it possible to do this in place? Am I missing something obvious?
from datatable import Frame
dat = Frame(a=[1,2,3],b=[4,5,6],c=[7,8,9])
def switch_cols(data,col1,col2):
data_n = list(data.names)
data_n[data.colindex(col1)], data_n[data.colindex(col2)] = data_n[data.colindex(col2)], data_n[data.colindex(col1)]
return data[:, data_n]
dat = switch_cols(dat, "c","a")
| c b a
| int32 int32 int32
-- + ----- ----- -----
0 | 7 4 1
1 | 8 5 2
2 | 9 6 3
[3 rows x 3 columns]
For comparison in R, we can do this
dat = data.table(a=c(1,2,3), b=c(4,5,6), c=c(7,8,9))
switch_cols <- function(data,col1,col2) {
indexes = which(names(dat) %in% c(col1,col2))
datn = names(dat)
datn[indexes] <- datn[c(indexes[2], indexes[1])]
return(datn)
}
Then, we can change the order of two columns in-place like this
setcolorder(dat, switch_cols(dat,"a","c"))
Please note that assigning the values to each column is not what I'm after here. Consider this example, in R. I construct a large data.table like this:
dat = data.table(
x = rnorm(10000000),
y = sample(letters, 10000000, replace = T)
)
I make two copies of this data.table d and e
e = copy(dat)
d = copy(dat)
I then compare these two in-place operations
setcolorder (simply reindexing where in the data.table two columns are)
:= re-assignment of the two columns
microbenchmark::microbenchmark(
list=alist("setcolorder" = setcolorder(d, c("y", "x")),
"`:=`" = e[,`:=`(x=y, y=x)]),
times=1)
Unit: microseconds
expr min lq mean median uq max neval
setcolorder 81.5 81.5 81.5 81.5 81.5 81.5 1
`:=` 53691.1 53691.1 53691.1 53691.1 53691.1 53691.1 1
As expected, setcolorder is the right way to switch column locations in R data.table. I'm looking for a similar approach in python.
I find a method after checking its document
from datatable import Frame,f,update
dat = Frame(a=[1,2,3],b=[4,5,6],c=[7,8,9])
dat[:,update(a = f.c, c = f.a)]
In R, you can do it similarily
dat[,`:=`(a = c, c = a)]
After some consideration and timings, I'm finding that the best approach is this:
from datatable import Frame
dat = Frame(a=[1,2,3],b=[4,5,6],c=[7,8,9])
| a b c
| int32 int32 int32
-- + ----- ----- -----
0 | 1 4 7
1 | 2 5 8
2 | 3 6 9
[3 rows x 3 columns]
def switch_cols(data,col1,col2):
return data[:, [col1 if c==col2 else col2 if c==col1 else c for c in data.names]]
switch_cols(dat, "a","c")
| c b a
| int32 int32 int32
-- + ----- ----- -----
0 | 7 4 1
1 | 8 5 2
2 | 9 6 3
[3 rows x 3 columns]
Related
If I have a python datatable like this:
from datatable import f, dt
data = dt.Frame(grp=["a","a","b","b","b","b","c"], value=[2,3,1,2,5,9,2])
how do I create an new column that has the row number, by group?. That is, what is the equivalent of R data.table's
data[, id:=1:.N, by=.(grp)]
This works, but seems completely ridiculous
data['id'] = np.concatenate(
[np.arange(x)
for x in data[:,dt.count(), dt.by(f.grp)]['count'].to_numpy()])
desired output:
| grp value id
| str32 int32 int64
-- + ----- ----- -----
0 | a 2 0
1 | a 3 1
2 | b 1 0
3 | b 2 1
4 | b 5 2
5 | b 9 3
6 | c 2 0
Update:
Datatable now has a cumcount function in dev :
data[:, [f.value, dt.cumcount()], 'grp']
| grp value C0
| str32 int32 int64
-- + ----- ----- -----
0 | a 2 0
1 | a 3 1
2 | b 1 0
3 | b 2 1
4 | b 5 2
5 | b 9 3
6 | c 2 0
[7 rows x 3 columns]
Old Answer:
datatable does not have a cumulative count function, in fact there is no cumulative function for any aggregation at the moment.
One way to possibly improve the speed is to use a faster iteration of numpy, where the for loop is done within C, and with more efficiency. The code is from here and modified for this purpose:
from datatable import dt, f, by
import numpy as np
In [244]: def create_ranges(indices):
...: cum_length = indices.cumsum()
...: ids = np.ones(cum_length[-1], dtype=int)
...: ids[0] = 0
...: ids[cum_length[:-1]] = -1 * indices[:-1] + 1
...: return ids.cumsum()
counts = data[:, dt.count(), by('grp', add_columns=False)].to_numpy().ravel()
data[:, f[:].extend({"counts" : create_ranges(counts)})]
| grp value counts
| str32 int32 int64
-- + ----- ----- ------
0 | a 2 0
1 | a 3 1
2 | b 1 0
3 | b 2 1
4 | b 5 2
5 | b 9 3
6 | c 2 0
[7 rows x 3 columns]
The create_ranges function is wonderful (the logic built on cumsum is nice) and really kicks in as the array size increases.
Of course this has its drawbacks; you are stepping out of datatable into numpy territory and then back into datatable; the other aspect is that I am banking on the fact that the groups are sorted lexically; this won't work if the data is unsorted (and has to be sorted on the grouping column).
Preliminary tests show a marked improvement in speed; again it is limited in scope and it would be much easier/better if this was baked into the datatable library.
If you are good with C++, you could consider contributing this function to the library; I and so many others would appreciate your effort.
You could have a look at pypolars and see if it helps with your use case. From the h2o benchmarks it looks like a very fast tool.
One approach is to convert to_pandas, groupby (on the pandas DataFrame) and use cumcount:
import datatable as dt
data = dt.Frame(grp=["a", "a", "b", "b", "b", "b", "c"], value=[2, 3, 1, 2, 5, 9, 2])
data["id"] = data.to_pandas().groupby("grp").cumcount()
print(data)
Output
| grp value id
| str32 int32 int64
-- + ----- ----- -----
0 | a 2 0
1 | a 3 1
2 | b 1 0
3 | b 2 1
4 | b 5 2
5 | b 9 3
6 | c 2 0
[7 rows x 3 columns]
Given the following datatable
DT = dt.Frame({'A':['A','A','A','B','B','B'],
'B':['a','a','b','a','a','a'],
})
I'd like to create column 'C', which numbers the rows within each group in columns A and B like this:
A B C
0 A a 1
1 A a 2
2 A b 1
3 B a 1
4 B a 2
5 B a 3
According to this thread for pandas cumcount() or rank() would be options, but it does not seem to be defined for pydatatable:
DT = DT[:, f[:].extend({'C': cumcount()}),by(f.A,f.B)]
DT = DT[:, f[:].extend({'C': rank(f.B)}),by(f.A,f.B)]
a) How can I number the rows within groups?
b) Is there a comprehensive resource with all the currently available functions for pydatatable?
Update:
Datatable now has a cumcount function in dev :
DT[:, {'C':dt.cumcount() + 1}, by('A', 'B')]
| A B C
| str32 str32 int64
-- + ----- ----- -----
0 | A a 1
1 | A a 2
2 | A b 1
3 | B a 1
4 | B a 2
5 | B a 3
[6 rows x 3 columns]
Old answer:
This is a hack, in time there should be an in-built way to do cumulative count, or even take advantage of itertools or other performant tools within python while still being very fast :
Step 1 : Get count of columns A and B and export to list
result = DT[:, dt.count(), by("A","B")][:,'count'].to_list()
Step 2 : Use a combination of itertools chain and list comprehension to get the cumulative counts :
from itertools import chain
cumcount = chain.from_iterable([i+1 for i in range(n)] for n in result[0])
Step 3: Assign result back to DT
DT['C'] = dt.Frame(tuple(cumcount))
print(DT)
A B C
▪▪▪▪ ▪▪▪▪ ▪▪▪▪
0 A a 1
1 A a 2
2 A b 1
3 B a 1
4 B a 2
5 B a 3
6 rows × 3 columns
Suppose I have a timeseries of values called X.
And I now want to know the first index after which the values of some other series Y will be reached by X. Or put differently, for each index i I want to know the first index j after which the line formed by X from j-1 to j intersects the value of Y at i.
Below is an example set of example X, Y series, showing the resulting values for Z. The length of these series is always the same:
X | Y | Z
2 | 3 | 2
2 | 3 | NaN
4 | 4.5 | 3
5 | 5 | NaN
4 | 5 | NaN
3 | 2 | 6
1 | 2 | NaN
Do pandas or numpy offer something that will assist with this? This function will be run on large datasets so I can't use python loops.
Use numpy broadcasting by compare with shifted values, then get indices of first Trues by DataFrame.idxmax with small improvement - added NaN column for get NaN if all False values per row and last remove duplicates values:
a = df['X']
b = df['Y']
a1 = a.values
a2 = a.shift(-1).ffill().values
b1 = b.values[:, None]
arr = (((a1 < b1) & (a2 > b1)) | ((a1 > b1) & (a2 < b1)))
df = pd.DataFrame(arr)
df[np.nan] = True
out = df.idxmax(axis=1) + 1
out = out.mask(out.duplicated())
print (out)
0 2.0
1 NaN
2 3.0
3 NaN
4 NaN
5 6.0
6 NaN
dtype: float64
I have the following grid with bins defined by x and y, and each grid square given a unique id
mapping = pd.DataFrame({'id': [1,2,3,4,5,6,7,8,9], 'x': [1,1,1,2,2,2,3,3,3], 'y': [1,2,3,1,2,3,1,2,3]})
id x y
0 1 1 1
1 2 1 2
2 3 1 3
3 4 2 1
4 5 2 2
5 6 2 3
6 7 3 1
7 8 3 2
8 9 3 3
I also have a new dataframe of observations for which I would like to know what their associated id should be (e.g. which grid square they fall into)
coordinates = pd.DataFrame({'x': [1.4, 2.7], 'y': [1.9, 1.1]})
x y
0 1.4 1.9
1 2.7 1.1
My solution is the following function:
import bisect
def get_id(coords, mapping):
x_val = mapping.x[bisect.bisect_right(mapping.x, coords[0]) - 1]
y_val = mapping.y[bisect.bisect_right(mapping.y, coords[1]) - 1]
id = mapping[(mapping.x == x_val) & (mapping.y == y_val)].iloc[0, 0]
return id
coordinates.apply(get_id, mapping = mapping, axis = 1)
Out[21]:
0 1
1 4
dtype: int64
This works but becomes slow as the coordinates dataframe grows long. I'm sure there is a fast way to do this for a coordinates dataframe with 10^6 + observations. Is there a faster way to do this?
Edit:
To answer #abdurrehman245 question from the comments below.
My current method is to simply round down any data point, this allows me to map it to an ID by using the mapping dataframe which contains the min entries (the bins) for any given ID. So x=1.4 y=1.9 round to x=1 y=1 which is mapped to id=1 according to the mapping.
Maybe this cartesian visualisation makes this a little bit more clear:
Y
4 -------------------------
| 3 | 6 | 9 |
| | | |
3 -------------------------
| 2 | 5 | 8 |
| | | |
2 -------------------------
| 1 | 4 | 7 |
| | | |
1 ------------------------- X
1 2 3 4
I would also add that I could not use the floor function as the bins are not necessarily nice integers as in this example.
I am looking to write a quick script that will run through a csv file with two columns and provide me the rows in which the values in column B switch from one value to another:
eg:
dataframe:
# | A | B
--+-----+-----
1 | 2 | 3
2 | 3 | 3
3 | 4 | 4
4 | 5 | 4
5 | 5 | 4
would tell me that the change happened between row 2 and row 3. I know how to get these values using for loops but I was hoping there was a more pythonic way of approaching this problem.
You can create a new column for the difference
> df['C'] = df['B'].diff()
> print df
# A B C
0 1 2 3 NaN
1 2 3 3 0
2 3 4 4 1
3 4 5 4 0
4 5 5 4 0
> df_filtered = df[df['C'] != 0]
> print df_filtered
# A B C
2 3 4 4 1
This will your required rows
You can do the following which also works for non numerical values:
>>> import pandas as pd
>>> df = pd.DataFrame({"Status": ["A","A","B","B","C","C","C"]})
>>> df["isStatusChanged"] = df["Status"].shift(1, fill_value=df["Status"].head(1)) != df["Status"]
>>> df
Status isStatusChanged
0 A False
1 A False
2 B True
3 B False
4 C True
5 C False
6 C False
>>>
Note the fill_value could be different depending on your application.
you can use this it is much faster, hope it helps!!
my_column_changes = df["MyStringColumn"].shift() != df["MyStringColumn"]