R sorting and grouping problems - python

I have this sample dataset.
s
I'm trying to filter to the earliest 'started_at' entry by 'rideable_type'. Here is my code.
test %>% select(started_at, rideable_type) %>%
arrange(rideable_type) %>%
filter(min(started_at)) %>%
group_by(rideable_type)
I'm getting the following error.
Error in `filter()`:
! Problem while computing `..1 = test`.
✖ Input `..1$ride_id` must be a logical vector, not a character.
Backtrace:
1. ... %>% group_by(test$rideable_type)
4. dplyr:::filter.data.frame(., test, min(test$started_at))
Error in filter(., test, min(test$started_at)) :
✖ Input `..1$ride_id` must be a logical vector, not a character.
This is the trackeback.
16.
stop(fallback)
15.
signal_abort(cnd, .file)
14.
abort(bullets, call = error_call, parent = skip_internal_condition(e))
13.
(function (e)
{
local_error_context(dots = dots, .index = env_filter$current_expression,
mask = mask) ...
12.
signalCondition(cnd)
11.
signal_abort(cnd, .file)
10.
abort(class = c(class, "dplyr:::internal_error"), dplyr_error_data = data)
9.
dplyr_internal_error("dplyr:::filter_incompatible_type", list(
index = 1L, column_name = "ride_id", result = c("98D355D9A9852BE9",
"04706CA7F5BD25EE", "42178E850B92597A", "6B93C46E8F5B114C",
"466943353EAC8022", "AC1F67BDCDDD5988", "A5BD5A4FD53D5414", ...
8.
mask$eval_all_filter(dots, env_filter)
7.
withCallingHandlers({
mask$eval_all_filter(dots, env_filter)
}, error = function(e) {
local_error_context(dots = dots, .index = env_filter$current_expression, ...
6.
filter_eval(dots, mask = mask, error_call = error_call)
5.
filter_rows(.data, ..., caller_env = caller_env())
4.
filter.data.frame(., test, min(test$started_at))
3.
filter(., test, min(test$started_at))
2.
group_by(., test$rideable_type)
1.
select(test, started_at, rideable_type) %>% arrange(test$rideable_type) %>%
filter(test, min(test$started_at)) %>% group_by(test$rideable_type)
I'm fairly new to R but I noticed the error referenced the ride_id column as part of the problem, but that col isn't included in the select clause.

could you please try the below code
test %>% select(started_at, rideable_type) %>%
group_by(rideable_type) %>%
filter(as.POSIXct(started_at)==min(as.POSIXct(started_at)))

Related

Running T Test - Statistical Significant

If I want to calculate the statistically significant difference between the different species when looking at the occurrence of the virus, I tried:
data = {'WNV Present': ["negative","negative","positive","negative","positive","positive","negative","negative","negative" ],
'Species': ["Myotis","Myotis","Hoary","Myotis","Myotis","Keens","Myotis","Keens","Keens"]}
my_data = pd.DataFrame(data)
# binarized the WNV Present Column
my_data["WNV Present"] = np.where(my_data["WNV Present"] == "positive", 1, 0)
my_data
# Binarize the Species Column
dum_col3 = pd.get_dummies(my_data["Species"])
dum_col3
dummy_df5 = my_data.join(dum_col2)
dummy_df5.drop(["Species"], axis=1, inplace=True)
dummy_df5
#running t test
from scipy.stats import ttest_ind
set1 = dummy_df5[dummy_df5['WNV Present']==1]
set2 = dummy_df5[dummy_df5['Myotis']==1]
stats.ttest_ind(set1, set2)
My results:
Ttest_indResult(statistic=array([ 3. , 1.36930639, 1.36930639, -2.73861279]), pvalue=array([0.0240082 , 0.21994382, 0.21994382, 0.03379779]))
Why am I receiving various P value results? I tried running this again without binarizing the Species column but that also doesnt tell me if there is a significant difference between species.

How to group multiple columns with NA values and discrepancies?

I'm looking for a way to group a data frame with multiple columns with missing values.
I want to regroup every row that has a common value for each columns inspected and ignore if a missing value is present or a discrepancies in the data.
The script should be independent of the order of appearance of missing values.
I succeed of doing so by iteration, but I would like a more efficient way of doing so by vectorizing the process. I used R software but I would also like to do it in Python.
As an example if a data frame as
df = data.frame("ID1"=c(NA,NA,"A","A","A","B","C","B","C"), "ID2"=c("D","E",NA,"D","E","F","F",NA,NA))
I want to obtain a final grouping vector as
c(1,1,1,1,1,2,2,2,2)
Where 1 and 2 can be any number, they should only be common between row that has a common value for any column.
I hope it's understandable ?
The easiest way I found, was by using a double iteration
df$GrpF = 1:dim(df)[1]
for (i in 1:dim(df)[1]){
for (ID in c("ID1","ID2")){
if (!is.na(df[i,ID])){
df$GrpF[df[ID]==df[i,ID]] = min(df$GrpF[df[ID]==df[i,ID]],na.rm = T)
}
}
}
Where df$GrpF is my final grouping vector. It works well and I don't have any duplicates when I summarise the information.
library(dplyr)
library(plyr)
dfG = df %>% group_by_("GrpF")%>%summarise_all(
function(x){
x1 = unique(x)
paste0(x1[!is.na(x1) & x1 != ""],collapse = "/")
}
)
But when I use my real data 60000 rows on 4 columns, it takes a lot of time (5 mins).
I tried using a single iteration by columns using the library dplyr and plyr
grpData = function(df, colGrp, colData, colReplBy = NA){
a = df %>% group_by_at(colGrp) %>% summarise_at(colData, function(x) { sort(x,na.last=T)[1]}) %>% filter_at(colGrp,all_vars(!is.na(.)))
b = plyr::mapvalues(df[[colGrp]], from=a[[colGrp]], to=a[[colData]])
if (is.na(colReplBy)) {
b[which(is.na(b))] = NA
}else if (colReplBy %in% colnames(df)) {
b[which(is.na(b))] = df[[colReplBy]][which(is.na(b))] #Set old value for missing values
}else {
stop("Col to use as replacement for missing values not present in dataframe")
}
return(b)
}
df$GrpF = 1:dim(df)[1]
for (ID in c("ID1","ID2")){
#Set all same old group same ID
df$IDN = grpData(df,"GrpF",ID)
#Set all same new ID the same old group
df$GrpN = grpData(df,"IDN","GrpF")
#Set all same ID the same new group
df$GrpN = grpData(df,ID,"GrpN")
#Set all same old group the same new group
df$GrpF = grpData(df,"GrpF","GrpN", colReplBy = "GrpF")
}
This does work (takes 30 sec for the real data) but I would like a more efficient way of doing so.
Do you have any ideas ?

How to write the Python/Pandas equivalent of the following R code?

For a project, I am attempting to convert the following R code to Python but I am struggling to write equivalent code for the summarize and mutate commands used in R. The code :
users <- users %>%
mutate(coup_start=ifelse(first_coup>DAY,"no","yes")) %>%
group_by(household_key,WEEK_NO,coup_start) %>%
summarize(weekly_spend=sum(SALES_VALUE),
dummy=1) #adding new column dummy
users_before <- filter(users,coup_start=="no")
users_after <- filter(users,coup_start=="yes")
users_before <- users_before %>%
group_by(household_key) %>%
mutate(cum_dummy=cumsum(dummy),
trip=cum_dummy-max(cum_dummy)) %>%
select(-dummy,-cum_dummy)
users_after <- users_after %>%
group_by(household_key) %>%
mutate(trip=cumsum(dummy)-1) %>%
select(-dummy)
I tried the following :
users = transaction_data.merge(coupon_users,on='household_key')
users['coup_start']= np.where((users['first_coup'] > users['DAY_x']), 1, 0)
users['dummy'] = 1
users_before = users[users['coup_start']==0]
users_after = users[users['coup_start']==1]
users_before['cum_dummy'] = users_before.groupby(['household_key'])['dummy'].cumsum()
users_before['trip'] = users_before.groupby(['household_key'])['cum_dummy'].transform(lambda x: x - x.max())
users_after['trip'] = users_after.groupby(['household_key'])['dummy'].transform(lambda x: cumsum(x) - 1)
But I'm encountering multiple issues, the transform(lambda x: cumsum(x) -1) is throwing an error. And the two groupby and transform attempts before that show the following warnings:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
"""Entry point for launching an IPython kernel.
I also feel that I did not insert the dummy = 1 correctly initially. How can I convert the mutate/summarize functions in R with Python?
Edit
I have attempted using apply function to perform the cumsum operation.
def thisop(x): return(cumsum(x)-1 )
users_after['trip']=users_after.groupby(['household_key'])['dummy'].apply(thisop)
The error : NameError: name 'cumsum' is not defined still persists.
You've changed some variable and value names from R to Python code (e.g.DAY to DAY_X).
The following code should work taking the variables/values from your R code:
users = (
users.assign(coup_start = np.where(users.first_coup > users.DAY), 'no', 'yes')
.groupby(['household_key','WEEK_NO','coup_start'])
.agg(weekly_spend=(SALES_VALUE, sum))
.assign(dummy=1)
)
users_before = users.query('coup_start=="no"')
users_after = users.query('coup_start=="yes"')
users_before = (
users_before.assign (
trip = users_before.groupby('household_key').dummy
.transform(lambda x: x.cumsum() - x.cumsum().max()) )
.drop(columns='dummy')
)
users_after = (
users_after.assign (
trip = users_after.groupby('household_key')
.transform(trip = dummy.cumsum()-1) )
.drop(columns='dummy')
)
How about using the same syntax in python:
from datar.all import f, mutate, if_else, summarize, filter, group_by, select, sum, cumsum, max
users = users >> \
mutate(coup_start=if_else(f.first_coup>f.DAY,"no","yes")) >> \
group_by(f.household_key,f.WEEK_NO,f.coup_start) >> \
summarize(weekly_spend=sum(f.SALES_VALUE),
dummy=1) #adding new column dummy
users_before = filter(users,f.coup_start=="no")
users_after = filter(users,f.coup_start=="yes")
users_before = users_before >> \
group_by(f.household_key) >> \
mutate(cum_dummy=cumsum(f.dummy),
trip=f.cum_dummy-max(f.cum_dummy)) >> \
select(~f.dummy,~f.cum_dummy)
users_after = users_after >> \
group_by(f.household_key) >> \
mutate(trip=cumsum(f.dummy)-1) >> \
select(~f.dummy)
I am the author of the datar package. Feel free to submit issues if you have any questions.

Translating a Python Pandas line to R:

I am following a blog post here and I am getting a little stuck on one part regarding the translation from Python pandas to R…
In the part of the blog:
Tick Bars
The author has the line:
data_tick_grp = data.reset_index().assign(grpId=lambda row: row.index // num_ticks_per_bar)
I understand that data is the "data frame" -
reset_index not sure what this is.
assing(grpId =…) - creating a new variable grpId
lambda row: - not sure what this does.
row.index - is this the same as row_number?
\\ - is this the same as floor() in R?
num_ticks_per_bar is calculated as.
total_ticks = len(data)
num_ticks_per_bar = total_ticks / num_time_bars
num_ticks_per_bar = round(num_ticks_per_bar, -3) # round to the nearest thousand
Which I understand it as:
ticks <- data %>%
filter(symbol == "XBTUSD") %>%
nrow()
ticks_per_bar <- ticks / 288
ticks_per_bar <- plyr::round_any(ticks_per_bar, 1000)
floor(1:nrow(data) / ticks_per_bar))
Can somebody help me translate the Python pandas line into R language?
Usually, Pandas best translates to base R:
reset_index same as resetting row.names for sequential numbering data.frame(..., row.names = NULL)
assign(grpId =…) same as assigning a column in place such as with transform, within or dplyr's mutate
lambda row this is required inside assign to reference data frame, here aliased as row
row.index is same as row number (remember Python is 0-index unlike R)
// is the integer division which in R one can be wrapped with as.integer or floor after division
Altogether, consider below adjustment to translate Pandas line:
data_tick_grp = (data.reset_index()
.assign(grpId=lambda row: row.index // num_ticks_per_bar)
)
To R:
data_tick_grp <- transform(data.frame(data, row.names = NULL),
grpId = floor(0:(nrow(data)-1) / num_ticks_per_bar))
Or in tidy format:
data_tick_grp <- data %>%
data.frame(row.names = NULL) %>%
mutate(grpId = floor(0:(nrow(data)-1) / num_ticks_per_bar))

XGBoost pairwise setup - python

In XGBoost I have tried multiple ways to make pairwise group work with group set, but without success. The following code doesn't work when using set_group but is fine with set_group commented out for xgbTrain
import xgboost
import pandas as pd
from xgboost import DMatrix,train
xgb_params ={
'booster' : 'gbtree',
'eta': 0.1,
'gamma' : 1.0 ,
'min_child_weight' : 0.1,
'objective' : 'rank:pairwise',
'eval_metric' : 'merror',
#'num_class': 3, #
'max_depth' : 6,
'num_round' : 4,
'save_period' : 0
}
n_group=2
n_choice=3
#training dataset
dtrain=np.random.uniform(0,100,[n_group*n_choice,2])
dtarget=np.array([np.random.choice([0,1,2],3,False) for i in range(n_group)]).flatten()
dgroup=np.array([np.repeat(i,3)for i in range(n_group)]).flatten()
xgbTrain = DMatrix(dtrain, label = dtarget)
xgbTrain =xgbTrain.set_group(dgroup)
#watchlist
dtrain_eval=np.random.uniform(0,100,[n_group*n_choice,2])
xgbTrain_eval = DMatrix(dtrain_eval, label = dtarget)
#xgbTrain_eval =xgbTrain_eval .set_group(dgroup)
#test dataset
dtest=np.random.uniform(0,100,[n_group*n_choice,2])
dtestgroup=np.array([np.repeat(i,3)for i in range(n_group)]).flatten()
xgbTest = DMatrix(dtest)
#xgbTest =xgbTest.set_group(dgroup)
evallist = [(xgbTrain_eval, 'eval')]
rankModel = xgboost.train(params=xgb_params,dtrain=xgbTrain )
print(rankModel.predict( xgbTest))
The error returned seem to point to the lack of eval data but even specifying the evals as
rankModel = xgboost.train(params=xgb_params,dtrain=xgbTrain,evals=evallist )
the error remains.
Note that num_class is commented out but intuitively it should have a value either 3 ( here corresponding to the number of class ) or 2 (for the number of group in the case of pairwise ranking)?
Any help in pointing to what is wrong?
(Xgboost 0.6)
An error:
mea cupla, the set_group is incorrect and should be
xgbTrain.set_group(dgroup)
and not
xgbTrain =xgbTrain.set_group(dgroup)
The solution:
The data in the set_group should just be the count of each items per group with one item per group.
dgroup=np.array([n_choice for i in range(n_group)]).flatten()
That did it!

Categories