I am following a blog post here and I am getting a little stuck on one part regarding the translation from Python pandas to R…
In the part of the blog:
Tick Bars
The author has the line:
data_tick_grp = data.reset_index().assign(grpId=lambda row: row.index // num_ticks_per_bar)
I understand that data is the "data frame" -
reset_index not sure what this is.
assing(grpId =…) - creating a new variable grpId
lambda row: - not sure what this does.
row.index - is this the same as row_number?
\\ - is this the same as floor() in R?
num_ticks_per_bar is calculated as.
total_ticks = len(data)
num_ticks_per_bar = total_ticks / num_time_bars
num_ticks_per_bar = round(num_ticks_per_bar, -3) # round to the nearest thousand
Which I understand it as:
ticks <- data %>%
filter(symbol == "XBTUSD") %>%
nrow()
ticks_per_bar <- ticks / 288
ticks_per_bar <- plyr::round_any(ticks_per_bar, 1000)
floor(1:nrow(data) / ticks_per_bar))
Can somebody help me translate the Python pandas line into R language?
Usually, Pandas best translates to base R:
reset_index same as resetting row.names for sequential numbering data.frame(..., row.names = NULL)
assign(grpId =…) same as assigning a column in place such as with transform, within or dplyr's mutate
lambda row this is required inside assign to reference data frame, here aliased as row
row.index is same as row number (remember Python is 0-index unlike R)
// is the integer division which in R one can be wrapped with as.integer or floor after division
Altogether, consider below adjustment to translate Pandas line:
data_tick_grp = (data.reset_index()
.assign(grpId=lambda row: row.index // num_ticks_per_bar)
)
To R:
data_tick_grp <- transform(data.frame(data, row.names = NULL),
grpId = floor(0:(nrow(data)-1) / num_ticks_per_bar))
Or in tidy format:
data_tick_grp <- data %>%
data.frame(row.names = NULL) %>%
mutate(grpId = floor(0:(nrow(data)-1) / num_ticks_per_bar))
Related
I'm looking for a way to group a data frame with multiple columns with missing values.
I want to regroup every row that has a common value for each columns inspected and ignore if a missing value is present or a discrepancies in the data.
The script should be independent of the order of appearance of missing values.
I succeed of doing so by iteration, but I would like a more efficient way of doing so by vectorizing the process. I used R software but I would also like to do it in Python.
As an example if a data frame as
df = data.frame("ID1"=c(NA,NA,"A","A","A","B","C","B","C"), "ID2"=c("D","E",NA,"D","E","F","F",NA,NA))
I want to obtain a final grouping vector as
c(1,1,1,1,1,2,2,2,2)
Where 1 and 2 can be any number, they should only be common between row that has a common value for any column.
I hope it's understandable ?
The easiest way I found, was by using a double iteration
df$GrpF = 1:dim(df)[1]
for (i in 1:dim(df)[1]){
for (ID in c("ID1","ID2")){
if (!is.na(df[i,ID])){
df$GrpF[df[ID]==df[i,ID]] = min(df$GrpF[df[ID]==df[i,ID]],na.rm = T)
}
}
}
Where df$GrpF is my final grouping vector. It works well and I don't have any duplicates when I summarise the information.
library(dplyr)
library(plyr)
dfG = df %>% group_by_("GrpF")%>%summarise_all(
function(x){
x1 = unique(x)
paste0(x1[!is.na(x1) & x1 != ""],collapse = "/")
}
)
But when I use my real data 60000 rows on 4 columns, it takes a lot of time (5 mins).
I tried using a single iteration by columns using the library dplyr and plyr
grpData = function(df, colGrp, colData, colReplBy = NA){
a = df %>% group_by_at(colGrp) %>% summarise_at(colData, function(x) { sort(x,na.last=T)[1]}) %>% filter_at(colGrp,all_vars(!is.na(.)))
b = plyr::mapvalues(df[[colGrp]], from=a[[colGrp]], to=a[[colData]])
if (is.na(colReplBy)) {
b[which(is.na(b))] = NA
}else if (colReplBy %in% colnames(df)) {
b[which(is.na(b))] = df[[colReplBy]][which(is.na(b))] #Set old value for missing values
}else {
stop("Col to use as replacement for missing values not present in dataframe")
}
return(b)
}
df$GrpF = 1:dim(df)[1]
for (ID in c("ID1","ID2")){
#Set all same old group same ID
df$IDN = grpData(df,"GrpF",ID)
#Set all same new ID the same old group
df$GrpN = grpData(df,"IDN","GrpF")
#Set all same ID the same new group
df$GrpN = grpData(df,ID,"GrpN")
#Set all same old group the same new group
df$GrpF = grpData(df,"GrpF","GrpN", colReplBy = "GrpF")
}
This does work (takes 30 sec for the real data) but I would like a more efficient way of doing so.
Do you have any ideas ?
I am trying to convert the following R script to Python. I am able to convert the individual lines of code to python. But could not find a way to apply for loop, as the following lines use the iterator variable i.
date is datetime data type, week is string type. There are 3 data frames.length of new_levels is greater than df2$week unique(), levels
Got stuck with this for loop for sometime.
df <- data.frame()
for(i in 1:NROW(df2)){
temp <- subset(df1,df1$ID == df2$ID[i] & df1$week <= df2$week[i])
temp_1 <- temp[order(factor(temp$week, levels = new_levels)),]
temp <- tail(temp_1,6)
temp$bookid <- df2$bookid[i]
temp$plan <- df2$plan[i]
temp$date <- df2$date[i]
if(nrow(temp)>1){
for(j in 1:NROW(temp)){
temp$date[j] <- (temp$date[j] - (NROW(temp)-j)*24*3600*7)
}
temp1 <- temp[-nrow(temp),]
df <- rbind(df,temp1)
}
Given iris data I'd like to add new columns corresponding to all numeric columns found. I can do by explicitly listing each numeric column:
from datatable import fread, f, mean, update
iris_dt = fread("https://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris.csv")
iris_dt[:, update(C0_dist_from_mean = dt.abs(f.C0 - mean(f.C0)),
C1_dist_from_mean = dt.abs(f.C1 - mean(f.C1)),
C2_dist_from_mean = dt.abs(f.C2 - mean(f.C2)),
C3_dist_from_mean = dt.abs(f.C3 - mean(f.C1)))]
But that way I hard-coded column names. More robust way is readily available with R datatable using .SDcols:
library(data.table)
iris = fread("https://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris.csv")
cols = names(sapply(iris, class)[sapply(iris, class)=='numeric'])
iris[, paste0(cols,"_dist_from_mean") := lapply(.SD, function(x) {abs(x-mean(x))}),
.SDcols=cols]
Is there a way to take similar approach with pydatatable today?
I do realize how to get all numeric columns in py-datatable, e.g. like this:
iris_dt[:, f[float]]
but it's the last part that uses .SDcols in R that evades me.
Create a dict comprehension of the new column names and the f expressions, then unpack it in the update method:
from datatable import f, update, abs, mean
aggs = {f"{col}_dist_from_mean" : abs(f[col] - mean(f[col]))
for col in iris_dt[:, f[float]].names}
iris_dt[:, update(**aggs)]
UPDATE:
Using the Type properties in v1.1, this is an alternative approach :
aggs = {f"{col}_dist_from_mean" : dt.math.abs(f[col] - f[col].mean())
for col, col_type
in zip(iris_dt.names, iris_dt.types)
if col_type.is_float}
You could also chunk the steps:
Create a Frame with the calculated values:
expression = f[float]-f[float].mean()
expression = dt.math.abs(expression)
compute = iris_dt[:, expression]
Rename the column labels for compute:
compute.names = [f"{name}_dist_from_mean" for name in compute.names]
Update iris_dt with compute (note that you could also use a cbind):
iris_dt[:, update(**compute)]
I am having the excel sheet using the pandas.read_excel, I got the output in dataframe but I want to add the calculations in the after reading through pandas I need to ado following calculation in each x and y columns.
ratiox = (73.77481944859028 - 73.7709567323327) / 720
ratioy = (18.567453940477293 - 18.56167674097576) / 1184
mapLongitudeStart = 73.7709567323327
mapLatitudeStart = 18.567453940477293
longitude = 0, latitude = 0
longitude = (mapLongitudeStart + x1 * ratiox)) #I have take for the single column x1 value
latitude = (mapLatitudeStart - (-y1 *ratioy )) # taken column y1 value
how to apply this calculation to every column and row of x and y a which has the values it should not take the null values. And I want the new dataframe created by doing the calculation in columns
Try the below code:
import pandas as pd
import itertools
df = pd.read_excel('file_path')
dfx=df.ix[:,'x1'::2]
dfy=df.ix[:,'y1'::2]
li=[dfx.apply(lambda x:mapLongitudeStart + x * ratiox),dfy.apply(lambda y:mapLatitudeStart - (-y))]
df_new=pd.concat(li,axis=1)
df_new = df_new[list(itertools.chain(*zip(dfx.columns,dfy.columns)))]
print(df_new)
Hope this helps!
I would first recommend to reshape your data into a long format, that way you can get rid of the empty cells naturally. Also most pandas functions work better that way, because then you can use things like group by operations on all x or y or wahtever dimenstion
from itertools import chain
import pandas as pd
## this part is only to have a running example
## here you would load your excel file
D = pd.DataFrame(
np.random.randn(10,6),
columns =chain(*[ [f"x{i}", f"y{i}"] for i in range(1,4)])
)
D["rowid"] = pd.np.arange(len(D))
D = D.melt(id_vars="rowid").dropna()
D["varIndex"] = D.variable.str[1]
D["variable"] = D.variable.str[0]
D = D.set_index(["varIndex","rowid","variable"])\
.unstack("variable")\
.droplevel(0, axis=1)
So these transformations will give you a table where you have an index both for the original row id (maybe it is a time series or something else), and the variable index so x1 or x2 etc.
Now you can do your calculations either by overwintering the previous columns
## Everything here is a constant
ratiox = (73.77481944859028 - 73.7709567323327) / 720
ratioy = (18.567453940477293 - 18.56167674097576) / 1184
mapLongitudeStart = 73.7709567323327
mapLatitudeStart = 18.567453940477293
# apply the calculations directly to the columns
D.x = (mapLongitudeStart + D.x * ratiox))
D.y = (mapLatitudeStart - (-D.y * ratioy ))
I have two types of tables containing time series. One type contains data referring to population and are stored in files with a particular pattern at the end. The other type contains data regarding resources. Furthermore, I have files for different farms (hundreds). Thus, the content of the folder is:
Farm01_population
Farm01_resources
Farm02_population
Farm02_resources
Farm03_population
Farm03_resources
Farm04_population
Farm04_resources
........
And so on.
I also must do computations within each file. So far, I´ve started the task by first performing the calculations separately for population and resources.
population_files <- list.files("path",pattern="population.txt$")
resources_files <- list.files("path",pattern="resources.txt$")
for(i in 1:length(population_files)){......}
for(j in 1:length(resources_files)){......}
How could I merge now every pair of tables referring to each farm?, thus obtaining:
Farm01_finaltable
Farm02_finaltable
Farm03_finaltable
Farm04_finaltable
......
And so on.
As the number of farms is very big, I cannot write an specific string as pattern at the beginning of each file name. What I need to state is that tables must be merged that share the same pattern at the beginning, whichever this pattern (farm) is.
I am using R but solutions with Python are also welcomed.
Just keep everything all in one file
library(dplyr)
library(rex)
file_regex =
rex(capture(digits),
"_",
capture(anything))
catalog =
data_frame(file = list.files("path") ) %>%
extract(file,
c("ID", "type"),
file_regex,
remove = FALSE)
population =
catalog %>%
filter(type == "population")
group_by(ID) %>%
do(.$file %>% first %>% read.csv)
resources =
catalog %>%
filter(type == "resources")
group_by(ID) %>%
do(.$file %>% first %>% read.csv)
together = full_join(population, resources)
Assuming files are csv format, consider the following base R and Python 3 (using pandas) solutions. Both use regex patterning to find corresponding population and resources files and then merge to a final table using a linked Farm ID. Do note if you need to iterate past 99 files, be sure to adjust regex digit count {#} to {3} (for Python do not change the string format operator {0}).
R
path = "C:/Path/To/Files"
numberoffiles = 2
for (i in (1:numberoffiles)) {
if (i < 10) { i = paste0('0', i) } else { i = as.character(i) }
filespop <- list.files(path, pattern=sprintf("^[a-zA-Z]*[%s]{2}_population.csv$", i))
dfpop <- read.csv(paste0(path, "/", filespop[[1]]))
filesres <- list.files(path, pattern=sprintf("^[a-zA-Z]*[%s]{2}_resources.csv$", i))
dfres <- read.csv(paste0(path, "/", filesres[[1]]))
farm <- gsub(sprintf("[%s]{2}_population.csv", i), "", filespop[[1]])
mergedf <- merge(dfpop, dfres, by=c('FarmID'), all=TRUE)
write.csv(mergedf, paste0(path, "/", farm,
sprintf("%s_FinalTable_r.csv", i)), row.names=FALSE)
}
Python
import os
import re
import pandas as pd
# CURRENT DIRECTORY OF SCRIPT
cd = os.path.dirname(os.path.abspath(__file__))
numberoffiles = 2
for item in os.listdir(cd):
for i in range(1, numberoffiles+1):
i = '0'+str(i) if i < 10 else str(i)
filepop = re.match("^[a-zA-Z]*[{0}]{{2}}_population.csv$".format(i), item, flags=0)
fileres = re.match("^[a-zA-Z]*[{0}]{{2}}_resources.csv$".format(i), item, flags=0)
if filepop:
dfpop = pd.read_csv(os.path.join(cd, item))
if fileres:
dfres = pd.read_csv(os.path.join(cd, item))
farm = item.replace("{0}_resources.csv".format(i), "")
mergedf = pd.merge(dfpop, dfres, on=['FarmID'])
mergedf.to_csv(os.path.join(cd, "{0}{1}_FinalTable_py.csv"\
.format(farm, i)), index=False)