What is the equivalent of .loc with multiple conditions in R? - python

I wonder if there is any equivalent of .loc in R where I am able to have multiple condition that would work in a for loop.
In python I accomplished this using .loc as seen in the code below. However, I am unable to reproduce this in R.
for column in df.columns[1:9]:
for i in range(4,len(df)):
col = 'AC' + str(column[-1])
df[col][i] = df['yw_lagged'].loc[(df['id'] == df[column][i]) & (df['yearweek'] == df['yearweek'][i])]
In R, I thought this would work
df[i,col] <- df[df$id == df[column][i] & df$yearweek == df$yearweek[i], "yw_lagged"]
but it dont seem to filter in the same way as .loc does.
edit:
structure(list(id = c(1, 2, 6, 7, 1, 2), v1 = c(2, 1, 1, 1, 2,
1), v2 = c(6, 3, 2, 2, 6, 3), v3 = c(7, 6, 5, 3, 7, 6), v4 =
c(NA, 7, 7, 6, NA, 7), v5 = c(NA, 8, 14, 8, NA, 8), v6 = c(NA,
NA,15, 15, NA, NA), v7 = c(NA, NA, 16, 16, NA, NA), v8 = c(NA,
NA,NA, 17, NA, NA), violent = c(1, 0, 1, 0, 0, 0), yw_lagged =
c(NA, NA, NA, NA, 1, 0), yearweek = c(20161, 20161, 20161, 20161,
20162, 20162), AC1 = c(NA, NA, NA, NA, NA, NA), AC2 = c(NA, NA,
NA, NA, NA, NA), AC3 = c(NA, NA, NA, NA, NA, NA), AC4 = c(NA, NA,
NA, NA, NA, NA), AC5 = c(NA, NA, NA, NA, NA, NA), AC6 = c(NA,
NA, NA, NA, NA, NA), AC7 = c(NA, NA, NA, NA, NA, NA), AC8 = c(NA,
NA, NA, NA, NA, NA)), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
Picture of expected output (Tried to add some connections using colors)

The sample data is lacking any matches, so I'll override a few rows of yw_lagged:
df$yw_lagged[1:4] <- 1:4
From here,
dplyr
library(dplyr)
df %>%
group_by(yearweek) %>%
mutate(across(matches("^v[0-9]+"),
~ yw_lagged[match(., id)],
.names = "AC{sub('^v', '', .col)}")) %>%
ungroup()
# # A tibble: 6 × 20
# id v1 v2 v3 v4 v5 v6 v7 v8 violent yw_lagged yearweek AC1 AC2 AC3 AC4 AC5 AC6 AC7 AC8
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 2 6 7 NA NA NA NA NA 1 1 20161 2 3 4 NA NA NA NA NA
# 2 2 1 3 6 7 8 NA NA NA 0 2 20161 1 NA 3 4 NA NA NA NA
# 3 6 1 2 5 7 14 15 16 NA 1 3 20161 1 2 NA 4 NA NA NA NA
# 4 7 1 2 3 6 8 15 16 17 0 4 20161 1 2 NA 3 NA NA NA NA
# 5 1 2 6 7 NA NA NA NA NA 0 1 20162 0 NA NA NA NA NA NA NA
# 6 2 1 3 6 7 8 NA NA NA 0 0 20162 1 NA NA NA NA NA NA NA
data.table
library(data.table)
cols <- grep("v[0-9]+", names(df), value = TRUE)
data.table(df)[, (sub("^v", "AC", cols)) :=
lapply(.SD, \(z) yw_lagged[match(z, id)]),
.SDcols = cols][]
# id v1 v2 v3 v4 v5 v6 v7 v8 violent yw_lagged yearweek AC1 AC2 AC3 AC4 AC5 AC6 AC7 AC8
# <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
# 1: 1 2 6 7 NA NA NA NA NA 1 1 20161 2 3 4 NA NA NA NA NA
# 2: 2 1 3 6 7 8 NA NA NA 0 2 20161 1 NA 3 4 NA NA NA NA
# 3: 6 1 2 5 7 14 15 16 NA 1 3 20161 1 2 NA 4 NA NA NA NA
# 4: 7 1 2 3 6 8 15 16 17 0 4 20161 1 2 NA 3 NA NA NA NA
# 5: 1 2 6 7 NA NA NA NA NA 0 1 20162 2 3 4 NA NA NA NA NA
# 6: 2 1 3 6 7 8 NA NA NA 0 0 20162 1 NA 3 4 NA NA NA NA

Related

Pandas Divide Dataframe by Another Based on Column Values

I want to divide a pandas dataframe by another based on the column values.
For example let's say I have:
>>> df = pd.DataFrame({'NAME': [ 'CA', 'CA', 'CA', 'AZ', 'AZ', 'AZ', 'TX', 'TX', 'TX'], 'NUM':[1, 2, 3, 1, 2, 3, 1, 2, 3], 'VALUE': [10, 20, 30, 40, 50, 60, 70, 80, 90]})
>>> df
NAME NUM VALUE
0 CA 1 10
1 CA 2 20
2 CA 3 30
3 AZ 1 40
4 AZ 2 50
5 AZ 3 60
6 TX 1 70
7 TX 2 80
8 TX 3 90
>>> states = pd.DataFrame({'NAME': ['CA', "AZ", "TX"], 'DIVISOR': [10, 5, 1]})
>>> states
NAME DIVISOR
0 CA 10
1 AZ 5
2 TX 1
For each STATE and NUM I want to divide the VALUE column in df by the divisor COLUMN of the respective state.
Which would give a result of
>>> result = pd.DataFrame({'NAME': [ 'CA', 'CA', 'CA', 'AZ', 'AZ', 'AZ', 'TX', 'TX', 'TX'], 'NUM':[1, 2, 3, 1, 2, 3, 1, 2, 3], 'VALUE': [1, 2, 3, 8, 10, 12, 70, 80, 90]})
>>> result
NAME NUM VALUE
0 CA 1 1
1 CA 2 2
2 CA 3 3
3 AZ 1 8
4 AZ 2 10
5 AZ 3 12
6 TX 1 70
7 TX 2 80
8 TX 3 90
Let us do map
df['NEW VALUE'] = df['VALUE'].div(df['NAME'].map(states.set_index('NAME')['DIVISOR']))
df
Out[129]:
NAME NUM VALUE NEW VALUE
0 CA 1 10 1.0
1 CA 2 20 2.0
2 CA 3 30 3.0
3 AZ 1 40 8.0
4 AZ 2 50 10.0
5 AZ 3 60 12.0
6 TX 1 70 70.0
7 TX 2 80 80.0
8 TX 3 90 90.0
You can use merge as well
result = df.merge(states,on=['NAME'])
result['NEW VALUE'] = result.VALUE/result.DIVISOR
print(result)
NAME NUM VALUE NEW VALUE DIVISOR
0 CA 1 10 1.0 10
1 CA 2 20 2.0 10
2 CA 3 30 3.0 10
3 AZ 1 40 8.0 5
4 AZ 2 50 10.0 5
5 AZ 3 60 12.0 5
6 TX 1 70 70.0 1
7 TX 2 80 80.0 1
8 TX 3 90 90.0 1
I feel like there must be a more eloquent way to accomplish what you are looking for, but this is the rout that I usually take.
myresult = df.copy()
for i in range(len(df['NAME'])):
for j in range(len(states['NAME'])):
if df['NAME'][i] == states['NAME'][j]:
myresult['VALUE'][i] = df['VALUE'][i]/states['DIVISOR'][j]
myresult.head()
Out[10]>>
NAME NUM VALUE
0 CA 1 1
1 CA 2 2
2 CA 3 3
3 AZ 1 8
4 AZ 2 10
This is a very brute force method. You start by looping through each value in the data frame df, then you loop through each element in the data frame states. Then for each comparison, you look to see if the NAME columns match. If they do, you do the VALUE / DIVISOR.
You will get a warring for using the .copy() method

Pyspark distributed matrix sum non-null values

I'm attempting to convert a pandas "dot matrix nansum" function to pyspark.
The goal is to convert this table into a matrix of non-null column sums:
dan ste bob
t1 na 2 na
t2 2 na 1
t3 2 1 na
t4 1 na 2
t5 na 1 2
t6 2 1 na
t7 1 na 2
For example, when 'dan' is not-null (t-2,3,4,6,7) the sum of 'ste' is 2 and 'bob' is 5. When 'ste' is not-null the sum of 'dan' is 4. (I zeroed out the diagonal, but no need to)
dan ste bob
dan 0 2 5
ste 4 0 2
bob 4 1 0
the calculation must remain distributed (no toPandas).
Here's the pandas version which works wonderfully: https://stackoverflow.com/a/46871184/7542835

Cumulative running percentage within each group and each group sorted in descending order python

df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,'id':
[1,2,3,4,5,6]*2 ,'sales': [np.random.randint(100000, 999999) for _ in
range(12)]})
This is ouput of df:
id sales state
0 1 847754 CA
1 2 362532 WA
2 3 615849 CO
3 4 376480 AZ
4 5 381286 CA
5 6 411001 WA
6 1 946795 CO
7 2 857435 AZ
8 3 928087 CA
9 4 675593 WA
10 5 371339 CO
11 6 440285 AZ
I am not able to do the cumulative percentage for each group in descending order. I want the output like this:
id sales state cumsum run_pct
0 2 857435 AZ 857435 0.5121460996296738
1 6 440285 AZ 1297720 0.7751284195436626
2 4 376480 AZ 1674200 1.0
3 3 928087 CA 928087 0.43024216932985404
4 1 847754 CA 1775841 0.8232436013271356
5 5 381286 CA 2157127 1.0
6 1 946795 CO 946795 0.48955704367618535
7 3 615849 CO 1562644 0.807992624547372
8 5 371339 CO 1933983 1.0
9 4 675593 WA 675593 0.46620721731581655
10 6 411001 WA 1086594 0.7498271371847582
11 2 362532 WA 1449126 1.0
One possible solution is to first sort the data, calculate the cumsum and then the percentages last.
Sorting with ascending states and descending sales:
df = df.sort_values(['state', 'sales'], ascending=[True, False])
Calculating the cumsum:
df['cumsum'] = df.groupby('state')['sales'].cumsum()
and the percentages:
df['run_pct'] = df.groupby('state')['sales'].apply(lambda x: (x/x.sum()).cumsum())
This will give:
id sales state cumsum run_pct
0 4 846079 AZ 846079 0.608566
1 2 312708 AZ 1158787 0.833491
2 6 231495 AZ 1390282 1.000000
3 3 790291 CA 790291 0.506795
4 1 554631 CA 1344922 0.862467
5 5 214467 CA 1559389 1.000000
6 1 983878 CO 983878 0.388139
7 5 779497 CO 1763375 0.695650
8 3 771486 CO 2534861 1.000000
9 6 794407 WA 794407 0.420899
10 2 587843 WA 1382250 0.732355
11 4 505155 WA 1887405 1.000000

Manipulating data frame in R

I'm trying to mungle my data from the following data frame to the one following it where the values in column B and C are combined to column names for the values in D grouped by the values in A.
Below is a reproducible example.
set.seed(10)
fooDF <- data.frame(A = sample(1:4, 10, replace=TRUE), B = sample(letters[1:4], 10, replace=TRUE), C= sample(letters[1:4], 10, replace=TRUE), D = sample(1:4, 10, replace=TRUE))
fooDF[!duplicated(fooDF),]
A B C D
1 4 c b 2
2 4 d a 2
3 2 a b 4
4 3 c a 1
5 4 a b 3
6 4 b a 2
7 1 b d 2
8 1 a d 4
9 2 b a 3
10 2 d c 2
newdata <- data.frame(A = 1:4)
for(i in 1:nrow(fooDF)){
col_name <- paste(fooDF$B[i], fooDF$C[i], sep="")
newdata[newdata$A == fooDF$A[i], col_name ] <- fooDF$D[i]
}
The format I am trying to get it in.
> newdata
A cb da ab ca ba bd ad dc
1 1 NA NA NA NA NA 2 4 NA
2 2 NA NA 4 NA 3 NA NA 2
3 3 NA NA NA 1 NA NA NA NA
4 4 2 2 3 NA 2 NA NA NA
Right now I am doing it line by line but that is unfeasible for a large csv containing 5 million + lines. Is there a way to do it faster in R or python?
In R, this can be done with tidyr
library(tidyr)
fooDF %>%
unite(BC, B, C, sep="") %>%
spread(BC, D)
# A ab ad ba bd ca cb da dc
#1 1 NA 4 NA 2 NA NA NA NA
#2 2 4 NA 3 NA NA NA NA 2
#3 3 NA NA NA NA 1 NA NA NA
#4 4 3 NA 2 NA NA 2 2 NA
Or we can do this with dcast
library(data.table)
dcast(setDT(fooDF), A~paste0(B,C), value.var = "D")
# A ab ad ba bd ca cb da dc
#1: 1 NA 4 NA 2 NA NA NA NA
#2: 2 4 NA 3 NA NA NA NA 2
#3: 3 NA NA NA NA 1 NA NA NA
#4: 4 3 NA 2 NA NA 2 2 NA
data
fooDF <- structure(list(A = c(4L, 4L, 2L, 3L, 4L, 4L, 1L, 1L, 2L, 2L),
B = c("c", "d", "a", "c", "a", "b", "b", "a", "b", "d"),
C = c("b", "a", "b", "a", "b", "a", "d", "d", "a", "c"),
D = c(2L, 2L, 4L, 1L, 3L, 2L, 2L, 4L, 3L, 2L)), .Names = c("A",
"B", "C", "D"), class = "data.frame", row.names = c("1", "2",
"3", "4", "5", "6", "7", "8", "9", "10"))
First paste columns B and C together (into column "z"):
fooDF$z = paste0(fooDF$B,fooDF$C)
A B C D z
1 3 d c 3 dc
2 1 b d 3 bd
3 1 a a 2 aa
4 2 d a 1 da
5 4 d c 1 dc
6 2 d b 2 db
7 4 b d 3 bd
8 2 c d 3 cd
9 1 a b 2 ab
10 4 a b 2 ab
Then I'll remove columns B and C
fooDF$B = NULL
fooDF$c = NULL
And last do a reshape from long to wide:
finalFooDF = reshape(fooDF, timevar = "z", direction = "wide",idvar = "A")
A D.dc D.bd D.aa D.da D.db D.cd D.ab
1 3 3 NA NA NA NA NA NA
2 1 NA 3 2 NA NA NA 2
4 2 NA NA NA 1 2 3 NA
5 4 1 3 NA NA NA NA 2

Python/R : If 2 columns have same value in multiple rows, add the values in the 3rd column and average the 4th, 5th and 6th column

Input :
0 77 1 2 3 5
0 78 2 4 6 1
0 78 1 2 3 5
3 79 0 4 5 2
3 79 6 8 2 1
3 79 1 2 3 1
Output : (add the 3rd column values for the identical rows and take mean of all the values in the 4th, 5th and the 6th column)
0 77 1.0 2.0 3.0 5.0
0 78 3.0 3.0 4.5 3.0
3 79 7.0 4.6 3.3 1.3
We can use dplyr in R. We group by the first two columns, mutate the 3rd column ('V3') as sum of that column, and use summarise_each to get the mean of columns 3:6.
library(dplyr)
res <- df1 %>%
group_by(V1, V2) %>%
mutate(V3=sum(V3)) %>%
summarise_each(funs(round(mean(.),1)), V3:V6)
as.data.frame(res)
# V1 V2 V3 V4 V5 V6
#1 0 77 1 2.0 3.0 5.0
#2 0 78 3 3.0 4.5 3.0
#3 3 79 7 4.7 3.3 1.3
data
df1 <- structure(list(V1 = c(0L, 0L, 0L, 3L, 3L, 3L), V2 = c(77L, 78L,
78L, 79L, 79L, 79L), V3 = c(1L, 2L, 1L, 0L, 6L, 1L), V4 = c(2L,
4L, 2L, 4L, 8L, 2L), V5 = c(3L, 6L, 3L, 5L, 2L, 3L), V6 = c(5L,
1L, 5L, 2L, 1L, 1L)), .Names = c("V1", "V2", "V3", "V4", "V5",
"V6"), class = "data.frame", row.names = c(NA, -6L))

Categories