Manipulating data frame in R - python

I'm trying to mungle my data from the following data frame to the one following it where the values in column B and C are combined to column names for the values in D grouped by the values in A.
Below is a reproducible example.
set.seed(10)
fooDF <- data.frame(A = sample(1:4, 10, replace=TRUE), B = sample(letters[1:4], 10, replace=TRUE), C= sample(letters[1:4], 10, replace=TRUE), D = sample(1:4, 10, replace=TRUE))
fooDF[!duplicated(fooDF),]
A B C D
1 4 c b 2
2 4 d a 2
3 2 a b 4
4 3 c a 1
5 4 a b 3
6 4 b a 2
7 1 b d 2
8 1 a d 4
9 2 b a 3
10 2 d c 2
newdata <- data.frame(A = 1:4)
for(i in 1:nrow(fooDF)){
col_name <- paste(fooDF$B[i], fooDF$C[i], sep="")
newdata[newdata$A == fooDF$A[i], col_name ] <- fooDF$D[i]
}
The format I am trying to get it in.
> newdata
A cb da ab ca ba bd ad dc
1 1 NA NA NA NA NA 2 4 NA
2 2 NA NA 4 NA 3 NA NA 2
3 3 NA NA NA 1 NA NA NA NA
4 4 2 2 3 NA 2 NA NA NA
Right now I am doing it line by line but that is unfeasible for a large csv containing 5 million + lines. Is there a way to do it faster in R or python?

In R, this can be done with tidyr
library(tidyr)
fooDF %>%
unite(BC, B, C, sep="") %>%
spread(BC, D)
# A ab ad ba bd ca cb da dc
#1 1 NA 4 NA 2 NA NA NA NA
#2 2 4 NA 3 NA NA NA NA 2
#3 3 NA NA NA NA 1 NA NA NA
#4 4 3 NA 2 NA NA 2 2 NA
Or we can do this with dcast
library(data.table)
dcast(setDT(fooDF), A~paste0(B,C), value.var = "D")
# A ab ad ba bd ca cb da dc
#1: 1 NA 4 NA 2 NA NA NA NA
#2: 2 4 NA 3 NA NA NA NA 2
#3: 3 NA NA NA NA 1 NA NA NA
#4: 4 3 NA 2 NA NA 2 2 NA
data
fooDF <- structure(list(A = c(4L, 4L, 2L, 3L, 4L, 4L, 1L, 1L, 2L, 2L),
B = c("c", "d", "a", "c", "a", "b", "b", "a", "b", "d"),
C = c("b", "a", "b", "a", "b", "a", "d", "d", "a", "c"),
D = c(2L, 2L, 4L, 1L, 3L, 2L, 2L, 4L, 3L, 2L)), .Names = c("A",
"B", "C", "D"), class = "data.frame", row.names = c("1", "2",
"3", "4", "5", "6", "7", "8", "9", "10"))

First paste columns B and C together (into column "z"):
fooDF$z = paste0(fooDF$B,fooDF$C)
A B C D z
1 3 d c 3 dc
2 1 b d 3 bd
3 1 a a 2 aa
4 2 d a 1 da
5 4 d c 1 dc
6 2 d b 2 db
7 4 b d 3 bd
8 2 c d 3 cd
9 1 a b 2 ab
10 4 a b 2 ab
Then I'll remove columns B and C
fooDF$B = NULL
fooDF$c = NULL
And last do a reshape from long to wide:
finalFooDF = reshape(fooDF, timevar = "z", direction = "wide",idvar = "A")
A D.dc D.bd D.aa D.da D.db D.cd D.ab
1 3 3 NA NA NA NA NA NA
2 1 NA 3 2 NA NA NA 2
4 2 NA NA NA 1 2 3 NA
5 4 1 3 NA NA NA NA 2

Related

Create union of two columns in pandas

I have two dataframes with identical columns. However the 'labels' column can have different labels. All labels are comma seperated strings. I want to make a union on the labels in order to go from this:
df1:
id1 id2 labels language
0 1 1 1 en
1 2 3 en
2 3 4 4 en
3 4 5 en
4 5 6 en
df2:
id1 id2 labels language
0 1 1 1,2 en
1 2 3 en
2 3 4 5,7 en
3 4 5 en
4 5 6 3 en
to this:
id1 id2 labels language
0 1 1 1,2 en
1 2 3 en
2 3 4 4,5,7 en
3 4 5 en
4 5 6 3 en
I've tried this:
df1['labels'] = df1['labels'].apply(lambda x: set(str(x).split(',')))
df2['labels'] = df2['labels'].apply(lambda x: set(str(x).split(',')))
result = df1.merge(df2, on=['article_id', 'line_number', 'language'], how='outer')
result['labels'] = result[['labels_x', 'labels_y']].apply(lambda x: list(set.union(*x)) if None not in x else set(), axis=1)
result['labels'] = result['labels'].apply(lambda x: ','.join(set(x)))
result = result.drop(['labels_x', 'techniques_y'], axis=1)
but I get a wierd df with odd commas in some places, e.g the ,3.:
id1 id2 labels language
0 1 1 1,2 en
1 2 3 en
2 3 4 4,5,7 en
3 4 5 en
4 5 6 ,3 en
How can I properly fix the commas? Any help is appreciated!
Here is a possible solution with pandas.merge :
out = (
df1.merge(df2, on=["id1", "id2", "language"])
.assign(labels= lambda x: x.filter(like="label")
.stack().str.split(",")
.explode().drop_duplicates()
.groupby(level=0).agg(",".join))
.drop(columns=["labels_x", "labels_y"])
[df1.columns]
)
Output :
print(out)
id1 id2 labels language
0 1 1 1,2 en
1 2 3 NaN en
2 3 4 4,5,7 en
3 4 5 NaN en
4 5 6 3 en

What is the equivalent of .loc with multiple conditions in R?

I wonder if there is any equivalent of .loc in R where I am able to have multiple condition that would work in a for loop.
In python I accomplished this using .loc as seen in the code below. However, I am unable to reproduce this in R.
for column in df.columns[1:9]:
for i in range(4,len(df)):
col = 'AC' + str(column[-1])
df[col][i] = df['yw_lagged'].loc[(df['id'] == df[column][i]) & (df['yearweek'] == df['yearweek'][i])]
In R, I thought this would work
df[i,col] <- df[df$id == df[column][i] & df$yearweek == df$yearweek[i], "yw_lagged"]
but it dont seem to filter in the same way as .loc does.
edit:
structure(list(id = c(1, 2, 6, 7, 1, 2), v1 = c(2, 1, 1, 1, 2,
1), v2 = c(6, 3, 2, 2, 6, 3), v3 = c(7, 6, 5, 3, 7, 6), v4 =
c(NA, 7, 7, 6, NA, 7), v5 = c(NA, 8, 14, 8, NA, 8), v6 = c(NA,
NA,15, 15, NA, NA), v7 = c(NA, NA, 16, 16, NA, NA), v8 = c(NA,
NA,NA, 17, NA, NA), violent = c(1, 0, 1, 0, 0, 0), yw_lagged =
c(NA, NA, NA, NA, 1, 0), yearweek = c(20161, 20161, 20161, 20161,
20162, 20162), AC1 = c(NA, NA, NA, NA, NA, NA), AC2 = c(NA, NA,
NA, NA, NA, NA), AC3 = c(NA, NA, NA, NA, NA, NA), AC4 = c(NA, NA,
NA, NA, NA, NA), AC5 = c(NA, NA, NA, NA, NA, NA), AC6 = c(NA,
NA, NA, NA, NA, NA), AC7 = c(NA, NA, NA, NA, NA, NA), AC8 = c(NA,
NA, NA, NA, NA, NA)), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
Picture of expected output (Tried to add some connections using colors)
The sample data is lacking any matches, so I'll override a few rows of yw_lagged:
df$yw_lagged[1:4] <- 1:4
From here,
dplyr
library(dplyr)
df %>%
group_by(yearweek) %>%
mutate(across(matches("^v[0-9]+"),
~ yw_lagged[match(., id)],
.names = "AC{sub('^v', '', .col)}")) %>%
ungroup()
# # A tibble: 6 × 20
# id v1 v2 v3 v4 v5 v6 v7 v8 violent yw_lagged yearweek AC1 AC2 AC3 AC4 AC5 AC6 AC7 AC8
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 2 6 7 NA NA NA NA NA 1 1 20161 2 3 4 NA NA NA NA NA
# 2 2 1 3 6 7 8 NA NA NA 0 2 20161 1 NA 3 4 NA NA NA NA
# 3 6 1 2 5 7 14 15 16 NA 1 3 20161 1 2 NA 4 NA NA NA NA
# 4 7 1 2 3 6 8 15 16 17 0 4 20161 1 2 NA 3 NA NA NA NA
# 5 1 2 6 7 NA NA NA NA NA 0 1 20162 0 NA NA NA NA NA NA NA
# 6 2 1 3 6 7 8 NA NA NA 0 0 20162 1 NA NA NA NA NA NA NA
data.table
library(data.table)
cols <- grep("v[0-9]+", names(df), value = TRUE)
data.table(df)[, (sub("^v", "AC", cols)) :=
lapply(.SD, \(z) yw_lagged[match(z, id)]),
.SDcols = cols][]
# id v1 v2 v3 v4 v5 v6 v7 v8 violent yw_lagged yearweek AC1 AC2 AC3 AC4 AC5 AC6 AC7 AC8
# <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
# 1: 1 2 6 7 NA NA NA NA NA 1 1 20161 2 3 4 NA NA NA NA NA
# 2: 2 1 3 6 7 8 NA NA NA 0 2 20161 1 NA 3 4 NA NA NA NA
# 3: 6 1 2 5 7 14 15 16 NA 1 3 20161 1 2 NA 4 NA NA NA NA
# 4: 7 1 2 3 6 8 15 16 17 0 4 20161 1 2 NA 3 NA NA NA NA
# 5: 1 2 6 7 NA NA NA NA NA 0 1 20162 2 3 4 NA NA NA NA NA
# 6: 2 1 3 6 7 8 NA NA NA 0 0 20162 1 NA 3 4 NA NA NA NA

Pandas - how to drop nan rows within a group, but only if there's more than one row

For example, say I have a DataFrame that looks like this:
df1 = pd.DataFrame({
"grp": ["a", "a", "a", "b", "b", "c", "c", "c", "d"],
"col1": ["1", "2", np.nan, "4", "5", np.nan, "6", "7", np.nan]
})
grp col1
0 a 1
1 a 2
2 a NaN
3 b 4
4 b 5
5 c NaN
6 c 6
7 c 7
8 d NaN
For each group with the column named grp, I want to drop the rows where col1 is NaN.
The constraint is that I do not want to drop these rows when there's multiple rows within the group.
I would expect the output DataFrame to look like this.
df2 = pd.DataFrame({
"grp": ["a", "a", "b", "b", "c", "c", "d"],
"col1": ["1", "2", "4", "5", "6", "7", np.nan]
})
# notice the NaN in `grp`=="d"
grp col1
0 a 1
1 a 2
2 b 4
3 b 5
4 c 6
5 c 7
6 d NaN
I managed to come up with a solution, but it's clunky. Is there a more succinct way of solving this? I also don't understand why the values were cast to strings...
df1_grp = df1.groupby("grp")['col1'].apply(np.hstack).to_frame().reset_index()
df1_grp['col1'] = df1_grp['col1'].apply(lambda x: [float(_) for _ in x if _!="nan"] if len(x)>1 else x)
df1_grp.explode('col1')
Use GroupBy.transform with GroupBy.all for test if all values of group is NaN and then chain inverted mask by | by & for bitwise OR:
m = df1['col1'].isna()
m1 = m.groupby(df1["grp"]).transform('all')
df = df1[~m | m1]
print (df)
grp col1
0 a 1
1 a 2
3 b 4
4 b 5
6 c 6
7 c 7
8 d NaN
Or you can filter groups with only missing values:
m = df1['col1'].notna()
m1 = df1['grp'].isin(df1.loc[m, 'grp'])
df = df1[m | ~m1]
print (df)
grp col1
0 a 1
1 a 2
3 b 4
4 b 5
6 c 6
7 c 7
8 d NaN
Because d has only nan, fill(),brill() and drop duplicates keeping the first.
df1=df1.assign(col1=df1.groupby('grp')['col1'].ffill().bfill()).drop_duplicates(keep='first')
grp col1
0 a 1
1 a 2
3 b 4
4 b 5
5 c 6
7 c 7
8 d NaN

Merge on columns and rows

I am trying to make a large dataframe using python. I have a large amount of little dataframes with different row and column names, but there is some overlap between the row names and column names. What I was trying to do is start with one of the little dataframes and then one by one add the others.
Each of the specific row-column combinations is unique and in the end there will probably be a lot of NA.
I have tried doing this with merge from pandas, but this results in a much larger dataframe than I need with row and column names being duplicated instead of merged. If I could find a way that pandas realises that NaN is not a value and overwrites it when a new little dataframe is added, I think I would obtain the result I want.
I am also willing to try something that is not using pandas.
For example:
DF1 A B
Y 1 2
Z 0 1
DF2 C D
X 1 2
Z 0 1
Merged: A B C D
Y 1 2 NA NA
Z 0 1 0 1
X NA NA 1 2
And then a new dataframe has to be added:
DF3 C E
Y 0 1
W 1 1
The result should be:
A B C D E
Y 1 2 0 NA 1
Z 0 1 0 1 NA
X NA NA 1 2 NA
W NA NA 1 NA 1
But what happens is:
A B C_x C_y D E
Y 1 2 NA 1 NA 1
Z 0 1 0 0 1 NA
X NA NA 1 1 2 NA
W NA NA 1 1 NA 1
You want to use DataFrame.combine_first, which will align the DataFrames based on index, and will prioritize values in the left DataFrame, while using values in the right DataFrame to fill missing values.
df1.combine_first(df2).combine_first(df3)
Sample data
import pandas as pd
df1 = pd.DataFrame({'A': [1,0], 'B': [2,1]})
df1.index=['Y', 'Z']
df2 = pd.DataFrame({'C': [1,0], 'D': [2,1]})
df2.index=['X', 'Z']
df3 = pd.DataFrame({'C': [0,1], 'E': [1,1]})
df3.index=['Y', 'W']
Code
df1.combine_first(df2).combine_first(df3)
Output:
A B C D E
W NaN NaN 1.0 NaN 1.0
X NaN NaN 1.0 2.0 NaN
Y 1.0 2.0 0.0 NaN 1.0
Z 0.0 1.0 0.0 1.0 NaN

Python Pandas Dataframe: replace variable by the frequency count

I have a dataframe which has categorical variables with hundreds of different values.
I'm able to verify the frequency of these levels using the 'values_count()' function of using a groupby statement + reset_index() ...
I was trying to replace these hundreds of values by their frequency count (and later on merge levels with low cardinality). I was trying to join two different dataframes (one with the values and the other with the counts), but I'm having issues...
For example, the frequency table would be below, with around 300 records (all unique):
v_catego Time
0 AA 353
1 AAC 136
2 ABB 2
3 ABC 1
4 ACA 13
300 ZZZ 33
original dataframe:
V_vatego
0 AA
1 AAC
2 ABB
3 AAC
4 DA
5 AAC
................
where I would like to replace(or add another) variable by the 'Time' values for each instance :
v_catego new_v_catego
0 AA 353
1 AAC 136
2 ABB 2
3 AA 353
4 AAC 136
.................
I know in R there is a simple function that does this. Is there an equivalent in python?
IIUC you can use concat, but before you have to set same categories in both Series (columns) by add_categories:
print df
v_catego Time
0 AA 353
1 AAC 136
2 ABB 2
3 AA 353
4 AAC 136
print df1
v_catego Time
0 ABC 1
1 ACA 13
#remember old cat in df1
old_cat = df1['v_catego']
#set same categories in both dataframes in column v_catego
df1['v_catego'] = df['v_catego'].cat.add_categories(df1['v_catego'])
df['v_catego'] = df['v_catego'].cat.add_categories(old_cat)
print df.v_catego
0 AA
1 AAC
2 ABB
3 AA
4 AAC
Name: v_catego, dtype: category
Categories (5, object): [AA, AAC, ABB, ABC, ACA]
print df1.v_catego
0 AA
1 AAC
Name: v_catego, dtype: category
Categories (5, object): [AA, AAC, ABB, ABC, ACA]
print pd.concat([df,df1])
v_catego Time
0 AA 353
1 AAC 136
2 ABB 2
3 AA 353
4 AAC 136
0 AA 1
1 AAC 13
EDIT:
I think you can use merge:
print df
v_catego
0 AA
1 AAC
2 ABB
3 AA
4 AAC
5 ABB
6 AA
7 AAC
8 AA
9 AAC
10 AAC
11 ABB
12 AA
13 AAC
14 ABB
15 AA
16 AAC
17 AA
18 AAC
df1 = df['v_catego'].value_counts()
.reset_index(name='count')
.rename(columns={'index': 'v_catego'})
print df1
v_catego count
0 AAC 8
1 AA 7
2 ABB 4
print pd.merge(df,df1,on=['v_catego'], how='left' )
v_catego count
0 AA 7
1 AAC 8
2 ABB 4
3 AA 7
4 AAC 8
5 ABB 4
6 AA 7
7 AAC 8
8 AA 7
9 AAC 8
10 AAC 8
11 ABB 4
12 AA 7
13 AAC 8
14 ABB 4
15 AA 7
16 AAC 8
17 AA 7
18 AAC 8

Categories