R: comparing two character columns in a big data frame - python

So, I have s really huge data frame, which has two columns of characters. the characters are some ID values separated by ";". So, I want to calculate the number of common ID values between these two columns. Here is an example:
id.x id.y
1 123;145;156 143;156;234;165
2 134;156;187;675 132;145;156;187
so in this case, the first row has 1 common value, and the second row has two common values.
The table size is 60M records, and some of the strings may be more than a 1000 long. I tried to write the data to a text file, and do this analysis by python, but the file size is 30GB. Any idea to do this in R? (regex, apply, ..)
I can count the numbe rof common values by this command:
intersect(strsplit(df[1,"ind.x"], split=";")[[1]], strsplit(df[1,"ind.y"], split=";") [[1]])
Therefore, I wrote a function:
myfun <- function(x,y) {
length(intersect(strsplit(x, split=";")[[1]], strsplit(y, split=";")[[1]]))
}
which works when I try it on a single call, but when I use it with mapply as below, it prints all the columns, but I only want the number in output:
> mapply(FUN=myfun, df[1:2,]$id.x, df[1:2,]$id.y)
123;145;156 134;156;187;675
1 2
So, why it prints the first column as well? What is wrong with my command?

Mapply returns an integer vector with name attributes.
y <- mapply(myfun, df$id.x, df$id.y)
str(y)
Named int [1:2] 1 2
- attr(*, "names")= chr [1:2] "123;145;156" "134;156;187;675"
Drop them with USE.NAMEs
mapply(myfun, df$id.x, df$id.y, USE.NAMES=FALSE)
[1] 1 2
And use an index and test the time on larger and larger sets of data
system.time(y <- mapply(myfun, df[1:1e5,]$id.x, df[1:1e5,]$id.y, USE.NAMES=FALSE))

Related

How to loop through multiple variables and filter data to produce multiple dataframes?

I have a dataframe df with about 1000 rows with the following columns:
df$ID<- c("ab11", "ab12" ...) #about 1000 rows
df$ID1<-numbers ranging from 1 to 20k # for all intense and purposes this can be treated as class 'factor'
df$Acol<- #numbers ranging from 1 to 1000
df$Bcol<- #numbers ranging from 0 to 1
The following lists gives me 12 values in each list:
A<- seq(50,600,by=50)
B<- seq(0.2,1,by=0.75)
I am trying to do 2 things:
I would like to create dataframes by filtering the original dataset with various combinations of lists A and B. So 144 dataframes.
Once I have those dataframes I would like to combine 3 dataframes at a time and see if the frequency of the IDs match a master dataframe x and if they do, get the combinations information for the matching dataframe.
So for 1, this is my approach:
df_50_0.2<-subset(df, df$Acol>=50 & df$Bcol>=0.2)
I can't write that out 144 times- I need a loop. I tried nested loop but that doesn't give me every combination of A and B so I tried a while loop.
Here is my code:
i<-50
while (i<550) {
for (j in B) {
assign(paste("df","_",as.character(i),"_",as.character(j)), df %>%
filter (Acol>=i) %>%
filter(Bcol>=j),envir=.GlobalEnv
i<-i+50
}}
That give me the desired result except it doesn't split the dataframe according to B. So the output is similar to what I would have if I had just filtered the data with values of A.
For the second part I need to loop through all possible combinations of three data frames at a time. Here is my code:
df.final<-rbind (df_50_0.2,df_100_0.25,df_150_0.5)
tmp<-subset(table(df.final$ID),!(table(df.final$ID) %in% table(x$ID))
I would like the above to be in a loop. If tmp has any values, I don't want it to be an output. If it is 0, that is, it is a perfect match to the frequency of IDs in the master dataframe x, I want that to be written. So something like the following in a loop? I want all possible combinations checked iteratively to come up with the combinations that match the master dataframe x ID frequency perfectly:
if tmp = NULL
tmp
else rm(tmp)
Any help is much appreciated. A python solution is also welcome!
A solution available in the following link but modified for two columns could be helpful Filter loop to create multiple data frames
This is not a complete answer, but it represents a different way to think about your problem. First we need a reproducible sample. I'm leaving out your first two columns, but it would not be difficult to modify the example to include them.
set.seed(42)
df <- data.frame(A=runif(200) * 1000 + 1, B=runif(200))
Aint <- seq(50, 600, by=50)
Bint <- seq(0.2, 1, by=0.05)
comb <- expand.grid(Aint=Aint, Bint=Bint)
This gives you a data frame of 200 observations and columns A and B. Then we create the intervals (called Aint and Bint so we don't confuse them with df$A and df$B. The next line creates all possible combinations of those intervals - 204 (not 144 since Bint has 17 values).
Now we need to define your groups and generate the label for each one:
groups <- apply(comb, 1, function(x) df$A > x[1] & df$B > x[2])
labels <- paste("df", comb[, 1], comb[, 2], sep="_")
groups is a logical matrix 200 x 204 so each column defines one of your comparisons. The
first group is defined as df[groups[, 1], ] and the second as df[groups[, 2], ]. The labels for those comparisons are labels[1] and labels[2].
Since your groups are overlapping, the next step will create multiple copies of your data in each group of three. The number of combinations of 204 groups taken 3 at a time is 1,394,204. The following generates the code to create the different combinations of 3 data frames:
allcombos <- combn(204, 3, simplify=TRUE)
for (i in 1:5) {
dfno <- allcombos[, i]
df.final <- rbind(df[groups[, dfno[1]], ], df[groups[, dfno[2]], ], df[groups[, dfno[3]], ])
lbl <- labels[dfno]
}
The next step involves subset(table(df.final$ID),!(table(df.final$ID) %in% table(x$ID)) but that line does not do what you suggest. It creates a frequency table of the number of times each ID appears in df.final and a table for a master data frame x. The %in% statement processes each frequency in df.final and checks to see if it matches the frequency of ANY ID in master. So I have not included that logical statement or code that would write the current df.final to a list.

Extract values from array type of column in pandas

I am trying to extract the location codes / product codes from a sql table using pandas. The field is an array type, i.e. it has multiple values as a list within each row. I have to extract values from string for product/location codes.
Here is a sample of the table
df.head()
Target_Type Constraints
45 ti_8188,to_8188,r_8188,trad_8188_1,to_9258,ti_9258,r_9258,trad_9258_1
45 ti_8188,to_8188,r_8188,trad_8188_1,trad_22420_1
45 ti_8894,trad_8894_0.2
Now I want to extract the numeric values of the codes. I also want to ignore the end float values after 2nd underscore in the entries, i.e. ignore the _1, _0.2 etc.
Here is a sample output I want to achieve. It should be unique list/df column of all the extracted values -
Target_Type_45_df.head()
Constraints
8188
9258
22420
8894
I have never worked with nested/array type of column before. Any help would be appreciated.
You can use explode to bring each variable into a single cell, under one column:
df = df.explode('Constraints')
df['newConst'] = df['Constraints'].apply(lambda x: str(x).split('_')[1])
I would think the following overall strategy would work well (you'll need to debug):
Define a function that takes a row as input (the idea being to broadcast this function with the pandas .apply method).
In this function, set my_list = row['Constraints'].
Then do my_list = my_list.split(','). Now you have a list, with no commas.
Next, split with the underscore, take the second element (index 1), and convert to int:
numbers = [int(element.split('_')[1]) for element in my_list]
Finally, convert to set: return set(numbers)
The output for each row will be a set - just union all these sets together to get the final result.

Add 1 where a substring is present in a column

I have several strings concatenated in column rows, separated by '|'.
I need to make columns for each of the strings. so applyed a unique method and now have an arrange with the desired strings, let's call it A.
i made columns checking if the string in the A is in the concatenated row of column C with this:
for i in A:
df[i] = df['C'].str.contains(i)
now this returns booleans, and now I'm trying to turn booleans into 1 and 0 values. The target is to make columns that tell if the string A is in the concatenated strings C.
So, is there a way to make it return values 1 for True and 0 for False? i'm asking also because i couldn't test much because A has 20 strings and C 20 million rows, so I have to let my laptop run this by night :P
hope my english is clear, thank's!
Assuming I have understood your question well you are actually just one function away:
for i in A:
df[i] = df['C'].str.contains(i).astype(int)
However, if you already have computed the Boolean values you can:
df[A] = df[A].astype(int)
or
df[A]=df[A].replace({True:1, False:0})

What is the Python syntax for accessing specific data in a function call?

I am not generally a python user. I do most things in R and Stata. However, I cannot find a good semantic similarity package/API in either of those.
I have two data frames in the environment. One is questions, which consists of 2 columns and 3 rows. The other is results, which has 3 columns and 3 rows.
I am trying to compare each question (individually) in the first column of the questions dataframe to all of the questions in the second column. Then I want the output to populate the results dataframe. The function takes two strings as arguments So far my code looks like this:
for i in range(1, 3):
results.iloc[i-1,i] = liteClient.compare(questions.iloc[0,i], questions.iloc[:,1])
I assume that I am pointing to the data incorrectly, but I really haven't been able to find good documentation about this seemingly straightforward issue.
By comparison, here is my working R code, which uses a slightly different function and only one dataframe.
for (i in 1:3){
df[,i+2] <- levenshteinSim(df$yr1questions[i], df$yr2questions)
}
Any help would be greatly appreciated! I am trying to come up with proof-of-concept code to compare similar survey questions between years based on semantic meaning.
Bob
Let's try to compare(multiply) every question A to every question B
import pandas as pd
questions = pd.DataFrame(np.arange(6).reshape(3,2), columns=["question_A", "question_B"])
This gives:
question_A question_B
0 0 1
1 2 3
2 4 5
Then let's define a compare fonction:
def compare(row):
return pd.Series(row[0]*questions['question_B'])
results = questions.apply(compare, axis=1)
That gives us:
0 1 2
0 0 0 0
1 2 6 10
2 4 12 20
As you pointed in the comments, here is a version comparing only two strings at a time:
def compare(row):
question_a = row[0]
return pd.Series([liteClient.compare(question_a, question_b) for question_b in questions['question_B']])
Based on what you've put so far here are some issues with what you've written which are understandable from your R programming background:
for i in range(1, 3):
In python 3.x, what this does is create a range object which you can think of as a special type of function (though is really an object that contains iteration properties) that allows you to generate a sequence of numbers with a certain step size (default is 1) exclusively. Additionally you need to know that most programming languages index starting at zero, not 1, and this includes python.
What this range object does here is generate the sequence 1, 2 and that is it.
Your arrays you are using i to index over are not going to index over all indicies. What I believe you want is something like:
for i in range(3):
Notice how there is only one number here, this defaults to the exclusive maximum of the range, and 0 being the inclusive minimum, so this will generate the sequence of 0,1,2. If you have an array of size 3, this will represent all possible indices for that array.
This next line is a bit confusing to me, since I'm not familiar with R, but I sort of understand what you were trying to do. If I understand correctly you are trying to compare two columns of 3 questions each, and compare each question in column 1 to the questions in column 2, resulting in a 3 x 3 matrix of comparison results which you are trying to store in results?. Assuming the size are already correct (as in results is 3x3) I'd like to explain some peculiarities I see in this code.
results.iloc[i-1,i] = liteClient.compare(questions.iloc[0,i], questions.iloc[:,1])
with results.iloc[i-1,i] you are going to be indexing by row an column, as in i-1 is the row, and i is the column here. So with out changing range(1,3) this results in the following indexes being accessed, 0,1, 1,2 and that is it. I believe liteClient.compare(...) is supposed to return either a dataframe of 1x3, or a list of size 3 based on what you were trying to do inside of it, this may not be the case however, I'm not sure what object you are using to call that member function, so I don't know where the documentation for that function exists. Assuming it does return a list of size 3 or the dataframe, what you'll need to change the way you are trying to assign the data, via this:
results.iloc[i,:] = ...
What is happening here is that iloc(...) is taking a row positional argument and a slice positional argument, here you are assigning all the columns in the result matrix at that row to the values returned by compare. With the for statement changes this will iterate over all indices in the dataframe.
liteClient.compare(questions.iloc[0,i], questions.iloc[:,1])
This line as it currently stands you will be iterating over each column in the first row of questions.iloc, and then comparing them to the second column and all rows of the second questions.iloc.
I believe what you will want to do is change this to:
liteClient.compare(questions.iloc[i,0], questions.iloc[:,1])
What this does is for each i, 0, 1, 2, at column 0, compare this to every row in column 1. If your questions dataframe is actually organized as 2 columns and 3 rows this should work, otherwise you will need to change how you create questions as well.
in all I believe the fixed program should look something like:
for i in range(3):
results.iloc[i,:] = liteClient.compare(questions.iloc[i,0], questions.iloc[:,1])

Return subset/slice of Pandas dataframe based on matching column of other dataframe, for each element in column?

So I think this is a relatively simple question:
I have a Pandas data frame (A) that has a key column (which is not unique/will have repeats of the key)
I have another Pandas data frame (B) that has a key column, which may have many matching entries/repeats.
So what I'd like is a bunch of data frames (a list, or a bunch of slice parameters, etc.), one for each key in A (regardless of whether it's unique or not)
In [bad] pseudocode:
for each key in A:
resultDF[] = Rows in B where B.key = key
I can easily do this iteratively with loops, but I've read that you're supposed to slice/merge/join data frames holistically, so I'm trying to see if I can find a better way of doing this.
A join will give me all the stuff that matches, but that's not exactly what I'm looking for, since I need a resulting dataframe for each key (i.e. for every row) in A.
Thanks!
EDIT:
I was trying to be brief, but here are some more details:
Eventually, what I need to do is generate some simple statistical metrics for elements in the columns of each row.
In other words, I have a DF, call it A, and it has a r rows, with c columns, one of which is a key. There may be repeats on the key.
I want to "match" that key with another [set of?] dataframe, returning however many rows match the key. Then, for that set of rows, I want to, say, determine the min and max of certain element (and std. dev, variance, etc.) and then determine if the corresponding element in A falls within that range.
You're absolutely right that it's possible that if row 1 and row 3 of DF A have the same key -- but potentially DIFFERENT elements -- they'd be checked against the same result set (the ranges of which obviously won't change). That's fine. These won't likely ever be big enough to make that an issue (but if there's the better way of doing it, that's great).
The point is that I need to be able to do the "in range" and stat summary computation for EACH key in A.
Again, I can easily do all of this iteratively. But this seems like the sort of thing pandas could do well, and I'm just getting into using it.
Thanks again!
FURTHER EDIT
The DF looks like this:
df = pd.DataFrame([[1,2,3,4,1,2,3,4], [28,15,13,11,12,23,21,15],['keyA','keyB','keyC','keyD', 'keyA','keyB','keyC','keyD']]).T
df.columns = ['SEQ','VAL','KEY']
SEQ VAL KEY
0 1 28 keyA
1 2 15 keyB
2 3 13 keyC
3 4 11 keyD
4 1 12 keyA
5 2 23 keyB
6 3 21 keyC
7 4 15 keyD
Both DF's A and B are of this format.
I can iterative get the resultant sets by:
loop_iter = len(A) / max(A['SEQ_NUM'])
for start in range(0, loop_iter):
matchA = A.iloc[start::loop_iter, :]['KEY']
That's simple. But I guess I'm wondering if I can do this "inline". Also, if for some reason the numeric ordering breaks (i.e. the SEQ get out of order) this this won't work. There seems to be no reason NOT to do it explicitly splitting on the keys, right? So perhaps I have TWO questions: 1). How to split on keys, iteratively (i.e. accessing a DF one row at a time), and 2). How to match a DF and do summary statistics, etc., on a DF that matches on the key.
So, once again:
1). Iterate through DF A, going one at a time, and grabbing a key.
2). Match the key to the SET (matchB) of keys in B that match
3). Do some stats on "values" of matchB, check to see if val.A is in range, etc.
4). Profit!
Ok, from what I understand, the problem at its most simple is that you have a pd.Series of values (i.e. a["key"], which let's just call keys), which correspond to the rows of a pd.DataFrame (the df called b), such that set(b["key"]).issuperset(set(keys)). You then want to apply some function to each group of rows in b where the b["key"] is one of the values in keys.
I'm purposefully disregarding the other df -- a -- that you mention in your prompt, because it doesn't seem to bear any significance to the problem, other than being the source of keys.
Anyway, this is a fairly standard sort of operation -- it's a groupby-apply.
def descriptive_func(df):
"""
Takes a df where key is always equal and returns some summary.
:type df: pd.DataFrame
:rtype: pd.Series|pd.DataFrame
"""
pass
# filter down to those rows we're interested in
valid_rows = b[b["key"].isin(set(keys))]
# this groups by the value and applies the descriptive func to each sub df in turn
summary = valid_rows.groupby("key").apply(descriptive_func)
There are a few built in methods on the groupby object that are useful. For example, check out valid_rows.groupby("key").sum() or valid_rows.groupby("key").describe(). Under the covers, these are really similar uses of apply. The shape of the returned summary is determined by the applied function. The unique grouped-by values -- those of b["key"] -- always constitute the index, but if the applied function returns a scalar, summary is a Series; if the applied function returns a Series, then summary constituted of the return Series as rows; if the applied function returns a DataFrame, then the result is a multiindex DataFrame. This is a core pattern in Pandas, and there's a whole, whole lot to explore here.

Categories