Add 1 where a substring is present in a column - python

I have several strings concatenated in column rows, separated by '|'.
I need to make columns for each of the strings. so applyed a unique method and now have an arrange with the desired strings, let's call it A.
i made columns checking if the string in the A is in the concatenated row of column C with this:
for i in A:
df[i] = df['C'].str.contains(i)
now this returns booleans, and now I'm trying to turn booleans into 1 and 0 values. The target is to make columns that tell if the string A is in the concatenated strings C.
So, is there a way to make it return values 1 for True and 0 for False? i'm asking also because i couldn't test much because A has 20 strings and C 20 million rows, so I have to let my laptop run this by night :P
hope my english is clear, thank's!

Assuming I have understood your question well you are actually just one function away:
for i in A:
df[i] = df['C'].str.contains(i).astype(int)
However, if you already have computed the Boolean values you can:
df[A] = df[A].astype(int)
or
df[A]=df[A].replace({True:1, False:0})

Related

Create a list of list from pandas dataframe in python with group by

I am beginner in python. I am just wondering, I have this data in a pandas dataframe
type
value
A
1
A
4
A
3
B
6
B
2
B
2
where each type will have three value (it can be different value or same value)
and what I want is to create a list of list (or anything, I don't know the exact name since I'm stuck to search the similar question using this vocabulary) but group by type column
expected output is exactly list of list like this:
[[1,4,3],[6,2,2]]
df.groupby('type').agg(pd.Series.tolist)['value'].tolist()
or simply:
df.groupby('type').agg(list)['value'].tolist()
Edit:
It depends on whether which number do you want to sort upon.. in this case, it will sort on the first element of the list which is 1.
If you want to reverse you can simply say k = df.groupby('type').agg(list)['value'].tolist()
k.sort(reverse=True)
it will give you [[6,2,2], [1,4,3]]. If you want to sort on any of the values of list you have to apply key=some_function() accordingly.

What is the Python syntax for accessing specific data in a function call?

I am not generally a python user. I do most things in R and Stata. However, I cannot find a good semantic similarity package/API in either of those.
I have two data frames in the environment. One is questions, which consists of 2 columns and 3 rows. The other is results, which has 3 columns and 3 rows.
I am trying to compare each question (individually) in the first column of the questions dataframe to all of the questions in the second column. Then I want the output to populate the results dataframe. The function takes two strings as arguments So far my code looks like this:
for i in range(1, 3):
results.iloc[i-1,i] = liteClient.compare(questions.iloc[0,i], questions.iloc[:,1])
I assume that I am pointing to the data incorrectly, but I really haven't been able to find good documentation about this seemingly straightforward issue.
By comparison, here is my working R code, which uses a slightly different function and only one dataframe.
for (i in 1:3){
df[,i+2] <- levenshteinSim(df$yr1questions[i], df$yr2questions)
}
Any help would be greatly appreciated! I am trying to come up with proof-of-concept code to compare similar survey questions between years based on semantic meaning.
Bob
Let's try to compare(multiply) every question A to every question B
import pandas as pd
questions = pd.DataFrame(np.arange(6).reshape(3,2), columns=["question_A", "question_B"])
This gives:
question_A question_B
0 0 1
1 2 3
2 4 5
Then let's define a compare fonction:
def compare(row):
return pd.Series(row[0]*questions['question_B'])
results = questions.apply(compare, axis=1)
That gives us:
0 1 2
0 0 0 0
1 2 6 10
2 4 12 20
As you pointed in the comments, here is a version comparing only two strings at a time:
def compare(row):
question_a = row[0]
return pd.Series([liteClient.compare(question_a, question_b) for question_b in questions['question_B']])
Based on what you've put so far here are some issues with what you've written which are understandable from your R programming background:
for i in range(1, 3):
In python 3.x, what this does is create a range object which you can think of as a special type of function (though is really an object that contains iteration properties) that allows you to generate a sequence of numbers with a certain step size (default is 1) exclusively. Additionally you need to know that most programming languages index starting at zero, not 1, and this includes python.
What this range object does here is generate the sequence 1, 2 and that is it.
Your arrays you are using i to index over are not going to index over all indicies. What I believe you want is something like:
for i in range(3):
Notice how there is only one number here, this defaults to the exclusive maximum of the range, and 0 being the inclusive minimum, so this will generate the sequence of 0,1,2. If you have an array of size 3, this will represent all possible indices for that array.
This next line is a bit confusing to me, since I'm not familiar with R, but I sort of understand what you were trying to do. If I understand correctly you are trying to compare two columns of 3 questions each, and compare each question in column 1 to the questions in column 2, resulting in a 3 x 3 matrix of comparison results which you are trying to store in results?. Assuming the size are already correct (as in results is 3x3) I'd like to explain some peculiarities I see in this code.
results.iloc[i-1,i] = liteClient.compare(questions.iloc[0,i], questions.iloc[:,1])
with results.iloc[i-1,i] you are going to be indexing by row an column, as in i-1 is the row, and i is the column here. So with out changing range(1,3) this results in the following indexes being accessed, 0,1, 1,2 and that is it. I believe liteClient.compare(...) is supposed to return either a dataframe of 1x3, or a list of size 3 based on what you were trying to do inside of it, this may not be the case however, I'm not sure what object you are using to call that member function, so I don't know where the documentation for that function exists. Assuming it does return a list of size 3 or the dataframe, what you'll need to change the way you are trying to assign the data, via this:
results.iloc[i,:] = ...
What is happening here is that iloc(...) is taking a row positional argument and a slice positional argument, here you are assigning all the columns in the result matrix at that row to the values returned by compare. With the for statement changes this will iterate over all indices in the dataframe.
liteClient.compare(questions.iloc[0,i], questions.iloc[:,1])
This line as it currently stands you will be iterating over each column in the first row of questions.iloc, and then comparing them to the second column and all rows of the second questions.iloc.
I believe what you will want to do is change this to:
liteClient.compare(questions.iloc[i,0], questions.iloc[:,1])
What this does is for each i, 0, 1, 2, at column 0, compare this to every row in column 1. If your questions dataframe is actually organized as 2 columns and 3 rows this should work, otherwise you will need to change how you create questions as well.
in all I believe the fixed program should look something like:
for i in range(3):
results.iloc[i,:] = liteClient.compare(questions.iloc[i,0], questions.iloc[:,1])

Loop through column until blank

I am trying to loop through a panda DataFrame until the columns are blank or does not contain the term 'Stock'. If it contains a date I want the word 'check' to be printed.
I am using:
print(df)
Stock 15/12/2015 15/11/2015 15/10/2015
0 AA 10 11 11
1 BB 20 10 8
2 CC 30 33 26
3 DD 40 80 60
I have tried the below (which is wrong):
column = df
while column != ("") or 'Stock':
print ('Check'),
column += 1
print ("")
There's a few problems in your code. First of all you've screwed up indentation so it's not even valid code.
Second your comparison is broken because it doesn't mean what you probably expect. column != ("") or 'Stock' will always be true because it means that first it will compare column with ("") and if that's equal the expression will be True, otherwise it will evaluate 'Stock' and make that the value of the expression (and in boolean context that would be considered true). What you probably should have written instead is column != "" and column != "Stock" or possibly column not in ("", "Stock").
Then I'm not sure if you're looping the right way or using column the right way either. Is it correct to step to the next by using column += 1? I don't know panda, but it seems odd. Also comparing it to a string may be incorrect.
Your code really needs improvement. You should follow #skyking advises. I'd like to add that you may want to transpose your dataframe, and put the dates as a variable.
Anyway, let me rephrase what you are looking for, to make sure I got it right: you want to iterate over the columns of your df and for every column which name is a date, you do print('Check'), otherwise nothing happens. Please, let us know if this is wrong.
To achieve that, here is a possible approach. You can iterate over the columns name and attempt to convert the string to a date, for instance, using pd.to_datetime. If successful, it prints a message.
for name in df.columns:
print(name) # comment this line after testing
try:
pd.to_datetime(name)
except ValueError:
pass # or do something in case the column name is not a date
else:
print('Check')
This outputs
Stock
15/12/2015
Check
15/11/2015
Check
15/10/2015
Check
You can see that Check was only printed when the column name was, at least, coercible into a date.

Return subset/slice of Pandas dataframe based on matching column of other dataframe, for each element in column?

So I think this is a relatively simple question:
I have a Pandas data frame (A) that has a key column (which is not unique/will have repeats of the key)
I have another Pandas data frame (B) that has a key column, which may have many matching entries/repeats.
So what I'd like is a bunch of data frames (a list, or a bunch of slice parameters, etc.), one for each key in A (regardless of whether it's unique or not)
In [bad] pseudocode:
for each key in A:
resultDF[] = Rows in B where B.key = key
I can easily do this iteratively with loops, but I've read that you're supposed to slice/merge/join data frames holistically, so I'm trying to see if I can find a better way of doing this.
A join will give me all the stuff that matches, but that's not exactly what I'm looking for, since I need a resulting dataframe for each key (i.e. for every row) in A.
Thanks!
EDIT:
I was trying to be brief, but here are some more details:
Eventually, what I need to do is generate some simple statistical metrics for elements in the columns of each row.
In other words, I have a DF, call it A, and it has a r rows, with c columns, one of which is a key. There may be repeats on the key.
I want to "match" that key with another [set of?] dataframe, returning however many rows match the key. Then, for that set of rows, I want to, say, determine the min and max of certain element (and std. dev, variance, etc.) and then determine if the corresponding element in A falls within that range.
You're absolutely right that it's possible that if row 1 and row 3 of DF A have the same key -- but potentially DIFFERENT elements -- they'd be checked against the same result set (the ranges of which obviously won't change). That's fine. These won't likely ever be big enough to make that an issue (but if there's the better way of doing it, that's great).
The point is that I need to be able to do the "in range" and stat summary computation for EACH key in A.
Again, I can easily do all of this iteratively. But this seems like the sort of thing pandas could do well, and I'm just getting into using it.
Thanks again!
FURTHER EDIT
The DF looks like this:
df = pd.DataFrame([[1,2,3,4,1,2,3,4], [28,15,13,11,12,23,21,15],['keyA','keyB','keyC','keyD', 'keyA','keyB','keyC','keyD']]).T
df.columns = ['SEQ','VAL','KEY']
SEQ VAL KEY
0 1 28 keyA
1 2 15 keyB
2 3 13 keyC
3 4 11 keyD
4 1 12 keyA
5 2 23 keyB
6 3 21 keyC
7 4 15 keyD
Both DF's A and B are of this format.
I can iterative get the resultant sets by:
loop_iter = len(A) / max(A['SEQ_NUM'])
for start in range(0, loop_iter):
matchA = A.iloc[start::loop_iter, :]['KEY']
That's simple. But I guess I'm wondering if I can do this "inline". Also, if for some reason the numeric ordering breaks (i.e. the SEQ get out of order) this this won't work. There seems to be no reason NOT to do it explicitly splitting on the keys, right? So perhaps I have TWO questions: 1). How to split on keys, iteratively (i.e. accessing a DF one row at a time), and 2). How to match a DF and do summary statistics, etc., on a DF that matches on the key.
So, once again:
1). Iterate through DF A, going one at a time, and grabbing a key.
2). Match the key to the SET (matchB) of keys in B that match
3). Do some stats on "values" of matchB, check to see if val.A is in range, etc.
4). Profit!
Ok, from what I understand, the problem at its most simple is that you have a pd.Series of values (i.e. a["key"], which let's just call keys), which correspond to the rows of a pd.DataFrame (the df called b), such that set(b["key"]).issuperset(set(keys)). You then want to apply some function to each group of rows in b where the b["key"] is one of the values in keys.
I'm purposefully disregarding the other df -- a -- that you mention in your prompt, because it doesn't seem to bear any significance to the problem, other than being the source of keys.
Anyway, this is a fairly standard sort of operation -- it's a groupby-apply.
def descriptive_func(df):
"""
Takes a df where key is always equal and returns some summary.
:type df: pd.DataFrame
:rtype: pd.Series|pd.DataFrame
"""
pass
# filter down to those rows we're interested in
valid_rows = b[b["key"].isin(set(keys))]
# this groups by the value and applies the descriptive func to each sub df in turn
summary = valid_rows.groupby("key").apply(descriptive_func)
There are a few built in methods on the groupby object that are useful. For example, check out valid_rows.groupby("key").sum() or valid_rows.groupby("key").describe(). Under the covers, these are really similar uses of apply. The shape of the returned summary is determined by the applied function. The unique grouped-by values -- those of b["key"] -- always constitute the index, but if the applied function returns a scalar, summary is a Series; if the applied function returns a Series, then summary constituted of the return Series as rows; if the applied function returns a DataFrame, then the result is a multiindex DataFrame. This is a core pattern in Pandas, and there's a whole, whole lot to explore here.

R: comparing two character columns in a big data frame

So, I have s really huge data frame, which has two columns of characters. the characters are some ID values separated by ";". So, I want to calculate the number of common ID values between these two columns. Here is an example:
id.x id.y
1 123;145;156 143;156;234;165
2 134;156;187;675 132;145;156;187
so in this case, the first row has 1 common value, and the second row has two common values.
The table size is 60M records, and some of the strings may be more than a 1000 long. I tried to write the data to a text file, and do this analysis by python, but the file size is 30GB. Any idea to do this in R? (regex, apply, ..)
I can count the numbe rof common values by this command:
intersect(strsplit(df[1,"ind.x"], split=";")[[1]], strsplit(df[1,"ind.y"], split=";") [[1]])
Therefore, I wrote a function:
myfun <- function(x,y) {
length(intersect(strsplit(x, split=";")[[1]], strsplit(y, split=";")[[1]]))
}
which works when I try it on a single call, but when I use it with mapply as below, it prints all the columns, but I only want the number in output:
> mapply(FUN=myfun, df[1:2,]$id.x, df[1:2,]$id.y)
123;145;156 134;156;187;675
1 2
So, why it prints the first column as well? What is wrong with my command?
Mapply returns an integer vector with name attributes.
y <- mapply(myfun, df$id.x, df$id.y)
str(y)
Named int [1:2] 1 2
- attr(*, "names")= chr [1:2] "123;145;156" "134;156;187;675"
Drop them with USE.NAMEs
mapply(myfun, df$id.x, df$id.y, USE.NAMES=FALSE)
[1] 1 2
And use an index and test the time on larger and larger sets of data
system.time(y <- mapply(myfun, df[1:1e5,]$id.x, df[1:1e5,]$id.y, USE.NAMES=FALSE))

Categories