Pandas groupby count with conditions - python

Example Data
Given the following data frame:
| feature | gene | target | pos |
| 1_1_1 | NRAS | AATTGG | 60 |
| 1_1_1 | NRAS | TTGGCC | 6 |
| 1_1_1 | NRAS | AATTGG | 20 |
| 1_1_1 | KRAS | GGGGTT | 0 |
| 1_1_1 | KRAS | GGGGTT | 0 |
| 1_1_1 | KRAS | GGGGTT | 0 |
| 1_1_2 | NRAS | CCTTAA | 2 |
| 1_1_2 | NRAS | GGAATT | 8 |
| 1_1_2 | NRAS | AATTGG | 60 |
The problem
For each feature, I would like to count how many targets appear in each gene, with the following rules:
If a target appears in only one position (pos column) for each gene, it gets a count of 1 for every time seen
If the same target appears in multiple positions for each gene, it gets a count of (count at position/total positions found)
Summarize total counts of each gene per feature
What I've done so far
matches.groupby(["FeatureID", "gene"]).size().reset_index()
matches['multi_mapped'] = np.where(matches.groupby(["FeatureID", "gene", "target"]).pos.transform('nunique') > 1, "T", '')
Which gives me a dataframe where targets that appear at more than one position are flagged as true. Now I just need to figure out how to normalize the counts.
Desired output
| feature | gene | count
| 1_1_1 | NRAS | 2
| 1_1_1 | KRAS | 1
| 1_1_2 | NRAS | 3
So in the example above for 1_1_1 NRAS, where AATTGG is found at both position 60 and position 20, each would get a count of .5. Since TTGGCC was found once at one position, so it gets a count of 1. This makes a total count of 2.
If for 1_1_1 NRAS TTGGCC was found 3 times at the same position, each of those would get a count of 1, for a total of 3 + .5 + .5 = 4.
The solution needs to check for the same target appearing at different positions and then adjust the counts accordingly, and that is the part I'm having a difficult time with. My ultimate goal is to choose the gene with the highest count per group.

It's not really clear to me why the count on first row should be 2. Could you try to play around this:
import pandas as pd
feature = ["1_1_1"]*6 +["1_1_2"]*3
gene = ["NRAS"]*3+["KRAS"]*3+["NRAS"]*3
target = ["AATTGG","TTGGCC", "AATTGG"]+ ["GGGGTT"]*3 + ["CCTTAA", "GGGGTT", "AATTGG"]
pos = [60,6,20,0,0,0,2,8,60]
df = pd.DataFrame({"feature":feature,
"gene":gene,
"target":target,
"pos":pos})
df.groupby(["feature", "gene"])\
.apply(lambda x:len(x.drop_duplicates(["target", "pos"])))

Okay, I figured it out. If there is a more efficient way to to do this, I'm all ears!
# flag targets that are multi-mapped and add flag as new column
matches['multi_mapped'] = np.where(matches.groupby(["FeatureID", "gene", "target"]).pos.transform('nunique') > 1, "T", '')
# separate multi and non multi mapped reads using flag
non = matches[matches["multi_mapped"] != "T"]\
.drop("multi_mapped", axis=1)
multi = matches[matches["multi_mapped"] == "T"]\
.drop("multi_mapped", axis=1)
# add counts to non multi mapped reads
non = non.groupby(["FeatureID", "gene", "target"])\
.count().reset_index().rename(columns={"pos":"count"})
# add counts to multi-mapped reads with normaliztion
multi["count"] = multi.groupby(["FeatureID", "gene", "target"])\
.transform(lambda x: 1/x.count())
multi.drop("pos", axis=1, inplace=True)
# join the multi and non back together
counts = pd.concat([multi, non], axis=0)

Related

Get word counts in strings of words in a list using python

Starting with a pandas data frame where the first column is made up of comment strings with the other columns being features of single words. For each row I would like to get a count of how many times each word appears in that row's comments cell. I have the list of words (feature columns) as a list called, "wordList" and am trying something like this but having trouble getting it working and the counts back into the data frame:
def word_count(comments):
for word in wordList:
return comment.count(word)
df.comments.apply(word_count)
What I have:
comments | hello | this | is | the | comments | blah |
--------------------------------------------------------------------------
this is the 1st | | | | | |
comments here | | | | | |
--------------------------------------------------------------------------
the 2nd comment | | | | | |
is is here this | | | | | |
What I want:
comments | hello | this | is | the | comments | blah |
--------------------------------------------------------------------------
this is the 1st | 0 | 1 | 2 | 1 | 1 | 0
comments is here| | | | | |
--------------------------------------------------------------------------
the 2nd comment | 0 | 1 | 2 | 2 | 0 | 0
is is here the | | | | | |
Convert your coment column into a list and explode.
Apply get dummies. That will tabulate frequency of occurrence
Reindex with the list of words you wish to check
Aggregate the frequency and join to the df.coments column
Code below:
g=pd.get_dummies(pd.Series(df1.coments.str.split('\s').explode())).reindex(columns=['hello', 'this','is','the','comments','blah']).fillna(0).astype(int)
pd.DataFrame(df1.iloc[:,0]).join(g.groupby(level=0).sum(0))
coments hello this is the comments blah
0 this is the 1st comments here 0 1 1 1 1 0
1 the 2nd comment is is here this 0 1 2 1 0 0
You can use str.extract to extract (only) the words in your word list, then use value_counts:
pattern = '|'.join(word_list)
(df.comments.str.extractall(rf'\b({pattern})\b')[0]
.groupby(level=0).value_counts()
.unstack(fill_value=0)
.reindex(word_list, axis=1, fill_value=0)
)
Output (note that this also has a column named comments as in the original dataframe)
0 hello this is the comments blah
0 0 1 1 1 1 0
1 0 1 2 1 0 0

Pandas group data until first ocurrence of a pattern

I have a dataframe that contains accidents of cars, they can be 'L' for light or 'S' for strong:
|car_id|type_acc|datetime_acc|
------------------------------
| 1 | L | 2020-01-01 |
| 1 | L | 2020-01-05 |
| 1 | S | 2020-01-07 |
| 1 | L | 2020-01-09 |
| 2 | L | 2020-01-04 |
| 2 | L | 2020-01-10 |
| 2 | L | 2020-01-12 |
I would like to get a column that counts until the first 'S' and the divide the number of occurrences between the max and min 'L', so the output is:
|car_id|freq_acc|
-----------------
| 1 | 2 | # 4 days (from 1 to 5) / 2 number of 'L' before first 'S'
| 2 | 8 | # 8 days(from 4 to 12) and no 'S'
Is such a thing possible? Thanks
You could use np.ptp to compute the max-min difference:
# find first by S by car_id
df['eq_s'] = df.groupby('car_id')['type_acc'].transform(lambda x: x.eq('S').cumsum())
# compute stats based on previous computation, keeping only the first group
groups = df[df['eq_s'].eq(0)].groupby(['car_id']).agg({'datetime_acc': np.ptp}).reset_index()
# rename
res = groups.rename(columns={'datetime_acc': 'freq_acc'})
print(res)
Output
car_id freq_acc
0 1 4 days
1 2 8 days
Before anything make sure that datetime_acc, is a datetime column, by doing:
df['datetime_acc'] = pd.to_datetime(df['datetime_acc'])
The first step:
# find first by S by car_id
df['eq_s'] = df.groupby('car_id')['type_acc'].transform(lambda x: x.eq('S').cumsum())
will create a new column where the only values of interest are the 0, that is the ones that are before the first S. In the second step we only keep those values, and perform a standard groupby:
# compute stats based on previous computation, keeping only the first group
groups = df[df['eq_s'].eq(0)].groupby(['car_id']).agg({'datetime_acc': np.ptp}).reset_index()

Calculate lag difference in group

I am trying to solve a problem using just SQL (I am able to do this when combining SQL and Python).
Basically what I want to do is to calculate score changes per candidate, where a score consists of joining a score lookup table and then summing these individual event scores. If a candidate fails, they are required to retake the events. Here is an example output:
| brandi_id | retest | total_score |
|-----------|--------|-------------|
| 1 | true | 128 |
| 1 | false | 234 |
| 2 | true | 200 |
| 2 | false | 230 |
| 3 | false | 265 |
What I want is to first only calculate a score change for those candidates who took a retest, where the score change will just be the difference in total_score for retest is true minus retest = false:
| brandi_id | difference |
|-----------|------------|
| 1 | 106 |
| 2 | 30 |
This is the SQL that I am using (with this I need to use Python)
select e.brandi_id, e.retest, sum(sl.scaled_score) as total_score
from event as e
left join apf_score_lookup as sl
on sl.asmnt_code = e.asmnt_code
and sl.raw_score = e.score
where e.asmnt_code in ('APFPS','APFSU','APF2M')
group by e.brandi_id, e.retest
order by e.brandi_id;
I think the solution involves using LAG and PARTITION but I cannot get it. Thanks!
If someone does the retest only once, then you can use a join:
select tc.*, tr.score, (tc.score - tr.score) as diff
from t tc join
t tr
on tc.brandi_id = tr.brandi_id and
tc.retest = 'true' and tr.retest = 'false';
You don't describe your table layout. If the results are from the query in your question, you can just plug that in as a CTE.

How to identify a first occurence of a condition in python data frame and perform a calculation on it?

I am really struggling to get a logic for this . I have data set called Col as shown below . I am using Python and Pandas
I want to set a new column called as "STATUS" . The logic is
a. When Col==0 , i will Buy . But this Buy will happen only when Col==0 is the first value in the data set or after the Status Sell. There cannot be two Buy values without a Sell in between
b. When Col<=-8 I will Sell. But this will happen if there is a Buy preceding it in the Satus Column. There cannot be two Sells without a Buy in between them .
I have provided the example of how i want my output as. Any help is really appreciated
Here the raw data is in the column : Col and output i want is in Status
+-------+--------+
| Col | Status |
+-------+--------+
| 0 | Buy |
| -1.41 | 0 |
| 0 | 0 |
| -7.37 | 0 |
| -8.78 | Sell |
| -11.6 | 0 |
| 0 | Buy |
| -5 | 0 |
| -6.1 | 0 |
| -8 | Sell |
| -11 | 0 |
| 0 | Buy |
| 0 | 0 |
| -9 | Sell |
+-------+--------+
Took me some time.
Relies on the following property : the last order you can see from now, even if you chose not to send it, is always the last decision that you took. (Otherwise it would have been sent.)
df['order'] = (df['order'] == 0).astype(int) - (df['order'] <= -8).astype(int)
orders_no_filter = df.loc[df['order'] != 0, 'order']
possible = (orders_no_filter != orders_no_filter.shift(1))
df['order'] = df['order'] * possible.reindex(df.index, fill_value=0)

PySpark - Split/Filter DataFrame by column's values

I have a DataFrame similar to this example:
Timestamp | Word | Count
30/12/2015 | example_1 | 3
29/12/2015 | example_2 | 1
28/12/2015 | example_2 | 9
27/12/2015 | example_3 | 7
... | ... | ...
and i want to split this data frame by 'word' column's values to obtain a "list" of DataFrame (to plot some figures in a next step). For example:
DF1
Timestamp | Word | Count
30/12/2015 | example_1 | 3
DF2
Timestamp | Word | Count
29/12/2015 | example_2 | 1
28/12/2015 | example_2 | 9
DF3
Timestamp | Word | Count
27/12/2015 | example_3 | 7
Is there a way to do this with PySpark (1.6)?
It won't be efficient but you can map with filter over the list of unique values:
words = df.select("Word").distinct().flatMap(lambda x: x).collect()
dfs = [df.where(df["Word"] == word) for word in words]
Post Spark 2.0
words = df.select("Word").distinct().rdd.flatMap(lambda x: x).collect()
In addition to what zero323 said, I would might add
word.persist()
before the creation of the dfs, so the "words" dataframe won't need to be transformed each time when you will have an action on each of your "dfs"

Categories