Update 5:
This feature has been released as part of pandas 20.1 (on my birthday :] )
Update 4:
PR has been merged!
Update 3:
The PR has moved here
Update 2:
It seems like this question may have contributed to re-opening the PR for IntervalIndex in pandas.
Update:
I no longer have this problem, since I'm actually now querying for overlapping ranges from A and B, not points from B which fall within ranges in A, which is a full interval tree problem. I won't delete the question though, because I think it's still a valid question, and I don't have a good answer.
Problem statement
I have two dataframes.
In dataframe A, two of the integer columns taken together represent an interval.
In dataframe B, one integer column represents a position.
I'd like to do a sort of join, such that points are assigned to each interval they fall within.
Intervals are rarely but occasionally overlapping. If a point falls within that overlap, it should be assigned to both intervals. About half of points won't fall within an interval, but nearly every interval will have at least one point within its range.
What I've been thinking
I was initially going to dump my data out of pandas, and use intervaltree or banyan or maybe bx-python but then I came across this gist. It turns out that the ideas shoyer has in there never made it into pandas, but it got me thinking -- it might be possible to do this within pandas, and since I want this code to be as fast as python can possibly go, I'd rather not dump my data out of pandas until the very end. I also get the feeling that this is possible with bins and pandas cut function, but I'm a total newbie to pandas, so I could use some guidance! Thanks!
Notes
Potentially related? Pandas DataFrame groupby overlapping intervals of variable length
This feature is was released as part of pandas 20.1
Answer using pyranges, which is basically pandas sprinkled with bioinformatics sugar.
Setup:
import numpy as np
np.random.seed(0)
import pyranges as pr
a = pr.random(int(1e6))
# +--------------+-----------+-----------+--------------+
# | Chromosome | Start | End | Strand |
# | (category) | (int32) | (int32) | (category) |
# |--------------+-----------+-----------+--------------|
# | chr1 | 8830650 | 8830750 | + |
# | chr1 | 9564361 | 9564461 | + |
# | chr1 | 44977425 | 44977525 | + |
# | chr1 | 239741543 | 239741643 | + |
# | ... | ... | ... | ... |
# | chrY | 29437476 | 29437576 | - |
# | chrY | 49995298 | 49995398 | - |
# | chrY | 50840129 | 50840229 | - |
# | chrY | 38069647 | 38069747 | - |
# +--------------+-----------+-----------+--------------+
# Stranded PyRanges object has 1,000,000 rows and 4 columns from 25 chromosomes.
# For printing, the PyRanges was sorted on Chromosome and Strand.
b = pr.random(int(1e6), length=1)
# +--------------+-----------+-----------+--------------+
# | Chromosome | Start | End | Strand |
# | (category) | (int32) | (int32) | (category) |
# |--------------+-----------+-----------+--------------|
# | chr1 | 52110394 | 52110395 | + |
# | chr1 | 122640219 | 122640220 | + |
# | chr1 | 162690565 | 162690566 | + |
# | chr1 | 117198743 | 117198744 | + |
# | ... | ... | ... | ... |
# | chrY | 45169886 | 45169887 | - |
# | chrY | 38863683 | 38863684 | - |
# | chrY | 28592193 | 28592194 | - |
# | chrY | 29441949 | 29441950 | - |
# +--------------+-----------+-----------+--------------+
# Stranded PyRanges object has 1,000,000 rows and 4 columns from 25 chromosomes.
# For printing, the PyRanges was sorted on Chromosome and Strand.
Execution:
result = a.join(b, strandedness="same")
# +--------------+-----------+-----------+--------------+-----------+-----------+--------------+
# | Chromosome | Start | End | Strand | Start_b | End_b | Strand_b |
# | (category) | (int32) | (int32) | (category) | (int32) | (int32) | (category) |
# |--------------+-----------+-----------+--------------+-----------+-----------+--------------|
# | chr1 | 227348436 | 227348536 | + | 227348516 | 227348517 | + |
# | chr1 | 18901135 | 18901235 | + | 18901191 | 18901192 | + |
# | chr1 | 230131576 | 230131676 | + | 230131636 | 230131637 | + |
# | chr1 | 84829850 | 84829950 | + | 84829903 | 84829904 | + |
# | ... | ... | ... | ... | ... | ... | ... |
# | chrY | 44139791 | 44139891 | - | 44139821 | 44139822 | - |
# | chrY | 51689785 | 51689885 | - | 51689859 | 51689860 | - |
# | chrY | 45379140 | 45379240 | - | 45379215 | 45379216 | - |
# | chrY | 37469479 | 37469579 | - | 37469576 | 37469577 | - |
# +--------------+-----------+-----------+--------------+-----------+-----------+--------------+
# Stranded PyRanges object has 16,153 rows and 7 columns from 24 chromosomes.
# For printing, the PyRanges was sorted on Chromosome and Strand.
df = result.df
# Chromosome Start End Strand Start_b End_b Strand_b
# 0 chr1 227348436 227348536 + 227348516 227348517 +
# 1 chr1 18901135 18901235 + 18901191 18901192 +
# 2 chr1 230131576 230131676 + 230131636 230131637 +
# 3 chr1 84829850 84829950 + 84829903 84829904 +
# 4 chr1 189088140 189088240 + 189088163 189088164 +
# ... ... ... ... ... ... ... ...
# 16148 chrY 38968068 38968168 - 38968124 38968125 -
# 16149 chrY 44139791 44139891 - 44139821 44139822 -
# 16150 chrY 51689785 51689885 - 51689859 51689860 -
# 16151 chrY 45379140 45379240 - 45379215 45379216 -
# 16152 chrY 37469479 37469579 - 37469576 37469577 -
#
# [16153 rows x 7 columns]
Related
I have this massive dataset and I need to subset the data by using criteria. This is for illustration:
| Group | Name | Value |
|--------------------|-------------|-----------------|
| A | Bill| 256 |
| A | Jack| 268 |
| A | Melissa| 489 |
| B | Amanda | 787 |
| B | Eric| 485 |
| C | Matt| 1236 |
| C | Lisa| 1485 |
| D | Ben | 785 |
| D | Andrew| 985 |
| D | Cathy| 1025 |
| D | Suzanne| 1256 |
| D | Jim| 1520 |
I know how to handle this problem manually, such as:
import pandas as pd
df=pd.read_csv('Test.csv')
A=df[df.Group =="A "].to_numpy()
B=df[df.Group =="B "].to_numpy()
C=df[df.Group =="C "].to_numpy()
D=df[df.Group =="D "].to_numpy()
But considering the size of the data, it will take a lot of time if I handle it in this way.
With that in mind, I would like to know if it is possible to build an iteration with an IF statement that can look at the values in column “Group”(table above) . I was thinking, IF statement to see if the first value is the same with one the below if so, group them and create a new array/ dataframe.
I have two pandas datasets
old:
| alpha | beta | zeta | id | rand | numb|
| ------ | ------------------ | ------------| ------ | --- -| ----|
| 1 | LA | bev | A100 | D | 100 |
| 1 | LA | malib | C150 | Z | 150 |
| 2 | NY | queens | B200 | N | 200 |
| 2 | NY | queens | B200 | N | 200 |
| 3 | Chic | lincpark | E300 | T | 300 |
| 3 | NY | Bronx | F300 | M | 300 |
new:
| alpha | beta | zeta | id | numb |
| ------ | ------------------ | ---------------| ------| -----|
| 1 | LA | Hwood | Q | Q400 |
| 2 | NY | queens | B | B200 |
| 3 | Chic | lincpark | D | D300 |
(Columns and data don't mean anything in particular, just an example).
I want to merge datasets in a way such that
IF old.alpha, old.beta, and old.zeta = their corresponding new columns and If old.id = new.numb, you only keep the entry from the old table. (in this case the row 2 on the old with queens would be kept as opposed to row 2 on new with queens)
Note that rows 3 and 4 on old are the same, but we still keep both. If there were 2 duplicates of these rows in new we consider them as 1-1 corresponding. If maybe there were 3 duplicates on new of rows 3 and 4 on old, then 2 are considered copies (and we don't add them, but we would add the third when we merge them)
IF old.alpha, old.beta, and old.zeta = their corresponding new columns and If old.numb is contained inside new.numb, you only keep the entry from the old table. (in this case the row 5 on the old with lincpark would be kept as opposed to row 3 on new with lincpark, because 300 is contained in new.numb)
Otherwise add the new data as new data, keeping the new table's id and numb, and having null for any extra columns that the old table has (new's row 1 with hollywood)
I have tried various merging methods along with the drop_duplicates method. The problem with the the latter is that I attempted to drop duplicates having the same alpha beta and zeta, but often they were deleted from the same datasource, because the rows were exactly the same.
This is what ultimately needs to be shown when merging. 2 of the rows in new were duplicates, one was something to be added.
| alpha | beta | zeta | id | rand | numb|
| ------ | ------------------ | ------------| ------ | --- -| ----|
| 1 | LA | bev | A100 | D | 100 |
| 1 | LA | malib | C150 | Z | 150 |
| 2 | NY | queens | B200 | N | 200 |
| 2 | NY | queens | B200 | N | 200 |
| 3 | Chic | lincpark | E300 | T | 300 |
| 3 | NY | Bronx | F300 | M | 300 |
| 1 | LA | Hwood | Q | | Q400|
We can merge two Data frames in several ways. Most common way in python is using merge operation in Pandas.
Assuming df1 is your new and df2 is the old
Follow merge by IF conditions.
import pandas
dfinal = df1.merge(df2, on="alpha", how = 'inner')
For merging based on columns of different dataframe, you may specify left and right common column names specially in case of ambiguity of two different names of same column, lets say - 'idold' as 'idnew'.
dfinal = df1.merge(df2, how='inner', left_on='alpha', right_on='id')
If you want to be even more specific, you may read the documentation of pandas merge operation.
Also specify If conditions and perform merge operations by rows, and then drop the remaining columns in a temporary dataframe. And add values to that dataframe according to conditions.
I understand the answer is a little bit complex, but so is your question. Cheers :)
I have 2 views as below:
experiments:
select * from experiments;
+--------+--------------------+-----------------+
| exp_id | exp_properties | value |
+--------+--------------------+-----------------+
| 1 | indicator:chemical | phenolphthalein |
| 1 | base | NaOH |
| 1 | acid | HCl |
| 1 | exp_type | titration |
| 1 | indicator:color | faint_pink |
+--------+--------------------+-----------------+
calculations:
select * from calculations;
+--------+------------------------+--------------+
| exp_id | exp_report | value |
+--------+------------------------+--------------+
| 1 | molarity:base | 0.500000000 |
| 1 | volume:acid:in_ML | 23.120000000 |
| 1 | volume:base:in_ML | 5.430000000 |
| 1 | moles:H | 0.012500000 |
| 1 | moles:OH | 0.012500000 |
| 1 | molarity:acid | 0.250000000 |
+--------+------------------------+--------------+
I managed to pivot each of these views individually as below:
experiments_pivot:
+-------+--------------------+------+------+-----------+----------------+
|exp_id | indicator:chemical | base | acid | exp_type | indicator:color|
+-------+--------------------+------+------+-----------+----------------+
| 1 | phenolphthalein | NaOH | HCl | titration | faint_pink |
+------+---------------------+------+------+-----------+----------------+
calculations_pivot:
+-------+---------------+---------------+--------------+-------------+------------------+-------------------+
|exp_id | molarity:base | molarity:acid | moles:H | moles:OH | volume:acid:in_ML| volume:base:in_ML |
+-------+---------------+---------------+--------------+-------------+------------------+-------------------+
| 1 | 0.500000000 | 0.250000000 | 0.012500000 | 0.012500000 | 23.120000000 | 5.430000000 |
+------+---------------------+------+------+-----------+----------------------------------------------------+
My question is how to get these two pivot results as a single row? Desired result is as below:
+-------+--------------------+------+------+-----------+----------------+--------------+---------------+--------------+-------------+------------------+------------------+
|exp_id | indicator:chemical | base | acid | exp_type | indicator:color|molarity:base | molarity:acid | moles:H | moles:OH | volume:acid:in_ML| volume:base:in_ML |
+-------+--------------------+------+------+-----------+----------------+--------------+---------------+--------------+-------------+------------------+------------------+
| 1 | phenolphthalein | NaOH | HCl | titration | faint_pink | 0.500000000 | 0.250000000 | 0.012500000 | 0.012500000 | 23.120000000 | 5.430000000 |
+------+---------------------+------+------+-----------+----------------+--------------+---------------+--------------+-------------+------------------+------------------+
Database Used: Mysql
Important Note: Each of these views can have increasing number of rows. Hence I considered "dynamic pivoting" for each of the view individually.
For reference -- Below is a prepared statement I used to pivot experiments in MySQL(and a similar statement to pivot the other view as well):
set #sql = Null;
SELECT
GROUP_CONCAT(DISTINCT
CONCAT(
'MAX(IF(exp_properties = ''',
exp_properties,
''', value, NULL)) AS ',
concat("`",exp_properties, "`")
)
)into #sql
from experiments;
set #sql = concat(
'select exp_id, ',
#sql,
' from experiment group by exp_id'
);
prepare stmt from #sql;
execute stmt;
I have two dataframes.
First dataframe: df_json
+------------+-----------------+-----------+------------+
| chromosome | ensembl_id | gene_end | gene_start |
+------------+-----------------+-----------+------------+
| 7 | ENSG00000122543 | 5886362 | 5879827 |
| 12 | ENSG00000111325 | 122980043 | 122974580 |
| 17 | ENSG00000181396 | 82418637 | 82389223 |
| 6 | ENSG00000119900 | 71308950 | 71288803 |
| 9 | ENSG00000106809 | 92404696 | 92383967 |
+------------+-----------------+-----------+------------+
Second dataframe: df
+------------+-----------------+-----------+------------+
| rs_id | variant | gene_id | chromosome |
+------------+-----------------+-----------+------------+
| rs13184706 | 5:43888254:C:T | 43888254| 5 |
| rs58824264 | 5:43888493:C:T | 43888493| 5 |
+------------+-----------------+-----------+------------+
I want to iterate through df_json and for each row in df_json, select the rows from df whose gene_id is in range (gene_start, gene_end) and df['chromosome'] == df_json['chromosome']. Also, I need to create a new column in the resulting dataframe which has the ensembl_id from df_json.
I am able to achieve the same using the code below but it is very slow. I need a faster way to do this as I need to execute this on millions of rows.
result_df = []
for row in df_json.itertuples():
gene_end, gene_start = row[3], row[4]
gene = df.loc[(df['gene_id'].between(gene_start, gene_end, inclusive=True)) & (df['chromosome'] == row[1])]
gene['ensembl_id'] = row[2]
result_df.append(gene)
print(krishna[0])
You should avoid iterating pandas dataframe rows where possible, as this is inefficient and less readable.
You can implement your logic using pd.DataFrame.merge and pd.Series.between. I have changed the data in your example to make it more interesting.
import pandas as pd
df_json = pd.DataFrame({'chromosome': [7, 12, 17, 6, 9],
'ensembl_id': ['ENSG00000122543', 'ENSG00000111325', 'ENSG00000181396',
'ENSG00000119900', 'ENSG00000106809'],
'gene_end': [5886362, 122980043, 82418637, 71308950, 92404696],
'gene_start': [5879827, 122974580, 82389223, 71288803, 92383967]})
df = pd.DataFrame({'rs_id': ['rs13184706', 'rs58824264'],
'variant': ['5:43888254:C:T', '5:43888493:C:T'],
'gene_id': [5880000, 43888493],
'chromosome': [7, 9]})
res = df_json.merge(df, how='left', on='chromosome')
res = res[res['gene_id'].between(res['gene_start'], res['gene_end'])]
print(res)
# chromosome ensembl_id gene_end gene_start gene_id rs_id \
# 0 7 ENSG00000122543 5886362 5879827 5880000.0 rs13184706
# variant
# 0 5:43888254:C:T
Use pyranges for large datasets. It is very efficient and fast:
import pyranges as pr
c = """Chromosome ensembl_id End Start
7 ENSG00000122543 5886362 5879827
12 ENSG00000111325 122980043 122974580
17 ENSG00000181396 82418637 82389223
5 MadeUp 43889000 43888253
6 ENSG00000119900 71308950 71288803
9 ENSG00000106809 92404696 92383967"""
c2 = """rs_id variant Start End Chromosome
rs13184706 5:43888254:C:T 43888254 43888256 5
rs58824264 5:43888493:C:T 43888493 43888494 5"""
gr = pr.from_string(c)
gr2 = pr.from_string(c2)
j = gr.join(gr2)
# +--------------+--------------+-----------+-----------+------------+----------------+-----------+-----------+
# | Chromosome | ensembl_id | End | Start | rs_id | variant | Start_b | End_b |
# | (category) | (object) | (int32) | (int32) | (object) | (object) | (int32) | (int32) |
# |--------------+--------------+-----------+-----------+------------+----------------+-----------+-----------|
# | 5 | MadeUp | 43889000 | 43888253 | rs13184706 | 5:43888254:C:T | 43888254 | 43888256 |
# | 5 | MadeUp | 43889000 | 43888253 | rs58824264 | 5:43888493:C:T | 43888493 | 43888494 |
# +--------------+--------------+-----------+-----------+------------+----------------+-----------+-----------+
# Unstranded PyRanges object has 2 rows and 8 columns from 1 chromosomes.
# For printing, the PyRanges was sorted on Chromosome.
df = j.df # as pandas df
Basically what I want is to switch every number above, below, to the left and to the right of my number 1, in each iteration. Something like this (consider # the number one, and ' ' the number 0:
---------------------
| | | | | |
---------------------
| | | | | |
---------------------
| | | # | | |
---------------------
| | | | | |
---------------------
| | | | | |
---------------------
---------------------
| | | | | |
---------------------
| | | # | | |
---------------------
| | # | # | # | |
---------------------
| | | # | | |
---------------------
| | | | | |
---------------------
---------------------
| | | # | | |
---------------------
| | # | # | # | |
---------------------
| # | # | # | # | # |
---------------------
| | # | # | # | |
---------------------
| | | # | | |
---------------------
---------------------
| | # | # | # | |
---------------------
| # | # | # | # | # |
---------------------
| # | # | # | # | # |
---------------------
| # | # | # | # | # |
---------------------
| | # | # | # | |
---------------------
---------------------
| # | # | # | # | # |
---------------------
| # | # | # | # | # |
---------------------
| # | # | # | # | # |
---------------------
| # | # | # | # | # |
---------------------
| # | # | # | # | |
---------------------
Everything goes well until I cant switch the number zero in the position (4,4).
Cant seem to find what is wrong, I know the code is sloppy, but what am I doing that is stopping the last number from changing?
linhas=5
matrix =[]
for i in range(linhas):
linha1=[]
for j in range(colunas):
linha1.append(0)
matrix.append(linha1)
pl=[]
cont=0
matrix[2][2]=1
while len(pl)!= colunas*linhas:
cont+=1
pl=[]
print(matrix)
for i in range(len(matrix)):
for j in range(len(matrix[i])):
if matrix[i][j]==1:
lista=[i,j]
pl.append(lista)
for l in range(len(pl)):
i= pl[l][0]
j= pl[l][1]
#this is wrong, now corrected below
#if (j>=0 and j+1<=len(matrix)-1) and (i>=0 and i+1<=len(matrix)-1):
#if matrix[i][j-1]==0:
#matrix[i][j-1]=1
#if matrix[i][j+1]==0:
#matrix[i][j+1]=1
#if matrix[i-1][j]==0:
#matrix[i-1][j]=1
#if matrix[i+1][j]==0:
#matrix[i+1][j]=1
correction, remove # text and substitute with this:
if j-1>=0:
if matrix[i][j-1]==0:
matrix[i][j-1]=1
if j+1<colunas:
if matrix[i][j+1]==0:
matrix[i][j+1]=1
if i-1>=0:
if matrix[i-1][j]==0:
matrix[i-1][j]=1
if i+1<linhas:
if matrix[i+1][j]==0:
matrix[i+1][j]=1
j is a column index. i is a row index.
j >= 0 is true for all valid column indices. (But note that matrix[i][j-1] could be matrix[i][-1] if j is 0. That may not be what you want to happen...)
j+1<=len(matrix)-1 means the column has to be on less than the maximum possible column index.
So (j>=0 and j+1<=len(matrix)-1) means j can not be in the last column.
Similarly, (i>=0 and i+1<=len(matrix)-1) means i can to be in the last row.
To flip matrix[4][4] from 0 to 1, (i,j) must be either (4,3) or (3,4) but both of these tuples are rejected by the if-condition since (4,3) is in the last row and (3,4) is in the last column.