Firstly, sorry if this is a bit lengthy, but I wanted to fully describe what I have having problems with and what I have tried already.
I am trying to join (merge) together two dataframe objects on multiple conditions. I know how to do this if the conditions to be met are all 'equals' operators, however, I need to make use of LESS THAN and MORE THAN.
The dataframes represent genetic information: one is a list of mutations in the genome (referred to as SNPs) and the other provides information on the locations of the genes on the human genome. Performing df.head() on these returns the following:
SNP DataFrame (snp_df):
chromosome SNP BP
0 1 rs3094315 752566
1 1 rs3131972 752721
2 1 rs2073814 753474
3 1 rs3115859 754503
4 1 rs3131956 758144
This shows the SNP reference ID and their locations. 'BP' stands for the 'Base-Pair' position.
Gene DataFrame (gene_df):
chromosome chr_start chr_stop feature_id
0 1 10954 11507 GeneID:100506145
1 1 12190 13639 GeneID:100652771
2 1 14362 29370 GeneID:653635
3 1 30366 30503 GeneID:100302278
4 1 34611 36081 GeneID:645520
This dataframe shows the locations of all the genes of interest.
What I want to find out is all of the SNPs which fall within the gene regions in the genome, and discard those that are outside of these regions.
If I wanted to merge together two dataframes based on multiple (equals) conditions, I would do something like the following:
merged_df = pd.merge(snp_df, gene_df, on=['chromosome', 'other_columns'])
However, in this instance - I need to find the SNPs where the chromosome values match those in the Gene dataframe, and the BP value falls between 'chr_start' and 'chr_stop'. What makes this challenging is that these dataframes are quite large. In this current dataset the snp_df has 6795021 rows, and the gene_df has 34362.
I have tried to tackle this by either looking at chromosomes or genes seperately. There are 22 different chromosome values (ints 1-22) as the sex chromosomes are not used. Both methods are taking an extremely long time. One uses the pandasql module, while the other approach is to loop through the separate genes.
SQL method
import pandas as pd
import pandasql as psql
pysqldf = lambda q: psql.sqldf(q, globals())
q = """
SELECT s.SNP, g.feature_id
FROM this_snp s INNER JOIN this_genes g
WHERE s.BP >= g.chr_start
AND s.BP <= g.chr_stop;
"""
all_dfs = []
for chromosome in snp_df['chromosome'].unique():
this_snp = snp_df.loc[snp_df['chromosome'] == chromosome]
this_genes = gene_df.loc[gene_df['chromosome'] == chromosome]
genic_snps = pysqldf(q)
all_dfs.append(genic_snps)
all_genic_snps = pd.concat(all_dfs)
Gene iteration method
all_dfs = []
for line in gene_df.iterrows():
info = line[1] # Getting the Series object
this_snp = snp_df.loc[(snp_df['chromosome'] == info['chromosome']) &
(snp_df['BP'] >= info['chr_start']) & (snp_df['BP'] <= info['chr_stop'])]
if this_snp.shape[0] != 0:
this_snp = this_snp[['SNP']]
this_snp.insert(len(this_snp.columns), 'feature_id', info['feature_id'])
all_dfs.append(this_snp)
all_genic_snps = pd.concat(all_dfs)
Can anyone give any suggestions of a more effective way of doing this?
I've just thought of a way to solve this - by combining my two methods:
First, focus on the individual chromosomes, and then loop through the genes in these smaller dataframes. This also doesn't have to make use of any SQL queries either. I've also included a section to immediately identify any redundant genes that don't have any SNPs that fall within their range. This makes use of a double for-loop which I normally try to avoid - but in this case it works quite well.
all_dfs = []
for chromosome in snp_df['chromosome'].unique():
this_chr_snp = snp_df.loc[snp_df['chromosome'] == chromosome]
this_genes = gene_df.loc[gene_df['chromosome'] == chromosome]
# Getting rid of redundant genes
min_bp = this_chr_snp['BP'].min()
max_bp = this_chr_snp['BP'].max()
this_genes = this_genes.loc[~(this_genes['chr_start'] >= max_bp) &
~(this_genes['chr_stop'] <= min_bp)]
for line in this_genes.iterrows():
info = line[1]
this_snp = this_chr_snp.loc[(this_chr_snp['BP'] >= info['chr_start']) &
(this_chr_snp['BP'] <= info['chr_stop'])]
if this_snp.shape[0] != 0:
this_snp = this_snp[['SNP']]
this_snp.insert(1, 'feature_id', info['feature_id'])
all_dfs.append(this_snp)
all_genic_snps = pd.concat(all_dfs)
While this doesn't run spectacularly quickly - it does run so that I can actually get some answers. I'd still like to know if anyone has any tips to make it run more efficiently though.
You can use the following to accomplish what you're looking for:
merged_df=snp_df.merge(gene_df,on=['chromosome'],how='inner')
merged_df=merged_df[(merged_df.BP>=merged_df.chr_start) & (merged_df.BP<=merged_df.chr_stop)][['SNP','feature_id']]
Note: your example dataframes do not meet your join criteria. Here is an example using modified dataframes:
snp_df
Out[193]:
chromosome SNP BP
0 1 rs3094315 752566
1 1 rs3131972 30400
2 1 rs2073814 753474
3 1 rs3115859 754503
4 1 rs3131956 758144
gene_df
Out[194]:
chromosome chr_start chr_stop feature_id
0 1 10954 11507 GeneID:100506145
1 1 12190 13639 GeneID:100652771
2 1 14362 29370 GeneID:653635
3 1 30366 30503 GeneID:100302278
4 1 34611 36081 GeneID:645520
merged_df
Out[195]:
SNP feature_id
8 rs3131972 GeneID:100302278
Related
I have many different tables that all have different column names and each refer to an outcome, like glucose, insulin, leptin etc (except keep in mind that the tables are all gigantic and messy with tons of other columns in them as well).
I am trying to generate a report that starts empty but then adds columns based on functions applied to each of the glucose, insulin, and leptin tables.
I have included a very simple example - ignore that the function makes little sense. The below code works, but I would like to, instead of copy + pasting final_report["outcome"] = over and over again, just run the find_result function over each of glucose, insulin, and leptin and add the "glucose_result", "insulin_result" and "leptin_result" to the final_report in one or a few lines.
Thanks in advance.
import pandas as pd
ids = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,4,4,4,4,4,4]
timepoint = [1,2,3,4,5,6,1,2,3,4,5,6,1,2,4,1,2,3,4,5,6]
outcome = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
glucose = pd.DataFrame({'id':ids,
'timepoint':timepoint,
'outcome':outcome})
insulin = pd.DataFrame({'id':ids,
'timepoint':timepoint,
'outcome':outcome})
leptin = pd.DataFrame({'id':ids,
'timepoint':timepoint,
'outcome':outcome})
ids = [1,2,3,4]
start = [1,1,1,1]
end = [6,6,6,6]
final_report = pd.DataFrame({'id':ids,
'start':start,
'end':end})
def find_result(subject, start, end, df):
df = df.loc[(df["id"] == subject) & (df["timepoint"] >= start) & (df["timepoint"] <= end)].sort_values(by = "timepoint")
return df["timepoint"].nunique()
final_report['glucose_result'] = final_report.apply(lambda x: find_result(x['id'], x['start'], x['end'], glucose), axis=1)
final_report['insulin_result'] = final_report.apply(lambda x: find_result(x['id'], x['start'], x['end'], insulin), axis=1)
final_report['leptin_result'] = final_report.apply(lambda x: find_result(x['id'], x['start'], x['end'], leptin), axis=1)
If you have to use this code structure, you can create a simple dictionary with your dataframes and their names and loop through them, creating new columns with programmatically assigned names:
input_dfs = {"glucose": glucose, "insulin": insulin, "leptin": leptin}
for name, df in input_dfs.items():
final_report[f"{name}_result"] = final_report.apply(
lambda x: find_result(x['id'], x['start'], x['end'], df),
axis=1
)
Output:
id start end glucose_result insulin_result leptin_result
0 1 1 6 6 6 6
1 2 1 6 6 6 6
2 3 1 6 3 3 3
3 4 1 6 6 6 6
I have the next DataFrame:
open high low close volume
0 62.8571 63.9285 62.7714 63.5642 82641944.0
1 63.6642 64.9285 63.5014 64.5114 88379522.0
2 61.7014 63.6857 61.4428 63.2757 112681030.0
3 62.5928 63.6399 62.0285 62.8085 113921367.0
4 63.4357 64.0499 62.6028 63.0505 110727309.0
.. .. .. .. .. ..
And currently I have the next code to generate a "bool"(0,1,-1) Series depending con multiple conditions (selecting 2 by 2 rows. In other cases I will need 3/4 rows in each iteration/calculation):
def check_pattern(data):
engulfed_bar_range = data.iloc[-2]['close'] - data.iloc[-2]['open']
if abs(engulfed_bar_range) >= params:
if engulfed_bar_range > 0:
return -1*((data.iloc[-1]['open'] > data.iloc[-2]['close']) and \
(data.iloc[-1]['close'] < data.iloc[-2]['open']))
else:
return +1*((data.iloc[-1]['open'] < data.iloc[-2]['close']) and \
(data.iloc[-1]['close'] > data.iloc[-2]['open']))
return False
res = []
for index in range(1, len(all_data)):
data = all_data.iloc[index-1:index+1]
res.append(check_pattern(d))
s = pd.Series(res)
There is any better/easiest/bestPerformance way of doing that? In some other cases similar to that, in which I only need the data of one column of the DataFrame, I have used df.rolling(..), but in this case that I need using data of several columns I dn't know how to do it. Maybe, there is some function of numpy that I can use? Or pd.eval? (I have tried but I havn't been able to get what I want)...
Thank so much in advance for your help.
Graphical explanation of what I'm looking for in the df:
I want a pd.Series with +1 when there is a Bullish Engulfing pattern and -1 when there is a Bearish Engulfing pattern. And 0 if there is no patter at that indexes.
I have two sets of csv data. One contains two columns (time and a boolean flag) and another data set which contains some info I have some graphing functions Id like to visually display. The data is sampled at different frequencies so the number of rows may not match for the datasets. How do I plot individual graphs for a range of data where the boolean is true?
Here is what the contact data looks like:
INDEX | TIME | CONTACT
0 | 240:18:59:31.750 | 0
1 | 240:18:59:32.000 | 0
2 | 240:18:59:32.250 | 0
........
1421 | 240:19:05:27.000 | 1
1422 | 240:19:05:27.250 | 1
The other (Vehicle) data isnt really important but contains values like Weight, Speed (MPH), Pedal Position etc.
I have many seperate large excel files and because the shapes do not match I am unsure how to slice the data using the time flags so I made a function below to create the ranges but I am thinking this can be done in an easier manner.
Here is the working code (with output below). In short, is there an easier way to do this?
def determineContactSlices(data):
contactStart = None
contactEnd = None
slices = pd.DataFrame([])
for index, row in data.iterrows():
if row['CONTACT'] == 1:
# begin slice
if contactStart is None:
contactStart = index
continue
else:
# still valid, move onto next
continue
elif row['CONTACT'] == 0:
if contactStart is not None:
contactEnd = index - 1
# create slice and add the df to list
slice = data[contactStart:contactEnd]
print(slice)
slices = slices.append(slice)
# then reset everything
slice = None
contactStart = None
contactEnd = None
continue
else:
# move onto next row
continue
return slices
Output: ([15542 rows x 2 columns])
Index Time CONTACT
1421 240:19:05:27.000 1
1422 240:19:05:27.250 1
1423 240:19:05:27.500 1
1424 240:19:05:27.750 1
1425 240:19:05:28.000 1
1426 240:19:05:28.250 1
... ...
56815 240:22:56:15.500 1
56816 240:22:56:15.750 1
56817 240:22:56:16.000 1
56818 240:22:56:16.250 1
56819 240:22:56:16.500 1
With this output I intend to loop through each time slice and display the Vehicle Data in subplots.
Any help or guidance would be much appreciated (:
UPDATE:
I believe I can just do filteredData = vehicleData[contactData['CONTACT'] == 1] but then I am faced with how to go about graphing individually when there is a disconnect. For example if there are 7 connections at various times and lengths, I woud like to have 7 individual plots to graph.
I think what you are trying to do is relatively simple, although I am not sure if I understand the output that you want or what you want to do with it after you have it. For example:
contact_df = data[data['CONTACT'] == 1]
non_contact_df = data[data['CONTACT'] == 0]
If this isn't helpful, please provide some additional details as to what the output should look like and what you plan to do with it after it is created.
Old question but why not:
sliceStart_index = df[ df["date"]=="2012-12-28" ].index.tolist()[0]
sliceEnd_index = df[ df["date"]=="2013-01-10" ].index.tolist()[0]
this_is_your_slice = df.iloc[sliceStart_index : sliceEnd_index]
first two lines actually get you a list of indexes where the condition is met, I just chose the first ones for example.
Suppose I have the following DataFrames:
Containers:
Key ContainerCode Quantity
1 P-A1-2097-05-B01 0
2 P-A1-1073-13-B04 0
3 P-A1-2024-09-H05 0
5 P-A1-2018-08-C05 0
6 P-A1-2089-03-C08 0
7 P-A1-3033-16-H07 0
8 P-A1-3035-18-C02 0
9 P-A1-4008-09-G01 0
Inventory:
Key SKU ContainerCode Quantity
1 22-3-1 P-A1-4008-09-G01 1
2 2132-12 P-A1-3033-16-H07 55
3 222-12 P-A1-4008-09-G01 3
4 4561-3 P-A1-3083-12-H01 126
How do I update the Quantity values in Containers to reflect the number of units in each container based on the information in Inventory? Note that multiple SKUs can reside in a single ContainerCode, so we need to add to the quantity, rather than just replace it, and there may be multiple entries in Containers for a particular ContainerCode.
What are the possible ways to accomplish this, and what are their relative pros and cons?
EDIT
The following code seems to serve as a good test case:
import itertools
import pandas as pd
import numpy as np
inventory = pd.DataFrame({'Container Code':['A1','A2','A2','A4'],
'Quantity':[10,87,2,44],
'SKU':['123-456','234-567','345-678','456-567']})
containers = pd.DataFrame({'Container Code':['A1','A2','A3','A4'],
'Quantity':[2,0,8,4],
'Path Order':[1,2,3,4]})
summedInventory = inventory.groupby('Container Code')['Quantity'].sum()
print('Containers Data Frame')
print(containers)
print('\nInventory Data Frame')
print(inventory)
print('\nSummed Inventory List')
print(summedInventory)
print('\n')
newContainers = containers.drop('Quantity', axis=1). \
join(inventory.groupby('Container Code').sum(), on='Container Code')
print(newContainers)
This seems to produce the desired output.
I also tried using a regular merge:
pd.merge(containers.drop('Quantity', axis=1), \
summedInventory,how='inner',left_on='Container Code', right_index=True)
But that produces an 'IndexError: list index out of range'
Any ideas?
I hope I got your scenario correctly. I think you can use:
containers.drop('Quantity', axis = 1).\
join(inventory.groupby('ContainerCode').sum(), \
on = 'ContainerCode')
I'm first dropping quantity from containers because you don't need it - we'll create it from inventory.
Then, we group by inventory by the container code, to sum the quantity relevant to each container.
We then perform the join between the two, and each containercode existent in containers would recieve the summed quantity from inventory
I'm trying to join two dataframes with dates that don't perfectly match up. For a given group/date in the left dataframe, I want to join the corresponding record from the right dataframe with the a date just before that of the left dataframe. Probably easiest to show with an example.
df1:
group date teacher
a 1/10/00 1
a 2/27/00 1
b 1/7/00 1
b 4/5/00 1
c 2/9/00 2
c 9/12/00 2
df2:
teacher date hair length
1 1/1/00 4
1 1/5/00 8
1 1/30/00 20
1 3/20/00 100
2 1/1/00 0
2 8/10/00 50
Gives us:
group date teacher hair length
a 1/10/00 1 8
a 2/27/00 1 20
b 1/7/00 1 8
b 4/5/00 1 100
c 2/9/00 2 0
c 9/12/00 2 50
Edit 1:
Hacked together a way to do this. Basically I iterate through every row in df1 and pick out the most recent corresponding entry in df2. It is insanely slow, surely there must be a better way.
One way to do this is to create a new column in the left data frame, which will (for a given row's date) determine the value that is closest and earlier:
df1['join_date'] = df1.date.map(lambda x: df2.date[df2.date <= x].max())
then a regular join or merge between 'join_date' on the left and 'date' on the right will work. You may need to tweak the function to handle Null values or other corner cases.
This is not very efficient (you are searching the right-hand dates over and over). A more efficient approach is to sort both data frames by the dates, iterate through the left-hand data frame, and consume entries from the right hand data frame just until the date is larger:
# Assuming df1 and df2 are sorted by the dates
df1['hair length'] = 0 # initialize
r_generator = df2.iterrows()
_, cur_r_row = next(r_generator)
for i, l_row in df1.iterrows():
cur_hair_length = 0 # Assume 0 works when df1 has a date earlier than df2
while cur_r_row['date'] <= l_row['date']:
cur_hair_length = cur_r_row['hair length']
try:
_, cur_r_row = next(r_generator)
except StopIteration:
break
df1.loc[i, 'hair length'] = cur_hair_length
Seems like the quickest way to do this is using sqlite via pysqldf:
def partial_versioned_join(tablea, tableb, tablea_keys, tableb_keys):
try:
tablea_group, tablea_date = tablea_keys
tableb_group, tableb_date = tableb_keys
except ValueError, e:
raise(e, 'Need to pass in both a group and date key for both tables')
# Note: can't actually use group here as a field name due to sqlite
statement = """SELECT a.group, a.{date_a} AS {temp_date}, b.*
FROM (SELECT tablea.group, tablea.{date_a}, tablea.{group_a},
MAX(tableb.{date_b}) AS tdate
FROM tablea
JOIN tableb
ON tablea.{group_a}=tableb.{group_b}
AND tablea.{date_a}>=tableb.{date_b}
GROUP BY tablea.{base_id}, tablea.{date_a}, tablea.{group_a}
) AS a
JOIN tableb b
ON a.{group_a}=b.{group_b}
AND a.tdate=b.{date_b};
""".format(group_a=tablea_group, date_a=tablea_date,
group_b=tableb_group, date_b=tableb_date,
temp_date='join_date', base_id=base_id)
# Note: you lose types here for tableb so you may want to save them
pre_join_tableb = sqldf(statement, locals())
return pd.merge(tablea, pre_join_tableb, how='inner',
left_on=['group'] + tablea_keys,
right_on=['group', tableb_group, 'join_date'])