How to make Slices from a Dataframe where Column Equals a Value - python

I have two sets of csv data. One contains two columns (time and a boolean flag) and another data set which contains some info I have some graphing functions Id like to visually display. The data is sampled at different frequencies so the number of rows may not match for the datasets. How do I plot individual graphs for a range of data where the boolean is true?
Here is what the contact data looks like:
INDEX | TIME | CONTACT
0 | 240:18:59:31.750 | 0
1 | 240:18:59:32.000 | 0
2 | 240:18:59:32.250 | 0
........
1421 | 240:19:05:27.000 | 1
1422 | 240:19:05:27.250 | 1
The other (Vehicle) data isnt really important but contains values like Weight, Speed (MPH), Pedal Position etc.
I have many seperate large excel files and because the shapes do not match I am unsure how to slice the data using the time flags so I made a function below to create the ranges but I am thinking this can be done in an easier manner.
Here is the working code (with output below). In short, is there an easier way to do this?
def determineContactSlices(data):
contactStart = None
contactEnd = None
slices = pd.DataFrame([])
for index, row in data.iterrows():
if row['CONTACT'] == 1:
# begin slice
if contactStart is None:
contactStart = index
continue
else:
# still valid, move onto next
continue
elif row['CONTACT'] == 0:
if contactStart is not None:
contactEnd = index - 1
# create slice and add the df to list
slice = data[contactStart:contactEnd]
print(slice)
slices = slices.append(slice)
# then reset everything
slice = None
contactStart = None
contactEnd = None
continue
else:
# move onto next row
continue
return slices
Output: ([15542 rows x 2 columns])
Index Time CONTACT
1421 240:19:05:27.000 1
1422 240:19:05:27.250 1
1423 240:19:05:27.500 1
1424 240:19:05:27.750 1
1425 240:19:05:28.000 1
1426 240:19:05:28.250 1
... ...
56815 240:22:56:15.500 1
56816 240:22:56:15.750 1
56817 240:22:56:16.000 1
56818 240:22:56:16.250 1
56819 240:22:56:16.500 1
With this output I intend to loop through each time slice and display the Vehicle Data in subplots.
Any help or guidance would be much appreciated (:
UPDATE:
I believe I can just do filteredData = vehicleData[contactData['CONTACT'] == 1] but then I am faced with how to go about graphing individually when there is a disconnect. For example if there are 7 connections at various times and lengths, I woud like to have 7 individual plots to graph.

I think what you are trying to do is relatively simple, although I am not sure if I understand the output that you want or what you want to do with it after you have it. For example:
contact_df = data[data['CONTACT'] == 1]
non_contact_df = data[data['CONTACT'] == 0]
If this isn't helpful, please provide some additional details as to what the output should look like and what you plan to do with it after it is created.

Old question but why not:
sliceStart_index = df[ df["date"]=="2012-12-28" ].index.tolist()[0]
sliceEnd_index = df[ df["date"]=="2013-01-10" ].index.tolist()[0]
this_is_your_slice = df.iloc[sliceStart_index : sliceEnd_index]
first two lines actually get you a list of indexes where the condition is met, I just chose the first ones for example.

Related

first attempt at python, error ("IndexError: index 8 is out of bounds for axis 0 with size 8") and efficiency question

learning python, just began last week, havent otherwise coded for about 20 years and was never that advanced to begin with. I got the hello world thing down. Now im trying to back test FX pairs. Any help up the learning curve appreciated, and of course scouring this site while on my Lynda vids.
Getting a funky error, and also wondering if theres blatantly more efficient ways to loop through columns of excel data the way I am.
The spreadsheet being read is simple ... 56 FX pairs down column A, and 8 rows over where the column headers are dates, and the cells in each column are the respective FX pair closing price on that date. The strategy starts at the top of the 2nd column (so that there is a return % that can be calc'd vs the prior priord) and calcs out period/period % returns for each pair, identifying which is the 'maximum value', and then "goes long" that highest performer ... whose performance in the subsequent period/period is recorded as PnL to the portfolio ("p" in the code), loops through that until the current, most recent column is read.
The error relates to using 8 columns instead of 7 ... works when i limit the loop to 7 columns but not 8. When I used 8 I get a wall of text concluding with "IndexError: index 8 is out of bounds for axis 0 with size 8" Similar error when i use too many rows, 56 instead of 55, think im missing the bottom row.
Here's my code:
,,,
enter code here
#set up imports
import pandas as pd
#import spreadsheet
x1 = pd.ExcelFile(r"C:\Users\Gamblor\Desktop\Python\test2020.xlsx")
df = pd.read_excel(x1, "Sheet1", header=1)
#define counters for loops
o = 1 # observation counter
c = 3 # column counter
r = 0 # active row counter for sorting through for max
#define identifiers for the portfolio
rpos = 0 # static row, for identifying which currency pair is in column 0 of that row
p = 100 # portfolio size starts at $100
#define the stuff we are evaluating for
pair = df.iat[r,0] # starting pair at 0,0 where each loop will begin
pair_pct_rtn = 0 # starts out at zero, becomes something at first evaluation, then gets
compared to each subsequent eval
pair_pct_rtn_calc = 0 # a second version of above, for comparison to prior return
#runs a loop starting at the top to find the max period/period % return in a specific column
while (c < 8): # manually limiting this to 5 columns left to right
while (r < 55): # i am manually limiting this to 55 data rows per the spreadsheet ... would be better if automatic
pair_pct_rtn_calc = ((df.iat[r,c])/(df.iat[r,c-1]) - 1)
if pair_pct_rtn_calc > pair_pct_rtn: # if its a higher return, it must be the "max" to that point
pair = df.iat[r,0] # identifies the max pair for this column observation, so far
pair_pct_rtn = pair_pct_rtn_calc # sets pair_pct_rtn as the new max
rpos = r # identifies the max pair's ROW for this column observation, so far
r = r + 1 # adds to r in order to jump down and calc the next row
print('in obs #', o ,', ', pair ,'did best at' ,pair_pct_rtn ,'.')
o = o + 1
# now adjust the portfolio by however well USDMXN did in the subsequent week
p = p * ( 1 + ((df.iat[rpos,c+1])/(df.iat[rpos,c]) - 1))
print('then the subsequent period it did: ',(df.iat[rpos,c+1])/(df.iat[rpos,c]) - 1)
print('resulting in portfolio value of', p)
rpos = 0
r = 0
pair_pct_rtn = 0
c = c + 1 # adds to c in order to move to the next period to the right
print(p)
Since indices are labelled from 0 onwards, the 8th element you are looking for will have index 7. Likewise, row index 55 (the 56th row) will be your last row.

How to perform multiple boolean conditions in a whole DataFrame (row by row)?

I have the next DataFrame:
open high low close volume
0 62.8571 63.9285 62.7714 63.5642 82641944.0
1 63.6642 64.9285 63.5014 64.5114 88379522.0
2 61.7014 63.6857 61.4428 63.2757 112681030.0
3 62.5928 63.6399 62.0285 62.8085 113921367.0
4 63.4357 64.0499 62.6028 63.0505 110727309.0
.. .. .. .. .. ..
And currently I have the next code to generate a "bool"(0,1,-1) Series depending con multiple conditions (selecting 2 by 2 rows. In other cases I will need 3/4 rows in each iteration/calculation):
def check_pattern(data):
engulfed_bar_range = data.iloc[-2]['close'] - data.iloc[-2]['open']
if abs(engulfed_bar_range) >= params:
if engulfed_bar_range > 0:
return -1*((data.iloc[-1]['open'] > data.iloc[-2]['close']) and \
(data.iloc[-1]['close'] < data.iloc[-2]['open']))
else:
return +1*((data.iloc[-1]['open'] < data.iloc[-2]['close']) and \
(data.iloc[-1]['close'] > data.iloc[-2]['open']))
return False
res = []
for index in range(1, len(all_data)):
data = all_data.iloc[index-1:index+1]
res.append(check_pattern(d))
s = pd.Series(res)
There is any better/easiest/bestPerformance way of doing that? In some other cases similar to that, in which I only need the data of one column of the DataFrame, I have used df.rolling(..), but in this case that I need using data of several columns I dn't know how to do it. Maybe, there is some function of numpy that I can use? Or pd.eval? (I have tried but I havn't been able to get what I want)...
Thank so much in advance for your help.
Graphical explanation of what I'm looking for in the df:
I want a pd.Series with +1 when there is a Bullish Engulfing pattern and -1 when there is a Bearish Engulfing pattern. And 0 if there is no patter at that indexes.

Pandas merge - combination of and and or conditions [duplicate]

Firstly, sorry if this is a bit lengthy, but I wanted to fully describe what I have having problems with and what I have tried already.
I am trying to join (merge) together two dataframe objects on multiple conditions. I know how to do this if the conditions to be met are all 'equals' operators, however, I need to make use of LESS THAN and MORE THAN.
The dataframes represent genetic information: one is a list of mutations in the genome (referred to as SNPs) and the other provides information on the locations of the genes on the human genome. Performing df.head() on these returns the following:
SNP DataFrame (snp_df):
chromosome SNP BP
0 1 rs3094315 752566
1 1 rs3131972 752721
2 1 rs2073814 753474
3 1 rs3115859 754503
4 1 rs3131956 758144
This shows the SNP reference ID and their locations. 'BP' stands for the 'Base-Pair' position.
Gene DataFrame (gene_df):
chromosome chr_start chr_stop feature_id
0 1 10954 11507 GeneID:100506145
1 1 12190 13639 GeneID:100652771
2 1 14362 29370 GeneID:653635
3 1 30366 30503 GeneID:100302278
4 1 34611 36081 GeneID:645520
This dataframe shows the locations of all the genes of interest.
What I want to find out is all of the SNPs which fall within the gene regions in the genome, and discard those that are outside of these regions.
If I wanted to merge together two dataframes based on multiple (equals) conditions, I would do something like the following:
merged_df = pd.merge(snp_df, gene_df, on=['chromosome', 'other_columns'])
However, in this instance - I need to find the SNPs where the chromosome values match those in the Gene dataframe, and the BP value falls between 'chr_start' and 'chr_stop'. What makes this challenging is that these dataframes are quite large. In this current dataset the snp_df has 6795021 rows, and the gene_df has 34362.
I have tried to tackle this by either looking at chromosomes or genes seperately. There are 22 different chromosome values (ints 1-22) as the sex chromosomes are not used. Both methods are taking an extremely long time. One uses the pandasql module, while the other approach is to loop through the separate genes.
SQL method
import pandas as pd
import pandasql as psql
pysqldf = lambda q: psql.sqldf(q, globals())
q = """
SELECT s.SNP, g.feature_id
FROM this_snp s INNER JOIN this_genes g
WHERE s.BP >= g.chr_start
AND s.BP <= g.chr_stop;
"""
all_dfs = []
for chromosome in snp_df['chromosome'].unique():
this_snp = snp_df.loc[snp_df['chromosome'] == chromosome]
this_genes = gene_df.loc[gene_df['chromosome'] == chromosome]
genic_snps = pysqldf(q)
all_dfs.append(genic_snps)
all_genic_snps = pd.concat(all_dfs)
Gene iteration method
all_dfs = []
for line in gene_df.iterrows():
info = line[1] # Getting the Series object
this_snp = snp_df.loc[(snp_df['chromosome'] == info['chromosome']) &
(snp_df['BP'] >= info['chr_start']) & (snp_df['BP'] <= info['chr_stop'])]
if this_snp.shape[0] != 0:
this_snp = this_snp[['SNP']]
this_snp.insert(len(this_snp.columns), 'feature_id', info['feature_id'])
all_dfs.append(this_snp)
all_genic_snps = pd.concat(all_dfs)
Can anyone give any suggestions of a more effective way of doing this?
I've just thought of a way to solve this - by combining my two methods:
First, focus on the individual chromosomes, and then loop through the genes in these smaller dataframes. This also doesn't have to make use of any SQL queries either. I've also included a section to immediately identify any redundant genes that don't have any SNPs that fall within their range. This makes use of a double for-loop which I normally try to avoid - but in this case it works quite well.
all_dfs = []
for chromosome in snp_df['chromosome'].unique():
this_chr_snp = snp_df.loc[snp_df['chromosome'] == chromosome]
this_genes = gene_df.loc[gene_df['chromosome'] == chromosome]
# Getting rid of redundant genes
min_bp = this_chr_snp['BP'].min()
max_bp = this_chr_snp['BP'].max()
this_genes = this_genes.loc[~(this_genes['chr_start'] >= max_bp) &
~(this_genes['chr_stop'] <= min_bp)]
for line in this_genes.iterrows():
info = line[1]
this_snp = this_chr_snp.loc[(this_chr_snp['BP'] >= info['chr_start']) &
(this_chr_snp['BP'] <= info['chr_stop'])]
if this_snp.shape[0] != 0:
this_snp = this_snp[['SNP']]
this_snp.insert(1, 'feature_id', info['feature_id'])
all_dfs.append(this_snp)
all_genic_snps = pd.concat(all_dfs)
While this doesn't run spectacularly quickly - it does run so that I can actually get some answers. I'd still like to know if anyone has any tips to make it run more efficiently though.
You can use the following to accomplish what you're looking for:
merged_df=snp_df.merge(gene_df,on=['chromosome'],how='inner')
merged_df=merged_df[(merged_df.BP>=merged_df.chr_start) & (merged_df.BP<=merged_df.chr_stop)][['SNP','feature_id']]
Note: your example dataframes do not meet your join criteria. Here is an example using modified dataframes:
snp_df
Out[193]:
chromosome SNP BP
0 1 rs3094315 752566
1 1 rs3131972 30400
2 1 rs2073814 753474
3 1 rs3115859 754503
4 1 rs3131956 758144
gene_df
Out[194]:
chromosome chr_start chr_stop feature_id
0 1 10954 11507 GeneID:100506145
1 1 12190 13639 GeneID:100652771
2 1 14362 29370 GeneID:653635
3 1 30366 30503 GeneID:100302278
4 1 34611 36081 GeneID:645520
merged_df
Out[195]:
SNP feature_id
8 rs3131972 GeneID:100302278

pandas speed up creation of columns from column of lists

I have an original dataset with informations stored as a list of dict, in a column (this is a mongodb extract). This is the column :
[{u'domain_id': ObjectId('A'), u'p': 1},
{u'domain_id': ObjectId('B'), u'p': 2},
{u'domain_id': ObjectId('B'), u'p': 3},
...
{u'domain_id': ObjectId('CG'), u'p': 101}]
I'm only interested in the first 10 dict ( 'p' value from 1 to 10). The output dataframe should look like this :
index | A | ... | B
------------------------
0 | 1 | ... | 2
1 | Nan | ... | Nan
2 | Nan | ... | 3
e.g : For each line of my original DataFrame, I create a column for each domain_id, and I associate it with the corresponding 'p' value. I can have the same domain_id for several 'p' value, in this case I only keep the first one (smaller 'p')
Here is my current code, which may be easier to understand :
first = True
for i in df.index[:]: # for each line of original Dataframe
temp_list = df["positions"][i] # this is the column with the list of dict inside
col_list = []
data_list = []
for j in range(10): # get the first 10 values
try:
if temp_list[j]["domain_id"] not in col_list: # check if domain_id already exist
col_list.append(temp_list[j]["domain_id"])
data_list.append(temp_list[j]["p"])
except IndexError as e:
print e
df_temp = pd.DataFrame([np.transpose(data_list)],columns = col_list) # create a temporary DataFrame for this line of the original DataFrame
if first:
df_kw = df_temp
first = False
else:
# pass
df_kw = pd.concat([df_kw,df_temp], axis=0, ignore_index=True) # concat all the temporary DataFrame : now I have my output Dataframe, with the same number of lines as my original DataFrame.
This is all working fine, but it is very very slow as I have 15k lines and end up with 10k columns.
I'm sure (or at least I hope very much) that there is a simpler an faster solution : any advice will be much appreciated.
I found a decent solution : the slow part is the concatenation, so it is way more efficient to first create the dataframe and then update the values.
Create the DataFrame:
for i in df.index[:]:
temp_list = df["positions"][i]
for j in range(10):
try:
# if temp_list[j]["domain_id"] not in col_list:
col_list.append(temp_list[j]["domain_id"])
except IndexError as e:
print e
df_total = pd.DataFrame(index=df.index, columns=set(col_list))
Update the values :
for i in df.index[:]:
temp_list = df["positions"][i]
col_list = []
for j in range(10):
try:
if temp_list[j]["domain_id"] not in col_list: # avoid overwriting values
df_total.loc[i, temp_list[j]["domain_id"]] = temp_list[j]["p"]
col_list.append(temp_list[j]["domain_id"])
except IndexError as e:
print e
Creating a 15k x 6k DataFrame took about 6 seconds on my computer, and filling it took 27 seconds.
I killed the former solution after more than 1 hour running, so this is really faster.

How can I count coherent values that are less than a specific number?

I need to handle some hourly weather data from CSV files with 8,760 values per column. For example I need to plot a histogram with the longest coherent calms of wind speed, which means less than 3 m/s.
I have already created a histogram with the wind speed distribution but this one is way harder. So I need some kind of string which count the serial hours less than 3 m/s and count them together and plot in the end.
My idea is to apply a string which ask every value "less than 3?", if yes it needs to create a new calm and continue until the answer is no, then finish the calm and so on. In the end it should have a lot of calms from one hour to approx. 48 hours. The output is a histogram of these calms sorted by frequency.
I didn't expect somebody would write the code for me, sorry if it seems like that. I just asked for an idea but I think I almost got it.
Here is my code so far, it should create a vector for every calm and put it into a dictionary. It works but every key is filled by the same vector and I'm not sure how to fix this? (the vector itself is fine, starts at =<3 and count till =>3)
#read column v_wind
saved_column = df.v_wind
fig, ax = plt.subplots()
#collecting vectors in empty dictionary
# array range 100
vector_coll = {}
a = np.array(range(100))
#for loop create vector
#set calm to zero
#i = calm vectors
#b = empty array
calm = 0
i = -1
b = []
for t in range(0, 8760, 1):
if df.v_wind[t] <= 3:
if calm == 0:
b = []
b = np.append(b, [df.v_wind[t]])
calm = 1
else:
b = np.append(b, [df.v_wind[t]])
else:
calm = False
calm = 0
i = i + 1
for i in np.array(range(100)):
vector_coll[str(a[i])] = b
#print(vector_coll.keys())
#print(vector_coll['1'])
for i in vector_coll.keys():
if vector_coll[i] == []:
print('empty')
else:
print('full')

Categories