I receive timeseries data from a broker and want to implement condition monitoring on this data. I want to analyze the data in a window of size 10. The window size must always stay the same. When the 11th data comes, I need to check its value against two thresholds which are calculated from the 10 values inside a window. If the 11th data is outsider, I must delete the data from the list and if it is within the range, I must delete the first element and add the 11th data to the last element. So this way the size of window stays the same. The code is simplified. data comes each 1 second.
temp_list = []
window_size = 10
if len(temy_list) <= window_size :
temp_list.append(data)
if len(temp_list) == 10:
avg = statistics.mean(temp_list)
std = statistics.stdev(temp_list)
u_thresh = avg + 3*std
l_thresh = avg - 3*std
temp_list.append(data)
if temp_list[window_size] < l_thresh or temp_list[window_size] > u_thresh:
temp_list.pop(-1)
else:
temp_list.pop(0)
temp_list.append(data)
With this code the list does not get updated and 11th data is stored and then no new data. I don't know how to correctly implement it. Sorry, if it is a simple question. I am still not very comfortable with python list. Thank you for your hint/help.
With how your code currently is if you plan to keep the last data point you add it twice instead. You can simplify your code down to make it a bit more clear and straightforward.
##First setup your initial variables
temp_list = []
window_size = 10
Then -
While(True):
data = ##Generate/Get data here
## If less than 10 data points add them to list
if len(temp_list) < window_size :
temp_list.append(data)
## If already at 10 check if its within needed range
else:
avg = statistics.mean(temp_list)
std = statistics.stdev(temp_list)
u_thresh = avg + 3*std
l_thresh = avg - 3*std
## If within range add point to end of list and remove first element
if(data >= l_thresh and data <= u_thresh):
temp_list.pop(0)
temp_list.append(data)
Related
learning python, just began last week, havent otherwise coded for about 20 years and was never that advanced to begin with. I got the hello world thing down. Now im trying to back test FX pairs. Any help up the learning curve appreciated, and of course scouring this site while on my Lynda vids.
Getting a funky error, and also wondering if theres blatantly more efficient ways to loop through columns of excel data the way I am.
The spreadsheet being read is simple ... 56 FX pairs down column A, and 8 rows over where the column headers are dates, and the cells in each column are the respective FX pair closing price on that date. The strategy starts at the top of the 2nd column (so that there is a return % that can be calc'd vs the prior priord) and calcs out period/period % returns for each pair, identifying which is the 'maximum value', and then "goes long" that highest performer ... whose performance in the subsequent period/period is recorded as PnL to the portfolio ("p" in the code), loops through that until the current, most recent column is read.
The error relates to using 8 columns instead of 7 ... works when i limit the loop to 7 columns but not 8. When I used 8 I get a wall of text concluding with "IndexError: index 8 is out of bounds for axis 0 with size 8" Similar error when i use too many rows, 56 instead of 55, think im missing the bottom row.
Here's my code:
,,,
enter code here
#set up imports
import pandas as pd
#import spreadsheet
x1 = pd.ExcelFile(r"C:\Users\Gamblor\Desktop\Python\test2020.xlsx")
df = pd.read_excel(x1, "Sheet1", header=1)
#define counters for loops
o = 1 # observation counter
c = 3 # column counter
r = 0 # active row counter for sorting through for max
#define identifiers for the portfolio
rpos = 0 # static row, for identifying which currency pair is in column 0 of that row
p = 100 # portfolio size starts at $100
#define the stuff we are evaluating for
pair = df.iat[r,0] # starting pair at 0,0 where each loop will begin
pair_pct_rtn = 0 # starts out at zero, becomes something at first evaluation, then gets
compared to each subsequent eval
pair_pct_rtn_calc = 0 # a second version of above, for comparison to prior return
#runs a loop starting at the top to find the max period/period % return in a specific column
while (c < 8): # manually limiting this to 5 columns left to right
while (r < 55): # i am manually limiting this to 55 data rows per the spreadsheet ... would be better if automatic
pair_pct_rtn_calc = ((df.iat[r,c])/(df.iat[r,c-1]) - 1)
if pair_pct_rtn_calc > pair_pct_rtn: # if its a higher return, it must be the "max" to that point
pair = df.iat[r,0] # identifies the max pair for this column observation, so far
pair_pct_rtn = pair_pct_rtn_calc # sets pair_pct_rtn as the new max
rpos = r # identifies the max pair's ROW for this column observation, so far
r = r + 1 # adds to r in order to jump down and calc the next row
print('in obs #', o ,', ', pair ,'did best at' ,pair_pct_rtn ,'.')
o = o + 1
# now adjust the portfolio by however well USDMXN did in the subsequent week
p = p * ( 1 + ((df.iat[rpos,c+1])/(df.iat[rpos,c]) - 1))
print('then the subsequent period it did: ',(df.iat[rpos,c+1])/(df.iat[rpos,c]) - 1)
print('resulting in portfolio value of', p)
rpos = 0
r = 0
pair_pct_rtn = 0
c = c + 1 # adds to c in order to move to the next period to the right
print(p)
Since indices are labelled from 0 onwards, the 8th element you are looking for will have index 7. Likewise, row index 55 (the 56th row) will be your last row.
I have a project where I'm trying to create a program that will take a csv data set from www.transtats.gov which is a data set for airline flights in the US. My goal is to find the flight from one airport to another that had the worst delays overall, meaning it is the "worst flight". So far I have this:
`import csv
with open('826766072_T_ONTIME.csv') as csv_infile: #import and open CSV
reader = csv.DictReader(csv_infile)
total_delay = 0
flight_count = 0
flight_numbers = []
delay_totals = []
dest_list = [] #create empty list of destinations
for row in reader:
if row['ORIGIN'] == 'BOS': #only take flights leaving BOS
if row['FL_NUM'] not in flight_numbers:
flight_numbers.append(row['FL_NUM'])
if row['DEST'] not in dest_list: #if the dest is not already in the list
dest_list.append(row['DEST']) #append the dest to dest_list
for number in flight_numbers:
for row in reader:
if row['ORIGIN'] == 'BOS': #for flights leaving BOS
if row['FL_NUM'] == number:
if float(row['CANCELLED']) < 1: #if the flight is not cancelled
if float(row['DEP_DELAY']) >= 0: #and the delay is greater or equal to 0 (some flights had negative delay?)
total_delay += float(row['DEP_DELAY']) #add time of delay to total delay
flight_count += 1 #add the flight to total flight count
for row in reader:
for number in flight_numbers:
delay_totals.append(sum(row['DEP_DELAY']))`
I was thinking that I could create a list of flight numbers and a list of the total delays from those flight numbers and compare the two and see which flight had the highest delay total. What is the best way to go about comparing the two lists?
I'm not sure if I understand you correctly, but I think you should use dict for this purpose, where key is a 'FL_NUM' and value is total delay.
In general I want to eliminate loops in Python code. For files that aren't massive I'll typically read through a data file once and build up some dicts that I can analyze at the end. The below code isn't tested because I don't have the original data but follows the general pattern I would use.
Since a flight is identified by the origin, destination, and flight number I would capture them as a tuple and use that as the key in my dict.
from collections import defaultdict
flight_delays = defaultdict(list) # look this up if you aren't familiar
for row in reader:
if row['ORIGIN'] == 'BOS': #only take flights leaving BOS
if row['CANCELLED'] > 0:
flight = (row['ORIGIN'], row['DEST'], row['FL_NUM'])
flight_delays[flight].append(float(row['DEP_DELAY']))
# Finished reading through data, now I want to calculate average delays
worst_flight = ""
worst_delay = 0
for flight, delays in flight_delays.items():
average_delay = sum(delays) / len(delays)
if average_delay > worst_delay:
worst_flight = flight[0] + " to " + flight[1] + " on FL#" + flight[2]
worst_delay = average_delay
A very simple solution would be. Adding two new variables:
max_delay = 0
delay_flight = 0
# Change: if float(row['DEP_DELAY']) >= 0: FOR:
if float(row['DEP_DELAY']) > max_delay:
max_delay = float(row['DEP_DELAY'])
delay_flight = #save the row number or flight number for reference.
I have a few hundred thousand groups through which I want to iterate this particular lag operation. Below is a sample where Buy_Ord_No is the group by variable:
I would like to generate Lag_Exec_Qty and Exec_Qty. What I am basically doing here is initially setting Exec_Qty equal to 0 when Buy_Act_Type = 1 or Buy_Act_Type = 4. Then, I take the lag value of Exec_Qty ad Lag_Exec_Qty. In the same row, I sum up Trd_Qty and Lag_Exec_Qty to get the updated Exec_Qty.
This is the code that I currently have:
for b in buy:
temp=buy_sorted_file[buy_sorted_file["Buy_Ord_No"]==b]
temp=temp.sort_values(["Buy_Ord_No","Buy_Ord_Txn_Time"], ascending=[True, True]).reset_index(drop=True)
for index in range(len(temp.index)):
if(int(temp["Buy_Act_Type"].iloc[index])==1 or int(temp["Buy_Act_Type"].iloc[index])==4):
temp["Exec_Qty"].iloc[index]=0
temp["Lag_Exec_Qty"].iloc[index]=0
else:
temp["Lag_Exec_Qty"].iloc[index]=temp["Exec_Qty"].iloc[index-1]
temp["Exec_Qty"].iloc[index]=temp["Trd_Qty"].iloc[index]+temp["Lag_Exec_Qty"].iloc[index]
if (len(buy_sorted_exec_file.index) == 0):
buy_sorted_exec_file = temp.copy()
else:
buy_sorted_exec_file = pd.concat([temp,buy_sorted_exec_file]).reset_index(drop=True)
buy_sorted_file= buy_sorted_exec_file.sort_values(["Buy_Ord_Txn_Time", "Buy_Ord_Limit_Pr"],ascending=[True, True]).reset_index(drop=True)
The code takes a really long time to run. Is there anyway through which I can speed this process up?
You should be able to do, without any loops:
temp['Lag_Exec_Qty'] = temp['Exec_Qty'].shift(1)
temp['Exec_Qty'] = temp['Trd_Qty'] + temp['Lag_Exec_Qty']
I need help with writing code for a work project. I have written a script that uses pandas to read an excel file. I have a while-loop written to iterate through each row and append latitude/longitude data from the excel file onto a map (Folium, Open Street Map)
The issue I've run into has to do with the GPS data. I download a CVS file with vehicle coordinates. On some of the vehicles I'm tracking, the GPS loses signal for whatever reason and doesn't come back online for hundreds of miles. This causes issues when I'm using line plots to track the vehicle movement on the map. I end up getting long straight lines running across cities since Folium is trying to connect the last GPS coordinate before the vehicle went offline, with the next GPS coordinate available once the vehicle is back online, which could be hundreds of miles away as shown here. I think if every time the script finds a gap in GPS coords, I can have a new loop generated that will basically start a completely new line plot and append it to the existing map. This way I should still see the entire vehicle route on the map but without the long lines trying to connect broken points together.
My idea is to have my script calculate the absolute value difference between each iteration of longitude data. If the difference between each point is greater than 0.01, I want my program to end the loop and to start a new loop. This new loop would then need to have new variables init. I will not know how many new loops would need to be created since there's no way to predict how many times the GPS will go offline/online in each vehicle.
https://gist.github.com/tapanojum/81460dd89cb079296fee0c48a3d625a7
import folium
import pandas as pd
# Pulls CSV file from this location and adds headers to the columns
df = pd.read_csv('Example.CSV',names=['Longitude', 'Latitude',])
lat = (df.Latitude / 10 ** 7) # Converting Lat/Lon into decimal degrees
lon = (df.Longitude / 10 ** 7)
zoom_start = 17 # Zoom level and starting location when map is opened
mapa = folium.Map(location=[lat[1], lon[1]], zoom_start=zoom_start)
i = 0
j = (lat[i] - lat[i - 1])
location = []
while i < len(lat):
if abs(j) < 0.01:
location.append((lat[i], lon[i]))
i += 1
else:
break
# This section is where additional loops would ideally be generated
# Line plot settings
c1 = folium.MultiPolyLine(locations=[location], color='blue', weight=1.5, opacity=0.5)
c1.add_to(mapa)
mapa.save(outfile="Example.html")
Here's pseudocode for how I want to accomplish this.
1) Python reads csv
2) Converts Long/Lat into decimal degrees
3) Init location1
4) Runs while loop to append coords
5) If abs(j) >= 0.01, break loop
6) Init location(2,3,...)
7) Generates new while i < len(lat): loop using location(2,3,...)
9) Repeats step 5-7 while i < len(lat) (Repeat as many times as there are
instances of abs(j) >= 0.01))
10) Creats (c1, c2, c3,...) = folium.MultiPolyLine(locations=[location], color='blue', weight=1.5, opacity=0.5) for each variable of location
11) Creates c1.add_to(mapa) for each c1,c2,c3... listed above
12) mapa.save
Any help would be tremendously appreciated!
UPDATE:
Working Solution
import folium
import pandas as pd
# Pulls CSV file from this location and adds headers to the columns
df = pd.read_csv(EXAMPLE.CSV',names=['Longitude', 'Latitude'])
lat = (df.Latitude / 10 ** 7) # Converting Lat/Lon into decimal degrees
lon = (df.Longitude / 10 ** 7)
zoom_start = 17 # Zoom level and starting location when map is opened
mapa = folium.Map(location=[lat[1], lon[1]], zoom_start=zoom_start)
i = 1
location = []
while i < (len(lat)-1):
location.append((lat[i], lon[i]))
i += 1
j = (lat[i] - lat[i - 1])
if abs(j) > 0.01:
c1 = folium.MultiPolyLine(locations=[location], color='blue', weight=1.5, opacity=0.5)
c1.add_to(mapa)
location = []
mapa.save(outfile="Example.html")
Your while loop looks wonky. You only set j once, outside the loop. Also, I think you want a list of line segments. Did you want something like this;
i = 0
segment = 0
locations = []
while i < len(lat):
locations[segment] = [] # start a new segment
# add points to the current segment until all are
# consumed or a disconnect is detected
while i < len(lat):
locations[segment].append((lat[i], lon[i]))
i += 1
j = (lat[i] - lat[i - 1])
if abs(j) > 0.01:
break
segment += 1
When this is done locations will be a list of segments, e.g.;
[ segment0, segment1, ..... ]
each segment will be a list of points, e.g.;
[ (lat,lon), (lan,lon), ..... ]
I need to handle some hourly weather data from CSV files with 8,760 values per column. For example I need to plot a histogram with the longest coherent calms of wind speed, which means less than 3 m/s.
I have already created a histogram with the wind speed distribution but this one is way harder. So I need some kind of string which count the serial hours less than 3 m/s and count them together and plot in the end.
My idea is to apply a string which ask every value "less than 3?", if yes it needs to create a new calm and continue until the answer is no, then finish the calm and so on. In the end it should have a lot of calms from one hour to approx. 48 hours. The output is a histogram of these calms sorted by frequency.
I didn't expect somebody would write the code for me, sorry if it seems like that. I just asked for an idea but I think I almost got it.
Here is my code so far, it should create a vector for every calm and put it into a dictionary. It works but every key is filled by the same vector and I'm not sure how to fix this? (the vector itself is fine, starts at =<3 and count till =>3)
#read column v_wind
saved_column = df.v_wind
fig, ax = plt.subplots()
#collecting vectors in empty dictionary
# array range 100
vector_coll = {}
a = np.array(range(100))
#for loop create vector
#set calm to zero
#i = calm vectors
#b = empty array
calm = 0
i = -1
b = []
for t in range(0, 8760, 1):
if df.v_wind[t] <= 3:
if calm == 0:
b = []
b = np.append(b, [df.v_wind[t]])
calm = 1
else:
b = np.append(b, [df.v_wind[t]])
else:
calm = False
calm = 0
i = i + 1
for i in np.array(range(100)):
vector_coll[str(a[i])] = b
#print(vector_coll.keys())
#print(vector_coll['1'])
for i in vector_coll.keys():
if vector_coll[i] == []:
print('empty')
else:
print('full')