Need help writing code that will automatically write more code? - python

I need help with writing code for a work project. I have written a script that uses pandas to read an excel file. I have a while-loop written to iterate through each row and append latitude/longitude data from the excel file onto a map (Folium, Open Street Map)
The issue I've run into has to do with the GPS data. I download a CVS file with vehicle coordinates. On some of the vehicles I'm tracking, the GPS loses signal for whatever reason and doesn't come back online for hundreds of miles. This causes issues when I'm using line plots to track the vehicle movement on the map. I end up getting long straight lines running across cities since Folium is trying to connect the last GPS coordinate before the vehicle went offline, with the next GPS coordinate available once the vehicle is back online, which could be hundreds of miles away as shown here. I think if every time the script finds a gap in GPS coords, I can have a new loop generated that will basically start a completely new line plot and append it to the existing map. This way I should still see the entire vehicle route on the map but without the long lines trying to connect broken points together.
My idea is to have my script calculate the absolute value difference between each iteration of longitude data. If the difference between each point is greater than 0.01, I want my program to end the loop and to start a new loop. This new loop would then need to have new variables init. I will not know how many new loops would need to be created since there's no way to predict how many times the GPS will go offline/online in each vehicle.
https://gist.github.com/tapanojum/81460dd89cb079296fee0c48a3d625a7
import folium
import pandas as pd
# Pulls CSV file from this location and adds headers to the columns
df = pd.read_csv('Example.CSV',names=['Longitude', 'Latitude',])
lat = (df.Latitude / 10 ** 7) # Converting Lat/Lon into decimal degrees
lon = (df.Longitude / 10 ** 7)
zoom_start = 17 # Zoom level and starting location when map is opened
mapa = folium.Map(location=[lat[1], lon[1]], zoom_start=zoom_start)
i = 0
j = (lat[i] - lat[i - 1])
location = []
while i < len(lat):
if abs(j) < 0.01:
location.append((lat[i], lon[i]))
i += 1
else:
break
# This section is where additional loops would ideally be generated
# Line plot settings
c1 = folium.MultiPolyLine(locations=[location], color='blue', weight=1.5, opacity=0.5)
c1.add_to(mapa)
mapa.save(outfile="Example.html")
Here's pseudocode for how I want to accomplish this.
1) Python reads csv
2) Converts Long/Lat into decimal degrees
3) Init location1
4) Runs while loop to append coords
5) If abs(j) >= 0.01, break loop
6) Init location(2,3,...)
7) Generates new while i < len(lat): loop using location(2,3,...)
9) Repeats step 5-7 while i < len(lat) (Repeat as many times as there are
instances of abs(j) >= 0.01))
10) Creats (c1, c2, c3,...) = folium.MultiPolyLine(locations=[location], color='blue', weight=1.5, opacity=0.5) for each variable of location
11) Creates c1.add_to(mapa) for each c1,c2,c3... listed above
12) mapa.save
Any help would be tremendously appreciated!
UPDATE:
Working Solution
import folium
import pandas as pd
# Pulls CSV file from this location and adds headers to the columns
df = pd.read_csv(EXAMPLE.CSV',names=['Longitude', 'Latitude'])
lat = (df.Latitude / 10 ** 7) # Converting Lat/Lon into decimal degrees
lon = (df.Longitude / 10 ** 7)
zoom_start = 17 # Zoom level and starting location when map is opened
mapa = folium.Map(location=[lat[1], lon[1]], zoom_start=zoom_start)
i = 1
location = []
while i < (len(lat)-1):
location.append((lat[i], lon[i]))
i += 1
j = (lat[i] - lat[i - 1])
if abs(j) > 0.01:
c1 = folium.MultiPolyLine(locations=[location], color='blue', weight=1.5, opacity=0.5)
c1.add_to(mapa)
location = []
mapa.save(outfile="Example.html")

Your while loop looks wonky. You only set j once, outside the loop. Also, I think you want a list of line segments. Did you want something like this;
i = 0
segment = 0
locations = []
while i < len(lat):
locations[segment] = [] # start a new segment
# add points to the current segment until all are
# consumed or a disconnect is detected
while i < len(lat):
locations[segment].append((lat[i], lon[i]))
i += 1
j = (lat[i] - lat[i - 1])
if abs(j) > 0.01:
break
segment += 1
When this is done locations will be a list of segments, e.g.;
[ segment0, segment1, ..... ]
each segment will be a list of points, e.g.;
[ (lat,lon), (lan,lon), ..... ]

Related

How to remove an element and append python list conditionally?

I receive timeseries data from a broker and want to implement condition monitoring on this data. I want to analyze the data in a window of size 10. The window size must always stay the same. When the 11th data comes, I need to check its value against two thresholds which are calculated from the 10 values inside a window. If the 11th data is outsider, I must delete the data from the list and if it is within the range, I must delete the first element and add the 11th data to the last element. So this way the size of window stays the same. The code is simplified. data comes each 1 second.
temp_list = []
window_size = 10
if len(temy_list) <= window_size :
temp_list.append(data)
if len(temp_list) == 10:
avg = statistics.mean(temp_list)
std = statistics.stdev(temp_list)
u_thresh = avg + 3*std
l_thresh = avg - 3*std
temp_list.append(data)
if temp_list[window_size] < l_thresh or temp_list[window_size] > u_thresh:
temp_list.pop(-1)
else:
temp_list.pop(0)
temp_list.append(data)
With this code the list does not get updated and 11th data is stored and then no new data. I don't know how to correctly implement it. Sorry, if it is a simple question. I am still not very comfortable with python list. Thank you for your hint/help.
With how your code currently is if you plan to keep the last data point you add it twice instead. You can simplify your code down to make it a bit more clear and straightforward.
##First setup your initial variables
temp_list = []
window_size = 10
Then -
While(True):
data = ##Generate/Get data here
## If less than 10 data points add them to list
if len(temp_list) < window_size :
temp_list.append(data)
## If already at 10 check if its within needed range
else:
avg = statistics.mean(temp_list)
std = statistics.stdev(temp_list)
u_thresh = avg + 3*std
l_thresh = avg - 3*std
## If within range add point to end of list and remove first element
if(data >= l_thresh and data <= u_thresh):
temp_list.pop(0)
temp_list.append(data)

first attempt at python, error ("IndexError: index 8 is out of bounds for axis 0 with size 8") and efficiency question

learning python, just began last week, havent otherwise coded for about 20 years and was never that advanced to begin with. I got the hello world thing down. Now im trying to back test FX pairs. Any help up the learning curve appreciated, and of course scouring this site while on my Lynda vids.
Getting a funky error, and also wondering if theres blatantly more efficient ways to loop through columns of excel data the way I am.
The spreadsheet being read is simple ... 56 FX pairs down column A, and 8 rows over where the column headers are dates, and the cells in each column are the respective FX pair closing price on that date. The strategy starts at the top of the 2nd column (so that there is a return % that can be calc'd vs the prior priord) and calcs out period/period % returns for each pair, identifying which is the 'maximum value', and then "goes long" that highest performer ... whose performance in the subsequent period/period is recorded as PnL to the portfolio ("p" in the code), loops through that until the current, most recent column is read.
The error relates to using 8 columns instead of 7 ... works when i limit the loop to 7 columns but not 8. When I used 8 I get a wall of text concluding with "IndexError: index 8 is out of bounds for axis 0 with size 8" Similar error when i use too many rows, 56 instead of 55, think im missing the bottom row.
Here's my code:
,,,
enter code here
#set up imports
import pandas as pd
#import spreadsheet
x1 = pd.ExcelFile(r"C:\Users\Gamblor\Desktop\Python\test2020.xlsx")
df = pd.read_excel(x1, "Sheet1", header=1)
#define counters for loops
o = 1 # observation counter
c = 3 # column counter
r = 0 # active row counter for sorting through for max
#define identifiers for the portfolio
rpos = 0 # static row, for identifying which currency pair is in column 0 of that row
p = 100 # portfolio size starts at $100
#define the stuff we are evaluating for
pair = df.iat[r,0] # starting pair at 0,0 where each loop will begin
pair_pct_rtn = 0 # starts out at zero, becomes something at first evaluation, then gets
compared to each subsequent eval
pair_pct_rtn_calc = 0 # a second version of above, for comparison to prior return
#runs a loop starting at the top to find the max period/period % return in a specific column
while (c < 8): # manually limiting this to 5 columns left to right
while (r < 55): # i am manually limiting this to 55 data rows per the spreadsheet ... would be better if automatic
pair_pct_rtn_calc = ((df.iat[r,c])/(df.iat[r,c-1]) - 1)
if pair_pct_rtn_calc > pair_pct_rtn: # if its a higher return, it must be the "max" to that point
pair = df.iat[r,0] # identifies the max pair for this column observation, so far
pair_pct_rtn = pair_pct_rtn_calc # sets pair_pct_rtn as the new max
rpos = r # identifies the max pair's ROW for this column observation, so far
r = r + 1 # adds to r in order to jump down and calc the next row
print('in obs #', o ,', ', pair ,'did best at' ,pair_pct_rtn ,'.')
o = o + 1
# now adjust the portfolio by however well USDMXN did in the subsequent week
p = p * ( 1 + ((df.iat[rpos,c+1])/(df.iat[rpos,c]) - 1))
print('then the subsequent period it did: ',(df.iat[rpos,c+1])/(df.iat[rpos,c]) - 1)
print('resulting in portfolio value of', p)
rpos = 0
r = 0
pair_pct_rtn = 0
c = c + 1 # adds to c in order to move to the next period to the right
print(p)
Since indices are labelled from 0 onwards, the 8th element you are looking for will have index 7. Likewise, row index 55 (the 56th row) will be your last row.

Filtering CSV data using python and storing different values in array

I am trying to filter CSV file where I need to store prices of different commodities that are > 1000 in different arrays, I can able to get only 1 commodity values perfectly but other commodity array just a duplicate of the 1st commodity.
CSV file looks like below figure:
CODE
import matplotlib.pyplot as plt
import csv
import pandas as pd
import numpy as np
# csv file name
filename = "CommodityPrice.csv"
# List gold price above 1000
gold_price_above_1000 = []
palladiun_price_above_1000 = []
gold_futr_price_above_1000 = []
cocoa_future_price_above_1000 = []
df = pd.read_csv(filename)
commodity = df["Commodity"]
price = df['Price']
for gold_price in price:
if (gold_price <= 1000):
break
else:
for gold in commodity:
if ('Gold' == gold):
gold_price_above_1000.append(gold_price)
break
for palladiun_price in price:
if (palladiun_price <= 1000):
break
else:
for palladiun in commodity:
if ('Palladiun' == palladiun):
palladiun_price_above_1000.append(palladiun_price)
break
for gold_futr_price in price:
if (gold_futr_price <= 1000):
break
else:
for gold_futr in commodity:
if ('Gold Futr' == gold_futr):
gold_futr_price_above_1000.append(gold_futr_price)
break
for cocoa_future_price in price:
if (cocoa_future_price <= 1000):
break
else:
for cocoa_future in commodity:
if ('Cocoa Future' == cocoa_future):
cocoa_future_price_above_1000.append(cocoa_future_price)
break
print(gold_price_above_1000)
print(palladiun_price_above_1000)
print(gold_futr_price_above_1000)
print(cocoa_future_price_above_1000)
plt.ylim(1000, 3000)
plt.plot(gold_price_above_1000)
plt.plot(palladiun_price_above_1000)
plt.plot(gold_futr_price_above_1000)
plt.plot(cocoa_future_price_above_1000)
plt.title('Commodity Price(>=1000)')
y = np.array(gold_price_above_1000)
plt.ylabel("Price")
plt.show()
print("SUCCESS")
Here is my question in detail,
Please use pandas and matplotlib to sort out the data in the csv and output and store the sorted data into the process chart. The output results are shown in the following figures.
Figure 1 The upper picture is to take all the products with Price> = 1000 in csv, mark all their prices in April and May and draw them into a linear graph. When outputting, the year in the date needs to be removed. The label name is marked and displayed. The title names of the chart, x-axis, and y- axis need to be marked. The range of the y-axis falls within 1000 ~ 3000, and the color of the line is not specified.
Figure 1 The picture below is from all the products with Price> = 1000 in csv. Mark their Change% in April and May and draw them into a dotted line graph. The dots need to be in a dot style other than '.' And 'o'. To mark, please mark the line with a line other than a solid line. When outputting, you need to remove the year from the date. You need to mark and display the label name of each line. The title names of the chart, x-axis, and y-axis must be marked. You need to add a grid line, the y-axis range falls from -15 to +15, and the color of the line is not specified.
The upper and lower two pictures in Figure 2 are changed to 1000> Price> = 500. The other conditions are basically the same as in Figure 1, except that the points and lines of the dot and line diagrams below Figure 2 need to use different styles from Figure 1.
The first and second pictures in Figure 1 must be displayed in the same window, as is the picture in Figure 2.
All of your blocks of code are doing the exact same thing. Changing the same of the iterator doesn't change what it does.
for gold_price in price:
for palladiun_price in price:
for gold_futr_price in price:
for cocoa_future_price in price:
This is going through the exact same data. You haven't subsetted for specific commodities.
Using the break statement in that loop doesn't make sense either. It should be a pass.
Basically for every number above 1000, you iterate through your entire Commodities column and add number to the list for every time you see a specific commodity.
Read how to index and select data in pandas.
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html
gold_price_above_1000 = df[(df.Commodity=='Gold') & (df.Price>1000)]

How to improve efficiency in while loop by pandas

I am a new python er. in my job, I open deal mass of data. So I begin to study python to improve the efficiency.
The first small trial is that: finding the nearest distance between two coordinates.
I have two files, one is named as "book.csv", the other is named as "macro.csv".[file content screen shot][1]
book.csv has three column: BookName, Longitude,Latitude; macro.csv has threed column: MacroName, Longitude,Latitude.
the trial purpose is to find the nearest Macro to each book. I try to use pandas to finish this trial, now I can get the right result, but the efficiency is a little low, when I have a 1500 book and 200 macro, it will take about 15 second.
please to help whether I can improve the efficiency. thx the following is my trial code:
#import pandas lib
from pandas import Series,DataFrame
import pandas as pd
#import geopy lib, to calculate the distance between two poins
import geopy.distance
#def func, to calculate the distance, input parameter: two points coordinates(Lat,Lon),return m
def dist(coord1,coord2):
return geopy.distance.vincenty(coord1, coord2).m
#def func, to find the nearest result: including MacroName and distance
def find_nearest_macro(df_macro,df_book):
#Get column content from dataframe to series
# Macro
s_macro_name = df_macro["MacroName"]
s_macro_Lat = df_macro["Latitude"]
s_macro_Lon = df_macro["Longitude"]
# Book
s_book_name = df_book["BookName"]
s_book_Lat = df_book["Latitude"]
s_book_Lon = df_book["Longitude"]
#def a empty list, used to append nearest result
nearest_macro = []
nearest_dist = []
#Loop through each book
ibook = 0
while ibook < len(s_book_name):
#Give initial value to result
nearest_macro_name = s_macro_name[0]
nearest_macro_dist = dist((s_book_Lat[0],s_book_Lon[0]), (s_macro_Lat[0],s_macro_Lon[0]))
#Get the coordinate of the x book
book_coord = (s_book_Lat[ibook],s_book_Lon[ibook])
#Loop through each Macro, Reset the loop variable
imacro = 1
while imacro < len(s_macro_name):
# Get the coordinate of the x Macro
macro_cood = (s_macro_Lat[imacro],s_macro_Lon[imacro])
#Calculate the distance between book and macro
tempd = dist(book_coord,macro_cood)
#if distance more close
if tempd < nearest_macro_dist:
#Update the result
nearest_macro_dist = tempd
nearest_macro_name = s_macro_name[imacro]
#Increments the loop variable
imacro = imacro + 1
#Loop over each book, append the nearest to the result
nearest_macro.append(nearest_macro_name)
nearest_dist.append(nearest_macro_dist)
# Increments the loop variable
ibook = ibook + 1
#return nearest macro name and distance(by tuple way can return 2 results
return (nearest_macro,nearest_dist)
# Assign the filename:
file_macro = '.\\TestFile\\Macro.csv'
file_book = '.\\TestFile\\Book.csv'
#read content from csv to dataframe
df_macro = pd.read_csv(file_macro)
df_book = pd.read_csv(file_book)
#find the nearest macro name and distance
t_nearest_result = find_nearest_macro(df_macro,df_book)
#create a new series, convert list to Series
s_nearest_marco_name = Series(t_nearest_result[0])
s_nearest_macro_dist = Series(t_nearest_result[1])
#insert the new Series to dataframe
df_book["NearestMacro"] = s_nearest_marco_name
df_book["NearestDist"] = s_nearest_macro_dist
print(df_book.head())
# write the new df_book to a new csv file
df_book.to_csv('.\\TestFile\\nearest.csv')

Loop through netcdf files and run calculations - Python or R

This is my first time using netCDF and I'm trying to wrap my head around working with it.
I have multiple version 3 netcdf files (NOAA NARR air.2m daily averages for an entire year). Each file spans a year between 1979 - 2012. They are 349 x 277 grids with approximately a 32km resolution. Data was downloaded from here.
The dimension is time (hours since 1/1/1800) and my variable of interest is air. I need to calculate accumulated days with a temperature < 0. For example
Day 1 = +4 degrees, accumulated days = 0
Day 2 = -1 degrees, accumulated days = 1
Day 3 = -2 degrees, accumulated days = 2
Day 4 = -4 degrees, accumulated days = 3
Day 5 = +2 degrees, accumulated days = 0
Day 6 = -3 degrees, accumulated days = 1
I need to store this data in a new netcdf file. I am familiar with Python and somewhat with R. What is the best way to loop through each day, check the previous days value, and based on that, output a value to a new netcdf file with the exact same dimension and variable.... or perhaps just add another variable to the original netcdf file with the output I'm looking for.
Is it best to leave all the files separate or combine them? I combined them with ncrcat and it worked fine, but the file is 2.3gb.
Thanks for the input.
My current progress in python:
import numpy
import netCDF4
#Change my working DIR
f = netCDF4.Dataset('air7912.nc', 'r')
for a in f.variables:
print(a)
#output =
lat
long
x
y
Lambert_Conformal
time
time_bnds
air
f.variables['air'][1, 1, 1]
#Output
298.37473
To help me understand this better what type of data structure am I working with? Is ['air'] the key in the above example and [1,1,1] are also keys? to get the value of 298.37473. How can I then loop through [1,1,1]?
You can use the very nice MFDataset feature in netCDF4 to treat a bunch of files as one aggregated file, without the need to use ncrcat. So you code would look like this:
from pylab import *
import netCDF4
f = netCDF4.MFDataset('/usgs/data2/rsignell/models/ncep/narr/air.2m.19??.nc')
# print variables
f.variables.keys()
atemp = f.variables['air']
print atemp
ntimes, ny, nx = shape(atemp)
cold_days = zeros((ny,nx),dtype=int)
for i in xrange(ntimes):
cold_days += atemp[i,:,:].data-273.15 < 0
pcolormesh(cold_days)
colorbar()
And here's one way to write the file (there might be easier ways):
# create NetCDF file
nco = netCDF4.Dataset('/usgs/data2/notebook/cold_days.nc','w',clobber=True)
nco.createDimension('x',nx)
nco.createDimension('y',ny)
cold_days_v = nco.createVariable('cold_days', 'i4', ( 'y', 'x'))
cold_days_v.units='days'
cold_days_v.long_name='total number of days below 0 degC'
cold_days_v.grid_mapping = 'Lambert_Conformal'
lono = nco.createVariable('lon','f4',('y','x'))
lato = nco.createVariable('lat','f4',('y','x'))
xo = nco.createVariable('x','f4',('x'))
yo = nco.createVariable('y','f4',('y'))
lco = nco.createVariable('Lambert_Conformal','i4')
# copy all the variable attributes from original file
for var in ['lon','lat','x','y','Lambert_Conformal']:
for att in f.variables[var].ncattrs():
setattr(nco.variables[var],att,getattr(f.variables[var],att))
# copy variable data for lon,lat,x and y
lono[:]=f.variables['lon'][:]
lato[:]=f.variables['lat'][:]
xo[:]=f.variables['x'][:]
yo[:]=f.variables['y'][:]
# write the cold_days data
cold_days_v[:,:]=cold_days
# copy Global attributes from original file
for att in f.ncattrs():
setattr(nco,att,getattr(f,att))
nco.Conventions='CF-1.6'
nco.close()
If I try looking at the resulting file in the Unidata NetCDF-Java Tools-UI GUI, it seems to be okay:
Also note that here I just downloaded two of the datasets for testing, so I used
f = netCDF4.MFDataset('/usgs/data2/rsignell/models/ncep/narr/air.2m.19??.nc')
as an example. For all the data, you could use
f = netCDF4.MFDataset('/usgs/data2/rsignell/models/ncep/narr/air.2m.????.nc')
or
f = netCDF4.MFDataset('/usgs/data2/rsignell/models/ncep/narr/air.2m.*.nc')
Here is an R solution.
infiles <- list.files("data", pattern = "nc", full.names = TRUE, include.dirs = TRUE)
outfile <- "data/air.colddays.nc"
library(raster)
r <- raster::stack(infiles)
r <- sum((r - 273.15) < 0)
plot(r)
I know this is rather late for this thread from 2013, but I just want to point out that the accepted solution doesn't provide the solution to the exact question posed. The question seems to want the length of each continuous period of temperatures below zero (note in the question the counter resets if the temperature exceeds zero), which can be important for climate applications (e.g. for farming) whereas the accepted solution only gives the total number of days in a year that the temperature is below zero. If this is really what mkmitchell wants (it has been accepted as the answer) then it can be done in from the command line in cdo without having to worry about NETCDF input/output:
cdo timsum -lec,273.15 in.nc out.nc
so a looped script would be:
files=`ls *.nc` # pick up all the netcdf files in a directory
for file in $files ; do
# I use 273.15 as from the question seems T is in Kelvin
cdo timsum -lec,273.15 $file ${file%???}_numdays.nc
done
If you then want the total number over the whole period you can then cat the _numdays files instead which are much smaller:
cdo cat *_numdays.nc total.nc
cdo timsum total.nc total_below_zero.nc
But again, the question seems to want accumulated days per event, which is different, but not provided by the accepted answer.

Categories