Lists/DataFrames - Running a function over all values in Python

Lists/DataFrames - Running a function over all values in Python - python

I am stuck at the moment and don't really know how to solve this problem.
I want to apply this calculation to a list/dataframe:
The equation itself is not really the problem for me, I am able to easily solve it manually, but that wouldn't do with the amount of data I have.
v : value to be approximated
vi: known values (in my case Temperatures)
di: distance to the approximated point
So basically this is for calculating/approximating a new temperature value for a position a certain distance away from the corners of the square:
import pandas as pd
import numpy as np
import xarray as xr
import math
filepath = r'F:\Data\data.nc' # just the path to the file
obj= xr.open_dataset(filepath)
# This is where I get the coordinates for each of the corners of the square
# from the netcdf4 file
lat = 9.7398
lon = 51.2695
xlat = obj['XLAT'].values
xlon = obj['XLON'].values
p_1 = [xlat[0,0], xlon[0,0]]
p_2 = [xlat[0,1], xlon[0,1]]
p_3 = [xlat[1,0], xlon[1,0]]
p_4 = [xlat[1,1], xlon[1,1]]
p_rect = [p_1, p_2, p_3, p_4]
p_orig = [lat, lon]
#=================================================
# Calculates the distance between the points
# d = sqrt((x2-x1)^2 + (y2-y1)^2))
#=================================================
distance = []
for coord in p_rect:
distance.append(math.sqrt(math.pow(coord[0]-p_orig[0],2)+math.pow(coord[1]-p_orig[1],2)))
# to get the values for they key['WS'] for example:
a = obj['WS'].values[:,0,0,0] # Array of floats for the first values
b = obj['WS'].values[:,0,0,1] # Array of floats for the second values
c = obj['WS'].values[:,0,1,0] # Array of floats for the third values
d = obj['WS'].values[:,0,1,1] # Array of floats for the fourth values
From then on, I have no idea how I should continue, should I do:
df = pd.DataFrame()
df['a'] = a
df['b'] = b
df['c'] = c
df['d'] = d
Then somehow work with DataFrames, and drop abcd after I got the needed values or should I do it with lists first, then add only the result to the dataframe. I am a bit lost.
The only thing I came up with so far is how it would look like if I would do it manually:
for i starting at 0 and ending if the end of the list [a, b, c d have the same length] is reached .
1/a[i]^2*distance[0] + 1/b[i]^2*distance[1] + 1/c[i]^2*distance[2] + 1/d[i]^2*distance[3]
v = ------------------------------------------------------------------------------------------
1/a[i]^2 + 1/b[i]^2 + 1/c[i]^2 + 1/d[i]^2
'''
This is the first time I had such a (at least for me) complex calculation on a list/dataframe. I hope you can help me solve this problem or at least nudge me in the right direction.
PS: here is the link to the file:
LINK TO FILE

Simply vectorize your calculations. With data frames you can run whole arithmetic operations directly on columns as if they were scalars to generate another column,df['v']. Below assumes distance is a list of four scalars and remember in Python ^ does not mean power, instead us **.
df = pd.DataFrame({'a':a, 'b':b, 'c':c, 'd':d})
df['v'] = (1/df['a']**2 * distance[0] +
1/df['b']**2 * distance[1] +
1/df['c']**2 * distance[2] +
1/df['d']**2 * distance[3]) / (1/df['a']**2 +
1/df['b']**2 +
1/df['c']**2 +
1/df['d']**2)
Or the functional form using Pandas Series binary operators. Below follows the order of operations (Parentheses --> Exponential --> Multiplication/Division --> Addition/Subtraction):
df['v'] = (df['a'].pow(2).pow(-1).mul(distance[0]) +
df['b'].pow(2).pow(-1).mul(distance[1]) +
df['c'].pow(2).pow(-1).mul(distance[2]) +
df['d'].pow(2).pow(-1).mul(distance[3])) / (df['a'].pow(2).pow(-1) +
df['b'].pow(2).pow(-1) +
df['c'].pow(2).pow(-1) +
df['d'].pow(2).pow(-1))

Related

For loop keeps returning empty arrays

I need some help with a for loop I have been trying to run. This is the code I have-
cal_points = []
cal_stars = np.genfromtxt('M67_Calibration_Star_List.csv', delimiter = ',', names = True)
radii = 0.00023
for star in range(len(cal_stars)):
ra_l = cal_stars[star][1] - radii; ra_u = cal_stars[star][1]+radii
dec_l = cal_stars[star][2]-radii; dec_u = cal_stars[star][2] + radii
for i in range(len(M67_catalogue)):
if ra_l <= M67_catalogue[i]['RA'] <= ra_u and dec_l <= M67_catalogue[i]['DEC'] <= dec_u:
cal_points = cal_points+[star]
cal_points.sort()
print(len(cal_points))
print(cal_points)
This keeps returning len(cal_points) as 0 and cal_points as []
These are headers in the csv file with a few of the row entries
Please tell me where I'm going wrong

Since you are trying to match a (small) catalogue of calibration stars with a catalogue of stars in M67, within a given radius(*), you may as well use astropy. Astropy can do all the matching for you, and takes into account the effect of latitudinal distance "shrinking" on a sphere.
Here's some example code that creates two random DataFrames with calibration and catalogue positions, converts them to appropriate Astropy SkyCoords and matches the two sets of positions. It then uses the result to find the stars in the corresponding DataFrames, and concatenates the results into a single DataFrame, including the relevant other information from the catalogue, such as the magnitude.
import pandas as pd
from numpy.random import default_rng
rng = default_rng()
n = 30
cal_stars = pd.DataFrame({'RAJ2000': 132 + rng.random(n),
'DECJ2000': 11 + rng.random(n),
'VTmag': 11 + rng.random(n)})
n = 200
M67_catalogue = pd.DataFrame({'RA': 132 + rng.random(n),
'DEC': 11 + rng.random(n),
'VTmag': 11 + rng.random(n)})
# Create coordinate arrays, using the relevant columns
# from the DataFrame
cal_stars_sc = SkyCoord(cal_stars['RAJ2000'] * u.deg,
cal_stars['DECJ2000'] * u.deg)
M67_catalogue_sc = SkyCoord(M67_catalogue['RA'] * u.deg,
M67_catalogue['DEC'] * u.deg)
# Slightly larger radius in this example;
# 0.00023 is too precise for the random coordinates used here
sep = 0.023 * u.deg
# `idxm67` are the indices into the M67_catalogue_sc SkyCoord
# that have a counterpart within `sep` in `cal_stars_sc`.
# Similarly for `idxcal`
# Note that an index (and thus a coordinate) may appear multiple times:
# a single source may be within `sep` distance to several sources in the
# other catalogue
idxm67, idxcal, dist, _ = cal_stars_sc.search_around_sky(M67_catalogue_sc, sep)
# We need to use `.iloc`, since `SkyCoord` follows standard (NumPy) indexing
# Thus we need to ignore any index that the Pandas DataFrame may have
df1 = cal_stars.iloc[idxcal, :]
df2 = M67_catalogue.iloc[idxm67, :]
df2.columns = ['M67' + name for name in df2.columns]
# We also want to reset both DataFrame indices, because these were copied above when using iloc
# Resetting them will make sure df1 and df2 have the same indices
# and are compatible to be concatenated.
df1.reset_index(inplace=True)
df2.reset_index(inplace=True)
# axis=1 means to concatenate along the columns.
df = pd.concat([df1, df2], axis=1)
# Add the found distances to the final DataFrame
df['dist'] = dist
print(df)
(*) I assume you want a radius, given the variable name, but the search in your code is within a rectangular region.
Here's the short version, without comments and creation of random data. It should be plug and play, provided M67_catalogue is actually a DataFrame (not a NumPy array). Note that the second half, the creation of a matched DataFrame, is a bonus. cal_stars.iloc[idxcal, :] after using search_round_sky is enough to get your result.
import pandas as pd
from astropy.coordinates import SkyCoord
import astropy.units as u
cal_stars = pd.read_csv('M67_Calibration_Star_List.csv')
radius = 0.00023
cal_stars_sc = SkyCoord(cal_stars['RAJ2000'] * u.deg,
cal_stars['DECJ2000'] * u.deg)
M67_catalogue_sc = SkyCoord(M67_catalogue['RA'] * u.deg,
M67_catalogue['DEC'] * u.deg)
idxm67, idxcal, dist, _ = cal_stars_sc.search_around_sky(M67_catalogue_sc, radius * u.deg)
df1 = cal_stars.iloc[idxcal, :]
df2 = M67_catalogue.iloc[idxm67, :]
df2.columns = ['M67' + name for name in df2.columns]
df1.reset_index(inplace=True)
df2.reset_index(inplace=True)
df = pd.concat([df1, df2], axis=1)
df['dist'] = dist
print(df)

Slicing in a for loop

i want to run two for loops in which i calculate annualized returns of a hypothetical trading strategy which is based on moving average crossovers. It's pretty simple: go long as soon as the "faster" MA crosses the "slower". Otherwise move to cash.
My data looks like this:
My Code:
rets = {}
ann_rets = {}
#Nested Loop to calculate returns
for short in range(20, 200):
for long in range(short + 1, 200):
#Calculate cumulative return
rets[short,long] = (aapl[short,long][-1] - aapl[short,long][1]) / aapl[short,long][1]
#calculate annualized return
ann_rets[short,long] = (( 1 + rets[short,long]) ** (12 / D))-1
The error message i get is the following:
TypeError: list indices must be integers or slices, not tuple
EDIT:
Using a dictionary works fine. The screenshot below shows where i'm stuck at the moment.
I want to have three final columns: (SMA_1,SMA_2,Ann_rets)
SMA_1: First Moving average e.g. 20
SMA_2: Second Moving average e.g. 50
Ann_rets: annualized return which is calculated in the loop above

I try to understand your questions. Hope this helps. I simplified your output ann_rets to illustrate reformatting to expected output format. Kr
rets = {}
ann_rets = {}
#Nested Loop to calculate returns
for short in range(20, 200):
for long in range(short + 1, 200):
#Calculate cumulative return
rets[short,long] = (aapl[short,long][-1] - aapl[short,long][1]) / aapl[short,long][1]
#calculate annualized return
ann_rets[short,long] = (( 1 + rets[short,long]) ** (12 / D))-1
# Reformat
Example data:
ann_rets = {(1,2): 0.1, (3,4):0.2, (5,6):0.3}
df1 = pd.DataFrame(ann_rets.values())
df2 = pd.DataFrame(list(ann_rets.keys()))
df = pd.concat([df2, df1], axis=1)
df.columns = ['SMA_1','SMA_2','Ann_rets']
print(df)
Which yields:
SMA_1 SMA_2 Ann_rets
0 1 2 0.1
1 3 4 0.2
2 5 6 0.3

You're trying access the index of a list with a tuple here: rets[short,long].
Try instead using a dictionary. So change
rets = []
ann_rets = []
to
rets = {}
ann_rets = {}

A double index like rets[short, long] will work for NumPy arrays and Pandas dataframes (like, presumably, your aapl variable), but not for a regular Python list. Use rets[short][long] instead. (Which also means you would need to change the initialization of rests at the top of your code.)
To explain briefly the actual error message: a tuple is more or less defined by the separating comma, that is, Python sees short,long and turns that into a tuple (short, long), which is then used inside the list index. Which, of course, fails, and throws this error message.

How to add some calculation in columns of the dataframe in python

I am having the excel sheet using the pandas.read_excel, I got the output in dataframe but I want to add the calculations in the after reading through pandas I need to ado following calculation in each x and y columns.
ratiox = (73.77481944859028 - 73.7709567323327) / 720
ratioy = (18.567453940477293 - 18.56167674097576) / 1184
mapLongitudeStart = 73.7709567323327
mapLatitudeStart = 18.567453940477293
longitude = 0, latitude = 0
longitude = (mapLongitudeStart + x1 * ratiox)) #I have take for the single column x1 value
latitude = (mapLatitudeStart - (-y1 *ratioy )) # taken column y1 value
how to apply this calculation to every column and row of x and y a which has the values it should not take the null values. And I want the new dataframe created by doing the calculation in columns

Try the below code:
import pandas as pd
import itertools
df = pd.read_excel('file_path')
dfx=df.ix[:,'x1'::2]
dfy=df.ix[:,'y1'::2]
li=[dfx.apply(lambda x:mapLongitudeStart + x * ratiox),dfy.apply(lambda y:mapLatitudeStart - (-y))]
df_new=pd.concat(li,axis=1)
df_new = df_new[list(itertools.chain(*zip(dfx.columns,dfy.columns)))]
print(df_new)
Hope this helps!

I would first recommend to reshape your data into a long format, that way you can get rid of the empty cells naturally. Also most pandas functions work better that way, because then you can use things like group by operations on all x or y or wahtever dimenstion
from itertools import chain
import pandas as pd
## this part is only to have a running example
## here you would load your excel file
D = pd.DataFrame(
np.random.randn(10,6),
columns =chain(*[ [f"x{i}", f"y{i}"] for i in range(1,4)])
)
D["rowid"] = pd.np.arange(len(D))
D = D.melt(id_vars="rowid").dropna()
D["varIndex"] = D.variable.str[1]
D["variable"] = D.variable.str[0]
D = D.set_index(["varIndex","rowid","variable"])\
.unstack("variable")\
.droplevel(0, axis=1)
So these transformations will give you a table where you have an index both for the original row id (maybe it is a time series or something else), and the variable index so x1 or x2 etc.
Now you can do your calculations either by overwintering the previous columns
## Everything here is a constant
ratiox = (73.77481944859028 - 73.7709567323327) / 720
ratioy = (18.567453940477293 - 18.56167674097576) / 1184
mapLongitudeStart = 73.7709567323327
mapLatitudeStart = 18.567453940477293
# apply the calculations directly to the columns
D.x = (mapLongitudeStart + D.x * ratiox))
D.y = (mapLatitudeStart - (-D.y * ratioy ))

How to multiprocess finding closest geographic point in two pandas dataframes?

I have a function which I'm trying to apply in parallel and within that function I call another function that I think would benefit from being executed in parallel. The goal is to take in multiple years of crop yields for each field and combine all of them into one pandas dataframe. I have the function I use for finding the closest point in each dataframe, but it is quite intensive and takes some time. I'm looking to speed it up.
I've tried creating a pool and using map_async on the inner function. I've also tried doing the same with the loop for the outer function. The latter is the only thing I've gotten to work the way I intended it to. I can use this, but I know there has to be a way to make it faster. Check out the code below:
return_columns = []
return_columns_cb = lambda x: return_columns.append(x)
def getnearestpoint(gdA, gdB, retcol):
dist = lambda point1, point2: distance.great_circle(point1, point2).feet
def find_closest(point):
distances = gdB.apply(
lambda row: dist(point, (row["Longitude"], row["Latitude"])), axis=1
)
return (gdB.loc[distances.idxmin(), retcol], distances.min())
append_retcol = gdA.apply(
lambda row: find_closest((row["Longitude"], row["Latitude"])), axis=1
)
return append_retcol
def combine_yield(field):
#field is a list of the files for the field I'm working with
#lots of pre-processing
#dfs in this case is a list of the dataframes for the current field
#mdf is the dataframe with the most points which I poppped from this list
p = Pool()
for i in range(0, len(dfs)):
p.apply_async(getnearestpoint, args=(mdf, dfs[i], dfs[i].columns[-1]), callback=return_cols_cb)
for col in return_columns:
mdf = mdf.append(col)
'''I unzip my points back to longitude and latitude here in the final
dataframe so I can write to csv without tuples'''
mdf[["Longitude", "Latitude"]] = pd.DataFrame(
mdf["Point"].tolist(), index=mdf.index
)
return mdf
def multiprocess_combine_yield():
'''do stuff to get dictionary below with each field name as key and values
as all the files for that field'''
yield_by_field = {'C01': ('files...'), ...}
#The farm I'm working on has 30 fields and below is too slow
for k,v in yield_by_field.items():
combine_yield(v)
I guess what I need help on is I envision something like using a pool to imap or apply_async on each tuple of files in the dictionary. Then within the combine_yield function when applied to that tuple of files, I want to to be able to parallel process the distance function. That function bogs the program down because it calculates the distance between every point in each of the dataframes for each year of yield. The files average around 1200 data points and then you multiply all of that by 30 fields and I need something better. Maybe the efficiency improvement lies in finding a better way to pull in the closest point. I still need something that gives me the value from gdB, and the distance though because of what I do later on when selecting which rows to use from the 'mdf' dataframe.

Thanks to #ALollz comment, I figured this out. I went back to my getnearestpoint function and instead of doing a bunch of Series.apply I am now using cKDTree from scipy.spatial to find the closest point, and then using a vectorized haversine distance to calculate the true distances on each of these matched points. Much much quicker. Here are the basics of the code below:
import numpy as np
import pandas as pd
from scipy.spatial import cKDTree
def getnearestpoint(gdA, gdB, retcol):
gdA_coordinates = np.array(
list(zip(gdA.loc[:, "Longitude"], gdA.loc[:, "Latitude"]))
)
gdB_coordinates = np.array(
list(zip(gdB.loc[:, "Longitude"], gdB.loc[:, "Latitude"]))
)
tree = cKDTree(data=gdB_coordinates)
distances, indices = tree.query(gdA_coordinates, k=1)
#These column names are done as so due to formatting of my 'retcols'
df = pd.DataFrame.from_dict(
{
f"Longitude_{retcol[:4]}": gdB.loc[indices, "Longitude"].values,
f"Latitude_{retcol[:4]}": gdB.loc[indices, "Latitude"].values,
retcol: gdB.loc[indices, retcol].values,
}
)
gdA = pd.merge(left=gdA, right=df, left_on=gdA.index, right_on=df.index)
gdA.drop(columns="key_0", inplace=True)
return gdA
def combine_yield(field):
#same preprocessing as before
for i in range(0, len(dfs)):
mdf = getnearestpoint(mdf, dfs[i], dfs[i].columns[-1])
main_coords = np.array(list(zip(mdf.Longitude, mdf.Latitude)))
lat_main = main_coords[:, 1]
longitude_main = main_coords[:, 0]
longitude_cols = [
c for c in mdf.columns for m in [re.search(r"Longitude_B\d{4}", c)] if m
]
latitude_cols = [
c for c in mdf.columns for m in [re.search(r"Latitude_B\d{4}", c)] if m
]
year_coords = list(zip_longest(longitude_cols, latitude_cols, fillvalue=np.nan))
for i in year_coords:
year = re.search(r"\d{4}", i[0]).group(0)
year_coords = np.array(list(zip(mdf.loc[:, i[0]], mdf.loc[:, i[1]])))
year_coords = np.deg2rad(year_coords)
lat_year = year_coords[:, 1]
longitude_year = year_coords[:, 0]
diff_lat = lat_main - lat_year
diff_lng = longitude_main - longitude_year
d = (
np.sin(diff_lat / 2) ** 2
+ np.cos(lat_main) * np.cos(lat_year) * np.sin(diff_lng / 2) ** 2
)
mdf[f"{year} Distance"] = 2 * (2.0902 * 10 ** 7) * np.arcsin(np.sqrt(d))
return mdf
Then I'll just do Pool.map(combine_yield, (v for k,v in yield_by_field.items()))
This has made a substantial difference. Hope it helps anyone else in a similar predicament.

How to find which input value in a loop yielded the output min?

I am trying to solve a min value problem, I could obtain the min values from two loops but, what I really need is also the exact values that correspended to output min.
from __future__ import division
from numpy import*
b1=0.9917949
b2=0.01911
b3=0.000840
b4=0.10175
b5=0.000763
mu=1.66057*10**(-24) #gram
c=3.0*10**8
Mler=open("olasiM.txt","w+")
data=zeros(0,'float')
for A in range(1,25):
M2=zeros(0,'float')
print 'A=',A
for Z in range(1,A+1):
SEMF=mu*c**2*(b1*A+b2*A**(2./3.)-b3*Z+b4*A*((1./2.)-(Z/A))**2+(b5*Z**2)/(A**(1./3.)))
SEMF=array(SEMF)
M2=hstack((M2,SEMF))
minm2=min(M2)
data=hstack((data,minm2))
data=hstack((data,A))
datalist = data.tolist()
for i in range (len(datalist)):
Mler.write(str(datalist[i])+'\n')
Mler.close()
Here, what I want is to see the min value of the SEMF and, corresponding A,Z values, For example, it has to be A=1, Z=1 and SEMF= some#
I also don't know how to write these, A and Z values to the document

The big advantage of numpy over using python lists is vectorized operations. Unfortunately your code fails completely in using them. For example the whole inner loop that has Z as index can easily be vectorized. You instead are computing the single elements using python floats and then stacking them one by one in the numpy array M2.
So I'd refactor that part of the code by:
import numpy as np
# ...
Zs = np.arange(1, A+1, dtype=float)
SEMF = mu*c**2 * (b1*A + b2*A**(2./3.) - b3*Zs + b4*A*((1./2.) - (Zs/A))**2 + (b5*Zs**2)/(A**(1./3.)))
Here the SEMF array should be exactly what you'd obtain as the final M2 array. Now you can find the minimum and stack that value into your data array:
min_val = SEMF.min()
data = hstack((data,minm2))
data = hstack((data,A))
If you also what to keep track for which value of Z you got the minimum you can use the argmin method:
min_val, min_pos = SEMF.min(), SEMF.argmin()
data = hstack((data,np.array([min_val, min_pos, A])))
The final code should look like:
from __future__ import division
import numpy as np
b1 = 0.9917949
b2 = 0.01911
b3 = 0.000840
b4 = 0.10175
b5 = 0.000763
mu = 1.66057*10**(-24) #gram
c = 3.0*10**8
data=zeros(0,'float')
for A in range(1,25):
Zs = np.arange(1, A+1, dtype=float)
SEMF = mu*c**2 * (b1*A + b2*A**(2./3.) - b3*Zs + b4*A*((1./2.) - (Zs/A))**2 + (b5*Zs**2)/(A**(1./3.)))
min_val, min_pos = SEMF.min(), SEMF.argmin()
data = hstack((data,np.array([min_val, min_pos, A])))
datalist = data.tolist()
with open("olasiM.txt","w+") as mler:
for i in range (len(datalist)):
mler.write(str(datalist[i])+'\n')
Note that numpy provides some functions to save/load array to/from files, like savetxt so I suggest that instead of manually saving the values there to use these functions.
Probably some numpy expert could vectorize also the operations for the As. Unfortunately my numpy knowledge isn't that advanced and I don't know how the handle the fact that we'd have a variable number of dimensions due to the range(1, A+1) thing...

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Lists/DataFrames - Running a function over all values in Python - python

Related

For loop keeps returning empty arrays

Slicing in a for loop

How to add some calculation in columns of the dataframe in python

How to multiprocess finding closest geographic point in two pandas dataframes?

How to find which input value in a loop yielded the output min?

Categories

Resources