For loop keeps returning empty arrays - python

I need some help with a for loop I have been trying to run. This is the code I have-
cal_points = []
cal_stars = np.genfromtxt('M67_Calibration_Star_List.csv', delimiter = ',', names = True)
radii = 0.00023
for star in range(len(cal_stars)):
ra_l = cal_stars[star][1] - radii; ra_u = cal_stars[star][1]+radii
dec_l = cal_stars[star][2]-radii; dec_u = cal_stars[star][2] + radii
for i in range(len(M67_catalogue)):
if ra_l <= M67_catalogue[i]['RA'] <= ra_u and dec_l <= M67_catalogue[i]['DEC'] <= dec_u:
cal_points = cal_points+[star]
cal_points.sort()
print(len(cal_points))
print(cal_points)
This keeps returning len(cal_points) as 0 and cal_points as []
These are headers in the csv file with a few of the row entries
Please tell me where I'm going wrong

Since you are trying to match a (small) catalogue of calibration stars with a catalogue of stars in M67, within a given radius(*), you may as well use astropy. Astropy can do all the matching for you, and takes into account the effect of latitudinal distance "shrinking" on a sphere.
Here's some example code that creates two random DataFrames with calibration and catalogue positions, converts them to appropriate Astropy SkyCoords and matches the two sets of positions. It then uses the result to find the stars in the corresponding DataFrames, and concatenates the results into a single DataFrame, including the relevant other information from the catalogue, such as the magnitude.
import pandas as pd
from numpy.random import default_rng
rng = default_rng()
n = 30
cal_stars = pd.DataFrame({'RAJ2000': 132 + rng.random(n),
'DECJ2000': 11 + rng.random(n),
'VTmag': 11 + rng.random(n)})
n = 200
M67_catalogue = pd.DataFrame({'RA': 132 + rng.random(n),
'DEC': 11 + rng.random(n),
'VTmag': 11 + rng.random(n)})
# Create coordinate arrays, using the relevant columns
# from the DataFrame
cal_stars_sc = SkyCoord(cal_stars['RAJ2000'] * u.deg,
cal_stars['DECJ2000'] * u.deg)
M67_catalogue_sc = SkyCoord(M67_catalogue['RA'] * u.deg,
M67_catalogue['DEC'] * u.deg)
# Slightly larger radius in this example;
# 0.00023 is too precise for the random coordinates used here
sep = 0.023 * u.deg
# `idxm67` are the indices into the M67_catalogue_sc SkyCoord
# that have a counterpart within `sep` in `cal_stars_sc`.
# Similarly for `idxcal`
# Note that an index (and thus a coordinate) may appear multiple times:
# a single source may be within `sep` distance to several sources in the
# other catalogue
idxm67, idxcal, dist, _ = cal_stars_sc.search_around_sky(M67_catalogue_sc, sep)
# We need to use `.iloc`, since `SkyCoord` follows standard (NumPy) indexing
# Thus we need to ignore any index that the Pandas DataFrame may have
df1 = cal_stars.iloc[idxcal, :]
df2 = M67_catalogue.iloc[idxm67, :]
df2.columns = ['M67' + name for name in df2.columns]
# We also want to reset both DataFrame indices, because these were copied above when using iloc
# Resetting them will make sure df1 and df2 have the same indices
# and are compatible to be concatenated.
df1.reset_index(inplace=True)
df2.reset_index(inplace=True)
# axis=1 means to concatenate along the columns.
df = pd.concat([df1, df2], axis=1)
# Add the found distances to the final DataFrame
df['dist'] = dist
print(df)
(*) I assume you want a radius, given the variable name, but the search in your code is within a rectangular region.
Here's the short version, without comments and creation of random data. It should be plug and play, provided M67_catalogue is actually a DataFrame (not a NumPy array). Note that the second half, the creation of a matched DataFrame, is a bonus. cal_stars.iloc[idxcal, :] after using search_round_sky is enough to get your result.
import pandas as pd
from astropy.coordinates import SkyCoord
import astropy.units as u
cal_stars = pd.read_csv('M67_Calibration_Star_List.csv')
radius = 0.00023
cal_stars_sc = SkyCoord(cal_stars['RAJ2000'] * u.deg,
cal_stars['DECJ2000'] * u.deg)
M67_catalogue_sc = SkyCoord(M67_catalogue['RA'] * u.deg,
M67_catalogue['DEC'] * u.deg)
idxm67, idxcal, dist, _ = cal_stars_sc.search_around_sky(M67_catalogue_sc, radius * u.deg)
df1 = cal_stars.iloc[idxcal, :]
df2 = M67_catalogue.iloc[idxm67, :]
df2.columns = ['M67' + name for name in df2.columns]
df1.reset_index(inplace=True)
df2.reset_index(inplace=True)
df = pd.concat([df1, df2], axis=1)
df['dist'] = dist
print(df)

Related

Different results from interpolation if (same data) is done with timeindex

I get different results from interpolation if (same data) is done with timeindex, how can that be?
On pandas docs it says:
The ‘krogh’, ‘piecewise_polynomial’, ‘spline’, ‘pchip’ and ‘akima’ methods
are wrappers around the respective SciPy implementations of similar names.
These use the actual numerical values of the index. For more information
on their behavior, see the SciPy documentation and SciPy tutorial.
the sub-methods in interpolation( method= ...), where i noticed this strange behavior are (among others):
['krogh', 'spline', 'pchip', 'akima', 'cubicspline']
reproducable sample (with comparison):
import numpy as np , pandas as pd
from math import isclose
# inputs:
no_timeindex = False # reset both dataframes indices to numerical indices # for comparison.
no_timeindex_for_B = True # reset only dataframe indices of the first approach to numerical indices, the other one stays datetime, for comparison.
holes = True # create date-timeindex that skips the timestamps, that would normally be at location 6,7,12, 14, 17, instead of a perfectly frequent one.
o_ = 2 # order parameter for interpolation.
method_ = 'cubicspline'
#------------------+
n = np.nan
arr = [n,n,10000000000 ,10,10,10000,10,10, 10,40,4,4,9,4,4,n,n,n,4,4,4,4,4,4,18,400000000,4,4,4,n,n,n,n,n,n,n,4,4,4,5,6000000000,4,5,4,5,4,3,n,n,n,n,n,n,n,n,n,n,n,n,n,4,n,n,n,n,n,n,n,n,n,n,n,n,n,n,2,n,n,n,10,1000000000,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,1,n,n,n,n,n,n,n,n,n]
#--------------------------------------------------------------------------------+
df = pd.DataFrame(arr) # create dataframe from array.
if holes: # create a date-timeindex that skips the timestamps, that would normally be at location 6,7,12, 14, 17.
ix = pd.date_range("01.01.2000", periods = len(df)+(2 +5), freq="T")[2:]
to_drop = [ix[6],ix[7],ix[12],ix[14],ix[17]]
ix = ix.drop( to_drop)
df.index = ix
else: # create a perfectly frequent datetime-index without any holes.
ix = pd.date_range("01.01.2000", periods = len(df)+2, freq="T")[2:]
df.index = ix
# if wanted, drop timeindex and set it to integer indices later
if no_timeindex == True:
df.reset_index( inplace=True, drop=True )
df = df.interpolate(method=method_, order=o_, limit_area = 'inside') # interpolate.
df.index = ix # set index equal to the second approach, for comparing later.
A = df.copy(deep=True) # create a copy, to compare result with second approach later.
#------------------------------+
# second approach with numerical index instead of index-wise
df = pd.DataFrame(arr) # create dataframe from array.
if holes: # create a date-timeindex that skips the timestamps, that would normally be at location 6,7,12, 14, 17.
ix = pd.date_range("01.01.2000", periods = len(df)+(2 +5), freq="T")[2:]
to_drop = [ix[6],ix[7],ix[12],ix[14],ix[17]]
ix = ix.drop( to_drop)
df.index = ix
else: # create a perfectly frequent datetime-index without any holes.
ix = pd.date_range("01.01.2000", periods = len(df)+2, freq="T")[2:]
df.index = ix
# if wanted, drop timeindex and set it to integer indices later
if no_timeindex == True or no_timeindex_for_B == True:
df.reset_index(inplace=True, drop=True)
df = df.interpolate(method=method_, order=o_, limit_area = 'inside') # interpolate.
df.index = ix # set index equal to the first approach, for comparing later.
B = df.copy(deep=True) # create a copy, to compare result with second approach later.
#--------------------------------------------------------------------------------+
# compare:
if A.equals(B)==False:
# if values arent equal, count the ones that arent.
i=0
for x,y in zip( A[A.columns[0]], B[B.columns[0]]):
if x!=y and not (np.isnan(x) and np.isnan(y) ) :
print( x, " ?= ", y," ", (x==y), abs(x-y))
i+=1
# if theres no different values, ...
if i==0: print(" both are the same. ")
else: # if theres different values, ...
# count those different values, that are NOT almost the same.
not_almost = 0
for x,y in zip( A[A.columns[0]], B[B.columns[0]]):
if not (np.isnan(x) and np.isnan(y) ) :
if isclose(x,y, abs_tol=0.000001) == False:
not_almost+=1
# if all values are almost the same, ...
if not_almost == 0: print(" both are not, but almost the same. ")
else: print(" both are definetly not the same. ")
else: print(" both are the same. ")
This shouldnt be the case, since the pandas docs state different. Why does it happen anyways?

vaex: shift column by n steps

I'm preparing a big multivariate time series data set for a supervised learning task and I would like to create time shifted versions of my input features so my model also infers from past values. In pandas there's the shift(n) command that lets you shift a column by n rows. Is there something similar in vaex?
I could not find anything comparable in the vaex documentation.
No, we do not support that yet (https://github.com/vaexio/vaex/issues/660). Because vaex is extensible (see http://docs.vaex.io/en/latest/tutorial.html#Adding-DataFrame-accessors) I thought I would give you the solution in the form of that:
import vaex
import numpy as np
#vaex.register_dataframe_accessor('mytool', override=True)
class mytool:
def __init__(self, df):
self.df = df
def shift(self, column, n, inplace=False):
# make a copy without column
df = self.df.copy().drop(column)
# make a copy with just the colum
df_column = self.df[[column]]
# slice off the head and tail
df_head = df_column[-n:]
df_tail = df_column[:-n]
# stitch them together
df_shifted = df_head.concat(df_tail)
# and join (based on row number)
return df.join(df_shifted, inplace=inplace)
x = np.arange(10)
y = x**2
df = vaex.from_arrays(x=x, y=y)
df['shifted_y'] = df.y
df2 = df.mytool.shift('shifted_y', 2)
df2
It generates a single column datagram, slices that up, concatenates and joins it back. All without a single memory copy.
I am assuming here a cyclic shift/rotate.
The function needs to be modified slightly in order to work in the latest release (vaex 4.0.0ax), see this thread.
Code by Maarten should be updated as follows:
import vaex
import numpy as np
#vaex.register_dataframe_accessor('mytool', override=True)
class mytool:
def __init__(self, df):
self.df = df
# mytool.shift is the analog of pandas.shift() but add the shifted column with specified name to the end of initial df
def shift(self, column, new_column, n, cyclic=True):
df = self.df.copy().drop(column)
df_column = self.df[[column]]
if cyclic:
df_head = df_column[-n:]
else:
df_head = vaex.from_dict({column: np.ma.filled(np.ma.masked_all(n, dtype=float), 0)})
df_tail = df_column[:-n]
df_shifted = df_head.concat(df_tail)
df_shifted.rename(column, new_column)
return df_shifted
x = np.arange(10)
y = x**2
df = vaex.from_arrays(x=x, y=y)
df2 = df.join(df.mytool.shift('y', 'shifted_y', 2))
df2

Lists/DataFrames - Running a function over all values in Python

I am stuck at the moment and don't really know how to solve this problem.
I want to apply this calculation to a list/dataframe:
The equation itself is not really the problem for me, I am able to easily solve it manually, but that wouldn't do with the amount of data I have.
v : value to be approximated
vi: known values (in my case Temperatures)
di: distance to the approximated point
So basically this is for calculating/approximating a new temperature value for a position a certain distance away from the corners of the square:
import pandas as pd
import numpy as np
import xarray as xr
import math
filepath = r'F:\Data\data.nc' # just the path to the file
obj= xr.open_dataset(filepath)
# This is where I get the coordinates for each of the corners of the square
# from the netcdf4 file
lat = 9.7398
lon = 51.2695
xlat = obj['XLAT'].values
xlon = obj['XLON'].values
p_1 = [xlat[0,0], xlon[0,0]]
p_2 = [xlat[0,1], xlon[0,1]]
p_3 = [xlat[1,0], xlon[1,0]]
p_4 = [xlat[1,1], xlon[1,1]]
p_rect = [p_1, p_2, p_3, p_4]
p_orig = [lat, lon]
#=================================================
# Calculates the distance between the points
# d = sqrt((x2-x1)^2 + (y2-y1)^2))
#=================================================
distance = []
for coord in p_rect:
distance.append(math.sqrt(math.pow(coord[0]-p_orig[0],2)+math.pow(coord[1]-p_orig[1],2)))
# to get the values for they key['WS'] for example:
a = obj['WS'].values[:,0,0,0] # Array of floats for the first values
b = obj['WS'].values[:,0,0,1] # Array of floats for the second values
c = obj['WS'].values[:,0,1,0] # Array of floats for the third values
d = obj['WS'].values[:,0,1,1] # Array of floats for the fourth values
From then on, I have no idea how I should continue, should I do:
df = pd.DataFrame()
df['a'] = a
df['b'] = b
df['c'] = c
df['d'] = d
Then somehow work with DataFrames, and drop abcd after I got the needed values or should I do it with lists first, then add only the result to the dataframe. I am a bit lost.
The only thing I came up with so far is how it would look like if I would do it manually:
for i starting at 0 and ending if the end of the list [a, b, c d have the same length] is reached .
1/a[i]^2*distance[0] + 1/b[i]^2*distance[1] + 1/c[i]^2*distance[2] + 1/d[i]^2*distance[3]
v = ------------------------------------------------------------------------------------------
1/a[i]^2 + 1/b[i]^2 + 1/c[i]^2 + 1/d[i]^2
'''
This is the first time I had such a (at least for me) complex calculation on a list/dataframe. I hope you can help me solve this problem or at least nudge me in the right direction.
PS: here is the link to the file:
LINK TO FILE
Simply vectorize your calculations. With data frames you can run whole arithmetic operations directly on columns as if they were scalars to generate another column,df['v']. Below assumes distance is a list of four scalars and remember in Python ^ does not mean power, instead us **.
df = pd.DataFrame({'a':a, 'b':b, 'c':c, 'd':d})
df['v'] = (1/df['a']**2 * distance[0] +
1/df['b']**2 * distance[1] +
1/df['c']**2 * distance[2] +
1/df['d']**2 * distance[3]) / (1/df['a']**2 +
1/df['b']**2 +
1/df['c']**2 +
1/df['d']**2)
Or the functional form using Pandas Series binary operators. Below follows the order of operations (Parentheses --> Exponential --> Multiplication/Division --> Addition/Subtraction):
df['v'] = (df['a'].pow(2).pow(-1).mul(distance[0]) +
df['b'].pow(2).pow(-1).mul(distance[1]) +
df['c'].pow(2).pow(-1).mul(distance[2]) +
df['d'].pow(2).pow(-1).mul(distance[3])) / (df['a'].pow(2).pow(-1) +
df['b'].pow(2).pow(-1) +
df['c'].pow(2).pow(-1) +
df['d'].pow(2).pow(-1))

How to add some calculation in columns of the dataframe in python

I am having the excel sheet using the pandas.read_excel, I got the output in dataframe but I want to add the calculations in the after reading through pandas I need to ado following calculation in each x and y columns.
ratiox = (73.77481944859028 - 73.7709567323327) / 720
ratioy = (18.567453940477293 - 18.56167674097576) / 1184
mapLongitudeStart = 73.7709567323327
mapLatitudeStart = 18.567453940477293
longitude = 0, latitude = 0
longitude = (mapLongitudeStart + x1 * ratiox)) #I have take for the single column x1 value
latitude = (mapLatitudeStart - (-y1 *ratioy )) # taken column y1 value
how to apply this calculation to every column and row of x and y a which has the values it should not take the null values. And I want the new dataframe created by doing the calculation in columns
Try the below code:
import pandas as pd
import itertools
df = pd.read_excel('file_path')
dfx=df.ix[:,'x1'::2]
dfy=df.ix[:,'y1'::2]
li=[dfx.apply(lambda x:mapLongitudeStart + x * ratiox),dfy.apply(lambda y:mapLatitudeStart - (-y))]
df_new=pd.concat(li,axis=1)
df_new = df_new[list(itertools.chain(*zip(dfx.columns,dfy.columns)))]
print(df_new)
Hope this helps!
I would first recommend to reshape your data into a long format, that way you can get rid of the empty cells naturally. Also most pandas functions work better that way, because then you can use things like group by operations on all x or y or wahtever dimenstion
from itertools import chain
import pandas as pd
## this part is only to have a running example
## here you would load your excel file
D = pd.DataFrame(
np.random.randn(10,6),
columns =chain(*[ [f"x{i}", f"y{i}"] for i in range(1,4)])
)
D["rowid"] = pd.np.arange(len(D))
D = D.melt(id_vars="rowid").dropna()
D["varIndex"] = D.variable.str[1]
D["variable"] = D.variable.str[0]
D = D.set_index(["varIndex","rowid","variable"])\
.unstack("variable")\
.droplevel(0, axis=1)
So these transformations will give you a table where you have an index both for the original row id (maybe it is a time series or something else), and the variable index so x1 or x2 etc.
Now you can do your calculations either by overwintering the previous columns
## Everything here is a constant
ratiox = (73.77481944859028 - 73.7709567323327) / 720
ratioy = (18.567453940477293 - 18.56167674097576) / 1184
mapLongitudeStart = 73.7709567323327
mapLatitudeStart = 18.567453940477293
# apply the calculations directly to the columns
D.x = (mapLongitudeStart + D.x * ratiox))
D.y = (mapLatitudeStart - (-D.y * ratioy ))

Python Pandas Panel counting value occurence

I have a large dataset stored as a pandas panel. I would like to count the occurence of values < 1.0 on the minor_axis for each item in the panel. What I have so far:
#%% Creating the first Dataframe
dates1 = pd.date_range('2014-10-19','2014-10-20',freq='H')
df1 = pd.DataFrame(index = dates)
n1 = len(dates)
df1.loc[:,'a'] = np.random.uniform(3,10,n1)
df1.loc[:,'b'] = np.random.uniform(0.9,1.2,n1)
#%% Creating the second DataFrame
dates2 = pd.date_range('2014-10-18','2014-10-20',freq='H')
df2 = pd.DataFrame(index = dates2)
n2 = len(dates2)
df2.loc[:,'a'] = np.random.uniform(3,10,n2)
df2.loc[:,'b'] = np.random.uniform(0.9,1.2,n2)
#%% Creating the panel from both DataFrames
dictionary = {}
dictionary['First_dataset'] = df1
dictionary['Second dataset'] = df2
P = pd.Panel.from_dict(dictionary)
#%% I want to count the number of values < 1.0 for all datasets in the panel
## Only for minor axis b, not minor axis a, stored seperately for each dataset
for dataset in P:
P.loc[dataset,:,'b'] #I need to count the numver of values <1.0 in this pandas_series
To count all the "b" values < 1.0, I would first isolate b in its own DataFrame by swapping the minor axis and the items.
In [43]: b = P.swapaxes("minor","items").b
In [44]: b.where(b<1.0).stack().count()
Out[44]: 30
Thanks for thinking with me guys, but I managed to figure out a surprisingly easy solution after many hours of attempting. I thought I should share it in case someone else is looking for a similar solution.
for dataset in P:
abc = P.loc[dataset,:,'b']
abc_low = sum(i < 1.0 for i in abc)

Categories