I get different results from interpolation if (same data) is done with timeindex, how can that be?
On pandas docs it says:
The ‘krogh’, ‘piecewise_polynomial’, ‘spline’, ‘pchip’ and ‘akima’ methods
are wrappers around the respective SciPy implementations of similar names.
These use the actual numerical values of the index. For more information
on their behavior, see the SciPy documentation and SciPy tutorial.
the sub-methods in interpolation( method= ...), where i noticed this strange behavior are (among others):
['krogh', 'spline', 'pchip', 'akima', 'cubicspline']
reproducable sample (with comparison):
import numpy as np , pandas as pd
from math import isclose
# inputs:
no_timeindex = False # reset both dataframes indices to numerical indices # for comparison.
no_timeindex_for_B = True # reset only dataframe indices of the first approach to numerical indices, the other one stays datetime, for comparison.
holes = True # create date-timeindex that skips the timestamps, that would normally be at location 6,7,12, 14, 17, instead of a perfectly frequent one.
o_ = 2 # order parameter for interpolation.
method_ = 'cubicspline'
#------------------+
n = np.nan
arr = [n,n,10000000000 ,10,10,10000,10,10, 10,40,4,4,9,4,4,n,n,n,4,4,4,4,4,4,18,400000000,4,4,4,n,n,n,n,n,n,n,4,4,4,5,6000000000,4,5,4,5,4,3,n,n,n,n,n,n,n,n,n,n,n,n,n,4,n,n,n,n,n,n,n,n,n,n,n,n,n,n,2,n,n,n,10,1000000000,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,1,n,n,n,n,n,n,n,n,n]
#--------------------------------------------------------------------------------+
df = pd.DataFrame(arr) # create dataframe from array.
if holes: # create a date-timeindex that skips the timestamps, that would normally be at location 6,7,12, 14, 17.
ix = pd.date_range("01.01.2000", periods = len(df)+(2 +5), freq="T")[2:]
to_drop = [ix[6],ix[7],ix[12],ix[14],ix[17]]
ix = ix.drop( to_drop)
df.index = ix
else: # create a perfectly frequent datetime-index without any holes.
ix = pd.date_range("01.01.2000", periods = len(df)+2, freq="T")[2:]
df.index = ix
# if wanted, drop timeindex and set it to integer indices later
if no_timeindex == True:
df.reset_index( inplace=True, drop=True )
df = df.interpolate(method=method_, order=o_, limit_area = 'inside') # interpolate.
df.index = ix # set index equal to the second approach, for comparing later.
A = df.copy(deep=True) # create a copy, to compare result with second approach later.
#------------------------------+
# second approach with numerical index instead of index-wise
df = pd.DataFrame(arr) # create dataframe from array.
if holes: # create a date-timeindex that skips the timestamps, that would normally be at location 6,7,12, 14, 17.
ix = pd.date_range("01.01.2000", periods = len(df)+(2 +5), freq="T")[2:]
to_drop = [ix[6],ix[7],ix[12],ix[14],ix[17]]
ix = ix.drop( to_drop)
df.index = ix
else: # create a perfectly frequent datetime-index without any holes.
ix = pd.date_range("01.01.2000", periods = len(df)+2, freq="T")[2:]
df.index = ix
# if wanted, drop timeindex and set it to integer indices later
if no_timeindex == True or no_timeindex_for_B == True:
df.reset_index(inplace=True, drop=True)
df = df.interpolate(method=method_, order=o_, limit_area = 'inside') # interpolate.
df.index = ix # set index equal to the first approach, for comparing later.
B = df.copy(deep=True) # create a copy, to compare result with second approach later.
#--------------------------------------------------------------------------------+
# compare:
if A.equals(B)==False:
# if values arent equal, count the ones that arent.
i=0
for x,y in zip( A[A.columns[0]], B[B.columns[0]]):
if x!=y and not (np.isnan(x) and np.isnan(y) ) :
print( x, " ?= ", y," ", (x==y), abs(x-y))
i+=1
# if theres no different values, ...
if i==0: print(" both are the same. ")
else: # if theres different values, ...
# count those different values, that are NOT almost the same.
not_almost = 0
for x,y in zip( A[A.columns[0]], B[B.columns[0]]):
if not (np.isnan(x) and np.isnan(y) ) :
if isclose(x,y, abs_tol=0.000001) == False:
not_almost+=1
# if all values are almost the same, ...
if not_almost == 0: print(" both are not, but almost the same. ")
else: print(" both are definetly not the same. ")
else: print(" both are the same. ")
This shouldnt be the case, since the pandas docs state different. Why does it happen anyways?
I'm preparing a big multivariate time series data set for a supervised learning task and I would like to create time shifted versions of my input features so my model also infers from past values. In pandas there's the shift(n) command that lets you shift a column by n rows. Is there something similar in vaex?
I could not find anything comparable in the vaex documentation.
No, we do not support that yet (https://github.com/vaexio/vaex/issues/660). Because vaex is extensible (see http://docs.vaex.io/en/latest/tutorial.html#Adding-DataFrame-accessors) I thought I would give you the solution in the form of that:
import vaex
import numpy as np
#vaex.register_dataframe_accessor('mytool', override=True)
class mytool:
def __init__(self, df):
self.df = df
def shift(self, column, n, inplace=False):
# make a copy without column
df = self.df.copy().drop(column)
# make a copy with just the colum
df_column = self.df[[column]]
# slice off the head and tail
df_head = df_column[-n:]
df_tail = df_column[:-n]
# stitch them together
df_shifted = df_head.concat(df_tail)
# and join (based on row number)
return df.join(df_shifted, inplace=inplace)
x = np.arange(10)
y = x**2
df = vaex.from_arrays(x=x, y=y)
df['shifted_y'] = df.y
df2 = df.mytool.shift('shifted_y', 2)
df2
It generates a single column datagram, slices that up, concatenates and joins it back. All without a single memory copy.
I am assuming here a cyclic shift/rotate.
The function needs to be modified slightly in order to work in the latest release (vaex 4.0.0ax), see this thread.
Code by Maarten should be updated as follows:
import vaex
import numpy as np
#vaex.register_dataframe_accessor('mytool', override=True)
class mytool:
def __init__(self, df):
self.df = df
# mytool.shift is the analog of pandas.shift() but add the shifted column with specified name to the end of initial df
def shift(self, column, new_column, n, cyclic=True):
df = self.df.copy().drop(column)
df_column = self.df[[column]]
if cyclic:
df_head = df_column[-n:]
else:
df_head = vaex.from_dict({column: np.ma.filled(np.ma.masked_all(n, dtype=float), 0)})
df_tail = df_column[:-n]
df_shifted = df_head.concat(df_tail)
df_shifted.rename(column, new_column)
return df_shifted
x = np.arange(10)
y = x**2
df = vaex.from_arrays(x=x, y=y)
df2 = df.join(df.mytool.shift('y', 'shifted_y', 2))
df2
I am stuck at the moment and don't really know how to solve this problem.
I want to apply this calculation to a list/dataframe:
The equation itself is not really the problem for me, I am able to easily solve it manually, but that wouldn't do with the amount of data I have.
v : value to be approximated
vi: known values (in my case Temperatures)
di: distance to the approximated point
So basically this is for calculating/approximating a new temperature value for a position a certain distance away from the corners of the square:
import pandas as pd
import numpy as np
import xarray as xr
import math
filepath = r'F:\Data\data.nc' # just the path to the file
obj= xr.open_dataset(filepath)
# This is where I get the coordinates for each of the corners of the square
# from the netcdf4 file
lat = 9.7398
lon = 51.2695
xlat = obj['XLAT'].values
xlon = obj['XLON'].values
p_1 = [xlat[0,0], xlon[0,0]]
p_2 = [xlat[0,1], xlon[0,1]]
p_3 = [xlat[1,0], xlon[1,0]]
p_4 = [xlat[1,1], xlon[1,1]]
p_rect = [p_1, p_2, p_3, p_4]
p_orig = [lat, lon]
#=================================================
# Calculates the distance between the points
# d = sqrt((x2-x1)^2 + (y2-y1)^2))
#=================================================
distance = []
for coord in p_rect:
distance.append(math.sqrt(math.pow(coord[0]-p_orig[0],2)+math.pow(coord[1]-p_orig[1],2)))
# to get the values for they key['WS'] for example:
a = obj['WS'].values[:,0,0,0] # Array of floats for the first values
b = obj['WS'].values[:,0,0,1] # Array of floats for the second values
c = obj['WS'].values[:,0,1,0] # Array of floats for the third values
d = obj['WS'].values[:,0,1,1] # Array of floats for the fourth values
From then on, I have no idea how I should continue, should I do:
df = pd.DataFrame()
df['a'] = a
df['b'] = b
df['c'] = c
df['d'] = d
Then somehow work with DataFrames, and drop abcd after I got the needed values or should I do it with lists first, then add only the result to the dataframe. I am a bit lost.
The only thing I came up with so far is how it would look like if I would do it manually:
for i starting at 0 and ending if the end of the list [a, b, c d have the same length] is reached .
1/a[i]^2*distance[0] + 1/b[i]^2*distance[1] + 1/c[i]^2*distance[2] + 1/d[i]^2*distance[3]
v = ------------------------------------------------------------------------------------------
1/a[i]^2 + 1/b[i]^2 + 1/c[i]^2 + 1/d[i]^2
'''
This is the first time I had such a (at least for me) complex calculation on a list/dataframe. I hope you can help me solve this problem or at least nudge me in the right direction.
PS: here is the link to the file:
LINK TO FILE
Simply vectorize your calculations. With data frames you can run whole arithmetic operations directly on columns as if they were scalars to generate another column,df['v']. Below assumes distance is a list of four scalars and remember in Python ^ does not mean power, instead us **.
df = pd.DataFrame({'a':a, 'b':b, 'c':c, 'd':d})
df['v'] = (1/df['a']**2 * distance[0] +
1/df['b']**2 * distance[1] +
1/df['c']**2 * distance[2] +
1/df['d']**2 * distance[3]) / (1/df['a']**2 +
1/df['b']**2 +
1/df['c']**2 +
1/df['d']**2)
Or the functional form using Pandas Series binary operators. Below follows the order of operations (Parentheses --> Exponential --> Multiplication/Division --> Addition/Subtraction):
df['v'] = (df['a'].pow(2).pow(-1).mul(distance[0]) +
df['b'].pow(2).pow(-1).mul(distance[1]) +
df['c'].pow(2).pow(-1).mul(distance[2]) +
df['d'].pow(2).pow(-1).mul(distance[3])) / (df['a'].pow(2).pow(-1) +
df['b'].pow(2).pow(-1) +
df['c'].pow(2).pow(-1) +
df['d'].pow(2).pow(-1))
I am having the excel sheet using the pandas.read_excel, I got the output in dataframe but I want to add the calculations in the after reading through pandas I need to ado following calculation in each x and y columns.
ratiox = (73.77481944859028 - 73.7709567323327) / 720
ratioy = (18.567453940477293 - 18.56167674097576) / 1184
mapLongitudeStart = 73.7709567323327
mapLatitudeStart = 18.567453940477293
longitude = 0, latitude = 0
longitude = (mapLongitudeStart + x1 * ratiox)) #I have take for the single column x1 value
latitude = (mapLatitudeStart - (-y1 *ratioy )) # taken column y1 value
how to apply this calculation to every column and row of x and y a which has the values it should not take the null values. And I want the new dataframe created by doing the calculation in columns
Try the below code:
import pandas as pd
import itertools
df = pd.read_excel('file_path')
dfx=df.ix[:,'x1'::2]
dfy=df.ix[:,'y1'::2]
li=[dfx.apply(lambda x:mapLongitudeStart + x * ratiox),dfy.apply(lambda y:mapLatitudeStart - (-y))]
df_new=pd.concat(li,axis=1)
df_new = df_new[list(itertools.chain(*zip(dfx.columns,dfy.columns)))]
print(df_new)
Hope this helps!
I would first recommend to reshape your data into a long format, that way you can get rid of the empty cells naturally. Also most pandas functions work better that way, because then you can use things like group by operations on all x or y or wahtever dimenstion
from itertools import chain
import pandas as pd
## this part is only to have a running example
## here you would load your excel file
D = pd.DataFrame(
np.random.randn(10,6),
columns =chain(*[ [f"x{i}", f"y{i}"] for i in range(1,4)])
)
D["rowid"] = pd.np.arange(len(D))
D = D.melt(id_vars="rowid").dropna()
D["varIndex"] = D.variable.str[1]
D["variable"] = D.variable.str[0]
D = D.set_index(["varIndex","rowid","variable"])\
.unstack("variable")\
.droplevel(0, axis=1)
So these transformations will give you a table where you have an index both for the original row id (maybe it is a time series or something else), and the variable index so x1 or x2 etc.
Now you can do your calculations either by overwintering the previous columns
## Everything here is a constant
ratiox = (73.77481944859028 - 73.7709567323327) / 720
ratioy = (18.567453940477293 - 18.56167674097576) / 1184
mapLongitudeStart = 73.7709567323327
mapLatitudeStart = 18.567453940477293
# apply the calculations directly to the columns
D.x = (mapLongitudeStart + D.x * ratiox))
D.y = (mapLatitudeStart - (-D.y * ratioy ))
I have a large dataset stored as a pandas panel. I would like to count the occurence of values < 1.0 on the minor_axis for each item in the panel. What I have so far:
#%% Creating the first Dataframe
dates1 = pd.date_range('2014-10-19','2014-10-20',freq='H')
df1 = pd.DataFrame(index = dates)
n1 = len(dates)
df1.loc[:,'a'] = np.random.uniform(3,10,n1)
df1.loc[:,'b'] = np.random.uniform(0.9,1.2,n1)
#%% Creating the second DataFrame
dates2 = pd.date_range('2014-10-18','2014-10-20',freq='H')
df2 = pd.DataFrame(index = dates2)
n2 = len(dates2)
df2.loc[:,'a'] = np.random.uniform(3,10,n2)
df2.loc[:,'b'] = np.random.uniform(0.9,1.2,n2)
#%% Creating the panel from both DataFrames
dictionary = {}
dictionary['First_dataset'] = df1
dictionary['Second dataset'] = df2
P = pd.Panel.from_dict(dictionary)
#%% I want to count the number of values < 1.0 for all datasets in the panel
## Only for minor axis b, not minor axis a, stored seperately for each dataset
for dataset in P:
P.loc[dataset,:,'b'] #I need to count the numver of values <1.0 in this pandas_series
To count all the "b" values < 1.0, I would first isolate b in its own DataFrame by swapping the minor axis and the items.
In [43]: b = P.swapaxes("minor","items").b
In [44]: b.where(b<1.0).stack().count()
Out[44]: 30
Thanks for thinking with me guys, but I managed to figure out a surprisingly easy solution after many hours of attempting. I thought I should share it in case someone else is looking for a similar solution.
for dataset in P:
abc = P.loc[dataset,:,'b']
abc_low = sum(i < 1.0 for i in abc)