How to add some calculation in columns of the dataframe in python - python

I am having the excel sheet using the pandas.read_excel, I got the output in dataframe but I want to add the calculations in the after reading through pandas I need to ado following calculation in each x and y columns.
ratiox = (73.77481944859028 - 73.7709567323327) / 720
ratioy = (18.567453940477293 - 18.56167674097576) / 1184
mapLongitudeStart = 73.7709567323327
mapLatitudeStart = 18.567453940477293
longitude = 0, latitude = 0
longitude = (mapLongitudeStart + x1 * ratiox)) #I have take for the single column x1 value
latitude = (mapLatitudeStart - (-y1 *ratioy )) # taken column y1 value
how to apply this calculation to every column and row of x and y a which has the values it should not take the null values. And I want the new dataframe created by doing the calculation in columns

Try the below code:
import pandas as pd
import itertools
df = pd.read_excel('file_path')
dfx=df.ix[:,'x1'::2]
dfy=df.ix[:,'y1'::2]
li=[dfx.apply(lambda x:mapLongitudeStart + x * ratiox),dfy.apply(lambda y:mapLatitudeStart - (-y))]
df_new=pd.concat(li,axis=1)
df_new = df_new[list(itertools.chain(*zip(dfx.columns,dfy.columns)))]
print(df_new)
Hope this helps!

I would first recommend to reshape your data into a long format, that way you can get rid of the empty cells naturally. Also most pandas functions work better that way, because then you can use things like group by operations on all x or y or wahtever dimenstion
from itertools import chain
import pandas as pd
## this part is only to have a running example
## here you would load your excel file
D = pd.DataFrame(
np.random.randn(10,6),
columns =chain(*[ [f"x{i}", f"y{i}"] for i in range(1,4)])
)
D["rowid"] = pd.np.arange(len(D))
D = D.melt(id_vars="rowid").dropna()
D["varIndex"] = D.variable.str[1]
D["variable"] = D.variable.str[0]
D = D.set_index(["varIndex","rowid","variable"])\
.unstack("variable")\
.droplevel(0, axis=1)
So these transformations will give you a table where you have an index both for the original row id (maybe it is a time series or something else), and the variable index so x1 or x2 etc.
Now you can do your calculations either by overwintering the previous columns
## Everything here is a constant
ratiox = (73.77481944859028 - 73.7709567323327) / 720
ratioy = (18.567453940477293 - 18.56167674097576) / 1184
mapLongitudeStart = 73.7709567323327
mapLatitudeStart = 18.567453940477293
# apply the calculations directly to the columns
D.x = (mapLongitudeStart + D.x * ratiox))
D.y = (mapLatitudeStart - (-D.y * ratioy ))

Related

Subset a DataFrame

If I have this data frame:
df = pd.DataFrame(
{"A":[45,67,12,78,92,65,89,12,34,78],
"B":["h","b","f","d","e","t","y","p","w","q"],
"C":[True,False,False,True,False,True,True,True,True,True]})
How can I select 50% of the rows, so that column "C" is True in 90% of the selected rows and False in 10% of them?
firstly create a dataframe in 1000 rows
import pandas as pd
df = pd.DataFrame(
{"A":[45,67,12,78,92,65,89,12,34,78],
"B":["h","b","f","d","e","t","y","p","w","q"],
"C":[True,False,False,True,False,True,True,True,True,True]})
df = pd.concat([df]*100)
print(df)
secondly get the true_row_num and false_row_num
row_num, _ = df.shape
true_row_num = int(row_num * 0.5 * 0.9)
false_row_num = int(row_num * 0.5 * 0.1)
print(true_row_num, false_row_num)
thirdly randomly sample true_df and false_df respectively
true_df = df[df["C"]].sample(true_row_num)
false_df = df[~df["C"]].sample(false_row_num)
new_df = pd.concat([true_df, false_df])
new_df = new_df.sample(frac=1.0).reset_index(drop=True) # shuffle
print(new_df["C"].value_counts())
I think if you calculate the needed sizes ex ante and then perform random sampling per group it might work. Look at something like this:
new=df.query('C==True').sample(int(0.5*len(df)*0.9)).append(df.query('C==False').sample(int(0.5*len(df)*0.1)))

pandas cumsum on lag-differenced dataframe

Say I have a pd.DataFrame() that I differenced with .diff(5), which works like "new number at idx i = (number at idx i) - (number at idx i-5)"
import pandas as pd
import random
example_df = pd.DataFrame(data=random.sample(range(1, 100), 20), columns=["number"])
df_diff = example_df.diff(5)
Now I want to undo this operation using the first 5 entries of example_df, and using df_diff.
If i had done .diff(1), I would simply use .cumsum(). But how can I achieve that it only sums up every 5th value?
My desired output is a df with the following values:
df_example[0]
df_example[1]
df_example[2]
df_example[3]
df_example[4]
df_diff[5] + df_example[0]
df_diff[6] + df_example[1]
df_diff[7] + df_example[2]
df_diff[8] + df_example[3]
...
you could shift the column, add them and fill nans:
df_diff["shifted"] = example_df.shift(5)
df_diff["undone"] = df_diff["number"] + df_diff["shifted"]
df_diff["undone"] = df_diff["undone"].fillna(example_df["number"])

For loop keeps returning empty arrays

I need some help with a for loop I have been trying to run. This is the code I have-
cal_points = []
cal_stars = np.genfromtxt('M67_Calibration_Star_List.csv', delimiter = ',', names = True)
radii = 0.00023
for star in range(len(cal_stars)):
ra_l = cal_stars[star][1] - radii; ra_u = cal_stars[star][1]+radii
dec_l = cal_stars[star][2]-radii; dec_u = cal_stars[star][2] + radii
for i in range(len(M67_catalogue)):
if ra_l <= M67_catalogue[i]['RA'] <= ra_u and dec_l <= M67_catalogue[i]['DEC'] <= dec_u:
cal_points = cal_points+[star]
cal_points.sort()
print(len(cal_points))
print(cal_points)
This keeps returning len(cal_points) as 0 and cal_points as []
These are headers in the csv file with a few of the row entries
Please tell me where I'm going wrong
Since you are trying to match a (small) catalogue of calibration stars with a catalogue of stars in M67, within a given radius(*), you may as well use astropy. Astropy can do all the matching for you, and takes into account the effect of latitudinal distance "shrinking" on a sphere.
Here's some example code that creates two random DataFrames with calibration and catalogue positions, converts them to appropriate Astropy SkyCoords and matches the two sets of positions. It then uses the result to find the stars in the corresponding DataFrames, and concatenates the results into a single DataFrame, including the relevant other information from the catalogue, such as the magnitude.
import pandas as pd
from numpy.random import default_rng
rng = default_rng()
n = 30
cal_stars = pd.DataFrame({'RAJ2000': 132 + rng.random(n),
'DECJ2000': 11 + rng.random(n),
'VTmag': 11 + rng.random(n)})
n = 200
M67_catalogue = pd.DataFrame({'RA': 132 + rng.random(n),
'DEC': 11 + rng.random(n),
'VTmag': 11 + rng.random(n)})
# Create coordinate arrays, using the relevant columns
# from the DataFrame
cal_stars_sc = SkyCoord(cal_stars['RAJ2000'] * u.deg,
cal_stars['DECJ2000'] * u.deg)
M67_catalogue_sc = SkyCoord(M67_catalogue['RA'] * u.deg,
M67_catalogue['DEC'] * u.deg)
# Slightly larger radius in this example;
# 0.00023 is too precise for the random coordinates used here
sep = 0.023 * u.deg
# `idxm67` are the indices into the M67_catalogue_sc SkyCoord
# that have a counterpart within `sep` in `cal_stars_sc`.
# Similarly for `idxcal`
# Note that an index (and thus a coordinate) may appear multiple times:
# a single source may be within `sep` distance to several sources in the
# other catalogue
idxm67, idxcal, dist, _ = cal_stars_sc.search_around_sky(M67_catalogue_sc, sep)
# We need to use `.iloc`, since `SkyCoord` follows standard (NumPy) indexing
# Thus we need to ignore any index that the Pandas DataFrame may have
df1 = cal_stars.iloc[idxcal, :]
df2 = M67_catalogue.iloc[idxm67, :]
df2.columns = ['M67' + name for name in df2.columns]
# We also want to reset both DataFrame indices, because these were copied above when using iloc
# Resetting them will make sure df1 and df2 have the same indices
# and are compatible to be concatenated.
df1.reset_index(inplace=True)
df2.reset_index(inplace=True)
# axis=1 means to concatenate along the columns.
df = pd.concat([df1, df2], axis=1)
# Add the found distances to the final DataFrame
df['dist'] = dist
print(df)
(*) I assume you want a radius, given the variable name, but the search in your code is within a rectangular region.
Here's the short version, without comments and creation of random data. It should be plug and play, provided M67_catalogue is actually a DataFrame (not a NumPy array). Note that the second half, the creation of a matched DataFrame, is a bonus. cal_stars.iloc[idxcal, :] after using search_round_sky is enough to get your result.
import pandas as pd
from astropy.coordinates import SkyCoord
import astropy.units as u
cal_stars = pd.read_csv('M67_Calibration_Star_List.csv')
radius = 0.00023
cal_stars_sc = SkyCoord(cal_stars['RAJ2000'] * u.deg,
cal_stars['DECJ2000'] * u.deg)
M67_catalogue_sc = SkyCoord(M67_catalogue['RA'] * u.deg,
M67_catalogue['DEC'] * u.deg)
idxm67, idxcal, dist, _ = cal_stars_sc.search_around_sky(M67_catalogue_sc, radius * u.deg)
df1 = cal_stars.iloc[idxcal, :]
df2 = M67_catalogue.iloc[idxm67, :]
df2.columns = ['M67' + name for name in df2.columns]
df1.reset_index(inplace=True)
df2.reset_index(inplace=True)
df = pd.concat([df1, df2], axis=1)
df['dist'] = dist
print(df)

How can I apply a function to each row in a pandas dataframe?

I am pretty new to coding so this may be simple, but none of the answers I've found so far have provided information in a way I can understand.
I'd like to take a column of data and apply a function (a x e^bx) where a > 0 and b < 0. The (x) in this case would be the float value in each row of my data.
See what I have so far, but I'm not sure where to go from here....
def plot_data():
# read the file
data = pd.read_excel(FILENAME)
# convert to pandas dataframe
df = pd.DataFrame(data, columns=['FP Signal'])
# add a blank column to store the normalized data
headers = ['FP Signal', 'Normalized']
df = df.reindex(columns=headers)
df.plot(subplots=True, layout=(1, 2))
df['Normalized'] = df.apply(normalize(['FP Signal']), axis=1)
print(df['Normalized'])
# show the plot
plt.show()
# normalization formula (exponential) = a x e ^bx where a > 0, b < 0
def normalize(x):
x = A * E ** (B * x)
return x
I can get this image to show, but not the 'normalized' data...
thanks for any help!
Your code is almost correct.
# normalization formula (exponential) = a x e ^bx where a > 0, b < 0
def normalize(x):
x = A * E ** (B * x)
return x
def plot_data():
# read the file
data = pd.read_excel(FILENAME)
# convert to pandas dataframe
df = pd.DataFrame(data, columns=['FP Signal'])
# add a blank column to store the normalized data
headers = ['FP Signal', 'Normalized']
df = df.reindex(columns=headers)
df['Normalized'] = df['FP Signal'].apply(lambda x: normalize(x))
print(df['Normalized'])
df.plot(subplots=True, layout=(1, 2))
# show the plot
plt.show()
I changed apply row to the following: df['FP Signal'].apply(lambda x: normalize(x)).
It takes only the value on df['FP Signal'] because you don't need entire row. lambda x states current values assign to x, which we send to normalize.
You can also write df['FP Signal'].apply(normalize) which is more directly and more simple. Using lambda is just my personal preference, but many may disagree.
One small addition is to put df.plot(subplots=True, layout=(1, 2)) after you change dataframe. If you plot before changing dataframe, you won't see any change in the plot. df.plot actually doing the plot, plt.show just display it. That's why df.plot must be after you done processing your data.
You can use map to apply a function to a field
pandas.Series.map
s = pd.Series(['cat', 'dog', 'rabbit'])
s.map(lambda x: x.upper())
0 CAT
1 DOG
2 RABBIT

Lists/DataFrames - Running a function over all values in Python

I am stuck at the moment and don't really know how to solve this problem.
I want to apply this calculation to a list/dataframe:
The equation itself is not really the problem for me, I am able to easily solve it manually, but that wouldn't do with the amount of data I have.
v : value to be approximated
vi: known values (in my case Temperatures)
di: distance to the approximated point
So basically this is for calculating/approximating a new temperature value for a position a certain distance away from the corners of the square:
import pandas as pd
import numpy as np
import xarray as xr
import math
filepath = r'F:\Data\data.nc' # just the path to the file
obj= xr.open_dataset(filepath)
# This is where I get the coordinates for each of the corners of the square
# from the netcdf4 file
lat = 9.7398
lon = 51.2695
xlat = obj['XLAT'].values
xlon = obj['XLON'].values
p_1 = [xlat[0,0], xlon[0,0]]
p_2 = [xlat[0,1], xlon[0,1]]
p_3 = [xlat[1,0], xlon[1,0]]
p_4 = [xlat[1,1], xlon[1,1]]
p_rect = [p_1, p_2, p_3, p_4]
p_orig = [lat, lon]
#=================================================
# Calculates the distance between the points
# d = sqrt((x2-x1)^2 + (y2-y1)^2))
#=================================================
distance = []
for coord in p_rect:
distance.append(math.sqrt(math.pow(coord[0]-p_orig[0],2)+math.pow(coord[1]-p_orig[1],2)))
# to get the values for they key['WS'] for example:
a = obj['WS'].values[:,0,0,0] # Array of floats for the first values
b = obj['WS'].values[:,0,0,1] # Array of floats for the second values
c = obj['WS'].values[:,0,1,0] # Array of floats for the third values
d = obj['WS'].values[:,0,1,1] # Array of floats for the fourth values
From then on, I have no idea how I should continue, should I do:
df = pd.DataFrame()
df['a'] = a
df['b'] = b
df['c'] = c
df['d'] = d
Then somehow work with DataFrames, and drop abcd after I got the needed values or should I do it with lists first, then add only the result to the dataframe. I am a bit lost.
The only thing I came up with so far is how it would look like if I would do it manually:
for i starting at 0 and ending if the end of the list [a, b, c d have the same length] is reached .
1/a[i]^2*distance[0] + 1/b[i]^2*distance[1] + 1/c[i]^2*distance[2] + 1/d[i]^2*distance[3]
v = ------------------------------------------------------------------------------------------
1/a[i]^2 + 1/b[i]^2 + 1/c[i]^2 + 1/d[i]^2
'''
This is the first time I had such a (at least for me) complex calculation on a list/dataframe. I hope you can help me solve this problem or at least nudge me in the right direction.
PS: here is the link to the file:
LINK TO FILE
Simply vectorize your calculations. With data frames you can run whole arithmetic operations directly on columns as if they were scalars to generate another column,df['v']. Below assumes distance is a list of four scalars and remember in Python ^ does not mean power, instead us **.
df = pd.DataFrame({'a':a, 'b':b, 'c':c, 'd':d})
df['v'] = (1/df['a']**2 * distance[0] +
1/df['b']**2 * distance[1] +
1/df['c']**2 * distance[2] +
1/df['d']**2 * distance[3]) / (1/df['a']**2 +
1/df['b']**2 +
1/df['c']**2 +
1/df['d']**2)
Or the functional form using Pandas Series binary operators. Below follows the order of operations (Parentheses --> Exponential --> Multiplication/Division --> Addition/Subtraction):
df['v'] = (df['a'].pow(2).pow(-1).mul(distance[0]) +
df['b'].pow(2).pow(-1).mul(distance[1]) +
df['c'].pow(2).pow(-1).mul(distance[2]) +
df['d'].pow(2).pow(-1).mul(distance[3])) / (df['a'].pow(2).pow(-1) +
df['b'].pow(2).pow(-1) +
df['c'].pow(2).pow(-1) +
df['d'].pow(2).pow(-1))

Categories