I am tryig to apply argrelextrema function with dataframe df. But unable to apply correctly. below is my code
import pandas as pd
from scipy.signal import argrelextrema
np.random.seed(42)
def maxloc(data):
loc_opt_ind = argrelextrema(df.values, np.greater)
loc_max = np.zeros(len(data))
loc_max[loc_opt_ind] = 1
data['loc_max'] = loc_max
return data
values = np.random.rand(23000)
df = pd.DataFrame({'value': values})
np.all(maxloc_faster(df).loc_max)
It gives me error
that loc_max[loc_opt_ind] = 1
IndexError: too many indices for array
A Pandas dataframe is two-dimensional. That is, df.values is two dimensional, even when it has only one column. As a result, loc_opt_ind will contain x and y indices (two tuples; just print loc_opt_ind to see), which can't be used to index loc_max. You probably want to use either df['values'].values (which turns into <Series>.values), or np.squeeze(df.values) as input. Note that argrelextrema still returns a tuple in that case, just a one-element one, so you may need loc_opt_ind[0] (np.where has similar behaviour).
Related
I need to make it so I can return the function (abspec2(v)) for each value of the array row, and sum each returned function for each value of the row (not sum each value of the array). apologies if my code is unclear, quite new to this. Please comment if you require clarification.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df =pd.read_csv(r'C:\Users\adamf\OneDrive\Desktop\hitran_water.csv')
print(df)
arr = df.to_numpy()
here I am setting my variables, setting i as the row value. I need to be able to calculate and return the below function (abspec2(v), for each individual row value of each column.
nu0 = arr[i,1]
s = arr[i,2]
gamma = arr[i,3]
def abspec2(v):
a = s1/np.pi
b = gamma/(gamma**2+(v - nu0)**2)
spec = a*b
return spec
plt.plot(V, abspec2(V))
plt.show()
dput(C:\Users\adamf\OneDrive\Desktop\hitran_water.csv)
First of all I'm a beginner and having an issue about functions and returning values. After that, I need to do some matrix operations to take minimum value of the right column. However, since I cannot return these values (I could not figure out why) I'm not able to do any operations on it. The problem here is, every time I try to use return, It gives me only the first or the last row of the matrix. If you can help, I really appreciate it. Thanks.
import numpy as np
import pandas as pd
df = pd.read_csv(r"C:\Users\Yunus Özer\Downloads/MA.csv")
df.head()
x = df["x"]
def minreg():
for k in range(2,16):
x_pred = np.full(x.shape, np.nan)
for t in range(k,x.size):
x_pred[t] = np.mean(x[(t-k):t])
mape_value=((np.mean(np.abs(x-x_pred)/np.abs(x))*100))
m=np.array([k,mape_value])
return m
print(minreg())
return m command basicly terminates the function and returns m. As a result, the function terminates after executing the first loop. So firstly you need to call return after your loop ends. Secondly you need to put each m value generated for the loop to an array to store them and return that array.
import numpy as np
import pandas as pd
df = pd.read_csv(r"C:\Users\Yunus Özer\Downloads/MA.csv")
df.head()
x = df["x"]
def minreg():
m_arr = []
for k in range(2,16):
x_pred = np.full(x.shape, np.nan)
for t in range(k,x.size):
x_pred[t] = np.mean(x[(t-k):t])
mape_value=((np.mean(np.abs(x-x_pred)/np.abs(x))*100))
m_arr.append(np.array([k,mape_value]))
return m_arr
print(minreg())
I have numpy arrays which are around 2000 long each, but not every element has a value. Some are blank. As you can see at the end of the code ive stacked them into one called 'match'. How would I remove a row in match if it is missing an element. So for example if a particular ID is missing the magnitude it removes the entire row. I'm only interested in keeping the rows that have data for all of the elements.
from astropy.table import Table
import numpy as np
data = '/home/myname/datable.fits'
data = Table.read(data, format="fits")
ID = np.array(data['ID'])
ID.astype(str)
redshift = np.array(data['z'])
redshift.astype(float)
radius = np.array(data['r'])
radius.astype(float)
mag = np.array(data['MAG'])
mag.astype(float)
match = (ID, redshift, radius, mag)
np.stack(match, axis=1)
Here you can use the numpy.isnan method which gives true for missing values and false for existing values. But numpy.isnan can be applied to NumPy arrays of native dtype (such as np.float64).
Your requirement can be achieved as follows:
Note: considering data is your numpy array.
import numpy as np
data = np.array(some_array) # set data as your numpy array
key_col = np.array(data[:,0], dtype=np.float64) # If you want to filter based on column 0
filtered_data = data[~np.isnan(key_col)] # ~ is the logical not here
For better flexibility, consider using pandas!!
Hope this helps!!
I have a pandas dataframe from which I wish to construct some matrices using numpy arrays. These matrices will be constructed based on variables in the dataframe, and I would like to create these via a loop over a list of the dataframe variables. I would also like the numpy arrays to be named based on the variable, so that I can easily reference them.
Below is code to try to illustrate my problem. I create a dataframe with two categorical variables and an identifier. I then create a list 'vars' with the variable names I'd like to loop over. I show that my code runs outside the loop (although the object created is pandas not numpy). The commented piece at the end does not work, but shows my attempt at including the variable string in the loop.
import pandas as pd
import numpy as np
import random
mult_cat = [] # multiple categories
bin_cat = [] # binary categories
id = []
for i in range(0,10):
x = random.randint(0,4)
y = random.randint(0,1)
z = i+1
mult_cat.append(x)
bin_cat.append(y)
id.append(z)
data_2 = {'ID': id,
'mult_cat': mult_cat,
'bin_cat': bin_cat}
df = pd.DataFrame(data_2,
columns = ['ID', 'mult_cat', 'bin_cat'])
vars = ['mult_cat', 'bin_cat']
twice_mult_cat=2*df.mult_cat
print(mult_cat)
print(twice_mult_cat)
"""
for var in vars:
twice_var=2*df.var
print(twice_var)
"""
I believe there are at least two issues here.
1) I am simply multiplying the pandas array, so the resulting object is not a numpy array.
2) The issue of naming, which is, I think, the more important issue here.
I can compare two Pandas series for exact equality using pandas.Series.equals. Is there a corresponding function or parameter that will check if the elements are equal to some ε of precision?
You can use numpy.allclose:
numpy.allclose(a, b, rtol=1e-05, atol=1e-08, equal_nan=False)
Returns True if two arrays are element-wise equal within a tolerance.
The tolerance values are positive, typically very small numbers. The
relative difference (rtol * abs(b)) and the absolute difference atol
are added together to compare against the absolute difference between
a and b.
numpy works well with pandas.Series objects, so if you have two of them - s1 and s2, you can simply do:
np.allclose(s1, s2, atol=...)
Where atol is your tolerance value.
Numpy works well with pandas Series. However one has to be careful with the order of indices (or columns and indices for pandas DataFrame)
For example
series_1 = pd.Series(data=[0,1], index=['a','b'])
series_2 = pd.Series(data=[1,0], index=['b','a'])
np.allclose(series_1,series_2)
will return False
A workaround is to use the index of one pandas series
np.allclose(series_1, series_2.loc[series_1.index])
If you want to avoid numpy, there is another way, use assert_series_equal
import pandas as pd
s1 = pd.Series([1.333333, 1.666666])
s2 = pd.Series([1.333, 1.666])
from pandas.testing import assert_series_equal
assert_series_equal(s1,s2)
raises an AssertionError. So use the check_less_precise flag
assert_series_equal(s1,s2, check_less_precise= True) # No assertion error
This doesn't raise an AssertionError as check_less_precise only compares 3 digits after decimal.
See the docs here
Not good to use asserts but if you want to avoid numpy, this is a way.
Note: I'm posting this mostly because I came to this thread via a Google search of something similar and it seemed too long for a comment. Not necessarily the best solution nor strictly "ε of precision"-based, but an alternative using scaling and rounding if you want to do this for vectors (i.e. rows) rather than scalars for a DataFrame (rather than Series) without looping through explicitly:
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
Xcomb = pd.concat((X, X2), axis=0, ignore_index=True)
# scale
scaler = MinMaxScaler()
scaler.fit(Xcomb)
Xscl = scaler.transform(Xcomb)
# round
df_scl = pd.DataFrame(np.round(Xscl, decimals=8), columns=X.columns)
# post-processing
n_uniq = df_scl.drop_duplicates().shape[0]
n_dup = df.shape[0] + df2.shape[0] - n_uniq
print(f"Number of shared rows: {n_dup}")