subset fail on np.meshgrid generated dataframe [duplicate] - python

This question already has answers here:
Working with floating point NumPy arrays for comparison and related operations
(1 answer)
What is the best way to compare floats for almost-equality in Python?
(18 answers)
Pandas Dataframe Comparison and Floating Point Precision
(1 answer)
Closed 19 days ago.
I generate a dataframe for lonlat like this
a=np.arange(89.7664, 89.7789, 1e-4)
b=np.arange(20.6897, 20.7050, 1e-4)
temp_arr=np.array(np.meshgrid(a, b)).T.reshape(-1, 2)
np_df=pd.DataFrame(temp_arr, columns = ['lon','lat'])
and it create the dataframe I want
When I tried to subset the first lon
len(np_df[np_df['lon']==89.7664])
it will return 153. But when I tried subset some last lon
len(np_df[np_df['lon']==89.7788])
it will return 0
I wonder what is wrong here. Thank you

Use numpy.isclose if compare floats within a tolerance:
len(np_df[np.isclose(np_df['lon'], 89.7788)])
If still not working integers with multiple by 10000 and cast to ints should help:
len(np_df[np_df['lon'].mul(10000).astype(int).eq(897788)])

Related

Need to plot Pairplot for a dataframe that has duplicate indices [duplicate]

This question already has answers here:
dataframe to long format
(2 answers)
Reshape wide to long in pandas
(2 answers)
Closed 9 months ago.
I have a dataframe 'df' (310, 7) and need to plot a pairplot for it. But I'm getting an error <ValueError: cannot reindex from a duplicate axis> when I do it in a regular way.
sns.pairplot(df,hue='Class')
ValueError: cannot reindex from a duplicate axis
The data is of this form:
[data]
P_incidence P_tilt L_angle S_slope P_radius S_Degree Class
0 38.505273 16.964297 35.112814 21.540976 127.632875 7.986683 Normal
1 54.920858 18.968430 51.601455 35.952428 125.846646 2.001642 Normal
2 44.362490 8.945435 46.902096 35.417055 129.220682 4.994195 Normal
3 48.318931 17.452121 48.000000 30.866809 128.980308 -0.910941 Normal
4 45.701789 10.659859 42.577846 35.041929 130.178314 -3.388910 Normal
I tried removing the duplicates using:
df.loc[df['L_angle'].duplicated(), 'L_angle'] = ''
But, this method converts the column to an object and I'm not able to negate it.
The expected output plot is as follows:
[expected]

Join 2 conditions using & operator [duplicate]

This question already has answers here:
Pandas: Filtering multiple conditions
(4 answers)
Closed 10 months ago.
i have 2 queries in pandas and need to join them together.
b.loc[b['Speed']=='100.0']
b.loc[b['Month']=='2022-01']
I need to join them using & but getting error of unsupported operand type.
You are comparing your data having different datatype with comparison value of str, while it should be float 64 and period M respectively as you have mentioned in your comment.
Try to match your comparison with correct data type. try this:
b.loc[(b['Speed'] == 100.0) & (b['Month'] == pd.Period('2022-01'))]

Subtle differences in data calcuations from array vs list [duplicate]

This question already has answers here:
Different std in pandas vs numpy
(2 answers)
Why is pandas.Series.std() different from numpy.std()?
(1 answer)
Different result for std between pandas and numpy
(1 answer)
Difference between numpy var() and pandas var()
(1 answer)
Closed 2 years ago.
As you can see in the code below, I calculate variance for data in the 'open' column two different ways. The only difference being that in the second version I grab the values rather than the column containing values. Why would this lead to different variance calculations?
apple_prices = pd.read_csv('apple_prices.csv')
print(apple_prices['open'].values.var())
#prints 102.22564310059172
print(apple_prices['open'].var())
#prints 103.82291877403847
The reason for the difference is because that pandas.Series.var has a default ddof (delta degrees of freedom) of 1, and numpy.ndarray.var has a default ddof of 0. Manually setting this produces the same result:
import pandas as pd
import numpy as np
np.random.seed(0)
x = pd.Series(np.random.rand(100))
print(x.var(ddof=1))
# 0.08395738934787107
print(x.values.var(ddof=1))
# 0.08395738934787107
See the documentation at:
pandas.Series.var
numpy.var

Getting the top N values and their coordinates from a 2D Numpy array [duplicate]

This question already has answers here:
Efficient way to take the minimum/maximum n values and indices from a matrix using NumPy
(3 answers)
Closed 2 years ago.
I have a 2D numpy array "bigrams" of shape (851, 851) with float values inside. I want to get the top ten values from this array and I want their coordinates.
I know that np.amax(bigrams) can return the single highest value, so that's basically what I want but then for the top ten.
As a numpy-noob, I wrote some code using a loop to get the top values per row and then using np.where() to get the coordinates, but i feel there must be a smarter way to solve this..
You can flatten and use argsort.
idxs = np.argsort(bigrams.ravel())[-10:]
rows, cols = idxs//851, idxs%851
print(bigrams[rows,cols])
An alternative would be to do a partial sorting with argpartition.
partition = np.argpartition(bigrams.ravel(),-10)[-10:]
max_ten = bigrams[partition//851,partition%851]
You will get the top ten values and their coordinates, but they won't be sorted. You can sort this smaller array of ten values later if you want.

How can I convert back? [duplicate]

This question already has answers here:
Pandas reverse of diff()
(6 answers)
Closed 4 years ago.
I converted my timeseries into stationary time series with differentiation
data['consumption_diff'] = data.consumption-data.consumption.shift(1)
How can I convert consumption_diff back into consumption?
You can use numpy's "r_" object which concatenates and flattens arrays and the "cumsum()" function which cumulatively sums values.
import numpy as np
undiffed = np.r_[data.consumption.iloc[0], data.consumption_diff.iloc[1:]].cumsum()
That is how you can undiff timeseries data and can be helpful if you've done a prediction into future dates that you need to undiff. However, you already have the undiffed values: data.consumption are your original undifffed data.

Categories