Subtle differences in data calcuations from array vs list [duplicate] - python

This question already has answers here:
Different std in pandas vs numpy
(2 answers)
Why is pandas.Series.std() different from numpy.std()?
(1 answer)
Different result for std between pandas and numpy
(1 answer)
Difference between numpy var() and pandas var()
(1 answer)
Closed 2 years ago.
As you can see in the code below, I calculate variance for data in the 'open' column two different ways. The only difference being that in the second version I grab the values rather than the column containing values. Why would this lead to different variance calculations?
apple_prices = pd.read_csv('apple_prices.csv')
print(apple_prices['open'].values.var())
#prints 102.22564310059172
print(apple_prices['open'].var())
#prints 103.82291877403847

The reason for the difference is because that pandas.Series.var has a default ddof (delta degrees of freedom) of 1, and numpy.ndarray.var has a default ddof of 0. Manually setting this produces the same result:
import pandas as pd
import numpy as np
np.random.seed(0)
x = pd.Series(np.random.rand(100))
print(x.var(ddof=1))
# 0.08395738934787107
print(x.values.var(ddof=1))
# 0.08395738934787107
See the documentation at:
pandas.Series.var
numpy.var

Related

subset fail on np.meshgrid generated dataframe [duplicate]

This question already has answers here:
Working with floating point NumPy arrays for comparison and related operations
(1 answer)
What is the best way to compare floats for almost-equality in Python?
(18 answers)
Pandas Dataframe Comparison and Floating Point Precision
(1 answer)
Closed 19 days ago.
I generate a dataframe for lonlat like this
a=np.arange(89.7664, 89.7789, 1e-4)
b=np.arange(20.6897, 20.7050, 1e-4)
temp_arr=np.array(np.meshgrid(a, b)).T.reshape(-1, 2)
np_df=pd.DataFrame(temp_arr, columns = ['lon','lat'])
and it create the dataframe I want
When I tried to subset the first lon
len(np_df[np_df['lon']==89.7664])
it will return 153. But when I tried subset some last lon
len(np_df[np_df['lon']==89.7788])
it will return 0
I wonder what is wrong here. Thank you
Use numpy.isclose if compare floats within a tolerance:
len(np_df[np.isclose(np_df['lon'], 89.7788)])
If still not working integers with multiple by 10000 and cast to ints should help:
len(np_df[np_df['lon'].mul(10000).astype(int).eq(897788)])

Find the maximum number of consecutive times value in array is less than threshold [duplicate]

This question already has answers here:
Consecutive events below threshold
(3 answers)
Closed 1 year ago.
I have the following numpy array
array([0.66594665, 0.33003433, NaN, 0.42567293, 0.48161913, 0.30000838, 0.13639367, 0.84300475, 0.19029748, NaN])
I would like to find the number of consecutive times the values in the array are less than 0.5. Is there a way to do this without using a for loop? In this example, the answer is 4 for the following sub-sequence: 0.42567293, 0.48161913, 0.30000838, 0.13639367
import numpy as np
# Create a numpy array
arr = np.array([0.66594665, 0.33003433, np.nan, 0.42567293, 0.48161913, 0.30000838, 0.13639367, 0.84300475, 0.19029748, np.nan])
# Create a numpy array with consecutive values less than 0.5
arr_less_than_0_5 = np.where(arr < 0.5)[0]
# Print the array
print(arr_less_than_0_5)
# Get the number of consecutive times the values in a numpy array are less than 0.5
print(len(arr_less_than_0_5))
The question was to give the number of consecutive times the values are less than 0.5.
It was not asked to print the specific values.
So this answers your question

Get nth and mth elements of a numpy array [duplicate]

This question already has answers here:
How to filter numpy array by list of indices?
(5 answers)
Closed 2 years ago.
A very basic question but I cannot find similar question in here og by googling.
tmp = np.array([1,2,3,4,5])
I can extract 2 by tmp[1] and 2 to 4 by tmp[1:4]
Suppose I want to extract 2 AND 4. What is the easiest way to do that?
You can use .take()
import numpy as np
tmp = np.array([1,2,3,4,5]).take([1,4])
# Out[4]: (2, 5)

How to change value of remainder of a row in a numpy array once a certain condition is met? [duplicate]

This question already has answers here:
Can NumPy take care that an array is (nonstrictly) increasing along one axis?
(2 answers)
Closed 3 years ago.
I have a 2d numpy array of the form:
array = [[0,0,0,1,0], [0,1,0,0,0], [1,0,0,0,0]]
I'd like to go to each of the rows, iterate over the entries until the value 1 is found, then replace every subsequent value in that row to a 1. The output would then look like:
array = [[0,0,0,1,1], [0,1,1,1,1], [1,1,1,1,1]]
My actual data set is very large, so I was wondering if there is a specialized numpy function that does something like this, or if there's an obvious way to do it that I'm missing.
Thanks!
You can use apply.
import numpy as np
array = np.array([[0,0,0,1,0], [0,1,0,0,0], [1,0,0,0,0]])
def myfunc(l):
i = 0
while(l[i]!=1):
i+=1
return([0]*i+[1]*(len(l)-i))
print(np.apply_along_axis(myfunc, 1, array))

How can I convert back? [duplicate]

This question already has answers here:
Pandas reverse of diff()
(6 answers)
Closed 4 years ago.
I converted my timeseries into stationary time series with differentiation
data['consumption_diff'] = data.consumption-data.consumption.shift(1)
How can I convert consumption_diff back into consumption?
You can use numpy's "r_" object which concatenates and flattens arrays and the "cumsum()" function which cumulatively sums values.
import numpy as np
undiffed = np.r_[data.consumption.iloc[0], data.consumption_diff.iloc[1:]].cumsum()
That is how you can undiff timeseries data and can be helpful if you've done a prediction into future dates that you need to undiff. However, you already have the undiffed values: data.consumption are your original undifffed data.

Categories