pandas: smallest X for a defined probability - python

The data is financial data, with OHLC values in column, e.g.
Open High Low Close
Date
2013-10-20 1.36825 1.38315 1.36502 1.38029
2013-10-27 1.38072 1.38167 1.34793 1.34858
2013-11-03 1.34874 1.35466 1.32941 1.33664
2013-11-10 1.33549 1.35045 1.33439 1.34950
....
I am looking for the answer to the following question:
What is the smallest number X for which (at least) N% of the numbers in a large data set are equal or bigger than X
And for our data with N=60 using the High column, the question would be: What is the smallest number X for which (at least) 60% of High column items are equal or bigger than X?
I know how to calculate std dev, mean and the rest with pandas but my statistic understanding is rather poor to allow me to proceed further. Please also point me to tehoretical papers/tutorials if you know so.
Thank you.

For the sake of completeness, even though the question was essentially resolved in the comment by #haki above, suppose your data is in the DataFrame data. If you were looking for the high price for which 25% of observed high prices are lower, you would use
data['High'].quantile(q=0.25)

Related

How to calculate relative frequency of an event from a dataframe?

I have a dataframe with temperature data for a certain period. With this data, I want to calculate the relative frequency of the month of August being warmer than 20° as well as January being colder than 2°. I have already managed to extract these two columns in a separate dataframe, to get the count of each temperature event and used the normalize function to get the frequency for each value in percent (see code).
df_temp1[df_temp1.aug >=20]
df_temp1[df_temp1.jan <= 2]
df_temp1['aug'].value_counts()
df_temp1['jan'].value_counts()
df_temp1['aug'].value_counts(normalize=True)*100
df_temp1['jan'].value_counts(normalize=True)*100
What I haven't managed is to calculate the relative frequency for aug>=20, jan<=2, as well as aug>=20 AND jan<=2 and aug>=20 OR jan<=2.
Maybe someone could help me with this problem. Thanks.
I would try something like this:
proprortion_of_augusts_above_20 = (df_temp1['aug'] >= 20).mean()
proprortion_of_januaries_below_20 = (df_temp1['jan'] <= 2).mean()
This calculates it in two steps. First, df_temp1['aug'] >= 20 creates a boolean array, with True representing months above 20, and False representing months which are not.
Then, mean() reinterprets True and False as 1 and 0. The average of this is the percentage of months which fulfill the criteria, divided by 100.
As an aside, I would recommend posting your data in a question, which allows people answering to check whether their solution works.

Python Pandas DataFrame Describe Gives Wrong Results?

While I was looking at the results of the describe() method, I realized something very strange. Data is the House Price data from kaggle . Below, you can see the code and the result for "Condition2" feature:
train.groupby(train["Condition2"].fillna('None'))["SalePrice"].describe()
On the other hand, when I look at data in Excel, the quantiles do not match.
So, while 33% of data points 85K SalePrice, how can 25% of data points be 95.5K SalePrice? It is really weird, or may be I'm missing something. Could anybody explain this?
Quantiles seek to divide the data into four equal groups so the value of the 25% quantile isn't going to be the value of the 25% of your data with small sample sizes like this where n=6. There are different methods of calculating quantile values. Describe uses the linear method described in the docs as
linear: i + (j - i) * fraction, where fraction is the fractional part of the index surrounded by i and j.
If you switch to the lower method it produces different results.
>>> feedr = train_df[train_df.Condition2 == 'Feedr']
>>> feedr.SalePrice.quantile(.50)
127500.0
>>> feedr.SalePrice.quantile(.50, interpolation='lower')
127000
>>> feedr.SalePrice.quantile(.25, interpolation='lower')
85000
>>> feedr.SalePrice.quantile(.25)
95500.0

groupby and pandas in python

I have a file (score.csv) like this:
I need to solve the problem in python using pandas and groupby
The problem is like I have one csv file in which I have a data which I need to group on the basis of some point
the points are
I want to group data basis on series id
Then find the highest score and percentage
To find the userids with closest to 50% data (middle cohort) - in comparison to point 2 for the first testid
Then find score of these users for rest of the testids
Then normalise the score with the toppers score
The idea is to find the performance of students in each test.
Structure:
Test Series (series_id)-> having multiple tests(test_id)->mapped with users(user_id)-> scores
For each series_id, to find the first test(which is the lowest test_id for each series id), and users with scores between 40-60 in the first test only.
(Now analysis will be done on users found in point 1 for other tests. Meaning I have found users who are scoring around 50 marks and now I will track their journey in other tests.)
Pick the users_ids from above and find scores for other tests as well. Along with this will have to find the highest score in each test to find the ratio of marks_obtained/highest marks in that test. Basically, we want to normalise the scores with respect to the highest scorer to understand the journey of these users.
You can concatenate the groupby function with other aggregating methods:
df.groupby('series_id').max('marks_gained')
df.groupby('series_id').mean('marks_gained')
You can then find the median and define your 50% cohort by equal distance to that.
df.groupby('series_id').median('marks_gained')
It's usually better to provide data in a format that is easily reproducible and to be completely explicit about each of your requests(for example one has to guess what you mean by percentage - the percentiles of marks?)

Generalized Data Quality Checks on Datasets

I am pulling in a handful of different datasets daily, performing a few simple data quality checks, and then shooting off emails if a dataset fails the checks.
My checks are as plain as checking for duplicates in the dataset, as well as checking if the number of rows and columns in a dataset haven't changed -- See below.
assert df.shape == (1016545, 8)
assert len(df) - len(df.drop_duplicates()) == 0
Since these datasets are updated daily and may change the number of rows, is there a better way to check instead of hardcoding the specific number?
For instance, one dataset might have only 400 rows, and another might have 2 million.
Could I say to check within 'one standard deviation' of the number of rows from yesterday? But in that case, I would need to start collecting previous days counts in a separate table, and that could get ugly.
Right now, for tables that change daily, I'm doing the following rudimentary check:
assert df.shape[0] <= 1016545 + 100
assert df.shape[0] >= 1016545 - 100
But obviously this is not sustainable.
Any suggestions are much appreciated.
Yes, you would need to store some previous information, but since you don't seem to care about perfectly statistically accurate I think you can cheat a little. If you keep the average number of records based on the previous samples, the previous deviation you calculated, and the number of samples you took you can get reasonably close to what you are looking for by finding the weighted average of the previous deviation with the current deviation.
For example:
If the average count has been 1016545 with a deviation of 85 captured over 10 samples, and today's count is 1016612. If you calculate the difference from the mean (1016612 - 1016545 = 67) then the weighted average of the previous deviation and the current deviation ((85*10 + 67)/11 ≈ 83).
This makes it so you are only storing a handful of variables for each data set instead of all the record counts back in time, but this also means it's not actually standard deviation.
As for storage, you could store your data in a database or a json file or any number of other locations -- I won't go into detail for that since it's not clear what environment you are working in or what resources you have available.
Hope that helps!

Removing points which deviate too much from adjacent point in Pandas

So I'm doing some time series analysis in Pandas and have a peculiar pattern of outliers which I'd like to remove. The bellow plot is based on a dataframe with the first column as a date and the second column the data
AS you can see those points of similar values interspersed and look like lines are likely instrument quirks and should be removed. Ive tried using both rolling_mean, median and removal based on standard deviation to no avail. For an idea of density, its daily measurements from 1984 to the present. Any ideas?
auge = pd.read_csv('GaugeData.csv', parse_dates=[0], header=None)
gauge.columns = ['Date', 'Gauge']
gauge = gauge.set_index(['Date'])
gauge['1990':'1995'].plot(style='*')
And the result of applying rolling median
gauge = pd.rolling_mean(gauge, 5, center=True)#gauge.diff()
gauge['1990':'1995'].plot(style='*')
After rolling median
You can demand that each data point has at least "N" "nearby" data points within a certain distance "D".
N can be 2 or more.
nearby for element gauge[i] can be a pair like: gauge[i-1] and gauge[i+1], but since some only have neighbors on one side you can ask for at least two elements with distance in indexes (dates) less than 2. So, let's say at least 2 of {gauge[i-2], gauge[i-1] gauge[i+1], gauge[i+2]} should satisfy: Distance(gauge[i], gauge[ix]) < D
D - you can decide this based on how close you expect those real data points to be.
It won't be perfect, but it should get most of the noise out of the dataset.

Categories