Computing a chi square statistic from scratch using numpy/pandas, matrix computations - python

I was just looking at https://en.wikipedia.org/wiki/Chi-squared_test and wanted to recreate the example "Example chi-squared test for categorical data".
I feel that the approach I've taken might have room for improvement, so was wondering how that might be done.
Here's the code:
csv = """\
,A,B,C,D
White collar,90,60,104,95
Blue collar,30,50,51,20
No collar,30,40,45,35
"""
observed_workers = pd.read_csv(io.StringIO(csv), index_col=0)
col_sums = dt.apply(sum)
row_sums = dt.apply(sum, axis=1)
l = list(x[1] * (x[0] / col_sums.sum()) for x in itertools.product(row_sums, col_sums))
expected_workers = pd.DataFrame(
np.array(l).reshape((3, 4)),
columns=observed_workers.columns,
index=observed_workers.index,
)
chi_squared_stat = (
((observed_workers - expected_workers) ** 2).div(expected_workers).sum().sum()
)
This returns the correct value, but is probably ignorant of a nicer approach using some particular numpy / pandas methods.

With numpy/scipy:
csv = """\
,A,B,C,D
White collar,90,60,104,95
Blue collar,30,50,51,20
No collar,30,40,45,35
"""
import io
from numpy import genfromtxt, outer
from scipy.stats.contingency import margins
observed = genfromtxt(io.StringIO(csv), delimiter=',', skip_header=True, usecols=range(1, 5))
row_sums, col_sums = margins(observed)
expected = outer(row_sums, col_sums) / observed.sum()
chi_squared_stat = ((observed - expected)**2 / expected).sum()
print(chi_squared_stat)
With pandas:
import io
import pandas as pd
csv = """\
work_group,A,B,C,D
White collar,90,60,104,95
Blue collar,30,50,51,20
No collar,30,40,45,35
"""
df = pd.read_csv(io.StringIO(csv))
df_melt = df.melt(id_vars ='work_group', var_name='group', value_name='observed')
df_melt['col_sum'] = df_melt.groupby('group')['observed'].transform(np.sum)
df_melt['row_sum'] = df_melt.groupby('work_group')['observed'].transform(np.sum)
total = df_melt['observed'].sum()
df_melt['expected'] = df_melt.apply(lambda row: row['col_sum']*row['row_sum']/total, axis=1)
chi_squared_stat = df_melt.apply(lambda row: ((row['observed'] - row['expected'])**2) / row['expected'], axis=1).sum()
print(chi_squared_stat)

Related

Generating and Storing Samples of an Exponential Distribution with a name for each sample using a loop

I've got a weird question for a class project. Assuming X ~ Exp(Lambda), Lambda=1.6, I have to generate 100 samples of X, with the indices corresponding to the sample size of each generated sample (S1, S2 ... S100). I've worked out a simple loop which generate the required samples in array, but i am not able to rename the array.
First attempt:
import numpy as np
import matplotlib.pyplot as plt
samples = []
for i in range(1,101,1):
samples.append(np.random.exponential(scale= 1/1.6, size= i))
Second attempt:
import numpy as np
import matplotlib.pyplot as plt
for i in range(1,101,1):
samples = np.random.exponential(scale= 1/1.2, size= i)
col = f'samples {i}'
df_samples[col] = exponential_sample
df_samples = pd.DataFrame(samples)
An example how I would like to visualize the data:
# drawing 50 random samples of size 2 from the exponentially distributed population
sample_size = 2
df2 = pd.DataFrame(index= ['x1', 'x2'] )
for i in range(1, 51):
exponential_sample = np.random.exponential((1/rate), sample_size)
col = f'sample {i}'
df2[col] = exponential_sample
# Taking a peek at the samples
df2
But instead of having a simple size = 2, I would like to have sample size = i. This way, I will be able to generate 1 rows for the first column (S1), 2 rows for the second column (S2), until I reach 100 rows for the 100th column (S100).
You cannot stick vectors of different lengths easily into a df so your mock-up code would not work, but you can concat one vector at a time:
df = pd.DataFrame()
for i in range(100,10100,100):
tmp = pd.DataFrame({f'S{i}':np.random.exponential(scale= 1/1.2, size= i)})
df = pd.concat([df, tmp], axis=1)
Use a dict instead maybe?
samples = {}
for i in range(100,10100,100):
samples[i] = np.random.exponential(scale= 1/1.2, size= i)
Then you can convert it into a pandas Dataframe if you like.

Display summary statistics in barplot using ggplot/plotnine

In the following simplified example, I wish to display the sum of each stacked barplot (3 for A and 7 for B), yet my code displays all the values, not the summary statistics. What am I doing wrong? Thank you in advance.
import io
import pandas as pd
import plotnine as p9
data_string = """V1,V2,value
A,a,1
A,b,2
B,a,3
B,b,4"""
data = io.StringIO(data_string)
df = pd.read_csv(data, sep=",")
p9.ggplot(df, p9.aes(x='V1', y='value', fill = 'V2')) + \
p9.geom_bar(stat = 'sum') + \
p9.stat_summary(p9.aes(label ='stat(y)'), fun_y = sum, geom = "text")
The issue is the grouping of your data. As you have a global fill aesthetic your data gets grouped by categories of V2. Hence stat_summary computes the sum per group of V2. To solve this issue make fill a local aesthetic of geom_bar or geom_col.
import io
import pandas as pd
import plotnine as p9
data_string = """V1,V2,value
A,a,1
A,b,2
B,a,3
B,b,4"""
data = io.StringIO(data_string)
df = pd.read_csv(data, sep=",")
p9.ggplot(df, p9.aes(x='V1', y='value')) + \
p9.geom_col(p9.aes(fill = 'V2')) + \
p9.stat_summary(p9.aes(label ='stat(y)'), fun_y = sum, geom = "text")
Another option would be to override the global grouping by setting group=1 in stat_summary:
p9.stat_summary(p9.aes(label ='stat(y)', group = 1), fun_y = sum, geom = "text")

How to apply euclidean distance to dataframe. Calculate each row

Please help me, I have the problem. It's been about 2 weeks but I don't get it yet.
So, I want to use "apply" in dataframe, which I got from Alphavantage API.
I want to apply euclidean distance to each row of dataframe.
import math
import numpy as np
import pandas as pd
from scipy.spatial import distance
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from sklearn.neighbors import KNeighborsRegressor
from alpha_vantage.timeseries import TimeSeries
from services.KEY import getApiKey
ts = TimeSeries(key=getApiKey(), output_format='pandas')
And in my picture I got this
My chart (sorry can't post image because of my reputation)
In my code
stock, meta_data = ts.get_daily_adjusted(symbol, outputsize='full')
stock = stock.sort_values('date')
open = stock['1. open'].values
low = stock['3. low'].values
high = stock['2. high'].values
close = stock['4. close'].values
sorted_date = stock.index.get_level_values(level='date')
stock_numpy_format = np.stack((sorted_date, open, low
,high, close), axis=1)
df = pd.DataFrame(stock_numpy_format, columns=['date', 'open', 'low', 'high', 'close'])
df = df[df['open']>0]
df = df[(df['date'] >= "2016-01-01") & (df['date'] <= "2018-12-31")]
df = df.reset_index(drop=True)
df['close_next'] = df['close'].shift(-1)
df['daily_return'] = df['close'].pct_change(1)
df['daily_return'].fillna(0, inplace=True)
stock_numeric_close_dailyreturn = df['close', 'daily_return']
stock_normalized = (stock_numeric_close_dailyreturn - stock_numeric_close_dailyreturn.mean()) / stock_numeric_close_dailyreturn.std()
euclidean_distances = stock_normalized.apply(lambda row: distance.euclidean(row, date_normalized) , axis=1)
distance_frame = pd.DataFrame(data={"dist": euclidean_distances, "idx":euclidean_distances.index})
distance_frame.sort_values("dist", inplace=True)
second_smallest = distance_frame.iloc[1]["idx"]
most_similar_to_date = df.loc[int(second_smallest)]["date"]
And I want that my chart like this
The chart that I want
And the code from this picture
distance_columns = ['Close', 'DailyReturn']
stock_numeric = stock[distance_columns]
stock_normalized = (stock_numeric - stock_numeric.mean()) / stock_numeric.std()
stock_normalized.fillna(0, inplace = True)
date_normalized = stock_normalized[stock["Date"] == "2016-06-29"]
euclidean_distances = stock_normalized.apply(lambda row: distance.euclidean(row, date_normalized), axis = 1)
distance_frame = pandas.DataFrame(data = {"dist": euclidean_distances, "idx": euclidean_distances.index})
distance_frame.sort_values("dist", inplace=True)
second_smallest = distance_frame.iloc[1]["idx"]
most_similar_to_date = stock.loc[int(second_smallest)]["Date"]
I tried to figure it out, the "apply" in the df.apply from pandas format and from pandas.csv_reader is different.
Is there any alternative to have same output in different format (pandas and csv)
Thank you!
nb: sorry if my english bad.

Pyspark: What is the Fastest way to Calculate Cosine Similarity against a Column of Vectors

Beginner Pyspark question here! I have a dataframe of ~2M rows of already vectorized text (via w2v; 300 dimensions). What is the most efficient way to calculate the cosine distance for each row against a new single vector input?
My current methodology uses a udf and takes a couple minutes, far too long for the webapp I'd like to create.
Create a sample df:
import numpy as np
import pandas as pd
from pyspark.sql.functions import *
column=[]
num_rows = 10000 #change to 2000000 to really slow your computer down!
for x in range(num_rows):
sample = np.random.uniform(low=-1, high=1, size=(300,)).tolist()
column.append(sample)
index = range(1000)
df_pd = pd.DataFrame([index, column]).T
#df_pd = pd.concat([df.T[x] for x in df.T], ignore_index=True)
df_pd.head()
df = spark.createDataFrame(df_pd).withColumnRenamed('0', 'Index').withColumnRenamed('1', 'Vectors')
df.show()
Create a sample input (which I create as a spark df in order to transform through my existing pipeline):
new_input = np.random.uniform(low=-1, high=1, size=(300,)).tolist()
df_pd_new = pd.DataFrame([[new_input]])
df_new = spark.createDataFrame(df_pd_new, ['Input_Vector'])
df_new.show()
Calculate cosine distance or similarity between Vectors and new_input:
value = df_new.select('Input_Vector').collect()[0][0]
def cos_sim(vec):
if (np.linalg.norm(value) * np.linalg.norm(vec)) !=0:
dot_value = np.dot(value, vec) / (np.linalg.norm(value)*np.linalg.norm(vec))
return dot_value.tolist()
cos_sim_udf = udf(cos_sim, FloatType())
#df_all_cos = df_all.withColumn('cos_dis', dot_product_udf('w2v')).dropna(subset='cos_dis')
df_cos = df.withColumn('cos_dis', cos_sim_udf('Vectors')).dropna(subset='cos_dis')
df_cos.show()
And finally let's pull out the max 5 indices for fun:
max_values = df_cos.select('index','cos_dis').orderBy('cos_dis', ascending=False).limit(5).collect()
top_indicies = []
for x in max_values:
top_indicies.append(x[0])
print top_indicies
No pyspark function for cosine distance exists (which would be ideal), so I'm not sure how to speed this up. Any ideas greatly appreciate!
You could try using pandas_udf instead of udf:
# other imports
from pyspark.sql.pandas.functions import pandas_udf
# make sure arrow is actually used
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
spark.conf.set("spark.sql.execution.arrow.pyspark.fallback.enabled", "false")
def cos_sim2(vec: pd.Series) -> pd.Series:
value_norm = np.linalg.norm(value)
cs_value = vec.apply(lambda v: np.dot(value, v) / (np.linalg.norm(v) * value_norm))
return cs_value.replace(np.inf, np.nan)
cos_sim_udf = pandas_udf(cos_sim2, FloatType())

How do I use the output of pandas.ewm.cov?

How is one intended to use the output of the pandas.ewm.cov function. I would presume that there are functions that allow you to directly use it in the form returned for multiplication, but nothing I try seems to work.
For example, suppose I take a minimal use case, stock X and Y returns timeseries in DF1, so we estimate an ewma covariance matrix, then to get the variance estimate for a portfolio of position A and B (given in DF2) I need to compute $x^T C x$, but I can't find the command to do this without writing a for loop?
# Python 3.6, pandas 0.20
import pandas as pd
import numpy as np
np.random.seed(100)
DF1 = pd.DataFrame(dict(X = np.random.normal(size = 100), Y = np.random.normal(size = 100)))
DF2 = pd.DataFrame(dict(A = np.random.normal(size = 100), B = np.random.normal(size = 100)))
COV = DF1.ewm(10).cov()
print(DF1)
print(COV)
# All of the following are invalid
print(COV.dot(DF2))
print(DF2.dot(COV))
print(COV.multiply(DF2))
The best I can figure out is this ugly piece of code
COV.reset_index().rename(columns = dict(level_0 = "index", level_1 = "variable"), inplace = True)
DF2m = pd.melt(DF2.reset_index(), id_vars = "index").sort_values("index")
MDF = pd.merge(COV, DF2m, on=["index", "variable"])
VAR = MDF.groupby("index").apply(lambda x: np.dot(np.dot(x["value"], np.matrix([x["X"], x["Y"]])), x["value"])[0,0])
I hold out hope that there is a nice way to do this...

Categories