Running T Test - Statistical Significant - python

If I want to calculate the statistically significant difference between the different species when looking at the occurrence of the virus, I tried:
data = {'WNV Present': ["negative","negative","positive","negative","positive","positive","negative","negative","negative" ],
'Species': ["Myotis","Myotis","Hoary","Myotis","Myotis","Keens","Myotis","Keens","Keens"]}
my_data = pd.DataFrame(data)
# binarized the WNV Present Column
my_data["WNV Present"] = np.where(my_data["WNV Present"] == "positive", 1, 0)
my_data
# Binarize the Species Column
dum_col3 = pd.get_dummies(my_data["Species"])
dum_col3
dummy_df5 = my_data.join(dum_col2)
dummy_df5.drop(["Species"], axis=1, inplace=True)
dummy_df5
#running t test
from scipy.stats import ttest_ind
set1 = dummy_df5[dummy_df5['WNV Present']==1]
set2 = dummy_df5[dummy_df5['Myotis']==1]
stats.ttest_ind(set1, set2)
My results:
Ttest_indResult(statistic=array([ 3. , 1.36930639, 1.36930639, -2.73861279]), pvalue=array([0.0240082 , 0.21994382, 0.21994382, 0.03379779]))
Why am I receiving various P value results? I tried running this again without binarizing the Species column but that also doesnt tell me if there is a significant difference between species.

Related

Join Pandas DataFrames on Fuzzy/Approximate Matches for Multiple Columns

I have two Pandas DataFrames that look like this. Trying to join the two data sets on 'Name','Longitude', and 'Latitude' but using a fuzzy/approximate match. Is there a way to join these together using a combination of the 'Name' strings being, for example, at least an 80% match and the 'Latitude' and 'Longitude' columns being the nearest value or within like 0.001 of each other? I tried using pd.merge_asof but couldn't figure out how to make it work. Thank you for the help!
import pandas as pd
data1 = [['Game Time Bar',42.3734,-71.1204,4.5],['Sports Grill',42.3739,-71.1214,4.6],['Sports Grill 2',42.3839,-71.1315,4.3]]
data2 = [['Game Time Sports Bar',42.3738,-71.1207,'$$'],['Sports Bar & Grill',42.3741,-71.1216,'$'],['Sports Grill',42.3841,-71.1316,'$$']]
df1 = pd.DataFrame(data1, columns=['Name', 'Latitude','Longitude','Rating'])
df2 = pd.DataFrame(data2, columns=['Name', 'Latitude','Longitude','Price'])
merge_asof won't work here since it can only merge on a single numeric column, such as datetimelike, integer, or float (see doc).
Here you can compute the (euclidean) distance between the coordinates of df1 and df2 and pickup the best match:
import pandas as pd
import numpy as np
from scipy.spatial.distance import cdist
data1 = [['Game Time Bar',42.3734,-71.1204,4.5],['Sports Grill',42.3739,-71.1214,4.6],['Sports Grill 2',42.3839,-71.1315,4.3]]
data2 = [['Game Time Sports Bar',42.3738,-71.1207,'$$'],['Sports Bar & Grill',42.3741,-71.1216,'$'],['Sports Grill',42.3841,-71.1316,'$$']]
df1 = pd.DataFrame(data1, columns=['Name', 'Latitude','Longitude','Rating'])
df2 = pd.DataFrame(data2, columns=['Name', 'Latitude','Longitude','Price'])
# Replacing 'Latitude' and 'Longitude' columns with a 'Coord' Tuple
df1['Coord'] = df1[['Latitude', 'Longitude']].apply(lambda x: (x['Latitude'], x['Longitude']), axis=1)
df1.drop(columns=['Latitude', 'Longitude'], inplace=True)
df2['Coord'] = df2[['Latitude', 'Longitude']].apply(lambda x: (x['Latitude'], x['Longitude']), axis=1)
df2.drop(columns=['Latitude', 'Longitude'], inplace=True)
# Creating a distance matrix between df1['Coord'] and df2['Coord']
distances_df1_df2 = cdist(df1['Coord'].to_list(), df2['Coord'].to_list())
# Creating df1['Price'] column from df2 and the distance matrix
for i in df1.index:
# you can replace the following lines with a loop over distances_df1_df2[i]
# and reject names that are too far from each other
min_dist = np.amin(distances_df1_df2[i])
if min_dist > 0.001:
continue
closest_match = np.argmin(distances_df1_df2[i])
# df1.loc[i, 'df2_Name'] = df2.loc[closest_match, 'Name'] # keep track of the merged row
df1.loc[i, 'Price'] = df2.loc[closest_match, 'Price']
print(df1)
Output:
Name Rating Coord Price
0 Game Time Bar 4.5 (42.3734, -71.1204) $$
1 Sports Grill 4.6 (42.3739, -71.1214) $
2 Sports Grill 2 4.3 (42.3839, -71.1315) $$
Edit: your requirement on 'Name' ("at least an 80% match") isn't really appropriate. Take a look at fuzzywuzzy to get a sense of how string distances can be measured.

How to show top 10 ranking with different magnitude

I want to create an overall ranking (but in my true data the features have not the same magnitude at all).
So for example if the top 10 in feature 6 looks like 10^6, 9^6...2^6, values in feature 1 are like 10^2,9^2...2^2.
Hence the overall ranking would be the same ranking as in feature 6, as it is influenced by the magnitude and the given weight is insignificant for infuencing the ranking.
I want to create a new column (or a new dataframe) for overall ranking.
A column that take into account the ranking for each features (hence eliminating the values).
In a second step, rank the countrues with the given different weight for each features, in order to plot the overall ranking of the 10 features.
Also it would be great if I could vizualise the result with matplotlib even though it is a dictionary in each column.
This is the dataframe I have:
import pandas as pd
import numpy as np
data = np.random.randint(100,size=(12,10))
countries = [
'Country1',
'Country2',
'Country3',
'Country4',
'Country5',
'Country6',
'Country7',
'Country8',
'Country9',
'Country10',
'Country11',
'Country12',
]
feature_names_weights = {
'feature1' :1.0,
'feature2' :4.0,
'feature3' :1.0,
'feature4' :7.0,
'feature5' :1.0,
'feature6' :1.0,
'feature7' :8.0,
'feature8' :1.0,
'feature9' :9.0,
'feature10' :1.0,
}
feature_names = list(feature_names_weights.keys())
df = pd.DataFrame(data=data, index=countries, columns=feature_names)
data_etude_copy = df
data_sorted_by_feature = {}
country_scores = (pd.DataFrame(data=np.zeros(len(countries)),index=countries))[0]
for feature in feature_names:
#Adds to each country's score and multiplies by weight factor for each feature
for country in countries:
country_scores[country] += data_etude_copy[feature][country]*(feature_names_weights[feature])
#Sorts the countries by feature (your code in loop form)
data_sorted_by_feature[feature] = data_etude_copy.sort_values(by=[feature], ascending=False).head(10)
data_sorted_by_feature[feature].drop(data_sorted_by_feature[feature].loc[:,data_sorted_by_feature[feature].columns!=feature], inplace=True, axis = 1)
#sort country total scores
ranked_countries = country_scores.sort_values(ascending=False).head(10)
##Put everything into one DataFrame
#Create empty DataFrame
empty_data=np.empty((10,10),str)
outputDF = pd.DataFrame(data=empty_data,columns=((feature_names)))
#Add entries for all features
for feature in feature_names:
for index in range(10):
country = list(data_sorted_by_feature[feature].index)[index]
outputDF[feature][index] = f'{country}: {data_sorted_by_feature[feature][feature][country]}'
#Add column for overall country score
#Print DataFrame
outputDF
The features in my dataframe have not the data normalized, just "ranked".
Expected output would be something like a sum of the normalized rankings with their corresponding weight:

IN Pandas Python Dataframe I need to set values of a column based on the values of other columns

I Have a dataframe with 3 columns namely cuid ,type , errorreason. Now the error reason is empty and I have to fill it with the following logic-
1.) If cuid is unique and type is 'COL' then errorreason is 'NO ERROR'( ALL UNIQUE VALUES ARE 'NO ERROR')
2.) If cuid is not unique , and type is 'COL' AND 'ROT' , then error is errorreason is 'AD'
3.) If cuid is not unique , and type is 'COL' AND 'TOT' , then error is errorreason is 'RE'
4.) Any other case , except the above mentioned , errorreason is 'Unidentified'
I have already seperated the unique and non unique values , so first point is done. Kinda stuck on the next points . I was trying to group by the non unique values and then apply a function. Kinda stuck here.
This is a quite long solution, but I inserted explanations for each step so that they are clear to you. At the end you obtain your desired output
import numpy as np
import pandas as pd
# sample data
df = pd.DataFrame({
'cuid': [100814, 100814, 100815, 100815, 100816],
'type': ['col', 'rot', 'col', 'tot', 'col']
})
# define function for concatenating 'type' within the same 'cuid'
def str_cat(x):
return x.str.cat(sep=', ')
# create a lookup dataset that we will merge later on
df_lookup = df.groupby('cuid').agg({
'cuid': 'count',
'type': str_cat
}).rename(columns={'cuid': 'unique'})
# create the variable 'error_reason' on this lookup dataset thanks to a case when like statement using np.select
df_lookup['error_reason'] = np.select(
[
(df_lookup['cuid'] == 1) & (df_lookup['type'] == 'col'),
(df_lookup['cuid'] > 1) & (df_lookup['type'].str.contains('col')) & (df_lookup['type'].str.contains('rot')),
(df_lookup['cuid'] > 1) & (df_lookup['type'].str.contains('col')) & (df_lookup['type'].str.contains('tot'))
],
[
'NO ERROR',
'AD',
'RE'
],
default = 'Unidentified'
)
# merge the two datasets
df.merge(df_lookup.drop(columns=['type', 'unique']), on='cuid')
Output
cuid type error_reason
0 100814 col AD
1 100814 rot AD
2 100815 col RE
3 100815 tot RE
4 100816 col NO ERROR
Try to use this:
df.groupby('CUID',as_index=False)['TYPE'].aggregate(lambda x: list(x))
I have not tested this solution so let me know if it does not work.

Python - Kmeans - Add the centroids as a new column

Assume I have the following dataframe. How can I create a new column "new_col" containing the centroids? I can only create the column with the labs, not with the centroids.
Here is my code.
from sklearn import preprocessing
from sklearn.cluster import KMeans
numbers = pd.DataFrame(list(range(1,1000)), columns = ['num'])
kmean_model = KMeans(n_clusters=5)
kmean_model.fit(numbers[['num']])
kmean_model.cluster_centers_
array([[699. ],
[297. ],
[497.5],
[899.5],
[ 99. ]])
numbers['new_col'] = kmean_model.predict(numbers[['num']])
It is simple. Just use .labels_ as follows.
numbers['new_col'] = kmean_model.labels_
Edit. Sorry my mistake.
Make dictionary whose key is label and value is centers, and replace the new_col using the dictionary. See the following.
label_center_dict = {k:v for k, v in zip(kmean_model.labels_, kmean_model.cluster_centers_)}
numbers['new_col'] = kmean_model.labels_
numbers['new_col'].replace(label_center_dict, inplace = True)

How to calculate the mean and standard deviation of multiple dataframes at one go?

I've several hundreds of pandas dataframes and And the number of rows are not exactly the same in all the dataframes like some have 600 but other have 540 only.
So what i want to do is like, i have two samples of exactly the same numbers of dataframes and i want to read all the dataframes(around 2000) from both the samples. So that's how thee data looks like and i can read the files like this:
5113.440 1 0.25846 0.10166 27.96867 0.94852 -0.25846 268.29305 5113.434129
5074.760 3 0.68155 0.16566 120.18771 3.02654 -0.68155 101.02457 5074.745627
5083.340 2 0.74771 0.13267 105.59355 2.15700 -0.74771 157.52406 5083.337081
5088.150 1 0.28689 0.12986 39.65747 2.43339 -0.28689 164.40787 5088.141849
5090.780 1 0.61464 0.14479 94.72901 2.78712 -0.61464 132.25865 5090.773443
#first Sample
path_to_files = '/home/Desktop/computed_2d_blaze/'
lst = []
for filen in [x for x in os.listdir(path_to_files) if '.ares' in x]:
df = pd.read_table(path_to_files+filen, skiprows=0, usecols=(0,1,2,3,4,8),names=['wave','num','stlines','fwhm','EWs','MeasredWave'],delimiter=r'\s+')
df = df.sort_values('stlines', ascending=False)
df = df.drop_duplicates('wave')
df = df.reset_index(drop=True)
lst.append(df)
#second sample
path_to_files1 = '/home/Desktop/computed_1d/'
lst1 = []
for filen in [x for x in os.listdir(path_to_files1) if '.ares' in x]:
df1 = pd.read_table(path_to_files1+filen, skiprows=0, usecols=(0,1,2,3,4,8),names=['wave','num','stlines','fwhm','EWs','MeasredWave'],delimiter=r'\s+')
df1 = df1.sort_values('stlines', ascending=False)
df1 = df1.drop_duplicates('wave')
df1 = df1.reset_index(drop=True)
lst1.append(df1)
Now the data is stored in lists and as the number of rows in all the dataframes are not same so i cant subtract them directly.
So how can i subtract them correctly?? And after that i want to take average(mean) of the residual to make a dataframe?
You shouldn't use apply. Just use Boolean making:
mask = df['waves'].between(lower_outlier, upper_outlier)
df[mask].plot(x='waves', y='stlines')
One solution that comes into mind is writing a function that finds outliers based on upper and lower bounds and then slices the data frames based on outliers index e.g.
df1 = pd.DataFrame({'wave': [1, 2, 3, 4, 5]})
df2 = pd.DataFrame({'stlines': [0.1, 0.2, 0.3, 0.4, 0.5]})
def outlier(value, upper, lower):
"""
Find outliers based on upper and lower bound
"""
# Check if input value is within bounds
in_bounds = (value <= upper) and (value >= lower)
return in_bounds
# Function finds outliers in wave column of DF1
outlier_index = df1.wave.apply(lambda x: outlier(x, 4, 1))
# Return DF2 without values at outlier index
df2[outlier_index]
# Return DF1 without values at outlier index
df1[outlier_index]

Categories