How to decide which statistical test is relevant for my data - python

This might not be the best platform to ask this, but I thought I would try.
I want to perform a statistical test of my data in order to validate their significance. My data is the following:
I have an online validation survey which asks participants to view a video and make a selection. Also for each annotation, they indicate their confidence in their answer.
I have analysed these results in Python using pandas dataframes and what I want is the following:
I want to see if there is a correlation between a video which has a high agreement count (i.e. a high count of participants making the same selection) and a high confidence value. So for each video, I have a selection and a confidence value for each participant. I have grouped these together to get the agreement count, shown in an example below:
Index Video_# Selected Joint.1 Agreement Count
33 5 Head 24
9 2 Head 21
58 9 Hip_centre 17
128 16 Hip_centre 14
Here's also an example of the data before it is grouped together:
Index Video_# Selected Joint.1 Confidence Value
0 33 Left_elbow 4
1 26 Left_shoulder 4
2 23 Right_foot 3
3 16 Left_hip 2
Is there a statistical test I can perform to find a correlation between agreement count and confidence values? Eg. Spearmnan's or Pearson's correlation? I just can't seem to wrap my head around how to use them in this context. I'm working in Python.
Thanks in advance for the help!!

Related

Change line plot color according to a feature

Suppose I have the following data frame.
ID
Some Reading
Diagnosis
Date
1
10
Stable condition
1/2020
1
20
Possible Failure
3/2020
1
5
Stable Condition
5/2020
2
90
Maintenance Required
1/2020
2
150
Imminent Fault
3/2020
2
200
Complete Failure
4/2020
I wish to do a time series line plot of the 2 machines according the Some Reading factor. However, I wish to give them a different color based on the Diagnosis variable. For example, stable condition would be a greener shade, while Complete Failure would be black. How can I change a line's color like this?
This is the sort of thing I currently have, where each color is a machine:

pandas data frame, apply t-test on rows simultaneously grouping by column names (have duplicates!)

I have a data frame with particular readouts as indexes (different types of measurements for a given sample), each column is a sample for which these readouts were taken. I also have a treatment group assigned as the column name for each sample. You can see the example below.
What I need to do: for a given readout (row) group samples by treatment (column name) and perform a t-test (Welch's t-test) on each group (each treatment). T-test must be done as a comparison with one fixed treatment (control treatment). I do not care about tracking out the sample ids (it was required, now I dropped them on purpose), I'm not going to do paired tests.
For example here, for readout1 I need to compare treatment1 vs treatment3, treatment2 vs treatment3 (it's ok if I'll also get treatment3 vs treatment3).
Example of data frame:
frame = pd.DataFrame(np.arange(27).reshape((3, 9)),
index=['readout1', 'readout2', 'readout3'],
columns=['treatment1', 'treatment1', 'treatment1',\
'treatment2', 'treatment2', 'treatment2', \
'treatment3', 'treatment3', 'treatment3'])
frame
Out[757]:
treatment1 treatment1 ... treatment3 treatment3
readout1 0 1 ... 7 8
readout2 9 10 ... 16 17
readout3 18 19 ... 25 26
[3 rows x 9 columns]
I'm fighting it for several days now. Tried to unstack/stack the data, transposing the data frame, then grouping by index, removing nan and applying lambda. Tried other strategies but none worked. Will appreciate any help.
thank you!

How to cluster data based on a subset of attributes (4 attributes)?

I have a pandas DataFrame that holds the data for some objects, among which the position of some parts of the object (Left, Top, Right, Bottom).
For example:
ObjectID Left, Right, Top, Bottom
1 0 0 0 0
2 20 15 5 5
3 3 2 0 0
How can I cluster the objects based on this 4 attributes?
Is there a clustering algorithm/technique that you recommend me?
Almost all clustering algorithms are multivariate and can be used here. So your question is too broad.
It may be worth looking at appropriate distance measures first.
Any recommendation would be sound to do, because we don't know how your data is distributed.
depending upon the data type and final objective you can try k-means, k-modes or k-prototypes. if your data got a mix of categorical or continuous variables then you can try partition around medoids algorithm. However, as stated earlier by another user, can you give more information about the type of data and its variance.

Create pandas timeseries from list of dictionaries with many keys

I have multiple timeseries that are outputs of various algorithms. These algorithms can have various parameters and they produce timeseries as a result:
timestamp1=1;
value1=5;
timestamp2=2;
value2=8;
timestamp3=3;
value3=4;
timestamp4=4;
value4=12;
resultsOfAlgorithms=[
{
'algorithm':'minmax',
'param-a':'12',
'param-b':'200',
'result-of-algorithm':[[timestamp1,value1],[timestamp2,value2]]
},
{
'algorithm':'minmax',
'param-a':'12',
'param-b':'30',
'result-of-algorithm':[[timestamp1,value1],[timestamp3,value3]]
},
{
'algorithm':'minmax',
'param-a':'12',
'param-b':'30',
'result-of-algorithm':[[timestamp2,value2],[timestamp4,value4]]
},
{
'algorithm':'delta',
'param-a':'12',
'param-b':'50',
'result-of-algorithm':[[timestamp2,value2],[timestamp4,value4]]
}
]
I would like to be able to filter the timeseries by algorithm and parameters and plot filtered timeseries to see how given parameters affect the output. To do that I need to know all the occurring values for given parameter and then to be able to select timeseries with desired parameters. E.g. I would like to plot all results of minmax algorithm with param-b==30. There are 2 results that were produced with minmax algorithm and param-b==30. Thus I would like to have a plot with 2 timeseries in it.
Is this possible with pandas or is this out of pandas functionality? How could this be implemented?
Edit:
Searching more the internet I think I am looking for a way to use hierarchical indexing. Also the timeseries should stay separated. Each result is a an individual time-series. It should not be merged together with other result. I need to filter the results of algorithms by parameters used. The result of filter should be still a list of timeseries.
Edit 2:
There are multiple sub-problems:
Find all existing values for each parameter (user does not know all the values since parameters can be auto-generated by system)
user selects some of values for filtering
One way this could be provided by user is a dictionary (but more-user friendly ideas are welcome):
filter={
'param-b':[30,50],
'algorithm':'minmax'
}
Timeseries from resultsOfAlgorithms[1:2] (2nd and 3rd result) are given as a result of filtering, since these results were produced by minmax algorithm and param-b was 30. Thus in this case
[
[[timestamp1,value1],[timestamp3,value3]],
[[timestamp1,value1],[timestamp3,value3]]
]
The result of filtering will return multiple time series, which I want to plot and compare.
user wants to try various filters to see how they affect results
I am doing all this in Jupyter notebook. And I would like to allow user to try various filters with the least hassle possible.
Timestamps in results are not shared. Timestamps between results are not necessarily shared. E.g. all timeseries might occur between 1pm and 3 pm and have roundly same amount of values but the timestamps nor the amount of values are not identical.
So there are two options here, one is to clean up the dict first, then convert it easily to a dataframe, the second is to convert it to a dataframe, then clean up the column that will have nested lists in it. For the first solution, you can just restructure the dict like this:
import pandas as pd
from collections import defaultdict
data = defaultdict(list)
for roa in resultsOfAlgorithms:
for i in range(len(roa['result-of-algorithm'])):
data['algorithm'].append(roa['algorithm'])
data['param-a'].append(roa['param-a'])
data['param-b'].append(roa['param-b'])
data['time'].append(roa['result-of-algorithm'][i][0])
data['value'].append(roa['result-of-algorithm'][i][1])
df = pd.DataFrame(data)
In [31]: df
Out[31]:
algorithm param-a param-b time value
0 minmax 12 200 1 5
1 minmax 12 200 2 8
2 minmax 12 30 1 5
3 minmax 12 30 3 4
4 minmax 12 30 2 8
5 minmax 12 30 4 12
6 delta 12 50 2 8
7 delta 12 50 4 12
And from here you can do whatever analysis you need with it, whether it's plotting or making the time column the index or grouping and aggregating, and so on. You can compare this to making a dataframe first in this link:
Splitting a List inside a Pandas DataFrame
Where they basically did the same thing, with splitting a column of lists into multiple rows. I think fixing the dictionary will be easier though, depending on how representative your fairly simple example is of the real data.
Edit: If you wanted to turn this into a multi-index, you can add one more line:
df_mi = df.set_index(['algorithm', 'param-a', 'param-b'])
In [25]: df_mi
Out[25]:
time value
algorithm param-a param-b
minmax 12 200 1 5
200 2 8
30 1 5
30 3 4
30 2 8
30 4 12
delta 12 50 2 8
50 4 12

How to plot a graph for correlation co-efficient between each attributes of a dataset and target attribute using Python

I am new to Python and I need to plot a graph between correlation coefficient of each attributes against target value. I have an input dataset with huge number of values. I have provided sample dataset value as below. We need to predict whether a particular consumer will leave or not in a company and hence Result column is the target variable.
SALARY DUE RENT CALLSPERDAY CALL DURATION RESULT
238790 7 109354 0 6 YES
56004 0 204611 28 15 NO
671672 27 371953 0 4 NO
786035 1 421999 19 11 YES
89684 2 503335 25 8 NO
904285 3 522554 0 13 YES
12072 4 307649 4 11 NO
23621 19 389157 0 4 YES
34769 11 291214 1 13 YES
945835 23 515777 0 5 NO
Here, if you see, the result column is String where as rest of the columns are integer. Similar to result, I also have few other columns(not mentioned in sample) which have string value. Here, I need to compute the values of column which has both string and integer values.
Using dictionary I have assigned a value to each of the columns which has string value.
Example: Result column has Yes or No. Hence assigned value as below:
D = {'NO': 0, 'YES': 1}
and using lambda function, looped through each columns of dataset and replaced NO with 0 and YES with 1.
I tried to calculate the correlation coefficient using the formula:
pearsonr(S.SALARY,targetVarible)
Where S is the dataframe which holds all values.
Similarly, I will loop through all the columns of dataset and calculate correlation coefficient of each columns agains target variable.
Is this an efficient way of calculating correlation coefficient?
Because, I am getting value as below
(0.088327739664096655, 1.1787456108540725e-25)
e^-25 seems to be too small.
Is there any other way to calculate it? Would you be suggesting any other way to input String values, so that it can be treated as integer when compared with other columns that has integer values(other than Dictionaries and lambdas which I used?)
Also I need to plot bar graph using the same code. I am planning to use from matplotlib import pyplot as plt library.
Would you be suggesting any other function to plot bar graph. Mostly Im using sklearn libraries,numpy and pandas to use existing functions from them.
It would be great, if someone helps me. Thanks.
As mentioned in the comments you can use df.corr() to get the correlation matrix of your data. Assuming the name of your DataFrame is df you can plot the correlation with:
df_corr = df.corr()
df_corr[['RESULT']].plot(kind='hist')
Pandas DataFrames have a plot function that uses matplotlib. You can learn more about it here: http://pandas.pydata.org/pandas-docs/stable/visualization.html

Categories