Python correlation matrix for categorical data - python

I have some data for a charity which contains the amount someone donated and some information about the person who donated like below.
gender age country donation_amount
F 25 UK 15
F 65 France 80
M 55 Germany 54
F 41 UK 3
M 74 France 99
I would like to find out which columns are most strongly correlated to the donation amount so I can investigate them further e.g. certain countries donate a lot compared to others so it would be good to target them. This is easy to do with the pandas.corr() function, however this doesn't work with categorical data such as gender, only numerical data such as age.
Does anyone know a way I can do this?
I have read about using pandas.get_dummies() to convert categorical variable into dummy/indicator variables. The problem is I have quite a lot of columns and a couple of them have over 40 different categories for demographics so this get's very big incredibly quickly and hard to interpret (the way I have been doing it at least!).
I also found this article to say you can use spearmanr but also read elsewhere that you shouldn't use spearmanr for categorical data. The pandas.corr(method=spearman) method still doesn't work on categorical data either.
(Python: Rank order correlation for categorical data)
This is my first post so apologies if I haven't explained myself very well! Please let me know and I will correct anything if needed.

Related

Classify variables in Nominal/ordinal/interval/binary in case user inputs not provided?

If there is no predefined column types(nominal/interval) stored and some of variables are encoded as 1,2,3... in place of actual Categories (e.g. Good, better, bad....) if we see, automatically it may be classified as interval variables but actually they are nominal variables that are encoded.
Is there any way to identify such variables?
I thought of cardinality but threshold becomes an issue here please suggest some other solution.
I'm good with python solution but if someone can give idea on SAS will be helpful :)
as a Data Analyst, its your call to consider the categorical column as nominal or ordinal (depending on the data).
if nominal data --> use dummy variable.(or one hot encoding)
if ordinal data --> use map() function for label-encoding.
if nominal data and cardinality is high --> encoding according to frequency count (lets say there are 30 different categories in a column, there are 1000 rows , 3 categories have high frequency count ,so these will be in separate 3 categories, other 17 have very low, so put all these 17 in 1 single category. ie. There will be only 4 categories, not 30).
apart from object type(string) columns, To identify categorical variables:
frequency count plays very important role for numeric columns.

Python longitudinal data, filling NaN values

I have a dataset on health indicators, with columns such as 'Country', 'Year', 'GDP', and 'Life expectancy'. The data covers the years 2000-2015.
So, there is data for many health indicators for each country for each of the years from 2000-2015.
Many of the variables have missing (NaN) data for specific years/countries.
So, for instance, How would I replace NaN values with average/mean values specific to the given country/year range for all countries?
Additionally, since this is longitudinal data, it would be great to maintain the general trend over time within each country's 16 years of data. Is there a way to replace NaN data for each country, accounting for the general trend for that country/variable over time?
If you guys could explain both methods, that would be phenomenal.
link to data: https://www.kaggle.com/kumarajarshi/life-expectancy-who
Thanks,
D
screenshot of data
You probably want to look into the pd.Dataframe.interpolate() method. It has different methods for filling NaNs in a time series or filling in missing values.

How to decide which statistical test is relevant for my data

This might not be the best platform to ask this, but I thought I would try.
I want to perform a statistical test of my data in order to validate their significance. My data is the following:
I have an online validation survey which asks participants to view a video and make a selection. Also for each annotation, they indicate their confidence in their answer.
I have analysed these results in Python using pandas dataframes and what I want is the following:
I want to see if there is a correlation between a video which has a high agreement count (i.e. a high count of participants making the same selection) and a high confidence value. So for each video, I have a selection and a confidence value for each participant. I have grouped these together to get the agreement count, shown in an example below:
Index Video_# Selected Joint.1 Agreement Count
33 5 Head 24
9 2 Head 21
58 9 Hip_centre 17
128 16 Hip_centre 14
Here's also an example of the data before it is grouped together:
Index Video_# Selected Joint.1 Confidence Value
0 33 Left_elbow 4
1 26 Left_shoulder 4
2 23 Right_foot 3
3 16 Left_hip 2
Is there a statistical test I can perform to find a correlation between agreement count and confidence values? Eg. Spearmnan's or Pearson's correlation? I just can't seem to wrap my head around how to use them in this context. I'm working in Python.
Thanks in advance for the help!!

implementing hot deck imputation in python

I have a data-set that contain both numeric and categorical data like this
subject_id hour_measure heart rate blood_pressure urine color
3 4 60
4 2 70 60 red
6 1 30 yellow
I tried various methods to handle missing data such as the following code
f = lambda x: x.mean() if np.issubdtype(x.dtype, np.number) else next(iter(x.mode()), None)
df[cols] = df[cols].fillna(df[cols].transform(f))
df= df.fillna(method='ffill')
but these techniques didn't give me the result I want. I tried to use hot deck imputation I already understand the concept of the hot deck imputation technique, as it is a suitable way to handle both numeric and categorical data.
If you are using your data as input for machine learning, you can convert the columns containing text to numbers (e.g. a LUT, or convert the colors to corresponding RGB values.
Regarding the second part of your question : could you be more specific about what results you are expecting and what your current code produces?
The hot-deck method is defined in the literature as that method replaces missing values with randomly selected values from the current dataset on hand. So, I tried hot-deck methods to handle missing data such as the following code:
def hotdeck_imputation(data):
for c in (data.columns):
data.loc[:,c] = [random.choice(data[c].dropna()) if np.isnan(i) else i for i in data[c]]
return data
I hope it helps with your problem.

Python conditional filtering in csv file

Please help! I have tried different things/packages writing a program that takes in 4 inputs and returns the writing score statistics of a group based on those combination of inputs from a csv file. This is my first project, so I would appreciate any insights/hints/tips!
Here is the csv sample (has 200 rows total):
id gender ses schtyp prog write
70 male low public general 52
121 female middle public vocation 68
86 male high public general 33
141 male high public vocation 63
172 male middle public academic 47
113 male middle public academic 44
50 male middle public general 59
11 male middle public academic 34
84 male middle public general 57
48 male middle public academic 57
75 male middle public vocation 60
60 male middle public academic 57
Here is what I have so far:
import csv
import numpy
csv_file_object=csv.reader(open('scores.csv', 'rU')) #reads file
header=csv_file_object.next() #skips header
data=[] #loads data into array for processing
for row in csv_file_object:
data.append(row)
data=numpy.array(data)
#asks for inputs
gender=raw_input('Enter gender [male/female]: ')
schtyp=raw_input('Enter school type [public/private]: ')
ses=raw_input('Enter socioeconomic status [low/middle/high]: ')
prog=raw_input('Enter program status [general/vocation/academic: ')
#makes them lower case and strings
prog=str(prog.lower())
gender=str(gender.lower())
schtyp=str(schtyp.lower())
ses=str(ses.lower())
What I am missing is how to filter and gets stats only for a specific group. For example, say I input male, public, middle, and academic -- I'd want to get the average writing score for that subset. I tried the groupby function from pandas, but that only gets you stats for broad groups (such as public vs private). I also tried DataFrame from pandas, but that only gets me filtering for one input and not sure how to get the writing scores. Any hints would be greatly appreciated!
Agreeing with Ramon, Pandas is definitely the way to go, and has extraordinary filtering/sub-setting capability once you get used to it. But it can be tough to first wrap your head around (or at least it was for me!), so I dug up some examples of the sub-setting you need from some of my old code. The variable itu below is a Pandas DataFrame with data on various countries over time.
# Subsetting by using True/False:
subset = itu['CntryName'] == 'Albania' # returns True/False values
itu[subset] # returns 1x144 DataFrame of only data for Albania
itu[itu['CntryName'] == 'Albania'] # one-line command, equivalent to the above two lines
# Pandas has many built-in functions like .isin() to provide params to filter on
itu[itu.cntrycode.isin(['USA','FRA'])] # returns where itu['cntrycode'] is 'USA' or 'FRA'
itu[itu.year.isin([2000,2001,2002])] # Returns all of itu for only years 2000-2002
# Advanced subsetting can include logical operations:
itu[itu.cntrycode.isin(['USA','FRA']) & itu.year.isin([2000,2001,2002])] # Both of above at same time
# Use .loc with two elements to simultaneously select by row/index & column:
itu.loc['USA','CntryName']
itu.iloc[204,0]
itu.loc[['USA','BHS'], ['CntryName', 'Year']]
itu.iloc[[204, 13], [0, 1]]
# Can do many operations at once, but this reduces "readability" of the code
itu[itu.cntrycode.isin(['USA','FRA']) &
itu.year.isin([2000,2001,2002])].loc[:, ['cntrycode','cntryname','year','mpen','fpen']]
# Finally, if you're comfortable with using map() and list comprehensions,
you can do some advanced subsetting that includes evaluations & functions
to determine what elements you want to select from the whole, such as all
countries whose name begins with "United":
criterion = itu['CntryName'].map(lambda x: x.startswith('United'))
itu[criterion]['CntryName'] # gives us UAE, UK, & US
Look at pandas. I think it will shorten your csv parsing work and gives the subset funcitonality you're asking for...
import pandas as pd
data = pd.read_csv('fileName.txt', delim_whitespace=True)
#get all of the male students
data[data['gender'] == 'male']

Categories