Multiple linear regression using binary, non-binary variables - python

I'm hoping to obtain some feedback on the most appropriate method in undertaking this approach. I have a df that contains revenue data and various related variables. I'm hoping to determine which variables predict revenue. These variables are both binary and non-binary though
I'll display an example df below and talk through my thinking:
import pandas as pd
d = ({
'Date' : ['01/01/18','01/01/18','01/01/18','01/01/18','02/01/18','02/01/18','02/01/18','02/01/18'],
'Country' : ['US','US','US','MX','US','US','MX','MX'],
'State' : ['CA','AZ','FL','BC','CA','CA','BC','BC'],
'Town' : ['LA','PO','MI','TJ','LA','SF','EN','TJ'],
'Occurences' : [1,5,3,4,2,5,10,2],
'Time Started' : ['12:03:00 PM','02:17:00 AM','13:20:00 PM','01:25:00 AM','08:30:00 AM','12:31:00 AM','08:35:00 AM','02:45:00 AM'],
'Medium' : [1,2,1,2,1,1,1,2],
'Revenue' : [100000,40000,500000,8000,10000,300000,80000,1000],
})
df = pd.DataFrame(data=d)
Out:
Date Country State Town Occurences Time Medium Revenue
0 01/01/18 US CA LA 1 12:03:00 PM 1 100000
1 01/01/18 US AZ PO 10 02:17:00 AM 2 40000
2 01/01/18 US FL MI 3 13:20:00 PM 1 500000
3 01/01/18 MX BC TJ 4 01:25:00 AM 2 8000
4 02/01/18 US CA LA 2 08:30:00 AM 1 10000
5 02/01/18 US CA SF 5 12:31:00 AM 1 300000
6 02/01/18 MX BC EN 10 08:35:00 AM 1 80000
7 02/01/18 MX BC TJ 2 02:45:00 AM 2 1000
So the specific variables that influence revenue are Medium, Time Started, and Occurrences. I also have location groups that can be used, such as, Country, State, and Town.
Would a multiple linear regression be appropriate here? Should I standardise the independent variables somehow? Medium will always be either 1 or 2. But should I group Time Started and Occurrences? Times will fall between a 20hr period (8AM - 4AM), while occurrences will fall between 1-10. Should these variable be assigned to dummy variables.

Some ideas: you could apply a logit transform of Medium, subtract the earliest starting time from all Time values, and convert it to hours. Then standardize all three variable in some way, and follow-up with multiple linear regression.
Before going into that kind of complex model, you could try plotting each variable against revenue and against each other, and see if there's any interesting patterns.

Related

How to count text event type and transform it into country-year data using pandas?

I am trying to convert a dataframe where each row is a specific event, and each column has information about the event. I want to turn this into data in which each row is a country and year with information about the number and characteristics about the events in the given year.In this data set, each event is an occurrence of terrorism, and I want to count the number of events where the "target" is a government building. One of the columns is called "targettype" or "targettype_txt" and there are 5 different entries in this column I want to count (government building, military, police, diplomatic building etc). The targettype is also coded as a number if that is easier (i.e. there is another column where gov't building is 2, military installation is 4 etc..)
FYI This data set has 16 countries in West Africa and is looking at years 2000-2020 with a total of roughly 8000 events recorded. The data comes from the Global Terrorism Database, and this is for a thesis/independent research project (i.e. not a graded class assignment).
Right now my data looks like this (there are a ton of other columns but they aren't important for this):
eventID
iyear
country_txt
nkill
nwounded
nhostages
targettype_txt
10000102
2000
Nigeria
3
10
0
government building
10000103
2000
Mali
1
3
15
military installation
10000103
2000
Nigeria
15
0
0
government building
10000103
2001
Benin
1
0
0
police
10000103
2001
Nigeria
1
3
15
private business
.
.
.
And I would like it to look like this:
country_txt
iyear
total_nkill
total_nwounded
total_nhostages
total public_target
Nigeria
2000
200
300
300
15
Nigeria
2001
250
450
15
17
I was able to get the total number for nkill,nwounded, and nhostages using this super simple line:
df2 = cdf.groupby(['country','country_txt', 'iyear'])['nkill', 'nwound','nhostkid'].sum()
But this is a little different because I want to only count certain entries and sum up the total number of times they occur. Any thoughts or suggestions are really appreciated!
Try:
cdf['CountCondition'] = (cdf['targettype_txt']=='government building') |
(cdf['targettype_txt']=='military installation') |
(cdf['targettype_txt']=='police')
df2 = cdf[cdf['CountCondition']].groupby(['country','country_txt', 'iyear', 'CountCondition']).count()
You create a new column 'CountCondition' which just marks as true or false if the condition in the statement holds. Then you just count the number of times the CountCondition is True. Hope this makes sense.
It is possible to combine all this into one statement and NOT create an additional column but the statement gets quite convaluted and more difficult to understand how it works:
df2 = cdf[(cdf['targettype_txt']=='government building') |
(cdf['targettype_txt']=='military installation') |
(cdf['targettype_txt']=='police')].groupby(['country','country_txt', 'iyear']).count()

Selecting values from dataframe based on multiple column values

I have a dataframe in this format:
ageClass
sex
nationality
treatment
unique_id
netTime
clockTime
0
20
M
KEN
Treatment
354658649da56c20c72b6689d2b7e1b8cc334ac9
7661
7661
1
20
M
KEN
Treatment
1da607e762ac07eba6f9b5a717e9ff196d987242
7737
7737
2
20
M
KEN
Control
1de4a95cef28c290ba5790217288f510afc3b26b
7747
7747
3
30
M
KEN
Control
12215d93d2cb5b0234991a64d097955338a73dd3
7750
7750
4
30
M
KEN
Treatment
5375986567be20b49067956e989884908fb807f6
8163
8163
5
20
M
ETH
Treatment
613be609b3f4a38834c2bc35bffbdb6c47418666
7811
7811
6
20
M
KEN
Control
70fb3284d112dc27a5cad7f705b38bc91f56ecad
7853
7853
7
30
M
AUS
Control
0ea5d606a83eb68c89da0a98543b815e383835e3
7902
7902
8
20
M
BRA
Control
ecdd57df778ad901b41e79dd2713a23cb8f13860
7923
7923
9
20
M
ETH
Control
ae7fe893268d86b7a1bdb4489f9a0798797c718c
7927
7927
The objective is to determine which age class benefitted most from being in the treatment group as measured by clocktime.
That means i need to somehow group all values for members in each agegroup for both treatment and control conditions and take an average of their clocktimes.
Then following that i need to take the difference of the average clocktimes for the subgroups and compare all of these against one another.
Where i am stuck is with filtering the dataframe based on multiple columns simulatneously. I tried using groupby() as follows:
df.groupby(['ageClass','treatment'])['clockTime'].mean()
However I was not able to then calculate the difference in the mean times from the resulting series.
How should I move forward?
You can pivot the table with means you produced
df2 = df.groupby(['ageClass','treatment'])[['clockTime']].mean().reset_index().pivot(columns=['ageClass'], values='clockTime', index='treatment')
ageClass 20 30
treatment
Control 7862.500000 7826.0
Treatment 7736.333333 8163.0
Then it's easy to find a difference
df2['diff'] = df2[20] - df2[30]
treatment
Control 36.500000
Treatment -426.666667
Name: diff, dtype: float64
From the groupby you've already done, you can groupby index level 0, i.e. 'ageClass' and then use diff to find the difference between the averages of treatment and control groups for each 'ageClass'. Since diff subtracts the second from the first (and "Control" and "Treatment" are sorted alphabetically), add "-Control" to make it a bit clearer.
s = df.groupby(['ageClass','treatment'])['clockTime'].mean()
out = s.groupby(level=0).diff().dropna().reset_index()
out = out.assign(treatment=out['treatment']+'-Control')
Output:
ageClass treatment clockTime
0 20 Treatment-Control -126.166667
1 30 Treatment-Control 337.000000
From your problem description, I would prescribe ranking. Differences between groups wont tell who benefited the most
s=df.groupby(['ageClass','treatment'])['clockTime'].agg('mean').reset_index()
s['rank']=s.groupby('ageClass')['clockTime'].rank()
ageClass treatment clockTime rank
0 20 Control 7862.500000 2.0
1 20 Treatment 7736.333333 1.0
2 30 Control 7826.000000 1.0
3 30 Treatment 8163.000000 2.0

How to access components of seasonal_decompose from statsmodels

I have two time series stored in data frames london and scotland of the same length and same columns. One column in date which spans from 2009 to 2019 and has a daily frequency for data of column yearly_cost. They look as such:
Date Yearly_cost
0 2009-01-01 230
1 2009-01-02 460
2 2009-01-03 260
3 2009-01-04 250
4 2009-01-05 320
5 2009-01-06 430
I wish to compare the euclidean distance of only the seasonality components of yearly_cost in the time series. I have decomposed them using seasonal_decompose() from statsmodels, however I wish to take only the seasonality component from the object:
result = <statsmodels.tsa.seasonal.DecomposeResult at 0x2b5d7d2add8>
Is this possible to take and create into a time series in a new_df?
Any help would be appreciated. Thanks
I have worked this out. To obtain the seasonal component, you simply use
new_df = result.seasonal
This gives you only the seasonal result.

Cleaning up & filling in categorical variables for Data Science analysis

I'm taking on my very first machine learning problem, and I'm struggling with cleaning my categorical features in my dataset. My goal is to build a rock climbing recommendation system.
PROBLEM 1:
I have three columns related columns that have erroneous information:
What it looks like now:
What I want it to look like:
If you groupby the location name, there are different location_id numbers and countries associated with that one name. However, there is a clear winner/clear majority to each of these discrepancies. I have a data set of 2 million entries and the mode of the location_id & location_country GIVEN the location_name is overwhelming pointing to one answer (example: "300" & "USA" for clear_creek).
Using pandas/python, how do I group my dataset by the location_name, compute the mode of location_id & location_country based on that location name, and then replace the entire id & country columns with these mode calculations based on location_name to clean up my data?
I've played around with groupby, replace, duplicated, but I think ultimately I will need to create a function that will do this, and I honestly have no idea where to start. (I apologize in advance for my coding naivety)I know there's got to be a solution, I just need to be pointed in the right direction.
PROBLEM 2:
Also, any one have suggestions on filling in NaN values in my location_name category (42,012/2 million) and location_country(46,890/2 million) columns? Is it best to keep as an unknown value? I feel like filling in these features based on frequency would be a horrible bias to my data set.
data = {'index': [1,2,3,4,5,6,7,8,9],
'location_name': ['kalaymous', 'kalaymous', 'kalaymous', 'kalaymous',
'clear_creek', 'clear_creek', 'clear_creek',
'clear_creek', 'clear_creek'],
'location_id': [100,100,0,100,300,625,300,300,300],
'location_country': ['GRC', 'GRC', 'ESP', 'GRC', 'USA', 'IRE',
'USA', 'USA', 'USA']}
df = pd.DataFrame.from_dict(data)
***looking for it to return:
improved_data = {'index': [1,2,3,4,5,6,7,8,9],
'location_name': ['kalaymous', 'kalaymous', 'kalaymous', 'kalaymous',
'clear_creek', 'clear_creek', 'clear_creek',
'clear_creek', 'clear_creek'],
'location_id': [100,100,100,100,300,300,300,300,300],
'location_country': ['GRC', 'GRC', 'GRC', 'GRC', 'USA', 'USA',
'USA', 'USA', 'USA']}
new_df = pd.DataFrame.from_dict(improved_data)
We can use .agg in combination with pd.Series.mode and cast that back to your dataframe with map:
m1 = df.groupby('location_name')['location_id'].agg(pd.Series.mode)
m2 = df.groupby('location_name')['location_country'].agg(pd.Series.mode)
df['location_id'] = df['location_name'].map(m1)
df['location_country'] = df['location_name'].map(m2)
print(df)
index location_name location_id location_country
0 1 kalaymous 100 GRC
1 2 kalaymous 100 GRC
2 3 kalaymous 100 GRC
3 4 kalaymous 100 GRC
4 5 clear_creek 300 USA
5 6 clear_creek 300 USA
6 7 clear_creek 300 USA
7 8 clear_creek 300 USA
8 9 clear_creek 300 USA
You can use transform by calculating mode using df.iat[]:
df=(df[['location_name']].join(df.groupby('location_name').transform(lambda x: x.mode()
.iat[0])).reindex(df.columns,axis=1))
print(df)
index location_name location_id location_country
0 1 kalaymous 100 GRC
1 1 kalaymous 100 GRC
2 1 kalaymous 100 GRC
3 1 kalaymous 100 GRC
4 5 clear_creek 300 USA
5 5 clear_creek 300 USA
6 5 clear_creek 300 USA
7 5 clear_creek 300 USA
8 5 clear_creek 300 USA
As Erfan mentions it would be helpful to have a view on your expected output for the first question.
For the second pandas has a fillna method. You can use this method to fill the NaN values. For example to fill values with 'UNKNOWN_LOCATION' you could do the following:
df.fillna('UNKNOWN_LOCATION')
See potential solution for first question:
df.groupby('location_name')[['location_id', 'location_country']].apply(lambda x: x.mode())

Boxplot: Extract outliers and tag them as either '0' or '1'

I'm trying to extract outliers from my dataset and tag them accordingly.
Sample Data
Doctor Name Hospital Assigned Region Claims Illness Claimed
1 Albert Some hospital Center R-1 20 Sepsis
2 Simon Another hospital Center R-2 21 Pneumonia
3 Alvin ... ... ... ...
4 Robert
5 Benedict
6 Cruz
So I'm trying to group every Doctor that Claimed a certain Illness in a certain Region and trying to find outliers among them.
Doctor Name Hospital Assigned Region Claims Illness Claimed is_outlier
1 Albert Some hospital Center R-1 20 Sepsis 1
2 Simon Another hospital Center R-2 21 Pneumonia 0
3 Alvin ... ... ... ...
4 Robert
5 Benedict
6 Cruz
I can do this in Power BI. But being fairly new to Python, I can't seem to figure this out.
This is what I'm trying to achieve:
Algo goes like:
Read data
Group data by Illness
Group by Region
get IQR based on Claims Count
if claims count > than (Q3 + 1.5) * IQR
then tag it as outlier = 1
else
not an outlier = 0
Export data
Any ideas?
Assuming you use pandas for data analysis (and you should!) You can use pandas dataframe boxplot to produce a plot similar to yours:
import pandas as pd
import numpy as np
df.boxplot(column=['b'], whis=[10, 90], vert=False,
flierprops=dict(markerfacecolor='g', marker='D'))
or, if you want to mark them 0,1 as you requested, use dataframe quantile() method https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.quantile.html
df.assign(outlier=df[df>=df.quantile(.9)].any(axis=1)).astype(np.int8)
a b outlier
0 1 1 0
1 2 10 0
2 3 100 1
3 4 100 1

Categories