how to analyze numerical and categorical variables at the same time?

how to analyze numerical and categorical variables at the same time? - python

I'm trying to analyze the data of a food ordering application,
the data consist of both numerical and categorical variables, the main variable I'm studying is the total delivery time of an order, which represent the time from placing the order to closing it, I want to study what are the variables the affects it the most.
an example of rows in the data is the following:
order id
branch id
date
time placed
day
period
items id
no. items
total no. items
total delivery time
total time in seconds
113113
31
2/2/2021
13:32:24
Tuesday
afternoon
571
4
11
00:46:19
2805
113113
31
2/2/2021
13:32:24
Tuesday
afternoon
573
4
11
00:46:19
2805
I want to study the effects of all the variables on the total time, even items id and branch id, does a certain item affect time? does the day and period of the day affect it as well?
I used linear regression to get the correlation between total time and the numerical variables, and tried one way anova for some categorical variables, but I didn't like the results, is there a way to analyze all variable together without encoding categorical variables?

I'm looking forward to seeing what other people say about this. Here's my two cents.
ML algos like Regression, love numbers. ML algos like Classification love labels (non-numbers). You can certainly convert labeled data to 'numbered' data. One example is to code ['red','green','blue'] with [1,2,3], would produce weird things like 'red' is lower than 'blue', and if you average a 'red' and a 'blue' you will get a 'green'. Another more subtle example might happen when you code ['low', 'medium', 'high'] with [1,2,3]. In the latter case it might happen to have an ordering which makes sense, however, some subtle inconsistencies might happen when 'medium' in not in the middle of 'low' and 'high'. Now, under the hood, I think classifiers convert labels to numbers, so if you feed in large, medium, and small, it isn't using large, medium, and small to do it's analysis, it's converting those categories to numbers. I think. Maybe someone can confirm this for me.
Thus, I don't think it makes sense to try to measure any kind of relationship between IDs and specific outcomes, like 'totaltime', 'totaldays', etc. If you kick off a project on a Monday or a Friday, does the project end sooner or later than non-Monday-start or non-Friday-start projects? Well, maybe it does. But, is that correlation or causation? You can find correlations between all kinds of things, but these don't necessarily imply causation between these same things. Let's say you find a strong relationship between multiple projects that start on the second Monday of the month and all of these projects get finished off much faster than all other projects. This seems like pure coincidence, rather than causation. Or, there is some other factor impacting the outcome. Maybe projects that start on the second Monday of the month are typically small upgrades, rather than full-blown new undertakings, so the volume of work is less, and the project is done faster. However, starting the work on the second Monday of the month doesn't CAUSE the project to be finished off faster. Tell me if I am wrong. I'm always open to feedback.

Related

LSTM - how implement holiday features

You can see in the following picture a demand problem.
My question relates to how one can/should implement fixed holidays in a LSTM model, which as seen here contain no demand and therefore cause sudden strong 1-day deviations from the average. I am specifically not referring to the change in trend between December and January
An Arima model, for example, can handle such days well.
After hours of searching the internet, all I could find was things how to deal with a change in trend. However, this is not the case, the trend remains the same and is only suspended for one day. I Hope there is someone here who has a paper or an approach for this kind of problem.

since the holydays have predefined dates, why not change the value of the data at that specific date to another value that wouldn't disturb the learning much, maybe the previous one, or the one after. or you could just remove the holydays data from your data and the sequence would be now unharmed by their drastic effect.

How do find correlation between time events and time series data in python?

I have two different excel files. One of them is including time series data (268943 accident time rows) as below
The other file is value of 14 workers measured daily from 8 to 17 and during 4 months(all data merged in one file)
I am trying to understand correlation between accident times and values (hourly from 8 to 17 per one hour and daily from Monday to Friday and monthly)
Which statistical method is fit(Normalized Auto or cross correlation) and how can I do that?
Generally, in the questions, the correlation analysis are performed between two time series based values, but I think this is a little bit different. Also, here times are different.
Thank your advance..

I think the accident times and the bloodsugar levels are not coming from the same source, and so I think it is not possible to draw a correlation between these two separate datasets. If you would like to assume that the blood sugar levels of all 14 workers reflect that of the workers accident dataset, that is a different story. But what if those who had accidents had a significantly different blood sugar level profile than the rest, and what if your tiny dataset of 14 workers does not comprise such examples? I think the best you may do is to graph the blood sugar level of your 14 worker dataset and also similarly analyze the accident dataset separately, and try to see visually whether there is any correlation here.

Determine which factors are significant, using machine learning

I am very inexperienced when it comes to machine learning, but I would like to learn and in order to improve my skills I am currently trying to apply the things I have learned on one of my own research data sets.
I have a dataset with 77 rows and 308 columns. Every row correspondents to a sample. 305 out of the 308 columns give information about concentrations, one column tells whether the column belongs to group A,B,C or D, one column tells whether it is an X or Y sample and one column tells you eventually whether the output is successful or not. I would like to determine which concentrations significantly impact the output, taking into account the variation between the groups and sample types. I have tried multiple things (feature selection, classification, etc.) but so far I do not get the desired output
My question is therefore whether people have suggestions/tips/ideas about how I could tackle this problem, taking into account that the dataset is relatively small and that only 15 out the 77 samples have 'not successful' as output?

Calculate the correlation and sort it. After sorting take top 10-15 categories/features.

How to pre-process transactional data to predict probability to buy?

I'm working on a model for a departament store that uses data from previous purchases to predict the customer's probability to buy today. For the sake of simplicity, say that we have 3 categories of products (A, B, C) and I want to use the purchase history of the customers in Q1, Q2 and Q3 2017 to predict the probability to buy in Q4 2017.
How should I structure my indicators file?
My try:
The variables I want to predict are the red colored cells in the production set.
Please note the following:
Since my set of customers is the same for both years, I'm using a photo of how customers acted last year to predict what will they do at the end of this year (which is unknown).
Data is separated by trimester, a co-worker sugested this is not correct, because I'm unintentionally giving greater weight to the indicators splitting each one in 4, when they should only be one per category.
Alternative:
Another aproach I was sugested was to use two indicators per category: Ex.'bought_in_category_A' and 'days_since_bought_A'. For me this looks simpler, but then the model will only be able to predict IF the customer will buy Y, not WHEN they will buy Y. Also, what will happen if the customer never bought A? I cannot use a 0 since that will imply customers who never bought are closer to customers who just bought a few days ago.
Questions:
Is this structure ok or would you structure the data in another way?
Is it ok to use information from last year in this case?
Is it ok to 'split' a cateogorical variable into several binary variables? does this affect the importance given to that variable?

Unfortunately, you need a different approach in order to achieve predictive analysis.
For example the products' properties are unknown here (color, taste,
size, seasonality,....)
There is no information about the customers
(age, gender, living area etc...)
You need more "transactional"
information, (when, why - how did they buy etc......)
What is the products "lifecycle"? Does it have to do with fashion?
What branch are you in? (Retail, Bulk, Finance, Clothing...)
Meanwhile have you done any campaign? How will this be measured?
I would first (if applicable) concetrate on the categories relations and behaviour for each Quarter:
For example When n1 decreases then n2 decreases
when q1 is lower than q2 or q1/2016 vs q2/2017.
I think you should first of all, work this out with a business analyst in order to to find out the right "rules" and approach.
I do no think you could get a concrete answer with these generic-assumed data.
Usually you need data from at least 3-5 recent years to do some descent predictive analysis, depending of course, on the nature of your product.
Hope, this helped a bit.
;-)
-mwk

Design pattern for ongoing survey anayisis

I'm doing an ongoing survey, every quarter. We get people to sign up (where they give extensive demographic info).
Then we get them to answer six short questions with 5 possible values much worse, worse, same, better, much better.
Of course over time we will not get the same participants,, some will drop out and some new ones will sign up,, so I'm trying to decide how to best build a db and code (hope to use Python, Numpy?) to best allow for ongoing collection and analysis by the various categories defined by the initial demographic data..As of now we have 700 or so participants, so the dataset is not too big.
I.E.;
demographic, UID, North, south, residential. commercial Then answer for 6 questions for Q1
Same for Q2 and so on,, then need able to slice dice and average the values for the quarterly answers by the various demographics to see trends over time.
The averaging, grouping and so forth is modestly complicated by having differing participants each quarter
Any pointers to design patterns for this sort of DB? and analysis? Is this a sparse matrix?

Regarding the survey analysis portion of your question, I would strongly recommend looking at the survey package in R (which includes a number of useful vignettes, including "A survey analysis example"). You can read about it in detail on the webpage "survey analysis in R". In particular, you may want to have a look at the page entitled database-backed survey objects which covers the subject of dealing with very large survey data.
You can integrate this analysis into Python with RPy2 as needed.

This is a Data Warehouse. Small, but a data warehouse.
You have a Star Schema.
You have Facts:
response values are the measures
You have Dimensions:
time period. This has many attributes (year, quarter, month, day, week, etc.) This dimension allows you to accumulate unlimited responses to your survey.
question. This has some attributes. Typically your questions belong to categories or product lines or focus or anything else. You can have lots question "category" columns in this dimension.
participant. Each participant has unique attributes and reference to a Demographic category. Your demographic category can -- very simply -- enumerate your demographic combinations. This dimension allows you to follow respondents or their demographic categories through time.
But Ralph Kimball's The Data Warehouse Toolkit and follow those design patterns. http://www.amazon.com/Data-Warehouse-Toolkit-Complete-Dimensional/dp/0471200247
Buy the book. It's absolutely essential that you fully understand it all before you start down a wrong path.
Also, since you're doing Data Warehousing. Look at all the [Data Warehousing] questions on Stack Overflow. Read every Data Warehousing BLOG you can find.
There's only one relevant design pattern -- the Star Schema. If you understand that, you understand everything.

On the analysis, if your six questions have been posed in a way that would lead you to believe the answers will be correlated, consider conducting a factor analysis on the raw scores first. Often comparing the factors across regions or customer type has more statistical power than comparing across questions alone. Also, the factor scores are more likely to be normally distributed (they are the weighted sum of 6 observations) while the six questions alone would not. This allows you to apply t-tests based on the normal distibution when comparing factor scores.
One watchout, though. If you assign numeric values to answers - 1 = much worse, 2 = worse, etc. you are implying that the distance between much worse and worse is the same as the distance between worse and same. This is generally not true - you might really have to screw up to get a vote of "much worse" while just being a passive screw up might get you a "worse" score. So the assignment of cardinal (numerics) to ordinal (ordering) has a bias of its own.
The unequal number of participants per quarter isn't a problem - there are statistical t-tests that deal with unequal sample sizes.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.