I have two sets of Likert data on a scale from 0 - 100 where 0 is strongly disagree and 100 is strongly agree. The first set consists of answers from a sample of 500 users. The second set also consists of numerical answers from the same sample of 500 users. These data sets are related in this way: the ith user in the first set has matched with the ith user in the second data in numerous occasions of a particular gaming platform (ex: a party on playstation network) for i = 1,...,500. The question asked to the user is: Do you like dogs? Here's an example of how the data looks:
user_1_data = [100,60,98, 50,0,...,20,100]
user_2_data = [50,75,12,...,100,20]
where user_1_data[0] is the user who matched with user_2_data[0] and their responses are 100, and 50 respectively to the question Do you like dogs? and so on so forth until i = 500.
I managed to plot the actual data in the probability distribution below. Where the x axis is the rating from 0 - 100, and the y- axis is the probability of picking that particular rating.
Although the distributions look similar, I need some sort of test to prove some significance between them (if any). Ultimately I'd like to answer the question: Does a similar distribution of answers imply that the users will play together on different occasions?
Please feel free to edit this question for formatting and to be easier to understand.
This is a statistics question. Please use statistics terms and math language if possible. I am new to data science and would love to learn how to answer my own question in the future.
I code in python.
Related
I have a data set. Some information of customers is kept in this data set and its columns are contains numbers.
These data contain information about the behavior of the customers a month before leaving the system. So there is exact information that these customers will leave the system within a month.
An up-to-date customer behavior data set is also available. But since these are up to date, we do not know whether they will leave the system or not.
Both data set contains same features.
In fact, I would like to find a probability value of leaving the system for each customer in the second data set with using what I learned from the first data set. How similar to the customers in this first data set.
I tried many methods for this, but they were not healthy. Because I cannot create a data set that customers who will not leave the system. For this reason, the algorithms in the classic sklearn library (classifier or regression algorithms) couldn't solved my problem in real life because I can't determine the Y column content precisely.
I am not sure if this question can be asked here.
But what method should I follow for such a problem? Is there an algorithm that can solve this? What kind of research should I do? With which keywords can I analyze the solution of the problem?
I am very inexperienced when it comes to machine learning, but I would like to learn and in order to improve my skills I am currently trying to apply the things I have learned on one of my own research data sets.
I have a dataset with 77 rows and 308 columns. Every row correspondents to a sample. 305 out of the 308 columns give information about concentrations, one column tells whether the column belongs to group A,B,C or D, one column tells whether it is an X or Y sample and one column tells you eventually whether the output is successful or not. I would like to determine which concentrations significantly impact the output, taking into account the variation between the groups and sample types. I have tried multiple things (feature selection, classification, etc.) but so far I do not get the desired output
My question is therefore whether people have suggestions/tips/ideas about how I could tackle this problem, taking into account that the dataset is relatively small and that only 15 out the 77 samples have 'not successful' as output?
Calculate the correlation and sort it. After sorting take top 10-15 categories/features.
I have data regarding users' visits and postings in a discussion forum for a 1-week period, and this data contains the timestamp of the activity. Based on this forum data, I tried to predict users' another behavior (let's say X behavior). Initial results of the regression model show that users' forum activity seem to be associated with their X behavior. Besides these cumulative features: avg_visits_per_day, total_posts_whole_week, I also have features for each day (0<a<8): {a}_visits and {a}_posts.
Thus, I have 16 features in total, and the regression model built with these 16 features gives promising results. So, it would make more sense if I can generate more features. However, I do not know if there any useful feature-extraction strategy for such time-series data. I am using sklearn but did not see a method for this purpose. Any ideas or recommendations?
There are lots of options, an it's difficult to suggest which ones are more useful for predicting the unknown "x behaviour". However, you could:
Manually create features representing information that's clearly available in raw data, but not present in you current feature set at all. For example, if you have not only dates, but also times of activity logged - you can construct additional features for first/last/average time of visiting within each day (maybe converted to categorical morning/day/evening/night), average time between visits and so on. Probably day of week information could be useful as well.
Manually create relative features from existing set: say, visits/posts ratio for each day, number of days since last post, longest period without visits, etc
Use additional information if it's available: user's browser, OS, screen resolution, post length, keywords present in his/her post, subforum it belongs to, new post or follow-up, ... - once again, it's hard to tell beforehand what will be relevant.
Do automated feature extraction by package like tsfresh or (less automated) hctsa
I have a pandas DataFrame whose index is unique user identifiers, columns corresponding to unique events, and values 1 (attended), 0 (did not attend), or NaN (wasn't invited/not relevant). The matrix is pretty sparse with respect to NaNs: there are several hundred events and most users were only invited to several tens at most.
I created some extra columns to measure the "success" which I define as just % attended relative to invites:
my_data['invited'] = my_data.count(axis=1)
my_data['attended'] = my_data.sum(axis=1)-my_data['invited']
my_data['success'] = my_data['attended']/my_data['invited']
Assume the following is true: the success data should be normally distributed with mean 0.80 and s.d. 0.10. When I look at the histogram of my_data['success'] it was not normal and skewed left. It is not important if this is true in reality. I just want to solve the technical problem I pose below.
So this is my problem: there are some events which I don't think are "good" in the sense that they are making the success data diverge from normal. I'd like to do "feature selection" on my events to pick a subset of them which makes the distribution of my_data['success'] as close to normal as possible in the sense of "convergence in distribution".
I looked at the scikit-learn "feature selection" methods here and the "Univariate feature selection" seems like it makes sense. But I'm very new to both pandas and scikit-learn and could really use help on how to actually implement this in code.
Constraints: I need to keep at least half the original events.
Any help would be greatly appreciated. Please share as many details as you can, I am very new to these libraries and would love to see how to do this with my DataFrame.
Thanks!
EDIT: After looking some more at the scikit-learn feature selection approaches, "Recursive feature selection" seems like it might make sense here too but I'm not sure how to build it up with my "accuracy" metric being "close to normally distributed with mean..."
Keep in mind that feature selection is to select features, not samples, i.e., (typically) the columns of your DataFrame, not the rows. So, I am not sure if feature selection is what you want: I understand that you want to remove those samples that cause the skew in your distribution?
Also, what about feature scaling, e.g., standardization, so that your data becomes normal distributed with mean=0 and sd=1?
The equation is simply z = (x - mean) / sd
To apply it to your DataFrame, you can simply do
my_data['success'] = (my_data['success'] - my_data['success'].mean(axis=0)) / (my_data['success'].std(axis=0))
However, don't forget to keep the mean and SD parameters to transform your test data, too. Alternatively, you could also use the StandardScaler from scikit-learn
I'm doing an ongoing survey, every quarter. We get people to sign up (where they give extensive demographic info).
Then we get them to answer six short questions with 5 possible values much worse, worse, same, better, much better.
Of course over time we will not get the same participants,, some will drop out and some new ones will sign up,, so I'm trying to decide how to best build a db and code (hope to use Python, Numpy?) to best allow for ongoing collection and analysis by the various categories defined by the initial demographic data..As of now we have 700 or so participants, so the dataset is not too big.
I.E.;
demographic, UID, North, south, residential. commercial Then answer for 6 questions for Q1
Same for Q2 and so on,, then need able to slice dice and average the values for the quarterly answers by the various demographics to see trends over time.
The averaging, grouping and so forth is modestly complicated by having differing participants each quarter
Any pointers to design patterns for this sort of DB? and analysis? Is this a sparse matrix?
Regarding the survey analysis portion of your question, I would strongly recommend looking at the survey package in R (which includes a number of useful vignettes, including "A survey analysis example"). You can read about it in detail on the webpage "survey analysis in R". In particular, you may want to have a look at the page entitled database-backed survey objects which covers the subject of dealing with very large survey data.
You can integrate this analysis into Python with RPy2 as needed.
This is a Data Warehouse. Small, but a data warehouse.
You have a Star Schema.
You have Facts:
response values are the measures
You have Dimensions:
time period. This has many attributes (year, quarter, month, day, week, etc.) This dimension allows you to accumulate unlimited responses to your survey.
question. This has some attributes. Typically your questions belong to categories or product lines or focus or anything else. You can have lots question "category" columns in this dimension.
participant. Each participant has unique attributes and reference to a Demographic category. Your demographic category can -- very simply -- enumerate your demographic combinations. This dimension allows you to follow respondents or their demographic categories through time.
But Ralph Kimball's The Data Warehouse Toolkit and follow those design patterns. http://www.amazon.com/Data-Warehouse-Toolkit-Complete-Dimensional/dp/0471200247
Buy the book. It's absolutely essential that you fully understand it all before you start down a wrong path.
Also, since you're doing Data Warehousing. Look at all the [Data Warehousing] questions on Stack Overflow. Read every Data Warehousing BLOG you can find.
There's only one relevant design pattern -- the Star Schema. If you understand that, you understand everything.
On the analysis, if your six questions have been posed in a way that would lead you to believe the answers will be correlated, consider conducting a factor analysis on the raw scores first. Often comparing the factors across regions or customer type has more statistical power than comparing across questions alone. Also, the factor scores are more likely to be normally distributed (they are the weighted sum of 6 observations) while the six questions alone would not. This allows you to apply t-tests based on the normal distibution when comparing factor scores.
One watchout, though. If you assign numeric values to answers - 1 = much worse, 2 = worse, etc. you are implying that the distance between much worse and worse is the same as the distance between worse and same. This is generally not true - you might really have to screw up to get a vote of "much worse" while just being a passive screw up might get you a "worse" score. So the assignment of cardinal (numerics) to ordinal (ordering) has a bias of its own.
The unequal number of participants per quarter isn't a problem - there are statistical t-tests that deal with unequal sample sizes.