How to treat missing data involving multiple datasets - python

I'm developing a model used to predict the probability of a client changing telephone companies based on their daily usage. My dataset has information from two weeks (14 days).
My datasets include in each row:
User ID, day (number from 1 to 14), a list of 15 more values.
The problem comes from the fact that some clients don't use their telephones everyday so for each client we have a random amount of rows (from 1 to 14) depending on the days they have used their telephones. Therefore we have some missing client-day data combinations.
Removing the missing values is not an option since the data set is small and it would affect the predictive methods.
What kind of treatement could I make for this missing day values for each client?
I've tried to make a new dataset in which we have only one entry per client, there is a new value that quantifies the amount of days of telephone usage and the rest of values are a mean of all the values found on each day of the original dataset. This decreases the size of the dataset and we would have the same problem than just removing the missing values.
I've thought about adding values for the missing days for each client (using interpolation methods) but that would twist the results since that would make the dataset as if every client used their phones everyday and that would affect the predictive model.

Related

How can ı find a similarity value of a data to specific group of data?

I have a data set. Some information of customers is kept in this data set and its columns are contains numbers.
These data contain information about the behavior of the customers a month before leaving the system. So there is exact information that these customers will leave the system within a month.
An up-to-date customer behavior data set is also available. But since these are up to date, we do not know whether they will leave the system or not.
Both data set contains same features.
In fact, I would like to find a probability value of leaving the system for each customer in the second data set with using what I learned from the first data set. How similar to the customers in this first data set.
I tried many methods for this, but they were not healthy. Because I cannot create a data set that customers who will not leave the system. For this reason, the algorithms in the classic sklearn library (classifier or regression algorithms) couldn't solved my problem in real life because I can't determine the Y column content precisely.
I am not sure if this question can be asked here.
But what method should I follow for such a problem? Is there an algorithm that can solve this? What kind of research should I do? With which keywords can I analyze the solution of the problem?

Machine learning application to column mapping

I have a dataframe with a large number of rows (several hundred thousand) and several columns that show industry classification for a company, while the eighth column is the output and shows the company type, e.g. Corporate or Bank or Asset Manager or Government etc.
Unfortunately industry classification is not consistent 100% of the time and is not finite, i.e. there are too many permutations of the industry classification columns to be mapped once manually. If I mapped say 1k rows with correct Output columns, how can I employ machine learning with python to predict the Output column based on my trained sample data? Please see the image attached which will make it clearer.
Part of the dataset
You are trying to predict to company type based in a couple of columns? That is not possible, there are a lot of companies working on that. The best you can do is to collect a lot of data from different sources match them, and then you can try with sklearn probably a decision tree classifier to start.

how to deal with high cardinal categorical feature into numeric for predictive machine learning model?

I have two columns of having high cardinal categorical values, one column(area_id) has 21878 unique values and other has(page_entry) 800 unique values. I am building a predictive ML model to predict the hits on a webpage.
column information:
area_id: all the locations that were visited during the session. (has location code number of different areas of a webpage)
page_entry: describes the landing page of the session.
how to change these two columns into numerical apart from one_hot encoding?
thank you.
One approach could be to group your categorical levels into smaller buckets using business rules. In your case for the feature area_id you could simply group them based on their geographical location, say all area_ids from a single district (or for that matter any other level of aggregation) will be replaced by a single id. Similarly, for page_entry you could group similar pages based on some attributes like nature of the web page like sports, travel, etc. In this way you could significantly reduce the number dimensions of your variables.

Determine which factors are significant, using machine learning

I am very inexperienced when it comes to machine learning, but I would like to learn and in order to improve my skills I am currently trying to apply the things I have learned on one of my own research data sets.
I have a dataset with 77 rows and 308 columns. Every row correspondents to a sample. 305 out of the 308 columns give information about concentrations, one column tells whether the column belongs to group A,B,C or D, one column tells whether it is an X or Y sample and one column tells you eventually whether the output is successful or not. I would like to determine which concentrations significantly impact the output, taking into account the variation between the groups and sample types. I have tried multiple things (feature selection, classification, etc.) but so far I do not get the desired output
My question is therefore whether people have suggestions/tips/ideas about how I could tackle this problem, taking into account that the dataset is relatively small and that only 15 out the 77 samples have 'not successful' as output?
Calculate the correlation and sort it. After sorting take top 10-15 categories/features.

Test differently binned data sets

I am trying to test how a periodic data set behaves with respect to the same data set folded with the period (that is, the average profile). More specifically, I want to test if the single profiles are consistent with the average one.
I am reading about a number of test available in Python, especially about the Kolmogorov-Smirnov statistic on 2 samples and the chi square test.
However, my data are real data, and binned of course.
Therefore, as it is frequent, my data have gaps in between. This means that very often the number of bins in the single profiles is less than the bins of the "model" (the folded/average profile).
This means that I can't use those tests straightforward (because the two arrays have different number of elements), but I probably need to:
1) do some transformation, or any other operation, that allows me to compare the distributions;
2) Also, converting the average profile into a continuous model would be a nice solution;
3) Proceed with different statistical instruments which I am not aware of.
But I don't know how to move on in either case, so I would need help in finding a way for (1) or (2) (perhaps both!), or a hint about the third case.
EDIT: the data are a light curve, that is photon counts versus time.
The data are from a periodic astronomical source, that is they repeat their pattern (profile) every given period. I can fold the data with the period and obtain an average profile, and I want to use this averaged profile as a model to test each single profile against the averaged one, that is my model.
Thanks!

Categories