Gathering data for machine learning (finance data)

Gathering data for machine learning (finance data) - python

So larger macro question here; I'm working on a machine learning model to predict long term performance of stocks using NLP of financial reports and different data from each yearly financial report. So the thing that I am wondering is what to do about different sized data.
So for example I can only get data 27 years back for one company but 100 years back for another. So I guess my question is how can I set it up so that it trains on each company as a single instance? When the ideal epoch size is not going to be constant as the amount of data per company is going to vary?
So the one thing I thought of is to standardize it, and give every company say 300 years of data, and autofill the data that doesn't exist with a impossible value that the model can learn represents no data, so that way if I set the epoch size to 300 it will see one/use a sliding window on one company.
So just wondering if this is a good solution or if there are better solutions out there. Thanks!

Related

Analyzing a Dataset in Jupyter Notebooks / Python

I have a dataset that I am trying to analyze for a project.
The first step of the project is to basically model the data, and I am running into some issues. The data is on house sales within the past 5 years. Collecting data on buyers, cost of house, income, age, year purchased, years in loan, years at current job, and whether or not this house was foreclosed on with YES or NO.
The goal is to train a model to make predictions using machine learning, but I am stuck on part 1 - describing the data. I am using Jupyter notebooks to analyze the data and trying to put together a linear or multilinear regression model, and I am failing. When I throw together a scatter plot, my data is all over the chart with no way to really "group" the data at intersection point and cast a prediction line. This makes it difficult to figure out what is actually happening, perhaps the data I am comparing is not correlated in any way.
The problem also comes in with the YES or NO data. I was thinking this might need to be converted into 0s and 1s, but then my linear regression model would have an incredible weight on both ends of the spectrum. Perhaps regression is not the best choice?
I'm just struggling to figure out what to do and how to do it. I am kind of new to data analysis, so perhaps I am thinking of this all wrong. If anyone has any insight it would be much appreciated.

Machine learning application to column mapping

I have a dataframe with a large number of rows (several hundred thousand) and several columns that show industry classification for a company, while the eighth column is the output and shows the company type, e.g. Corporate or Bank or Asset Manager or Government etc.
Unfortunately industry classification is not consistent 100% of the time and is not finite, i.e. there are too many permutations of the industry classification columns to be mapped once manually. If I mapped say 1k rows with correct Output columns, how can I employ machine learning with python to predict the Output column based on my trained sample data? Please see the image attached which will make it clearer.
Part of the dataset

You are trying to predict to company type based in a couple of columns? That is not possible, there are a lot of companies working on that. The best you can do is to collect a lot of data from different sources match them, and then you can try with sklearn probably a decision tree classifier to start.

Predict Sales as Counterfactual for Experiment

Which modelling strategy (time frame, features, technique) would you recommend to forecast 3-month sales for total customer base?
At my company, we often analyse the effect of e.g. marketing campaigns that run at the same time for the total customer base. In order to get an idea of the true incremental impact of the campaign, we - among other things - want to use predicted sales as a counterfactual for the campaign, i.e. what sales were expected to be assuming no marketing campaign.
Time frame used to train the model I'm currently considering 2 options (static time frame and rolling window) - let me know what you think. 1. Static: Use the same period last year as the dependent variable to build a specific model for this particular 3 month time frame. Data of 12 months before are used to generate features. 2. Use a rolling window logic of 3 months, dynamically creating dependent time frames and features. Not yet sure what the benefit of that would be. It uses more recent data for the model creation but feels less specific because it uses any 3 month period in a year as dependent. Not sure what the theory says for this particular example. Experiences, thoughts?
Features - currently building features per customer from one year pre-period data, e.g. Sales in individual months, 90,180,365 days prior, max/min/avg per customer, # of months with sales, tenure, etc. Takes quite a lot of time - any libraries/packages you would recommend for this?
Modelling technique - currently considering GLM, XGboost, S/ARIMA, and LSTM networks. Any experience here?
To clarify, even though I'm considering e.g. ARIMA, I do not need to predict any seasonal patterns of the 3 month window. As a first stab, a single number, total predicted sales of customer base for these 3 months, would be sufficient.
Any experience or comment would be highly appreciated.
Thanks,
F

How to extract useful features from time-series data (e.g., users' daily activities in a forum)

I have data regarding users' visits and postings in a discussion forum for a 1-week period, and this data contains the timestamp of the activity. Based on this forum data, I tried to predict users' another behavior (let's say X behavior). Initial results of the regression model show that users' forum activity seem to be associated with their X behavior. Besides these cumulative features: avg_visits_per_day, total_posts_whole_week, I also have features for each day (0<a<8): {a}_visits and {a}_posts.
Thus, I have 16 features in total, and the regression model built with these 16 features gives promising results. So, it would make more sense if I can generate more features. However, I do not know if there any useful feature-extraction strategy for such time-series data. I am using sklearn but did not see a method for this purpose. Any ideas or recommendations?

There are lots of options, an it's difficult to suggest which ones are more useful for predicting the unknown "x behaviour". However, you could:
Manually create features representing information that's clearly available in raw data, but not present in you current feature set at all. For example, if you have not only dates, but also times of activity logged - you can construct additional features for first/last/average time of visiting within each day (maybe converted to categorical morning/day/evening/night), average time between visits and so on. Probably day of week information could be useful as well.
Manually create relative features from existing set: say, visits/posts ratio for each day, number of days since last post, longest period without visits, etc
Use additional information if it's available: user's browser, OS, screen resolution, post length, keywords present in his/her post, subforum it belongs to, new post or follow-up, ... - once again, it's hard to tell beforehand what will be relevant.
Do automated feature extraction by package like tsfresh or (less automated) hctsa

How to forecast in python using machine learning , from a given set of geographical data?

I was analyzing some geographical data and attempting to predict/forecast next occurrence of event with respect to time and it geographical position. The data was in following order (with sample data)
Timestamp Latitude Longitude Event
13307266 102.86400972 70.64039541 "Event A"
13311695 102.8082912 70.47394645 "Event A"
13314940 102.82240522 70.6308513 "Event A"
13318949 102.83402128 70.64103035 "Event A"
13334397 102.84726242 70.66790352 "Event A"
First step was classifying it into 100 zones, so that reduces dimensions and complexity.
Timestamp Zone
13307266 47
13311695 65
13314940 51
13318949 46
13334397 26
Next step was to do time series analysis then I got stuck here for 2 months, read around a lot of literature and figured these were my options
* ARIMA (auto-regression method)
* Machine Learning
I wanted to utilize Machine learning to forecast using python but couldn't really figure out how.Specifically are there any python libraries/open-source-code specific for use case, which I can build upon.
EDIT 1:
To clarify, data is loosely dependent on past data but over a period of time is uniformly distributed.
The best way to visualize the data would be, to imagine N number of agents controlled by a algorithm which allots them task of picking resource from grids. Resources are function of socioeconomic structure of society and also strongly dependent on geography. Its in interest of " algorithm " to be able to predict demand zone and time wise.
p.s:
For Auto-regressive models like ARIMA Python already has a library http://pypi.python.org/pypi/statsmodels .

Without example data or existing code I can't offer you anything concrete.
However, often it's helpful to re-phrase your problem in the nomenclature of the field you want to explore. In ML terms:
Your problem's features: How your inputs are specified. Timestamp is continuous, geographic zone is discrete.
Your problems's target label: an event, precisely whether or not a given event has occurred.
Your problem is supervised: target labels for previous data are available. You have previous instances of (timestamp, geographic zone) to event mappings.
The target label is discrete, so this is a classification problem (as opposed to a regression problem, where the output is continuous).
So I'd say you have a supervised classification problem. As an aside you may want to do some sort of time regularisation first; I'm guessing there are going to be patterns of the events depending on what time of the day, day of the month, or month of the year it is, and you may want to represent this as an additional feature.
Taking a look at one of the popular Python ML libraries available, scikit-learn, here:
http://scikit-learn.org/stable/supervised_learning.html
and consulting a recent posting on a cheatsheet for scikit-learn by one of the contributors:
http://peekaboo-vision.blogspot.de/2013/01/machine-learning-cheat-sheet-for-scikit.html
Your first good bet would be to try Support Vector Machines (SVM), and if that fails maybe give k Nearest Neighbours (kNN) a shot as well. Note that using an ensemble classifier is usually superior than using just one instance of a given SVM/kNN.
How, exactly, to apply SVM/kNN with time as a feature may require more research, since AFAIK (and others will probably correct me) SVM/kNN require bounded inputs with a mean of zero (or normalised to have a mean of zero). Just doing some random Googling you may be able to find certain SVM kernels, for example a Fourier kernel, that can transform a time-series feature for you:
SVM Kernels for Time Series Analysis
http://www.stefan-rueping.de/publications/rueping-2001-a.pdf
scikit-learn handily allows you to specify a custom kernel for an SVM. See:
http://scikit-learn.org/stable/auto_examples/svm/plot_custom_kernel.html#example-svm-plot-custom-kernel-py
With your knowledge of ML nomenclature, and example data in hand, you may want to consider posting the question to Cross Validated, the statistics Stack Exchange.
EDIT 1: Thinking about this problem more you need to really understand if your features and corresponding labels are independent and identically distributed (IID) or not. For example what if you were modelling how forest fires spread over time. It's clear that the likelihood of a given zone catches fire is contingent on its neighbours being on fire or not. AFAIK SVM and kNN assume the data is IID. At this point I'm starting to get out of my depth, but I think you should at least give several ML methods a shot and see what happens! Remember to cross-validate! (scikit-learn does this for you).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.