I am trying to run a logistic regression model on a very large dataset with 2.3 billion observations in Python. I need a standard regression output. Statsmodels with parquet seemed promising:
https://www.statsmodels.org/v0.13.0/large_data.html
import pyarrow as pa
import pyarrow.parquet as pq
import statsmodels.formula.api as smf
class DataSet(dict):
def __init__(self, path):
self.parquet = pq.ParquetFile(path)
def __getitem__(self, key):
try:
return self.parquet.read([key]).to_pandas()[key]
except:
raise KeyError
LargeData = DataSet('LargeData.parquet')
res = smf.ols('Profit ~ Sugar + Power + Women', data=LargeData).fit()
However it says "Additionally, you can add code to this example DataSet object to return only a subset of the rows until you have built a good model. Then, you can refit your final model on more data."
This is what I tried all day and could not get to work. I am not super familiar with Python classes and how to iterate row-group-wise through a parquet.
I am sure it's only few lines of code, could anyone help me out?
P.S.: Ideally, of course I need the combination of the distributed model and the subsetted data. But I would already be happy to get the subsetting to without running out of memory. Thanks!
You can read "subset" of your parquet file by using row groups:
import random
row_group = random.rand_range(self.parquet.num_row_groups)
return self.parquet.read_row_group(row_group).to_pandas()
I'm not sure how exactly you'd do that in your case, maybe by selecting an arbitrary / random set of available row groups.
Related
References:
https://examples.dask.org/applications/forecasting-with-prophet.html?highlight=prophet
https://facebook.github.io/prophet/
A few things to note:
I've got a total of 48gb of ram
Here are my versions of the libraries im using
Python 3.7.7
dask==2.18.0
fbprophet==0.6
pandas==1.0.3
The reason im import pandas is for this line only pd.options.mode.chained_assignment = None
This helps with dask erroring when im using dask.distributed
So, I have a 21gb csv file that I am reading using dask and jupyter notebook...
I've tried to read it from my mysql database table, however, the kernel eventually crashes
I've tried multiple combinations of using my local network of workers, threads, and available memory, available storage_memory, and even tried not using distributed at all. I have also tried chunking with pandas (not with the line mentioned above related to pandas), however, even with chunking, the kernel still crashes...
I can now load the csv with dask, and apply a few transformations, such as setting the index, adding the column (names) that fbprophet requires... but I am still not able to compute the dataframe with df.compute(), as this is why I think I am receiving the error I am with fbprophet. After I have added the columns y, and ds, with the appropriate dtypes, I receive the error Truth of Delayed objects is not supported, and I think this is because fbprophet expects the dataframe to not be lazy, which is why im trying to run compute beforehand. I have also bumped up the ram on the client to allow it to use the full 48gb, as I suspected that it may be trying to load the data twice, however, this still failed, so most likely this wasn't the case / isn't causing the problem.
Alongside this, fbpropphet is also mentioned in the documentation of dask for applying machine learning to dataframes, however, I really don't understand why this isn't working... I've also tried modin with ray, and with dask, with basically the same result.
Another question... regarding memory usage
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 32.35 GB -- Worker memory limit: 25.00 GB
I am getting this error when assigning the client, reading the csv file, and applying operations/transformations to the dataframe, however the allotted size is larger than the csv file itself, so this confuses me...
What I have done to try and solve this myself:
- Googling of course, did not find anything :-/
- Asking a discord help channel, on multiple occasions
- Asking IIRC help channel, on multiple occasions
Anyways, would really appreciate any help on this problem!!!
Thank you in advance :)
MCVE
from dask.distributed import Client
import dask.dataframe as dd
import pandas as pd
from fbprophet import Prophet
pd.options.mode.chained_assignment = None
client = Client(n_workers=2, threads_per_worker=4, processes=False, memory_limit='4GB')
csv_file = 'provide_your_own_csv_file_here.csv'
df = dd.read_csv(csv_file, parse_dates=['Time (UTC)'])
df = df.set_index('Time (UTC)')
df['y'] = df[['a','b']].mean(axis=1)
m = Prophet(daily_seasonality=True)
m.fit(df)
# ERROR: Truth of Delayed objects is not supported
Unfortunately Prophet doesn't support Dask dataframes today.
The example that you refer to shows using Dask to accelerate Prophet's fitting on Pandas dataframes. Dask Dataframe is only one way that people use Dask.
As already suggested, one approach is to use dask.delayed with a pandas DataFrame, and skip dask.dataframe.
You could use a simplified version of the load-clean-analyze pipeline shown for custom computations using Dask.
Here is one possible approach based on this type of custom pipeline, using a small dataset (to create a MCVE) - every step in the pipeline will be delayed
Imports
import numpy as np
import pandas as pd
from dask import delayed
from dask.distributed import Client
from fbprophet import Prophet
Generate some data in a .csv, with column names Time (UTC), a and b
def generate_csv(nrows, fname):
df = pd.DataFrame(np.random.rand(nrows, 2), columns=["a", "b"])
df["Time (UTC)"] = pd.date_range(start="1850-01-01", periods=nrows)
df.to_csv(fname, index=False)
First write the load function from the pipeline, to load the .csv with Pandas, and delay its execution using the dask.delayed decorator
might be good to use read_csv with nrows to see how the pipeline performs on a subset of the data, rather than loading it all
this will return a dask.delayed object and not a pandas.DataFrame
#delayed
def load_data(fname, nrows=None):
return pd.read_csv(fname, nrows=nrows)
Now create the process function, to process data using pandas, again delayed since its input is a dask.delayed object and not a pandas.DataFrame
#delayed
def process_data(df):
df = df.rename(columns={"Time (UTC)": "ds"})
df["y"] = df[["a", "b"]].mean(axis=1)
return df
Last function - this one will train fbprophet on the data (loaded from the .csv and processed, but delayed) to make a forecast. This analyze function is also delayed, since one of its inputs is a dask.delayed object
#delayed
def analyze(df, horizon):
m = Prophet(daily_seasonality=True)
m.fit(df)
future = m.make_future_dataframe(periods=horizon)
forecast = m.predict(future)
return forecast
Run the pipeline (if running from a Python script, requires __name__ == "__main__")
the output of the pipeline (a forecast by fbprophet) is stored in a variable result, which is delayed
when this output is computed, this will generate a pandas.DataFrame (corresponding to the output of a forecast by fbprophet), so it can be evaluated using result.compute()
if __name__ == "__main__":
horizon = 8
num_rows_data = 40
num_rows_to_load = 35
csv_fname = "my_file.csv"
generate_csv(num_rows_data, csv_fname)
client = Client() # modify this as required
df = load_data(csv_fname, nrows=num_rows_to_load)
df = process_data(df)
result = analyze(df, horizon)
forecast = result.compute()
client.close()
assert len(forecast) == num_rows_to_load + horizon
print(forecast[["ds", "yhat", "yhat_lower", "yhat_upper"]].head())
Output
ds yhat yhat_lower yhat_upper
0 1850-01-01 0.330649 0.095788 0.573378
1 1850-01-02 0.493025 0.266692 0.724632
2 1850-01-03 0.573344 0.348953 0.822692
3 1850-01-04 0.491388 0.246458 0.712400
4 1850-01-05 0.307939 0.066030 0.548981
I am new to Python and machine learning. I have this data file on which I want to apply binary classification. But I am unable to guess its format and to load it in Python. Can someone help me out here?
In the dataset first column is class, and there are 100 features. I am using pandas IO to load it, and tried read_csv, but it's not working! And also it's definitely not JSON. (And I have used only these formats till now, so pardon me in advance if it is some well known format!)
You can try sklearn.datasets.load_svmlight_file to read the file.
Here's an example from the documentation link on how to use the method:
from sklearn.externals.joblib import Memory
from sklearn.datasets import load_svmlight_file
mem = Memory("./mycache")
#mem.cache
def get_data():
data = load_svmlight_file("mysvmlightfile")
return data[0], data[1]
X, y = get_data()
It's a pure text file. By looking at the first row, it looks like a libsvm format.
See this for a reference.
I am trying to load a mat file for the Street View House Numbers (SVHN) Dataset http://ufldl.stanford.edu/housenumbers/ in Python with the following code
import h5py
labels_file = './sv/train/digitStruct.mat'
f = h5py.File(labels_file)
struct= f.values()
names = struct[1].values()
print(names[1][1].value)
I get [<HDF5 object reference>] but I need to know the actual string
To get an idea of the data layout you could execute
h5dump ./sv/train/digitStruct.mat
but there are also other methods like visit or visititems.
A good reference that can help you and that seems to have already addressed a very similar problem (if not the same) recently is the following SO post:
h5py, access data in Datasets in SVHN
For example the snippet:
import h5py
import numpy
def get_name(index, hdf5_data):
name = hdf5_data['/digitStruct/name']
print ''.join([chr(v[0]) for v in hdf5_data[name[index][0]].value])
labels_file = 'train/digitStruct.mat'
f = h5py.File(labels_file)
for j in range(33402):
get_name(j, f)
will print the name of the files. I get for example:
7459.png
7460.png
7461.png
7462.png
7463.png
7464.png
7465.png
You can generalize from here.
In the sci-kit learn python library there are many datasets accessed easily by the following commands:
for example to load the iris dataset:
iris=datasets.load_iris()
And we can now assign data and target/label variables as follows:
X=iris.data # assigns feature dataset to X
Y=iris.target # assigns labels to Y
My question is how to create my own data dictionary using my own data either in csv, xml or any other format into something similar above so data can be called easily and features/labels are easily accessed.
Is this possible? someone help me!!
By the way I am using the spyder (anaconda) platform by continuum.
Thanks!
I see at least two (easy) solutions to your problem.
First, you can store your data in whichever structure you like.
# Storing in a list
my_list = []
my_list.append(iris.data)
my_list[0] # your data
# Storing in a dictionary
my_dict = {}
my_dict["data"] = iris.data
my_dict["data"] # your data
Or, you can create your own class:
Class MyStructure:
def __init__(data, target):
self.data = data
self.target = target
my_class = MyStructure(iris.data, iris.target)
my_class.data # your data
Hope it helps
If ALL you want to do is read data from csv files and have them organized , I would recommend you to simply use either pandas or numpy's genfromtxt function.
mydata=numpy.genfromtxt(filepath,*params)
If the CSV is formatted regularly, you can extract for example the names of each column by specifying:
mydata=numpy.genfromtxt(filepath,unpack=True,names=True,delimiter=',')
then you can access any column data you want by simply typing it's name/header:
mydata['your header']
(Pandas also has a similar convenient way of grabbing data in an organized manner from CSV or similar files.)
However if you want to do it the long way and learn:
Simply, you want to write a class for the data that you are using, complete with its own access, modify, read, #dosomething functions. Instead of code for this, I think you would benefit more from going in and reading for example the iris class, or an introduction to a simple Class from any beginners guide to object based programming.
To do what you want, for an object MyData, you could have for example
read(#file) function that reads from a given file of some expected format and returns some specified structure. For reading from csv files, you can simply use numpy's loadtxt method.
modify(#some attribute)
etc.
I've written a python script to take a large file (a matrix ~50k rows X ~500 cols) and use it as a dataset to train a random forest model.
My script has two functions, one to load the dataset and the other to train the random forest model using said data. These both work fine, but the file upload takes ~45 seconds and it's a pain to do this every time I want to train a subtly different model (testing many models on the same dataset). Here is the file upload code:
def load_train_data(train_file):
# Read in training file
train_f = io.open(train_file)
train_id_list = []
train_val_list = []
for line in train_f:
list_line = line.strip().split("\t")
if list_line[0] != "Domain":
train_identifier = list_line[9]
train_values = list_line[12:]
train_id_list.append(train_identifier)
train_val_float = [float(x) for x in train_values]
train_val_list.append(train_val_float)
train_f.close()
train_val_array = np.asarray(train_val_list)
return(train_id_list,train_val_array)
This returns a numpy array with col. 9 as the label and cols. 12-end as the metadata to train the random forest.
I am going to train many different forms of my model with the same data, so I just want to upload the file one time and have it available to feed into my random forest function. I want the file to be an object I think (I am fairly new to python).
If I understand you correctly, the data set does not change but the model parameters do change and you are changing the parameters after each run.
I would put the file load script in one file, and run this in the python interpreter. Then the file will load and be saved in memory with whatever variable you use.
Then you can import another file with your model code, and run that with the training data as argument.
If all your model changes can be determined as parameters in a function call, all you need is to import your model and then call the training function with different parameter settings.
If you need to change the model code between runs, save with a new filename and import that one, run again and send the source data to that one.
If you don't want to save each model modification with a new filename, you might be able to use the reload functionality depending on python version, but it is not recommended (see Proper way to reload a python module from the console)
Simplest way would be to cache the results, like so:
_train_data_cache = {}
def load_cached_train_data(train_file):
if train_file not in _train_data_cache:
_train_data_cache[train_file] = load_train_data(train_file)
return _train_data_cache[train_file]
Try to learn about Python data serialization. You would basically be storing the large file as a python specific, serialized binary object using python's marshal function. This would drastically speed up IO of the file. See these benchmarks for performance variations. However, if these random forest models are all trained at the same time then you could just train them against the data-set you already have in memory then release train data after completion.
Load your data in ipython.
my_data = open("data.txt")
Write your codes in a python script, say example.py, which uses this data. At the top of the script example.py add these lines:
import sys
args = sys.argv
data = args[1]
...
Now run the python script in ipython:
%run example.py $mydata
Now, when running your python script, you don't need to load data multiple times.