I am trying to load a dataset from huggingface organization, but I am getting the following error:
ValueError: Couldn't cast string
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 686
to
{'text': Value(dtype='string', id=None)}
because column names don't match
I used the following lines of code to load the dataset:
from datasets import load_dataset
dataset = load_dataset("datasetFile", use_auth_token=True)
Pleases note the dataset version = (2.0.0), I changed it to 1.18.2 but it did not work.
Is there any way to fix this error?
I solved this error by streaming the dataset.
from datasets import load_dataset
dataset = load_dataset("datasetFile", use_auth_token=True, streaming= True)
According to https://github.com/huggingface/datasets/issues/3700#issuecomment-1035400186, you actually want to use load_from_disk:
from datasets import load_from_disk
dataset = load_from_disk("datasetFile")
Related
I am trying to run a logistic regression model on a very large dataset with 2.3 billion observations in Python. I need a standard regression output. Statsmodels with parquet seemed promising:
https://www.statsmodels.org/v0.13.0/large_data.html
import pyarrow as pa
import pyarrow.parquet as pq
import statsmodels.formula.api as smf
class DataSet(dict):
def __init__(self, path):
self.parquet = pq.ParquetFile(path)
def __getitem__(self, key):
try:
return self.parquet.read([key]).to_pandas()[key]
except:
raise KeyError
LargeData = DataSet('LargeData.parquet')
res = smf.ols('Profit ~ Sugar + Power + Women', data=LargeData).fit()
However it says "Additionally, you can add code to this example DataSet object to return only a subset of the rows until you have built a good model. Then, you can refit your final model on more data."
This is what I tried all day and could not get to work. I am not super familiar with Python classes and how to iterate row-group-wise through a parquet.
I am sure it's only few lines of code, could anyone help me out?
P.S.: Ideally, of course I need the combination of the distributed model and the subsetted data. But I would already be happy to get the subsetting to without running out of memory. Thanks!
You can read "subset" of your parquet file by using row groups:
import random
row_group = random.rand_range(self.parquet.num_row_groups)
return self.parquet.read_row_group(row_group).to_pandas()
I'm not sure how exactly you'd do that in your case, maybe by selecting an arbitrary / random set of available row groups.
I'm trying to follow this notebook but I get stuck at loading my SQuAD dataset.
import transformers
from datasets import load_dataset, load_metric
dataset = load_dataset('json', data_files={'train': 'squad/nl_squad_train_clean.json', 'test': 'squad/nl_squad_train_clean.json'}, field='data')
Gives the following error ArrowInvalid: JSON parse error: Column(/paragraphs/[]/qas/[]/answers/[]) changed from object to array in row 0.
Does anyone know how to fix this? If needed I can post the complete stack trace.
I'm trying to build a recommendation system using TensorFlow recommenders (https://www.tensorflow.org/recommenders/examples/quickstart)
in their quick start they're loading the data like this
ratings = tfds.load('movielens/100k-ratings', split="train")
I have a .csv file how to put it in the same format as the data they're passing in?
Also I would like to use .map on the tf data
EX: ratings = ratings .map(lambda x: {"itemId": x["itemId"],"userId": x["userId"]})
I created a simple fbprophet model with the airpassengers data:
I created a simple fbprophet model with the airpassengers data:
import pandas as pd
import pickle
from fbprophet import Prophet
import sys
df = pd.read_csv("airline-passengers.csv")
# preprocess columns as fbprophet expects it
df.rename(columns={"Month": "ds", "Passengers": "y"}, inplace=True)
df["ds"] = pd.to_datetime(df["ds"])
m = Prophet()
m.fit(df)
However, when I save the object m:
with open("p_model", "wb") as f:
pickle.dump(m, f)
it needs >1 MB of memory on my hard drive. The object m itself seems to be rather small, as sys.getsizeof(m) returns 56.
Why is the pickle file so large? Is there a suitable alternative for saving the the object for later reuse? Thanks in advance.
I think that it pickles training data also, so try not to save model.history and it should be fine.
Here is nice discussion: https://github.com/facebook/prophet/issues/1159
Thanks to the link of #Kohelet, I found the solultion, it was the stan_backend attribute:
m.stan_backend = None
This reduced the filesize on hard drive to around 18 KB.
Im am still wondering why this is not visible when invoking sys.sizeof(m)
I wanted to Load a CSV file with a Target column and 25 feature columns.I have loaded it via pd.read_csv() as a pandas.Dataframe:
import pandas as pd
import tensorflow as tf
data = pd.read_csv("./data.csv")
data = data.astype('float64')
data.shape #returns (6500, 26)
y_train = data.pop('target')
y_train.shape #returns (6500,)
Then I used the standard Tensorflow 2.0 procedure to read values from pandas.Dataframe:
dataset = tf.data.Dataset.from_tensor_slices((data.values, y_train.values))
As the Docs said,I have to load features and targets seperately from the TensorflowSliceDataset. But as soon as i run the for loop, it freezes for couple of seconds and suddenly kernel dies without any specific reason.
for feat, targ in dataset.take(1):
print ('Features: {}, Target: {}'.format(feat, targ))
I have tried to run the code without the for loop but the same thing happens with:
tf.constant(data['feature-1'])
I don't know what is causing this problem. I have also re-installed the pandas as well.