I want to create sequences of my dataset. However, Tensorflow only provides the function:
tf.parse_single_example()
I tried to avoid this problem by using the tf.py_func and smth like this:
dataset.map(lambda x: tf.py_func(_parse_tf_record, [x, sequence_length])
for sequence_id in range(0, sequence_length):
filename = x
# files only contain one record
for record in tf.python_io.tf_record_iterator(filename, options):
...
tf.parse_single_example()
...
break # only one sample per file
So for every map call I read #sequence_length files. However, this cannot be done parallel since tf.py_func does not allow for it.
A tensorflow example is a single conceptual unit and it should be independent from the other examples (so that batching and shuffling work properly).
If you want more data to be grouped together you should write it as a single example.
To make things easier there tf.train.SequenceExample that works with tf.parse_single_sequence_example. It has a context part that's common for all entries in the sequence and a sequence part that is repeated for every step. This is commonly used when working with recurrent networks (LSTM and alike) but you can use it whenever it makes sense in your context.
Related
I'm working through the TensorFlow Load pandas.DataFrame tutorial, and I'm trying to modify the output from a code snippet that creates the dictionary slices:
dict_slices = tf.data.Dataset.from_tensor_slices((df.to_dict('list'), target.values)).batch(16)
for dict_slice in dict_slices.take(1):
print (dict_slice)
I find the following output sloppy, and I want to put it into a more readable table format.
I tried to format the for loop, based on this recommendation
Which gave me the error that the BatchDataset was not subscriptable
Then I tried to use the range and leng function on the dict_slices, so that i would be an integer index and not a slice
Which gave me the following error (as I understand, because the dict_slices is still an array, and each iteration is one vector of the array, not one index of the vector):
Refer here for solution. To summarize we need to use as_numpy_iterator
example = list(dict_slices.as_numpy_iterator())
example[0]['age']
BatchDataset is a tf.data.Dataset instance that has been batches by calling it's .batch(..) method. You cannot "index" a tensorflow Dataset, or call the len function on it. I suggest iterating through it like you did in the first code snippet.
However in your dataset you are using .to_dict('list'), which means that a key in your dictionary is mapped to a list as value. Basically you have "columns" for every key and not rows, is this what you want? This would make printing line-by-line (shown in the table printing example you linked) a lot more difficult, since you do not have different features in a row. Also it is different from the example in the official Tensorflow code, where a datapoint consists of multiple features, and not one feature with multiple values.
Combining the Tensorflow code and pretty printing:
columns = list(df.columns.values)+['target']
dict_slices = tf.data.Dataset.from_tensor_slices((df.values, target.values)).batch(1) # batch = 1 because otherwise you will get multiple dict_slice - target pairs in one iteration below!
print(*columns, sep='\t')
for dict_slice, target in dict_slices.take(1):
print(*dict_slice.numpy(), target.numpy(), sep='\t')
This needs a bit of formatting, because column widths are not equal.
I have a large dataset with a defined structure of columns for which I have built a script/pipeline that generally does: first, ingests the data (formatting, cleaning, etc.), and second, it transform values and creates a new column with these new transformed values (final result), more or less like this:
Imports csv into pandas framework, fills nans, cleans some values in some columns, homogenizes text, names, etc.
1.1. Creates a new column (cleaned names)
Transform/convert values in another column via look ups in dictionaries, doing groupbys, etc.
2.2 Creates 1 new column (transformed values)
My script is divided into two files (~150 lines of code, each) and is composed of many methods: .where, .replace, .map, .apply, .etc. Given that pandas allows for method chaining and is very flexible, the dataset can be processed without defining any function (except a few for ) df.apply(func). My code gets the csv into a df and naturally starts processing it with the mentioned methods .where, .replace, .map, .apply, .etc without using any function or a .pipeline method. My project tree looks like:
/project
table.csv
ingest.py (outputs a clean intermediate_table.csv)
transform.py (reads previous table.csv and outputs a final_table.csv)
final_table.csv
The thing is, I need to send this code over to other people who will run my script in more datasets, so I will need to comment and test it. Given the above, here are my questions in terms of the code structure.
Should I have a function for each of the steps above?
If so, with what granularity?
E.g. Should I have multiple functions like below?:
df = pd.read_csv('file.csv')
def uppercase_column_A(dataframe, col)
def clean_column(dataframe, col)
def calculate_mean_here(dataframe, col)
def transform_values_there(dataframe, col)
df
.pipe(uppercase_column_A)
.pipe(clean_column)
.pipe(calculate_mean_here)
.pipe(transform_values_there)
.pipe(etc)
)
or, maybe, just two big functions ?
df = pd.read_csv('file.csv')
def ingest(df): returns intermediate_df
def transform(intermediate_df)
df
.pipe(ingest)
.pipe(transform)
Do I actually need to use .pipe? at all
Should I use classes? separate into modules?
I know the question is broad but I think common practices are important as well as the code itself. In academia (my background), this does not matter much as there is not a 'production' side. So, in general, what would be a recommended industry-way of building data pipelines in terms of code/structure?
In my experience, using smaller functions is better for maintenance since error codes will be easier to follow the fewer level of abstractions there are (which is what having 2 big functions will not do).
My personal suggestion:
Add as many comments as you can. Above functions, above variable names, below a function call, etc...
Be as descriptive about naming structure. calculate_mean_of_columns instead of calc_mean_cols, for example. Avoid, as much as you can, using abbreviations (even standard abbreviations in the DS community) like df or cols.
I'd structure my folders differently, honestly. My typical pipelines have had a consistent structure like this:
/project
/code
code_to_transform_dataframe.py
/data
datetimestamp_filename.csv
/output
datetimestamp_output.csv
You can use this as a framework for your own use case but that's for the work I've done in a couple of different companies.
I have an HDF5 dataset and I'm using a framework which is creating multiple processes to read from it (PyTorch's DataLoader, but this framework shouldn't be important). I'm indexing the first dimension of a 3D float array randomly, and to debug what was going on, I have been summing the slice from the indexing. Every once and a while, the summed slice turns out nan or as an extremely small value (a value that shouldn't appear in my data). If I preform the same index twice in a row, the values come out correct the other time (either the first or the second index might come out wrong). For example, below is some of values I get during indexing, where the left is expected to match the right, but sometimes the value comes out wrong:
21.2162 21.2162
89.9759 6.5469e-33
35.7114 35.7114
35.2934 35.2934
56.8512 56.8512
42.2215 42.2215
11.5307 nan
19.2904 19.2904
25.4261 25.4261
This comes from indexing one right after the other:
print(dataset[index].sum(), end=' ')
print(dataset[index].sum())
The problem does not seem to arise when I only use a single process to index the dataset. The dataset is only being read from (no writing). Does anyone know why this might be happening and if there's a way to prevent it?
I encountered the very same issue, and after spending a day trying to marry PyTorch DataParallel loader wrapper with HDF5 via h5py, I discovered that it is crucial to open h5py.File inside the new process, rather than having it opened in the main process and hope it gets inherited by the underlying multiprocessing implementation.
Since PyTorch seems to adopt lazy way of initializing workers, this means that the actual file opening has to happen inside of the __getitem__ function of the Dataset wrapper.
according to this answer, there is modified code, it can work well:
the modification is that we should close the object in h5py.File in __len__ and __getitem__ function.
the sample codes are here:
class DeephomographyDataset(Dataset):
def __init__(self,hdf5file,imgs_key='images',labels_key='labels',
transform=None):
self.hdf5file=hdf5file
self.imgs_key=imgs_key
self.labels_key=labels_key
self.transform=transform
def __len__(self):
# return len(self.db[self.labels_key])
with h5py.File(self.hdf5file, 'r') as db:
lens=len(db[self.labels_key])
return lens
def __getitem__(self, idx):
with h5py.File(self.hdf5file,'r') as db:
image=db[self.imgs_key][idx]
label=db[self.labels_key][idx]
sample={'images':image,'labels':label}
if self.transform:
sample=self.transform(sample)
return sample
I have a quite complex Apache PySpark pipeline which performs several transformations on a (very large) set of text files. The intended output of my pipeline are different stages of the pipeline. Which is the best way (i.e. more efficient but also more sparkling, in the sense of: more fitting the Spark programming model and style) to do this?
Right now, my code looks like the following:
# initialize the pipeline and perform the first set of transformations.
ctx = pyspark.SparkContext('local', 'MyPipeline')
rdd = ctx.textFile(...).map(...).map(...)
# first checkpoint: the `first_serialization` function serializes
# the data into properly formatted string.
rdd..map(first_serialization).saveAsTextFile("ckpt1")
# here, I have to read again from the previously saved checkpoint
# using a `first_deserialization` function that deserializes what has
# been serialized from the `firs_serialization` function. Then performs
# other transformations.
rdd = ctx.textFile("ckpt1").map(...).map(...)
and so on. I would like to get rid of the serialization methods and of the multiple save/read -- by the way, does it impact the efficiency? I assume yes.
Any hint?
Thanks in advance.
This seems obviously simple, because it is, but I would recommend writing the intermediate stages out while continuing to reuse the existing RDD (side bar: use datasets/dataframes instead of RDDs to get more performance) and continue processing, writing out intermediate results as you go.
There's no need to pay the penalty of reading from disk/network when you already have the data processed (ideally even cached!) for further usage.
Example using your own code:
# initialize the pipeline and perform the first set of transformations.
ctx = pyspark.SparkContext('local', 'MyPipeline')
rdd = ctx.textFile(...).map(...).map(...)
# first checkpoint: the `first_serialization` function serializes
# the data into properly formatted string.
string_rdd = rdd..map(first_serialization)
string_rdd.saveAsTextFile("ckpt1")
# reuse the existing RDD after writing out the intermediate results
rdd = rdd.map(...).map(...) # rdd here is the same variable we used to create the string_rdd results above. alternatively, you may want to use the string_rdd variable here instead of the original rdd variable.
I am looking for a way to speed up my code. I managed to speed up most parts of my code, reducing runtime to about 10 hours, but it's still not fast enough and since I'm running out of time I'm looking for a quick way to optimize my code.
An example:
text = pd.read_csv(os.path.join(dir,"text.csv"),chunksize = 5000)
new_text = [np.array(chunk)[:,2] for chunk in text]
new_text = list(itertools.chain.from_iterable(new_text))
In the code above I read in about 6 million rows of text documents in chunks and flatten them. This code takes about 3-4 hours to execute. This is the main bottleneck of my program. edit: I realized that I wasn't very clear on what the main issue was, The flattening is the part which takes the most amount of time.
Also this part of my program takes a long time:
train_dict = dict(izip(text,labels))
result = [train_dict[test[sample]] if test[sample] in train_dict else predictions[sample] for sample in xrange(len(predictions))]
The code above first zips the text documents with their corresponding labels (This a machine learning task, with the train_dict being the training set). Earlier in the program I generated predictions on a test set. There are duplicates between my train and test set so I need to find those duplicates. Therefore, I need to iterate over my test set row by row (2 million rows in total), when I find a duplicate I actually don't want to use the predicted label, but the label from the duplicate in the train_dict. I assign the result of this iteration to the variable result in the above code.
I heard there are various libraries in python that could speed up parts of your code, but I don't know which of those could do the job and right know I do not have the time to investigate this, that is why I need someone to point me in the right direction. Is there a way with which I could speed the code snippets above up?
edit2
I have investigated again. And it is definitely a memory issue. I tried to read the file in a row by row manner and after a while the speed declined dramatically, furthermore my ram usage is nearly 100%, and python's disk usage increased sharply. How can I decrease the memory footprint? Or should I find a way to make sure that I don't hold everything into memory?
edit3
As memory is the main issue of my problems I'll give an outline of a part of my program. I have dropped the predictions for the time being, which reduced the complexity of my program significantly, instead I insert a standard sample for every non duplicate in my test set.
import numpy as np
import pandas as pd
import itertools
import os
train = pd.read_csv(os.path.join(dir,"Train.csv"),chunksize = 5000)
train_2 = pd.read_csv(os.path.join(dir,"Train.csv"),chunksize = 5000)
test = pd.read_csv(os.path.join(dir,"Test.csv"), chunksize = 80000)
sample = list(np.array(pd.read_csv(os.path.join(dir,"Samples.csv"))[:,2]))#this file is only 70mb
sample = sample[1]
test_set = [np.array(chunk)[:,2] for chunk in test]
test_set = list(itertools.chain.from_iterable(test_set))
train_set = [np.array(chunk)[:,2] for chunk in train]
train_set = list(itertools.chain.from_iterable(train_set))
labels = [np.array(chunk)[:,3] for chunk in train_2]
labels = list(itertools.chain.from_iterable(labels))
"""zipping train and labels"""
train_dict = dict(izip(train,labels))
"""finding duplicates"""
results = [train_dict[test[item]] if test[item] in train_dict else sample for item in xrange(len(test))]
Although this isn't my entire program, this is the part of my code that needs optimization. As you can see I am only using three important modules in this part, pandas, numpy and itertools. The memory issues arise when flattening train_set and test_set. The only thing I am doing is reading in the files, getting the necessary parts zipping the train documents with the corresponding labels in a dictionary. And then search for duplicates.
edit 4
As requested I'll give an explanation of my data sets. My Train.csv contains 4 columns. The first columns contain ID's for every sample, the second column contains titles and the third column contains text body samples(varying from 100-700 words). The fourth column contains category labels. Test.csv contains only the ID's and text bodies and titles. The columns are separated by commas.
Could you please post a dummy sample data set of a half dozen rows or so?
I can't quite see what your code is doing and I'm not a Pandas expert, but I think we can greatly speed up this code. It reads all the data into memory and then keeps re-copying the data to various places.
By writing "lazy" code we should be able to avoid all the re-copying. The ideal would be to read one line, transform it as we want, and store it into its final destination. Also this code uses indexing when it should be just iterating over values; we can pick up some speed there too.
Is the code you posted your actual code, or something you made just to post here? It appears to contain some mistakes so I am not sure what it actually does. In particular, train and labels would appear to contain identical data.
I'll check back and see if you have posted sample data. If so I can probably write "lazy" code for you that will have less re-copying of data and will be faster.
EDIT: Based on your new information, here's my dummy data:
id,title,body,category_labels
0,greeting,hello,noun
1,affirm,yes,verb
2,deny,no,verb
Here is the code that reads the above:
def get_train_data(training_file):
with open(training_file, "rt") as f:
next(f) # throw away "headers" in first line
for line in f:
lst = line.rstrip('\n').split(',')
# lst contains: id,title,body,category_labels
yield (lst[1],lst[2])
train_dict = dict(get_train_data("data.csv"))
And here is a faster way to build results:
results = [train_dict.get(x, sample) for x in test]
Instead of repeatedly indexing test to find the next item, we just iterate over the values in test. The dict.get() method handles the if x in train_dict test we need.
You can try Cython. It supports numpy and can give you a nice speedup.
Here is an introduction and explanation of what needs to be done
http://www.youtube.com/watch?v=Iw9-GckD-gQ
If the order of your rows is not important you can use sets to find elements that are in train set but not in test set (intersection trainset & testset) and add them first to your "result" and after that use set difference (testset-trainset) to add elements that are in your test set but not in the train set. This will allow to save on checking if sample is in trainset.