tf.data.* has dataset classes. There is a TextLineDataset, but what I need is one for multiline text (between start/end tokens). Is there a way to use a different line-break delimiter for tf.data.TextLineDataset?
I am an experienced developer, but a python neophyte. I can read but my writing is limited. I am bending an existing Tensorflow NMT tutorial to my own dataset. Most TFRecord tutorials involve jpgs or other structured data.
You can try two options:
Write a generator and then use Dataset.from_generator: In your generator you can read your file line by line, append to your example while doing that and then yield when you encounter your custom delimiter.
First parse your file, create tf.train.SequenceExample with multiple lines and then store your dataset as a TFRecordDataset (more cumbersome option in my opinion)
Related
I'm looking at some machine learning/forecasting code using Keras, and the input data sets are stored in npz files instead of the usual csv format.
Why would the authors go with this format instead of csv? What advantages does it have?
It depends of the expected usage. If a file is expected to have broad use cases including direct access from an ordinary client machines, then csv is fine because it can be directly loaded in Excel or LibreOffice calc which are widely deployed. But it is just an good old text file with no indexes nor any additional feature.
On the other hand is a file is only expected to be used by data scientists or generally speaking numpy aware users, then npz is a much better choice because of the additional features (compression, lazy loading, etc.)
Long story made short, you exchange a larger audience for higher features.
From https://kite.com/python/docs/numpy.lib.npyio.NpzFile
A dictionary-like object with lazy-loading of files in the zipped archive provided on construction.
So, it is a zipped archive (smaller size than CSV on the disk, more than one file can be stored) and files can be loaded from disk only when needed (in CSV, when you only need 1 column, you still have to read whole file to parse it).
=> advantages are: performance and more features
What are some popular methods to find and replace concatenated words such as:
brokenleg -> (broken,leg)
The method should run on thousands of lines without knowing if there are concatenated words there in advance.
I'm using SpaCy library for most of my string handling so best method would be one that works along well with SpaCy.
I would like to use TensorFlow to process XML strings that are proper TFRecords. I'm curious to understand how to structure code that parses each TFRecord. There is a set of input rules and data type mappings that are applied to each TFRecord record to produce an output TFRecord.
Example input TFRecord:
<PLANT><COMMON>Shooting Star</COMMON><BOTANICAL>Dodecatheon</BOTANICAL><ZONE>Annual</ZONE><LIGHT>Mostly Shady</LIGHT><PRICE>$8.60</PRICE><EXTREF><REF1><ID>608</ID><TYPE>LOOKUP</TYPE><REF2><ID>703</ID><TYPE>STD</TYPE></EXTREF><AVAILABILITY>051399</AVAILABILITY></PLANT>
The rules show what needs to be parsed and how it needs to be formatted. E.g. find the COMMON, PRICE, EXTREF>REF2>ID and AVAILABILITY elements and export their values as a TFRecord.
Example output TFRecord:
Shooting Star,8.60,703,51399
How do I add this logic to a graph so when it executes it produces the output TFRecord? My initial thoughts are that I need to translate the mapping logic into a series of tf.ops...
I believe this link will be very helpful to you. It specifies the exact format that the TFRecord needs, and it provides the code to turn your own dataset into a TFRecord file.
However, that link did not mention XML files. It only talked about how to create a tf_example and turn it into a TFRecord. This link will actually go a step back and show you how to turn a XML file into a tf_example. Note that it will need some modification to fit your needs because it is using the Oxford Pet Dataset.
I have a list of pdf files that have different numbers of pages and presentations.
Each file contains a list of information that I need to extract. but the problem is that the information is wrapped in different type of phrases and syntax.
I need to know if I need to build a machine learning to do this and if it is the case which algorithms and techniques are suited for my case.
NB: I have a huge dataset of pdf files to use to train model.
So if you want to do this in Python it seems that PyPDF2 is the way to go. You should be able to read in and extract the text data you want from the PDFs. Automate the boring stuff has examples of using PyPDF2.
I've been working through the TensorFlow documentation (still learning), and I can't figure out how to represent input/output sequence data. My inputs are a sequences of 20 8-entry vectors, making a 8x20xN matrix, where N is the number of instances. I'd like to eventually pass these through an LSTM for sequence to sequence learning. I know I need a 3D vector, but I'm unsure which dimensions are which.
RTFMs with pointers to the correct documentation greatly appreciated. I feel like this is obvious and I'm just missing it.
As described in the excellent blog post by WildML, the proper way is to save your example in a TFRecord using the formate tf.SequenceExample(). Using TFRecords for this provides the following advantages:
You can split your data in to many files and load them each on a different GPU.
You can use Tensorflow utilities for loading the data (for example using Queues to load you data on demand.
Your model code will be separate from your dataset processing (this is a good habit to have).
You can bring new data to your model just by putting it into this format.
TFRecords uses protobuf or protocol buffers as a way to format your data. The documentation of which can be found here. The basic idea is you have a format for your data (in this case in the format of tf.SequenceExample) save it to a TFRecord and load it using the same data definition. Code for this pattern can be found at this ipython notebook.
As my answer is mostly summarizing the WildML blog post on this topic, I suggest you check that out, again found here.