Processing TensorFlow Records that are XML (text) - python

I would like to use TensorFlow to process XML strings that are proper TFRecords. I'm curious to understand how to structure code that parses each TFRecord. There is a set of input rules and data type mappings that are applied to each TFRecord record to produce an output TFRecord.
Example input TFRecord:
<PLANT><COMMON>Shooting Star</COMMON><BOTANICAL>Dodecatheon</BOTANICAL><ZONE>Annual</ZONE><LIGHT>Mostly Shady</LIGHT><PRICE>$8.60</PRICE><EXTREF><REF1><ID>608</ID><TYPE>LOOKUP</TYPE><REF2><ID>703</ID><TYPE>STD</TYPE></EXTREF><AVAILABILITY>051399</AVAILABILITY></PLANT>
The rules show what needs to be parsed and how it needs to be formatted. E.g. find the COMMON, PRICE, EXTREF>REF2>ID and AVAILABILITY elements and export their values as a TFRecord.
Example output TFRecord:
Shooting Star,8.60,703,51399
How do I add this logic to a graph so when it executes it produces the output TFRecord? My initial thoughts are that I need to translate the mapping logic into a series of tf.ops...

I believe this link will be very helpful to you. It specifies the exact format that the TFRecord needs, and it provides the code to turn your own dataset into a TFRecord file.
However, that link did not mention XML files. It only talked about how to create a tf_example and turn it into a TFRecord. This link will actually go a step back and show you how to turn a XML file into a tf_example. Note that it will need some modification to fit your needs because it is using the Oxford Pet Dataset.

Related

Read documents with OCR in Python

I have the need to read structured data (pdfs and excel spreadsheets) to a central location. From where other files can read data, to produce a constant and consistent output.
The first challenge is to read data from the input files, the second is to read recognized data into a central location feeding into output files.
Any insight to OCR in python, or alternative text and variable recognition algorithms would be highly appreciated.

Problem with Loading and Preprocessing Data Using tf.Dataset

I'm trying to load .npy files using map method of tf.Databases. The reasoning behind this is, in the future, I will be loading data not only from .npy but also for .dat encoded in Medical Imaging format. I can do this with Keras Sequence class but I want to be able to prefetch my data since it will take a lot of time to load huge amounts of files from remote hard drives. Even if I'm loading them from the same hard drive it will take ages (seconds) to generate masks, load data, then feed it to network. Right now, I'm working with the most simple version of the problem that I can handle. I have been working on this for a couple of weeks (on and off for a year) to no avail. The code is given below and the error is.
TypeError: expected str, bytes or os.PathLike object, not Tensor
I do know the reason behind the error but I could not convert it to something numpy can read. I can do this if it weren't an online operation, or output from dataset with list(file) then go for a for loop. I have researched it and every single book goes with the same irrelevant CSV example. I just need to load and preprocess data using prefetch method. So that when my network is doing reconstruction, I can fetch the next batch.
The file argument that is passed to the np_read function is a tensor. However, I expect it to be a string, this tensor has the string in it but I can not extract it.
Tensor("file:0", shape=(None,), dtype=string)
When I convert file to list I get this error.
OperatorNotAllowedInGraphError: iterating over tf.Tensor is not allowed: AutoGraph did not convert this function. Try decorating it directly with #tf.function.
I expect to have a string that points to the location of the files, I do get these locations if I fetch the batch without mapping it. Then print using the list method.
array([b'C:\\Datasets\\MRI_Data\\Recon_v4\\Train\\e14080s3_P18944.7_144.npy', b'C:\\Datasets\\MRI_Data\\Recon_v4\\Train\\e13993s4_P16896.7_77.npy', b'C:\\Datasets\\MRI_Data\\Recon_v4\\Train\\e13992s4_P08704.7_65.npy'], dtype=object)>]
Thank you kindly.
pattern = "C:/Datasets/MRI_Data/Recon_v4/Train/*.npy"
filepath_dataset = tf.data.Dataset.list_files(file_pattern = pattern)
#tf.function
def np_read(file):
loadedFile = np.load(file)
return(loadedFile)
dataset = filepath_dataset.repeat(2).batch(3)
dataset = dataset.map(np_read)

Multi-line text dataset in Tensorflow

tf.data.* has dataset classes. There is a TextLineDataset, but what I need is one for multiline text (between start/end tokens). Is there a way to use a different line-break delimiter for tf.data.TextLineDataset?
I am an experienced developer, but a python neophyte. I can read but my writing is limited. I am bending an existing Tensorflow NMT tutorial to my own dataset. Most TFRecord tutorials involve jpgs or other structured data.
You can try two options:
Write a generator and then use Dataset.from_generator: In your generator you can read your file line by line, append to your example while doing that and then yield when you encounter your custom delimiter.
First parse your file, create tf.train.SequenceExample with multiple lines and then store your dataset as a TFRecordDataset (more cumbersome option in my opinion)

Representing time sequence input/output in tensorflow

I've been working through the TensorFlow documentation (still learning), and I can't figure out how to represent input/output sequence data. My inputs are a sequences of 20 8-entry vectors, making a 8x20xN matrix, where N is the number of instances. I'd like to eventually pass these through an LSTM for sequence to sequence learning. I know I need a 3D vector, but I'm unsure which dimensions are which.
RTFMs with pointers to the correct documentation greatly appreciated. I feel like this is obvious and I'm just missing it.
As described in the excellent blog post by WildML, the proper way is to save your example in a TFRecord using the formate tf.SequenceExample(). Using TFRecords for this provides the following advantages:
You can split your data in to many files and load them each on a different GPU.
You can use Tensorflow utilities for loading the data (for example using Queues to load you data on demand.
Your model code will be separate from your dataset processing (this is a good habit to have).
You can bring new data to your model just by putting it into this format.
TFRecords uses protobuf or protocol buffers as a way to format your data. The documentation of which can be found here. The basic idea is you have a format for your data (in this case in the format of tf.SequenceExample) save it to a TFRecord and load it using the same data definition. Code for this pattern can be found at this ipython notebook.
As my answer is mostly summarizing the WildML blog post on this topic, I suggest you check that out, again found here.

Python graphing from csv

I have extracted 6 months of email metadata and saved it as a csv file. The csv now only contains two columns (from and to email addresses). I want to build a graph where the vertices are those with whom I am communicating and whom communicated with me and the edges are created by a communications link labeling the edges by how many communications I had. What is the best approach for going about this?
One approach is to use Linked Data principles (although not advisable if you are short on time and don't have a background in Linked Data). Here's a possible approach:
Depict each entity as a URI
Use an existing ontology (such as foaf) to describe the data
The data is transformed into Resource Description Framework (RDF)
Use an RDF visualization tool.
Since RDF is inherently a graph, you will be able to visualize your data as well as extend it.
If you are unfamiliar with Linked Data, a way to view the garphs is using Pajek (http://vlado.fmf.uni-lj.si/pub/networks/pajek/). This approach is much simpler but lacks the benefits of semantic interoperability, provided you care about them in the first place.
Cytoscape might be able to import your data in that format and build a network from it.
http://www.cytoscape.org/
Your question (while mentioning Python) does not say what part or how much you want to do with Python. I will assume Python is a tool you know but that the main goal is to get the data visualized. In that case:
1) use Gephi network analysis tool - there are tools that can use your CSV file as-is and Gephi is one of them. in your case edge weights need to be preserved (= number of emails exchanged b/w 2 email addresses) which can be done using the "mixed" variation of Gephi's CSV format.
2) another option is to pre-process your CSV file (e.g. using Python), calculate edge weights (the number of e-mail between every 2 email addresses) and save it in any format you like. The result can be visualized in network analysis tools (such as Gephi) or directly in Python (e.g. using https://graph-tool.skewed.de).
Here's an example of an email network analysis project (though their graph does not show weights).

Categories