Tensorflow transform each element of a string tensor - python

I have a tensor of strings. Some example strings are as follows.
com.abc.display,com.abc.backend,com.xyz.forte,blah
com.pqr,npr.goog
I want to do some preprocessing which splits the CSV into its part, then splits each part at the dots and then create multiple strings where one string is a prefix of another. Also, all blahs have to be dropped.
For example, given the first string com.abc.display,com.abc.backend,com.xyz.forte, it is transformed into an array/list of the following strings.
['com', 'com.abc', 'com.abc.display', 'com.abc.backend', 'com.xyz', 'com.xyz.forte']
The resulting list has no duplicates (that is why the prefixed strings for com.abc.backend didn't show up as those were already included - com and com.abc).
I wrote the following python function that would do the above given a single CSV string example.
def expand_meta(meta):
expanded_subparts = []
meta_parts = set([x for x in meta.split(',') if x != 'blah'])
for part in meta_parts:
subparts = part.split('.')
for i in range(len(subparts)+1):
expanded = '.'.join(subparts[:i])
if expanded:
expanded_subparts.append(expanded)
return list(set(expanded_subparts))
Calling this method on the first example
expand_meta('com.abc.display,com.abc.backend,com.xyz.forte,blah')
returns
['com.abc.display',
'com.abc',
'com.xyz',
'com.xyz.forte',
'com.abc.backend',
'com']
I know that tensorflow has this map_fn method. I was hoping to use that to transform each element of the tensor. However, I am getting the following error.
File "mypreprocess.py", line 152, in expand_meta
meta_parts = set([x for x in meta.split(',') if x != 'blah'])
AttributeError: 'Tensor' object has no attribute 'split'
So, it seems like I can't use a regular python function with map_fn since it expects the elements to be tensors. How can I do what I intend to do here?
(My Tensorflow version is 1.11.0)

I think this does what you want:
import tensorflow as tf
# Function to process a single string
def make_splits(s):
s = tf.convert_to_tensor(s)
# Split by comma
split1 = tf.strings.split([s], ',').values
# Remove blahs
split1 = tf.boolean_mask(split1, tf.not_equal(split1, 'blah'))
# Split by period
split2 = tf.string_split(split1, '.')
# Get dense split tensor
split2_dense = tf.sparse.to_dense(split2, default_value='')
# Accummulated concatenations
concats = tf.scan(lambda a, b: tf.string_join([a, b], '.'),
tf.transpose(split2_dense))
# Get relevant concatenations
out = tf.gather_nd(tf.transpose(concats), split2.indices)
# Remove duplicates
return tf.unique(out)[0]
# Test
with tf.Graph().as_default(), tf.Session() as sess:
# Individual examples
print(make_splits('com.abc.display,com.abc.backend,com.xyz.forte,blah').eval())
# [b'com' b'com.abc' b'com.abc.display' b'com.abc.backend' b'com.xyz'
# b'com.xyz.forte']
print(make_splits('com.pqr,npr.goog').eval())
# [b'com' b'com.pqr' b'npr' b'npr.goog']
# Apply to multiple strings with a loop
data = tf.constant([
'com.abc.display,com.abc.backend,com.xyz.forte,blah',
'com.pqr,npr.goog'])
ta = tf.TensorArray(size=data.shape[0], dtype=tf.string,
infer_shape=False, element_shape=[None])
_, ta = tf.while_loop(
lambda i, ta: i < tf.shape(data)[0],
lambda i, ta: (i + 1, ta.write(i, make_splits(data[i]))),
[0, ta])
out = ta.concat()
print(out.eval())
# [b'com' b'com.abc' b'com.abc.display' b'com.abc.backend' b'com.xyz'
# b'com.xyz.forte' b'com' b'com.pqr' b'npr' b'npr.goog']
I'm not sure if you want the total results concatenated like that, or maybe you want to apply tf.unique to the global result, but in any case the idea is the same.

Related

InputFunction error on Snakemake when returning a list of file with a lambda function

I'm writing a snakemake rule which will take input values from a parased yaml and return the files that are associated with that group label as a list, but I am getting an odd error.
I've got my function printing the return output before it returns so it seems to be returning a list alright
['/SAN/vyplab/alb_projects/data/muscle/analysis/feature_counts/Ctrl1_featureCounts_results.txt', '/SAN/vyplab/alb_projects/data/muscle/analysis/feature_counts/Ctrl4_featureCounts_results.txt', '/SAN/vyplab/alb_projects/data/muscle/analysis/feature_counts/Ctrl2_featureCounts_results.txt', '/SAN/vyplab/alb_projects/data/muscle/analysis/feature_counts/Ctrl3_featureCounts_results.txt']
However I am get an "AttributeError" which is unexpected espicially since I've straight adapted this from some previous pipeline that worked perfectly well with this function
InputFunctionException in line 26 of /SAN/vyplab/alb_projects/pipelines/rna_seq_snakemake/rules/deseq2_featureCounts.smk:
AttributeError: 'str' object has no attribute 'list'
Wildcards:
bse=control
contrast=ContrastvControl
Rule looks like this, am ommitting the shell and params calls as they're not necessary to debug I think
rule run_standard_deseq:
input:
base_group = lambda wildcards: featurecounts_files_from_contrast(wildcards.bse),
contrast_group = lambda wildcards: featurecounts_files_from_contrast(wildcards.contrast)
output:
os.path.join(DESEQ2_DIR,"{bse}_{contrast}" + "normed_counts.csv.gz")
Implementation of the helper function
def featurecounts_files_from_contrast(grp):
"""
given a contrast name or list of groups return a list of the files in that group
"""
#reading in the samples
samples = pd.read_csv(config['sampleCSVpath'])
#there should be a column which allows you to exclude samples
samples2 = samples.loc[samples.exclude_sample_downstream_analysis != 1]
#read in the comparisons and make a dictionary of comparisons, comparisons needs to be in the config file
compare_dict = load_comparisons()
#go through the values of the dictionary and break when we find the right groups in that contrast
grps, comparison_column = return_sample_names_group(grp)
#take the sample names corresponding to those groups
if comparison_column == "":
return([""])
grp_samples = list(set(list(samples2[samples2[comparison_column].isin(grps)].sample_name)))
feature_counts_outdir = get_output_dir(config["project_top_level"], config["feature_counts_output_folder"])
fc_suffix = "_featureCounts_results.txt"
#build a list with the full path from those sample names
fc_files = [os.path.join(feature_counts_outdir,x + fc_suffix) \
for x in grp_samples]
fc_files = list(set(fc_files))
print(fc_files)
return(fc_files)
The print command is returning the correct files so I had assumed this would function

Convert a tensorflow tf.data.Dataset FlatMapDataset to TensorSliceDataset

I want to pass a list of tf.Strings to the .map(_parse_function) function.
def _parse_function(self, img_path):
img_str = tf.read_file(img_path)
img_decode = tf.image.decode_jpeg(img_str, channels=3)
img_decode = tf.divide(tf.cast(img_decode , tf.float32),255)
return img_decode
When the tf.data.Dataset is of type TensorSliceDataset,
dataset_from_slices = tf.data.Dataset.from_tensor_slices((tensor_with_filenames))
I can simply do
dataset_from_slices.map(_parse_function), which works.
However, dataset_from_generator = tf.data.Dataset.from_generator(...) returns a Dataset which is an instance of FlatMapDataset type and dataset_from_generator.map(_parse_function) gives the following error:
InvalidArgumentError: Input filename tensor must be scalar, but had shape: [32]
If I change the first line to:
img_str = tf.read_file(img_path[0])
that also works but then I only get the first image, which is not what I am looking for. Any suggestions?
It sounds like the elements of your dataset_from_generator are batched. The simplest remedy is to use tf.contrib.data.unbatch() to convert them back into individual elements:
# Each element is a vector of strings.
dataset_from_generator = tf.data.Dataset.from_generator(...)
# Converts each vector of strings into multiple individual elements.
dataset = dataset_from_generator.apply(tf.contrib.data.unbatch())
dataset = dataset.map(_parse_function)

List and Numpy array in Python

I was trying to hot-encode data.
Data is list of vocabulary_size = 17005207.
To hot-encode, I made a list of inputs of num_labels = 100.
Following code:
inputs = []
for i in range(vocabulary_size):
inputs.append(np.arange(num_labels) == data[i]).astype(np.float32)
Throws me an Error:
AttributeError: 'NoneType' object has no attribute 'astype'
I tried dtype = np.float32 inside append function but again erroneous.
When I try this :
inputs = []
for i in range(vocabulary_size):
inputs.append(np.arange(num_labels) == data[i])
inputs = np.array(inputs,dtype=np.float32)
I get correct answer : Hot-Encoded Input Sequence of vocabulary_size x num_labels.
Any Alternative Solution Of One Line Without Using Numpy?
Solved :Can I be done directly using numpy array(input) with list(data)?
Info about data : data = np.ndarray(len(words), dtype=np.int32)
Reformat function:
def reformat(data):
num_labels = vocabulary_size
print (type(data))
data = (np.arange(num_labels) == data[:,None]).astype(np.int32)
return data
print (data,len(data))
return data
New Question : The dimension of data is (vocabulary_size,)...How to convert data using ravel or reshape into dimension of (1,vocabulary_size)?
Not sure whether I've understood correctly what you're asking for, but if what you want is a oneliner, you could transform you're already working code into this:
inputs = np.array([np.arange(num_labels) == data[i] for i in range(vocabulary_size)], dtype=np.float32)

NaiveBayes model training with separate training set and data using pyspark

So, I am trying to train a naive bayes clasifier. Went into a lot of trouble of preprocessing the data and I have now produced two RDDs:
Traininng set: composed of a set of sparse-vectors;
Labels: a corresponding list of labels (0,1) for every vector.
I need to run something like this:
# Train a naive Bayes model.
model = NaiveBayes.train(training, 1.0)
but "training" is a dataset derived from running:
def parseLine(line):
parts = line.split(',')
label = float(parts[0])
features = Vectors.dense([float(x) for x in parts[1].split(' ')])
return LabeledPoint(label, features)
data = sc.textFile('data/mllib/sample_naive_bayes_data.txt').map(parseLine)
based on the documentation for python here. My question is, given that I don't want to load the data from a txt file and that I have already created the training set in the form of records mapped to sparse-vectors (RDD) and a corresponding labelled list, how can I run naive-bayes?
Here is part of my code:
# Function
def featurize(tokens_kv, dictionary):
"""
:param tokens_kv: list of tuples of the form (word, tf-idf score)
:param dictionary: list of n words
:return: sparse_vector of size n
"""
# MUST sort tokens_kv by key
tokens_kv = collections.OrderedDict(sorted(tokens_kv.items()))
vector_size = len(dictionary)
non_zero_indexes = []
index_tfidf_values = []
for key, value in tokens_kv.iteritems():
index = 0
for word in dictionary:
if key == word:
non_zero_indexes.append(index)
index_tfidf_values.append(value)
index += 1
print non_zero_indexes
print index_tfidf_values
return SparseVector(vector_size, non_zero_indexes, index_tfidf_values)
# Feature Extraction
Training_Set_Vectors = (TFsIDFs_Vector_Weights_RDDs
.map(lambda (tokens): featurize(tokens, Dictionary_BV.value))
.cache())
... and labels is just a list of 1s and 0s. I understand that I may need to somehow use labeledpoint somehow but I am confused a to how... RDDs are not a list while labels is a list am hoping for something as simple as coming up with a way to create labeledpoint objets[i] combining sparse-vectors[i],corresponding-labels[i] respective values... any ideas?
I was able to solve this by first collecting the SparseVectors RDDs - effectively converting them to a list. Then, I run a function that constructed a list of
labelledpoint objects:
def final_form_4_training(SVs, labels):
"""
:param SVs: List of Sparse vectors.
:param labels: List of labels
:return: list of labeledpoint objects
"""
to_train = []
for i in range(len(labels)):
to_train.append(LabeledPoint(labels[i], SVs[i]))
return to_train
# Feature Extraction
Training_Set_Vectors = (TFsIDFs_Vector_Weights_RDDs
.map(lambda (tokens): featurize(tokens, Dictionary_BV.value))
.collect())
raw_input("Generate the LabeledPoint parameter... ")
labelled_training_set = sc.parallelize(final_form_4_training(Training_Set_Vectors, training_labels))
raw_input("Train the model... ")
model = NaiveBayes.train(labelled_training_set, 1.0)
However, this assumes that the RDDs maintain their order (with which I am not messing with) throughout the process pipeline. I also hate the part where I had to collect everything on the master. Any better ideas?

Python script for trasnforming ans sorting columns in ascending order, decimal cases

I wrote a script in Python removing tabs/blank spaces between two columns of strings (x,y coordinates) plus separating the columns by a comma and listing the maximum and minimum values of each column (2 values for each the x and y coordinates). E.g.:
100000.00 60000.00
200000.00 63000.00
300000.00 62000.00
400000.00 61000.00
500000.00 64000.00
became:
100000.00,60000.00
200000.00,63000.00
300000.00,62000.00
400000.00,61000.00
500000.00,64000.00
10000000 50000000 60000000 640000000
This is the code I used:
import string
input = open(r'C:\coordinates.txt', 'r')
output = open(r'C:\coordinates_new.txt', 'wb')
s = input.readline()
while s <> '':
s = input.readline()
liste = s.split()
x = liste[0]
y = liste[1]
output.write(str(x) + ',' + str(y))
output.write('\n')
s = input.readline()
input.close()
output.close()
I need to change the above code to also transform the coordinates from two decimal to one decimal values and each of the two new columns to be sorted in ascending order based on the values of the x coordinate (left column).
I started by writing the following but not only is it not sorting the values, it is placing the y coordinates on the left and the x on the right. In addition I don't know how to transform the decimals since the values are strings and the only function I know is using %f and that needs floats. Any suggestions to improve the code below?
import string
input = open(r'C:\coordinates.txt', 'r')
output = open(r'C:\coordinates_sorted.txt', 'wb')
s = input.readline()
while s <> '':
s = input.readline()
liste = string.split(s)
x = liste[0]
y = liste[1]
output.write(str(x) + ',' + str(y))
output.write('\n')
sorted(s, key=lambda x: x[o])
s = input.readline()
input.close()
output.close()
thanks!
First, try to format your code according to PEP8—it'll be easier to read. (I've done the cleanup in your post already).
Second, Tim is right in that you should try to learn how to write your code as (idiomatic) Python not just as if translated directly from its C equivalent.
As a starting point, I'll post your 2nd snippet here, refactored as idiomatic Python:
# there is no need to import the `string` module; `.strip()` is a built-in
# method of strings (i.e. objects of type `str`).
# read in the data as a list of pairs of raw (i.e. unparsed) coordinates in
# string form:
with open(r'C:\coordinates.txt') as in_file:
coords_raw = [line.strip().split() for line in in_file.readlines()]
# convert the raw list into a list of pairs (2-tuples) containing the parsed
# (i.e. float not string) data:
coord_pairs = [(float(x_raw), float(y_raw)) for x_raw, y_raw in coords_raw]
coord_pairs.sort() # you want to sort the entire data set, not just values on
# individual lines as in your original snippet
# build a list of all x and y values we have (this could be done in one line
# using some `zip()` hackery, but I'd like to keep it readable (for you at
# least)):
all_xs = [x for x, y in coord_pairs]
all_ys = [y for x, y in coord_pairs]
# compute min and max:
x_min, x_max = min(all_xs), max(all_xs)
y_min, y_max = min(all_ys), max(all_ys)
# NOTE: the above section performs well for small data sets; for large ones, you
# should combine the 4 lines in a single for loop so as to NOT have to read
# everything to memory and iterate over the data 6 times.
# write everything out
with open(r'C:\coordinates_sorted.txt', 'wb') as out_file:
# here, we're doing 3 things in one line:
# * iterate over all coordinate pairs and convert the pairs to the string
# form
# * join the string forms with a newline character
# * write the result of the join+iterate expression to the file
out_file.write('\n'.join('%f,%f' % (x, y) for x, y in coord_pairs))
out_file.write('\n\n')
out_file.write('%f %f %f %f' % (x_min, x_max, y_min, y_max))
with open(...) as <var_name> gives you guaranteed closing of the file handle as with try-finally; also, it's shorter than open(...) and .close() on separate lines. Also, with can be used for other purposes, but is commonly used for dealing with files. I suggest you look up how to use try-finally as well as with/context managers in Python, in addition to everything else you might have learned here.
Your code looks more like C than like Python; it is quite unidiomatic. I suggest you read the Python tutorial to find some inspiration. For example, iterating using a while loop is usually the wrong approach. The string module is deprecated for the most part, <> should be !=, you don't need to call str() on an object that's already a string...
Then, there are some errors. For example, sorted() returns a sorted version of the iterable you're passing - you need to assign that to something, or the result will be discarded. But you're calling it on a string, anyway, which won't give you the desired result. You also wrote x[o] where you clearly meant x[0].
You should be using something like this (assuming Python 2):
with open(r'C:\coordinates.txt') as infile:
values = []
for line in infile:
values.append(map(float, line.split()))
values.sort()
with open(r'C:\coordinates_sorted.txt', 'w') as outfile:
for value in values:
outfile.write("{:.1f},{:.1f}\n".format(*value))

Categories