BatchDataset not subscriptable when trying to format Python dictionary as table - python

I'm working through the TensorFlow Load pandas.DataFrame tutorial, and I'm trying to modify the output from a code snippet that creates the dictionary slices:
dict_slices = tf.data.Dataset.from_tensor_slices((df.to_dict('list'), target.values)).batch(16)
for dict_slice in dict_slices.take(1):
print (dict_slice)
I find the following output sloppy, and I want to put it into a more readable table format.
I tried to format the for loop, based on this recommendation
Which gave me the error that the BatchDataset was not subscriptable
Then I tried to use the range and leng function on the dict_slices, so that i would be an integer index and not a slice
Which gave me the following error (as I understand, because the dict_slices is still an array, and each iteration is one vector of the array, not one index of the vector):

Refer here for solution. To summarize we need to use as_numpy_iterator
example = list(dict_slices.as_numpy_iterator())
example[0]['age']

BatchDataset is a tf.data.Dataset instance that has been batches by calling it's .batch(..) method. You cannot "index" a tensorflow Dataset, or call the len function on it. I suggest iterating through it like you did in the first code snippet.
However in your dataset you are using .to_dict('list'), which means that a key in your dictionary is mapped to a list as value. Basically you have "columns" for every key and not rows, is this what you want? This would make printing line-by-line (shown in the table printing example you linked) a lot more difficult, since you do not have different features in a row. Also it is different from the example in the official Tensorflow code, where a datapoint consists of multiple features, and not one feature with multiple values.
Combining the Tensorflow code and pretty printing:
columns = list(df.columns.values)+['target']
dict_slices = tf.data.Dataset.from_tensor_slices((df.values, target.values)).batch(1) # batch = 1 because otherwise you will get multiple dict_slice - target pairs in one iteration below!
print(*columns, sep='\t')
for dict_slice, target in dict_slices.take(1):
print(*dict_slice.numpy(), target.numpy(), sep='\t')
This needs a bit of formatting, because column widths are not equal.

Related

How to create a Series of matrices in Python (with pandas and Gurobi)

I'm doing a linear optimization in Gurobi and trying to make my decision variables in a Series of matrices, using this code:
schedule = pd.Series(index = Weekdays)
for day in Weekdays:
schedule[day] = m.addVars(Blocks, Departments, vtype=GRB.BINARY)
But it keeps throwing an error "cannot set using a list-like indexer with a different length than the value." How do I get around this to make a list of matrices?
If anyone comes across this, I figured out that the addVars method allows you to directly enter all three dimensions and uses a dictionary to reference. Therefore, you can simplify by writing:
schedule = m.addVars(Weekdays, Blocks, Departments, vtype=GRB.BINARY)
to reference, all you need to do is write:
schedule[weekday, block, department]

Data presentation difference in python

Hopefully a fairly simple answer to my issue.
When I run the following code:
print (data_1.iloc[1])
I get a nice, vertical presentation of the data, with each column value header, and its value presented on separate rows. This is very useful when looking at 2 sets of data, and trying to find discrepancies.
However, when I write the code as:
print (data_1.loc[data_1["Name"].isin(["John"])])
I get all the information arrayed across the screen, with the column header in 1 row, and the values in another row.
My question is:
Is there any way of using the second code, and getting the same vertical presentation of the data?
The difference is that data_1.iloc[1] returns a pandas Series whereas data_1.loc[data_1["Name"].isin(["John"])] returns a DataFrame. Pandas has different representations for these two data types (i.e. they print differently).
The reason iloc[1] gives you a Series is because you indexed it using a scalar. If you do data_1.iloc[[1]] you'll see you get a DataFrame instead. Conversely, I'm assuming that data_1["Name"].isin(["John"]) is returning a collection. If you wanted to get a Series instead you might try something like
print(data_1.loc[data_1["Name"].isin(["John"])[0]])
but only if you're sure you're getting one element back.

Is there a tensor eqiv to python's list.count()

I'm attempting to do all my input pipeline work in tensorflow. This includes transforming the examples into the types required by the classifier.
I just learned I can't iterate over a string tensor like I would do with a standard python list. My specific question is "is there a tf function for testing the existence of a constant value within a tensor?" Of course there may be a better way to do this (I'm new to tf and python).
# creating a unique list of tokens (python)
a_global = []
a = [...]
for token in a:
if a_global.count(token) == 0:
a_global.append(token)
I'm indexing string tokens so I can essentially convert them into integers using the token's position within the list as its new value. That snippet will not work when "a" is a tensor, so I'm trying tf.map_fn() instead, but I don't know how to replicate the IF statement predicate. Can someone point me in the right direction?
tf ver 1.8
If you don't need gradients for this operation (which I guess you don't for preprocessing stuff), the easiest could be to use tf.py_func. It essentially is able to wrap numpy code snippets into TensorFlow ops.
If that doesn't work for you, look at this post to count occurrences. Then you could use tf.cond to replicate the if statement.

Creating two lists from one randomly

I'm using pandas to import a lot of data from a CSV file, and once read I format it to contain only numerical data. This then returns a list within a list. Each list then contains around 140k bits of data. numericalData[][].
From this list, I wish to create Testing and Training Data. For my testing data, I want to have 30% of my read data numericalData, so I use this following bit of code;
testingAmount = len(numericalData0[0]) * trainingDataPercentage / 100
Works a treat. Then, I use numpy to select that amount of data from each column of my imported numericalData;
testingData.append(np.random.choice(numericalData[x], testingAmount) )
This then returns a sample with 38 columns (running in a loop), where each column has around 49k elements of data randomly selected from my imported numericalData.
The issue is, my trainingData needs to hold the other 70% of the data, but I'm unsure on how to do this. I've tried to compare each element in my testingData, and if both elements aren't equal, then add it to my trainingData. This resulted in an error and didn't work. Next, I tried to delete the selected testingData from my imported data, and then save that new column to my trainingData, alas, that didn't work eiher.
I've only been working with python for the past week so I'm a bit lost on what to try now.
You can use random.shuffle and split list after that. For toy example:
import random
data = range(1, 11)
random.shuffle(data)
training = data[:5]
testing = data[5:]
To get more information, read the docs.

arcgis python index of layer within a dataframe

I am new to python and I have searched everywhere but I can not find an explicit method to get the index of a layer in the map, and more specifically in a given dataframe.
I have been able to list the layers using ListLayers function.
I have using this code which does not work, and i also did not really expect it to work, but I have tried other things and nothing, so I decided to ask the group, Thanks
import arcpy
mxd = arcpy.mapping.MapDocument(r"D:\PythonTest\Data\MyMap.mxd")
df = arcpy.mapping.ListDataFrarames(mxd, "MTM8")[0]
listlayer = arcpy.mapping.ListLayers(mxd, "", df)
for lyr in listlayer:
print lyr.index(lyr)
The listlayers method returns a list of ArcGIS layer objects. In your expression you are searching for the text name of the object and not the object itself. This is why python will return ""lyr" is not in the list"
I'm also trying to work out the correct method for this. What i have done thus far is to create two lists:
a list of layer objects (Layerlist)
a list of layer object names (retrievelist)
I can then pull out the layer object from layerlist by searching the text names in retrievelist retreiving the index and then slicing layerlist with that index position. It seems "unclean" but since both list are in the exact same order it works.
I'd be interested in any cleaner solutions to this

Categories