Tensorflow: Compilation using SSE and AVX [duplicate]

Tensorflow: Compilation using SSE and AVX [duplicate] - python

This is the message received from running a script to check if Tensorflow is working:
I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcurand.so.8.0 locally
W tensorflow/core/platform/cpu_feature_guard.cc:95] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:95] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:910] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I noticed that it has mentioned SSE4.2 and AVX,
What are SSE4.2 and AVX?
How do these SSE4.2 and AVX improve CPU computations for Tensorflow tasks.
How to make Tensorflow compile using the two libraries?

I just ran into this same problem, it seems like Yaroslav Bulatov's suggestion doesn't cover SSE4.2 support, adding --copt=-msse4.2 would suffice. In the end, I successfully built with
bazel build -c opt --copt=-mavx --copt=-mavx2 --copt=-mfma --copt=-mfpmath=both --copt=-msse4.2 --config=cuda -k //tensorflow/tools/pip_package:build_pip_package
without getting any warning or errors.
Probably the best choice for any system is:
bazel build -c opt --copt=-march=native --copt=-mfpmath=both --config=cuda -k //tensorflow/tools/pip_package:build_pip_package
(Update: the build scripts may be eating -march=native, possibly because it contains an =.)
-mfpmath=both only works with gcc, not clang. -mfpmath=sse is probably just as good, if not better, and is the default for x86-64. 32-bit builds default to -mfpmath=387, so changing that will help for 32-bit. (But if you want high-performance for number crunching, you should build 64-bit binaries.)
I'm not sure what TensorFlow's default for -O2 or -O3 is. gcc -O3 enables full optimization including auto-vectorization, but that sometimes can make code slower.
What this does: --copt for bazel build passes an option directly to gcc for compiling C and C++ files (but not linking, so you need a different option for cross-file link-time-optimization)
x86-64 gcc defaults to using only SSE2 or older SIMD instructions, so you can run the binaries on any x86-64 system. (See https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html). That's not what you want. You want to make a binary that takes advantage of all the instructions your CPU can run, because you're only running this binary on the system where you built it.
-march=native enables all the options your CPU supports, so it makes -mavx512f -mavx2 -mavx -mfma -msse4.2 redundant. (Also, -mavx2 already enables -mavx and -msse4.2, so Yaroslav's command should have been fine). Also if you're using a CPU that doesn't support one of these options (like FMA), using -mfma would make a binary that faults with illegal instructions.
TensorFlow's ./configure defaults to enabling -march=native, so using that should avoid needing to specify compiler options manually.
-march=native enables -mtune=native, so it optimizes for your CPU for things like which sequence of AVX instructions is best for unaligned loads.
This all applies to gcc, clang, or ICC. (For ICC, you can use -xHOST instead of -march=native.)

Let's start with the explanation of why do you see these warnings in the first place.
Most probably you have not installed TF from source and instead of it used something like pip install tensorflow. That means that you installed pre-built (by someone else) binaries which were not optimized for your architecture. And these warnings tell you exactly this: something is available on your architecture, but it will not be used because the binary was not compiled with it. Here is the part from documentation.
TensorFlow checks on startup whether it has been compiled with the
optimizations available on the CPU. If the optimizations are not
included, TensorFlow will emit warnings, e.g. AVX, AVX2, and FMA
instructions not included.
Good thing is that most probably you just want to learn/experiment with TF so everything will work properly and you should not worry about it
What are SSE4.2 and AVX?
Wikipedia has a good explanation about SSE4.2 and AVX. This knowledge is not required to be good at machine-learning. You may think about them as a set of some additional instructions for a computer to use multiple data points against a single instruction to perform operations which may be naturally parallelized (for example adding two arrays).
Both SSE and AVX are implementation of an abstract idea of SIMD (Single instruction, multiple data), which is
a class of parallel computers in Flynn's taxonomy. It describes
computers with multiple processing elements that perform the same
operation on multiple data points simultaneously. Thus, such machines
exploit data level parallelism, but not concurrency: there are
simultaneous (parallel) computations, but only a single process
(instruction) at a given moment
This is enough to answer your next question.
How do these SSE4.2 and AVX improve CPU computations for TF tasks
They allow a more efficient computation of various vector (matrix/tensor) operations. You can read more in these slides
How to make Tensorflow compile using the two libraries?
You need to have a binary which was compiled to take advantage of these instructions. The easiest way is to compile it yourself. As Mike and Yaroslav suggested, you can use the following bazel command
bazel build -c opt --copt=-mavx --copt=-mavx2 --copt=-mfma --copt=-mfpmath=both --copt=-msse4.2 --config=cuda -k //tensorflow/tools/pip_package:build_pip_package

Let me answer your 3rd question first:
If you want to run a self-compiled version within a conda-env, you can. These are the general instructions I run to get tensorflow to install on my system with additional instructions. Note: This build was for an AMD A10-7850 build (check your CPU for what instructions are supported...it may differ) running Ubuntu 16.04 LTS. I use Python 3.5 within my conda-env. Credit goes to the tensorflow source install page and the answers provided above.
git clone https://github.com/tensorflow/tensorflow
# Install Bazel
# https://bazel.build/versions/master/docs/install.html
sudo apt-get install python3-numpy python3-dev python3-pip python3-wheel
# Create your virtual env with conda.
source activate YOUR_ENV
pip install six numpy wheel, packaging, appdir
# Follow the configure instructions at:
# https://www.tensorflow.org/install/install_sources
# Build your build like below. Note: Check what instructions your CPU
# support. Also. If resources are limited consider adding the following
# tag --local_resources 2048,.5,1.0 . This will limit how much ram many
# local resources are used but will increase time to compile.
bazel build -c opt --copt=-mavx --copt=-msse4.1 --copt=-msse4.2 -k //tensorflow/tools/pip_package:build_pip_package
# Create the wheel like so:
bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg
# Inside your conda env:
pip install /tmp/tensorflow_pkg/NAME_OF_WHEEL.whl
# Then install the rest of your stack
pip install keras jupyter etc. etc.
As to your 2nd question:
A self-compiled version with optimizations are well worth the effort in my opinion. On my particular setup, calculations that used to take 560-600 seconds now only take about 300 seconds! Although the exact numbers will vary, I think you can expect about a 35-50% speed increase in general on your particular setup.
Lastly your 1st question:
A lot of the answers have been provided above already. To summarize: AVX, SSE4.1, SSE4.2, MFA are different kinds of extended instruction sets on X86 CPUs. Many contain optimized instructions for processing matrix or vector operations.
I will highlight my own misconception to hopefully save you some time: It's not that SSE4.2 is a newer version of instructions superseding SSE4.1. SSE4 = SSE4.1 (a set of 47 instructions) + SSE4.2 (a set of 7 instructions).
In the context of tensorflow compilation, if you computer supports AVX2 and AVX, and SSE4.1 and SSE4.2, you should put those optimizing flags in for all. Don't do like I did and just go with SSE4.2 thinking that it's newer and should superseed SSE4.1. That's clearly WRONG! I had to recompile because of that which cost me a good 40 minutes.

These are SIMD vector processing instruction sets.
Using vector instructions is faster for many tasks; machine learning is such a task.
Quoting the tensorflow installation docs:
To be compatible with as wide a range of machines as possible, TensorFlow defaults to only using SSE4.1 SIMD instructions on x86 machines. Most modern PCs and Macs support more advanced instructions, so if you're building a binary that you'll only be running on your own machine, you can enable these by using --copt=-march=native in your bazel build command.

Thanks to all this replies + some trial and errors, I managed to install it on a Mac with clang. So just sharing my solution in case it is useful to someone.
Follow the instructions on Documentation - Installing TensorFlow from Sources
When prompted for
Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native]
then copy-paste this string:
-mavx -mavx2 -mfma -msse4.2
(The default option caused errors, so did some of the other flags. I got no errors with the above flags. BTW I replied n to all the other questions)
After installing, I verify a ~2x to 2.5x speedup when training deep models with respect to another installation based on the default wheels - Installing TensorFlow on macOS
Hope it helps

I have recently installed it from source and bellow are all the steps needed to install it from source with the mentioned instructions available.
Other answers already describe why those messages are shown. My answer gives a step-by-step on how to isnstall, which may help people struglling on the actual installation as I did.
Install Bazel
Download it from one of their available releases, for example 0.5.2.
Extract it, go into the directory and configure it: bash ./compile.sh.
Copy the executable to /usr/local/bin: sudo cp ./output/bazel /usr/local/bin
Install Tensorflow
Clone tensorflow: git clone https://github.com/tensorflow/tensorflow.git
Go to the cloned directory to configure it: ./configure
It will prompt you with several questions, bellow I have suggested the response to each of the questions, you can, of course, choose your own responses upon as you prefer:
Using python library path: /usr/local/lib/python2.7/dist-packages
Do you wish to build TensorFlow with MKL support? [y/N] y
MKL support will be enabled for TensorFlow
Do you wish to download MKL LIB from the web? [Y/n] Y
Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native]:
Do you wish to use jemalloc as the malloc implementation? [Y/n] n
jemalloc disabled
Do you wish to build TensorFlow with Google Cloud Platform support? [y/N] N
No Google Cloud Platform support will be enabled for TensorFlow
Do you wish to build TensorFlow with Hadoop File System support? [y/N] N
No Hadoop File System support will be enabled for TensorFlow
Do you wish to build TensorFlow with the XLA just-in-time compiler (experimental)? [y/N] N
No XLA JIT support will be enabled for TensorFlow
Do you wish to build TensorFlow with VERBS support? [y/N] N
No VERBS support will be enabled for TensorFlow
Do you wish to build TensorFlow with OpenCL support? [y/N] N
No OpenCL support will be enabled for TensorFlow
Do you wish to build TensorFlow with CUDA support? [y/N] N
No CUDA support will be enabled for TensorFlow
The pip package. To build it you have to describe which instructions you want (you know, those Tensorflow informed you are missing).
Build pip script: bazel build -c opt --copt=-mavx --copt=-mavx2 --copt=-mfma --copt=-mfpmath=both --copt=-msse4.1 --copt=-msse4.2 -k //tensorflow/tools/pip_package:build_pip_package
Build pip package: bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg
Install Tensorflow pip package you just built: sudo pip install /tmp/tensorflow_pkg/tensorflow-1.2.1-cp27-cp27mu-linux_x86_64.whl
Now next time you start up Tensorflow it will not complain anymore about missing instructions.

This is the simplest method. Only one step.
It has significant impact on speed. In my case, time taken for a training step almost halved.
Refer
custom builds of tensorflow

I compiled a small Bash script for Mac (easily can be ported to Linux) to retrieve all CPU features and apply some of them to build TF. Im on TF master and use kinda often (couple times in a month).
https://gist.github.com/venik/9ba962c8b301b0e21f99884cbd35082f

To compile TensorFlow with SSE4.2 and AVX, you can use directly
bazel build --config=mkl
--config="opt"
--copt="-march=broadwell"
--copt="-O3"
//tensorflow/tools/pip_package:build_pip_package
Source:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/docker/Dockerfile.devel-cpu-mkl

2.0 COMPATIBLE SOLUTION:
Execute the below commands in Terminal (Linux/MacOS) or in Command Prompt (Windows) to install Tensorflow 2.0 using Bazel:
git clone https://github.com/tensorflow/tensorflow.git
cd tensorflow
#The repo defaults to the master development branch. You can also checkout a release branch to build:
git checkout r2.0
#Configure the Build => Use the Below line for Windows Machine
python ./configure.py
#Configure the Build => Use the Below line for Linux/MacOS Machine
./configure
#This script prompts you for the location of TensorFlow dependencies and asks for additional build configuration options.
#Build Tensorflow package
#CPU support
bazel build --config=opt //tensorflow/tools/pip_package:build_pip_package
#GPU support
bazel build --config=opt --config=cuda --define=no_tensorflow_py_deps=true //tensorflow/tools/pip_package:build_pip_package

When building TensorFlow from source, you'll run the configure script. One of the questions that the configure script asks is as follows:
Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native]
The configure script will attach the flag(s) you specify to the bazel command that builds the TensorFlow pip package. Broadly speaking, you can respond to this prompt in one of two ways:
If you are building TensorFlow on the same type of CPU type as the one on which you'll run TensorFlow, then you should accept the default (-march=native). This option will optimize the generated code for your machine's CPU type.
If you are building TensorFlow on one CPU type but will run TensorFlow on a different CPU type, then consider supplying a more specific optimization flag as described in the gcc
documentation.
After configuring TensorFlow as described in the preceding bulleted list, you should be able to build TensorFlow fully optimized for the target CPU just by adding the --config=opt flag to any bazel command you are running.

To hide those warnings, you could do this before your actual code.
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'
import tensorflow as tf

Related

GPU processing - cuDF install problem (O/S or hardware issue?)

My aim to to explore GPU acceleration for tabular data with 10,000 to 10M+ records. I am most familiar with Pandas, so cuDF seems like a good place to start.
I'm finding mixed results re: whether cuDF will run on my system (Windows 7 Pro 64-bit, i7-6820HQ, 32GB RAM, NVidia Quadro M2000M 4GB). There is also an onboard graphics card.
per the gitHub page (https://github.com/rapidsai/cudf):
CUDA/GPU Requirements
CUDA 10.0+ (YES - I have v10.1.120)
NVIDIA driver 410.48+ (YES - I have 432.06)
Pascal architecture or better (NO - Maxwell)
I have heard that Pascal architecture is preferred/optimal as opposed to a requirement, but maybe that was for older versions of cuDF? Just this morning I heard it will run on Win 64, though performance benefits may also be reduced. Nonetheless, I'm interested in giving it a shot.
When I install from the conda prompt (python 3.6 env) using the recommended command for my CUDA version:
conda install -c rapidsai -c nvidia -c numba -c conda-forge cudf=0.13
python=3.6 cudatoolkit=10.1
I get:
Collecting package metadata (repodata.json): done Solving environment:
failed with initial frozen solve. Retrying with flexible solve.
PackagesNotFoundError: The following packages are not available from
current channels:
cudf=0.13
Current channels:
https://conda.anaconda.org/rapidsai/win-64
https://conda.anaconda.org/rapidsai/noarch
https://conda.anaconda.org/nvidia/win-64
https://conda.anaconda.org/nvidia/noarch
https://conda.anaconda.org/numba/win-64
https://conda.anaconda.org/numba/noarch
https://conda.anaconda.org/conda-forge/win
https://conda.anaconda.org/conda-forge/noa
https://repo.anaconda.com/pkgs/main/win-64
https://repo.anaconda.com/pkgs/main/noarch
https://repo.anaconda.com/pkgs/r/win-64
https://repo.anaconda.com/pkgs/r/noarch
https://repo.anaconda.com/pkgs/msys2/win-6
https://repo.anaconda.com/pkgs/msys2/noarc
To search for alternate channels that may provide the conda package
you're looking for, navigate to
https://anaconda.org
and use the search bar at the top of the page.
When I go to anaconda.org and search for cuDF (or RAPIDS), all I find are Linux installs.
I attended an Anaconda-sponsored webinar earlier today where the speaker said it'll run in Win-64, though this older post suggest maybe I need to build from source:
Package not found error while installing CuSpatial or CuDf library
I'm not ready to attempt a build from source. Am I just wasting my time? Recommendations appreciated (for either resolving cuDF with my system or alternative packages).

cuDF maintainer here.
Currently, cuDF nor any other RAPIDS libraries are supported in a native Windows environment. There's an issue tracking Windows support here: https://github.com/rapidsai/cudf/issues/28.
In general, native Windows support is not a priority for us, especially given the push towards GPU support in WSL2 that is currently in open beta.

Apparently there are some news regarding this. Here one can find the guide for using NVIDIA CUDA on Windows Subsystem for Linux.
Getting started with running CUDA on WSL requires you to complete
these steps in order:
1. Installing the latest builds from the Microsoft Windows Insider Program
2. Installing the NVIDIA preview driver for WSL 2
3. Installing WSL 2
Important note regarding the installation of the latest builds from the Microsoft Windows Insider Program
Ensure that you install Build version 20145 or higher.
You can check your build version number by running winver via the Windows Run command. (Source)
Hopefully next year a version of Windows that meets the Build version 20145 or higher requirement will be released and then one doesn't need to run an "Insider Program" build.
Source for Windows 10 release information.
Here one will be able to follow all the updates regarding the Support for Windows.

Update Tensorflow binary in virtual environment in PyCharm to use AVX2

My question is related to this one here, but I am using PyCharm and I set up my virtual environment with Python interpreter according to this guide, page 5.
When I run my tensorflow code, I get the warning:
Your CPU supports instructions that this TensorFlow binary was not
compiled to use: AVX2
I could ignore it, but since my model fitting is quite slow, I would like to take advantage of it. However, I do not know how to update my system here in this virtual environment PyCharm setting to make use of AVX2?

Anaconda/conda as package management tool:
Assuming that you have installed anaconda/conda on your machine, if not follow this - https://docs.anaconda.com/anaconda/install/windows/
conda create --name tensorflow_optimized python=3.7
conda activate tensorflow_optimized
# you need intel's tensorflow version that's optimized to use SSE4.1 SSE4.2 AVX AVX2 FMA
conda install tensorflow-mkl -c anaconda
#run this to check if the installed version is using MKL,
#which in turns uses all the optimizations that your system provide.
python -c "import tensorflow as tf; tf.test.is_gpu_available(cuda_only=False, min_cuda_compute_capability=None)"
# you should see something like this as the output.
2020-07-14 19:19:43.059486: I tensorflow/core/platform/cpu_feature_guard.cc:145] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical operations: SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in non-MKL-DNN operations, rebuild TensorFlow with the appropriate compiler flags.
pip3 as package management tool:
py -m venv tensorflow_optimized
.\tensorflow_optimized\Scripts\activate
#once the env is activated, you need intel's tensorflow version
#that's optimized to use SSE4.1 SSE4.2 AVX AVX2 FMA
pip install intel-tensorflow
#run this to check if the installed version is using MKL,
#which in turns uses all the optimizations that your system provide.
py -c "import tensorflow as tf; tf.test.is_gpu_available(cuda_only=False, min_cuda_compute_capability=None)"
# you should see something like this as the output.
2020-07-14 19:19:43.059486: I tensorflow/core/platform/cpu_feature_guard.cc:145] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical operations: SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in non-MKL-DNN operations, rebuild TensorFlow with the appropriate compiler flags.
Once you have this, you can set use this env in pycharm.
Before that, run
where python on windows, which python on Linux and Mac when the env is activated, should give you the path for the interpreter. In Pycharm,
Go to Preference -> Project: your project name -> Project interpreter -> click on settings symbol -> click on add.
Select System interpreter -> click on ... -> this will open a popup window which asks for location of python interpreter.
In the location path, paste the path from where python ->click ok
now you should see all the packages installed in that env.
From Next time, if you want select that interpreter for your project, Click on the lower right half where it says python3/python2 (your interpreter name) and select the one you need.
I'd suggest you to install Anaconda as your default package manager, as it makes your dev life easier wrt python on Windows machine, but you can make do with pip as well.

If your CPU utilization during training stays under 100% for most of the time you should not even bother getting a different TF binary.
You might not see much if any benefit of using AVX2 (or AVX512 for that matter) depending on the workload you are running.
AVX2 is a set of CPU vector instructions of size 256(bits). Chances are, you can get at most x2 times benefit comparing to 128-bits streaming instructions. When it comes to deep learning models, they are very much memory-bandwidth bound and would not see much, if at all, benefits from switching to larger register sizes. Easy way to check it: see how long does your CPU utilization stays at 100% during training. If most of the time it is under 100% than you are probably memory (or else-wise) bound already. If your training is running on GPU and CPU is used only for data-preprocessing and occasional operations the benefit would be even less noticeable.
Back to answering your question. The best way to update TF binary to get the most out of the latest CPU architecture, CUDA version, python version and etc. would be to build tensorflow from source. Which might take a few hours of your time. That would be an official and the most robust way of solving your issue.
If you would be satisfied with using better CPU instructions you can try installing different 3-rd party binaries from wherever you can find them. Installing Conda and pointing pycharm interpreter to conda installation would be one of the options.

Tensorflow OMP: Error #15 when training

I am training my neural network using tensorflow on CentOS HPC. However I got this error at start of the training process:
OMP: Error #15: Initializing libiomp5.so, but found libiomp5.so already initialized.
OMP: Hint: This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://www.intel.com/software/products/support/.
The code is for instance segmentation and it worked fine for many people, but failed in my case.
Why it occurs? How to solve it?

I had a similar issue on macOS with the same error message (see this question) and found the following reasons:
Problem:
I had a conda environment where Numpy, SciPy and TensorFlow were installed.
Conda is using Intel(R) MKL Optimizations, see docs:
Anaconda has packaged MKL-powered binary versions of some of the most popular numerical/scientific Python libraries into MKL Optimizations for improved performance.
The Intel MKL functions (e.g. FFT, LAPACK, BLAS) are threaded with the OpenMP technology.
But on macOS you do not need MKL, because the Accelerate Framework comes with its own optimization algorithms and already uses OpenMP. That is the reason for the error message: OMP Error #15: ...
Workaround:
You should install all packages without MKL support:
conda install nomkl
and then use
conda install numpy scipy pandas tensorflow
followed by
conda remove mkl mkl-service
For more information see conda MKL Optimizations.

I solved this problem by asking a HPC server expert. Maybe useful for Compute Canada system users.
Why it occurs?
This error is due to conflict between a tensorflow pre-built Python wheel(which is specific for Compute Canada system) and conda environment.
Quote : "conda is always a bit problematic because it downloads precompiled binaries, mileage may vary..."
How to solve it?
As #abccd pointed out "The best thing to do is to ensure that only a single OpenMP runtime is linked into the process". However, I haven't figured out how to ensure that.
So I uninstalled conda, and install everything in module system using pip install. Then the network works fine.

I solved, as explained by the message, by adding:
import os
os.environ['KMP_DUPLICATE_LIB_OK']='True'

Simply downgrading my version of TensorFlow using Anaconda did it for me.

Tensorflow Object Detection API on Windows

Tensorflow recently released their new object detection api Is there any way to run this on windows? The directions apear to be for linux.

Yes, you can run the Tensorflow Object Detection API on Windows. Unfortunately it is a bit tricky and the official documentation does not reflect that appropriately. I used the following procedure:
Install Tensorflow natively on Windows with Anaconda + CUDA + cuDNN. Note that TF 1.5 is now built against CUDA 9.0, so make sure you download the appropriate versions.
Then you clone the repository and build the Protobuf files as described in the tutorial, but beware, there is a bug in Windows Protobuf 3.5, so make sure you use version 3.4.
cd [TF-models]\research
protoc.exe object_detection/protos/*.proto --python_out=.
Finally, you need to build and install the packages with
cd [TF-models]\research\slim
python setup.py install
cd [TF-models]\research
python setup.py install
If you get the exception error: could not create 'BUILD': Cannot create a file when that file already exists here, delete the BUILD file inside first, it will be re-created automatically
And make the built binaries available to your path python path, or simply copy the directories slim and object_detection to your [Anaconda3]/Lib/site-packages directory
To see everything put together, check out our Music Object Detector, which was trained on Windows and Linux.

We don't officially support the Tensorflow Object Detection API, but some external users have gotten it to work.
Our dependencies are pillow, lxml, jupyter, matplotlib and protobuf compiler. You can download a version of the protobuf compiler here. The remaining dependencies can be installed with pip.

As I said on the other post, you can use your local GPU in windows, as Tensorflow supports GPU on python.
And here is an example.
Unfortunately, Tensoflow does not support tensorflow-serving on windows. Also as you said Nvidia-Docker is not supported on windows. Bash on windows has no support for GPU either. So I think this is the only easy way to go for now.

The below tutorial was build specifically for using the Tensorflow Object Detection API on Windows. I've successfully used the below tutorial many times:
https://github.com/EdjeElectronics/TensorFlow-Object-Detection-API-Tutorial-Train-Multiple-Objects-Windows-10

Illegal instruction: 4 when importing python plugins

I tried to install a hoomd_script molecular dynamics software on my imac (it's imac pro before 2009, the system is OS X El captain v10.11.3). I have successfully compiled this to iMac, but when I import this hoomd_script in Python 2.7.12, Python crashes completely and I get the error:
Illegal instruction: 4.
I have installed all the prerequisites packages (including boost, sphinx, git, mpich2, numpy, cmake, pkg-config, sqlite) using conda.
I applied python -vc 'hoomd_script' to test, and the result is here. I tried to reinstall all the packages including conda and recompile the hoomd, but nothing changed. I wonder how can I fix this. Thanks!

As stated on the HOOMD-blue web page, the conda builds require a CPU capable of AVX instructions (2011 or newer). The illegal instruction results because you are trying to execute an instruction that your processor does not support.
Compiling hoomd from a clean build directory on your system should result in a binary that your system can execute. Note that conda provided prerequisite libraries are difficult to work with: I recommend using macports or homebrew.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.