How to compile TensorFlow-GPU both on Linux (WSL2) and Window
This article record some key procedures for me to compile TensorFlow-GPU on Linux (WSL2) and
on Windows. Because of the convenience of MiniConda, we
can abstract the compiling process into a number of steps that are
almost independent of the operating system (platform). Therefore, this
article is not a rehash of the official or third-party compiling process
available online, but a new integration basing on Conda tools, getting
rid of those tedious and annoying steps in those tutorials and furtherly
simplifying the whole process. To reduce ambiguities, “user names” in
this article will be replaced by <username>
or
<wsl_username>
, e.g. a user folder
C:\Users\bob
will be modified to
C:\Users\<username>
.
NOTICE: In order to reflect the light-platform-dependence of the compiling process, the concreate details in different platforms of a same abstracted procedure will be given out simultaneously. This may be totally different from those usual tutorials that give out procedures of a same platform in succession without breaking them up. So, the content of this article may make readers feel uncomfortable because they need to keep an eye on the platform switch at all times. However, we still wish to organize the text in this way to reflect the light-platform-dependence of the compiling process, and to reveal the core concepts or procedures of compilation.
Preamble
This preface is not very important and can be skipped.
Scientific research in the field of machine learning and deep learning has always been limited by hardware and software providers. On the aspect of open-source framework, we usually cannot get the newest and stable features because of our using platforms may be not supported by contributors, which undoubtedly has a negative effect on the progress of scientific research. I always hope that the compiling and usage of open-source frameworks can be more and more convenient, so that each of us researchers can put more energy into the research itself, rather than the usage of tools, and can in turn promote the development of the community to speed up the support for new features and new platforms, making a win-win cycle of positive feedback.
The compiling procedures are nothing but sources, environment and operations. The environment is the most difficult place to control, which may be the key point for successful compiling. My using environment may be not the same to that in official tutorials and may be not the same to the readers of this article. So the real purpose of this article is to abstract and document the key steps of the compiling process to illustrate a whole compiling procedure in a specific environment. This paper does not guarantee, nor is it necessary to guarantee, that the process will succeed in other environment, because we are supposed to obtain the essence through the phenomenon, grasp the core principles of compiling, and solve the problems in our own environment.
Therefore, this article will divide the compiling procedures into
preparing
, compiling
and
troubleshooting
. We cannot guarantee that all operations
are ablation-tested, so there may be some redundant operations. Hope you
readers can go through these processes and compile successfully in your
own environment.
Preparing
At first, we make some statements on crucial environment components
or tools and their versions. We try to ensure these stuffs’ versions are
as new as possible, but not necessarily the latest versions, because the
latest versions may incur compatibility issues. For example, see here, before
2022/12/18
, the latest cudnn8.7.0.84
does not
support cuda12.0
, so we can only use cuda11.8
to make it compatible with cudnn8.7.0.84
. Additionally, see
here
that start in TensorFlow2.11
, CUDA build is not supported
for Windows, so we can only use and build
TensorFlow-GPU2.10.1
on Win11 platform.
The crucial environment statements are as the following:
shell
: bashgcc
tool chain: 11.3.0 (Ubuntu 11.3.0-1ubuntu1~22.04)cuda
with toolkits: 11.8.0cudnn
: 8.7.0.84bazel
: 5.2Python
:3.10TensorFlow
: 2.11
shell
: PowerShell 7.3msvc
tool chain: MSVC v143 (14.34.31933)windows 11
sdk: 10.0.22621.0cuda
with toolkits: 11.8.0cudnn
: 8.7.0.84bazel
: 5.2Python
:3.10TensorFlow
: 2.10
Then, we can install and configure essential environment components
or tools, including but not limited to conda
,
cuda
, cudnn
and bazel
. We assume
that all the following operations are based on a conda environment named
compile
.
Install and
configure conda (Miniconda)
With the help of conda (MiniConda), we can easily compile TensorFlow-GPU on both WSL2 and Win11 in a very similar way. So the first target is to install and use conda on these 2 platforms.
Create a conda env named compile
as:
1 | conda clean --all # clean is a good way to reduce chance of encountering weird bugs |
Create a conda env named compile
as:
1 | conda clean --all # clean is a good way to reduce chance of encountering weird bugs |
We can also refer to Miniconda — conda documentation for more details.
Install and configure
cuda
by conda
The total official tutorials can be found in CUDA Quick Start Guide or CUDA Toolkit Documentation.
We can refer to cuda-installation-guide-linux.
Install gcc tools set as:
1
sudo apt install build-essential
Install cuda by NVIDIA’s channel:
1
2conda activate compile
conda install cuda -c nvidia/label/cuda-11.8.0Add cuda components into
$PATH
and$LD_LIBRARY_PATH
. Here we make full use of conda’s independent environment variable configuration mechanism as here:1
2
3
4
5cd $CONDA_PREFIX
mkdir -p ./etc/conda/activate.d # `-p` is for multi-level directory
mkdir -p ./etc/conda/deactivate.d
touch ./etc/conda/activate.d/env_activate.sh
touch ./etc/conda/deactivate.d/env_deactivate.shEdit the above 2 scripts, such as:
1
code ./etc/conda
Edit
$CONDA_PREFIX/etc/conda/activate.d/env_activate.sh
as follows:1
2
3
4
5
6
7
8
9
10
11
12# For cuda-sample tests
export CUDA_PATH_CONDA_BACKUP="${CUDA_PATH:-}"
export CUDA_PATH=$CONDA_PREFIX
export LD_LIBRARY_PATH_CONDA_BACKUP="${LD_LIBRARY_PATH:-}"
export LD_LIBRARY_PATH=$CONDA_PREFIX/lib${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
# Set TF_CUDA_PATHS
# See line 570 **\third_party\gpus\find_cuda_config.py
# the function find_cuda_config() use $TF_CUDA_PATHS as base_paths to find all cuda and cudnn components
export TF_CUDA_PATHS_CONDA_BACKUP="${TF_CUDA_PATHS:-}"
export TF_CUDA_PATHS=$CONDA_PREFIXNOTICE: There is no need to add cuda components into
$PATH
since these corresponding stuffs are located in$CONDA_PREFIX/bin
, which has been added into$PATH
automatically byconda activate
.Edit
$CONDA_PREFIX/etc/conda/deactivate.d/env_deactivate.sh
as follows:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17export CUDA_PATH=${CUDA_PATH_CONDA_BACKUP:-}
unset CUDA_PATH_CONDA_BACKUP
if [ -z $CUDA_PATH ]; then
unset CUDA_PATH
fi
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH_CONDA_BACKUP:-}
unset LD_LIBRARY_PATH_CONDA_BACKUP
if [ -z $LD_LIBRARY_PATH ]; then
unset LD_LIBRARY_PATH
fi
export TF_CUDA_PATHS=${TF_CUDA_PATHS_CONDA_BACKUP:-}
unset TF_CUDA_PATHS_CONDA_BACKUP
if [ -z $TF_CUDA_PATHS ]; then
unset TF_CUDA_PATHS
fire-activate conda env as:
1
conda activate compile
Verify the Installation as here:
1
2
3
4
5
6
7
8cd $HOME
git clone https://github.com/NVIDIA/cuda-samples.git
cd cuda-samples/Samples/1_Utilities/deviceQuery
make clean && make
./deviceQuery # check if `Result = PASS`
cd ../bandwidthTest/
make clean && make
./bandwidthTest # check if `Result = PASS`If the words like
Result = PASS
is shown, it means the test passed.
NOTICE: MLNX_OFED will be skipped since WSL2 may do not supported. (Nvme devices are directly connected to windows and mounted to WSL2, instead of directly connected to WSL2)
We can refer to CUDA Installation Guide for Microsoft Windows (nvidia.com).
Install MSVC tools
1
2winget search "visual studio"
winget install --id Microsoft.VisualStudio.2022.CommunityOpen
Visual Studio Installer
> modify > choose to installDesktop development with C++
. The necessary components areMSVC build tools
andWindows SDK
.Check if
"C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\xx.xx.xxxxx"
exists.Install cuda by NVIDIA’s channel:
1
2conda activate compile
conda install cuda -c nvidia/label/cuda-11.8.0Add cuda components into
$PATH
. Here we make full use of conda’s independent environment variable configuration mechanism as here:1
2
3
4
5cd $Env:CONDA_PREFIX
New-Item .\etc\conda\activate.d -ItemType Directory
New-Item .\etc\conda\deactivate.d -ItemType Directory
New-Item .\etc\conda\activate.d\env_activate.ps1 -ItemType File
New-Item .\etc\conda\deactivate.d\env_deactivate.ps1 -ItemType FileEdit the above 2 scripts, such as:
1
code $Env:CONDA_PREFIX\etc\conda
Edit
$Env:CONDA_PREFIX\etc\conda\activate.d\env_activate.ps1
as follows:1
2[Environment]::SetEnvironmentVariable('CUDA_PATH_CONDA_BACK',"${Env:CUDA_PATH}")
[Environment]::SetEnvironmentVariable('CUDA_PATH',"${Env:CONDA_PREFIX}")NOTICE: There is no need to add cuda components into
$PATH
since these corresponding stuffs are located in$CONDA_PREFIX/bin
, which has been added into$PATH
automatically byconda activate
.Edit
$Env:CONDA_PREFIX\etc\conda\deactivate.d\env_deactivate.ps1
as follows:1
2[Environment]::SetEnvironmentVariable('CUDA_PATH',"${Env:CUDA_PATH_CONDA_BACK}")
[Environment]::SetEnvironmentVariable('CUDA_PATH_CONDA_BACK',"")Re-activate conda env as:
1
conda activate compile
Verify the installation as here. But it may be difficult to verify this installation by building and running
cuda-samples
, since these samples on windows is originally designed for MSVC, but the above environment configuration is not directly suitable for MSVC tools. As the consequence, we can usenvcc -V
to perform a simple verification:1
2
3
4
5
6$ nvcc -V # check if nvcc is recognized
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:41:10_Pacific_Daylight_Time_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
Install and
configure cudnn
to conda
The total official tutorials can be found in Installation
Guide :: NVIDIA Deep Learning cuDNN Documentation or NVIDIA Deep
Learning cuDNN Documentation. Here we
choosedownloading compressed packages
to install cudnn to
conda.
Install
zlib
as Installing Zlib on Linux:1
sudo apt install zlib1g
Install cudnn to conda env (copy cudnn components to conda env):
We can refer to Installing cuDNN on Linux. In order to download cuDNN, ensure we have registered for the NVIDIA Developer Program. Then, go to NVIDIA cuDNN home page to download the
tar
file as the namecudnn-linux-x86_64-8.x.x.x_cudaX.Y-archive.tar.xz
, then move it to$HOME
directory in anyway. (To download this cudnntar
file, a NVIDIA account registration on browser is needed, so we cannot download it to local simply via tools such aswget
.)1
2
3
4
5
6conda activate compile
cd $HOME
tar -xvf cudnn-linux-x86_64-8.x.x.x_cudaX.Y-archive.tar.xz
cp cudnn-*-archive/include/* $CONDA_PREFIX/include/
cp -P cudnn-*-archive/lib/* $CONDA_PREFIX/lib/
sudo chmod a+r $CONDA_PREFIX/include/cudnn*.h $CONDA_PREFIX/lib/libcudnn*Add cudnn components into
$PATH
and$LD_LIBRARY_PATH
. Here we make full use of conda’s independent environment variable configuration mechanism as here:Open
$CONDA_PREFIX/etc/conda/activate.d/env_activate.sh
, check if the following exists:1
2
3
4
5export CUDA_PATH_CONDA_BACKUP="${CUDA_PATH:-}"
export CUDA_PATH=$CONDA_PREFIX
export LD_LIBRARY_PATH_CONDA_BACKUP="${LD_LIBRARY_PATH:-}"
export LD_LIBRARY_PATH=$CONDA_PREFIX/lib${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}Open
$CONDA_PREFIX/etc/conda/deactivate.d/env_deactivate.sh
, check if the following exists:1
2
3
4
5
6
7
8
9
10
11export CUDA_PATH=${CUDA_PATH_CONDA_BACKUP:-}
unset CUDA_PATH_CONDA_BACKUP
if [ -z $CUDA_PATH ]; then
unset CUDA_PATH
fi
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH_CONDA_BACKUP:-}
unset LD_LIBRARY_PATH_CONDA_BACKUP
if [ -z $LD_LIBRARY_PATH ]; then
unset LD_LIBRARY_PATH
fiRe-activate conda env as:
1
conda activate compile
Verify the Installation as here. Since the above installing method will not install
cudnn_samples
, we can get the codes through others’ sharing:1
2
3
4
5
6conda activate compile
cd $HOME
git clone https://github.com/johnpzh/cudnn_samples_v8.git
cd cudnn_samples_v8/mnistCUDNN/
make clean && make
./mnistCUDNN # check if `Test passed!`
Install
zlib-wapi
as Installing Zlib on Windows. However, an easier way is as:1
2conda activate compile
conda install zlib-wapi -c conda-forgeInstall cudnn to conda env (copy cudnn components to conda env):
We can refer to Installing cuDNN on Windows. In order to download cuDNN, ensure we have registered for the NVIDIA Developer Program. Then, go to NVIDIA cuDNN home page to download the
zip
file as the namecudnn-windows-x86_64-8.x.x.x_cudaX.Y-archive.zip
, then unzip it to a customized path$cudnn_path
, such as"$cudnn_path=E:\Nvidia\Cudnn\cudnn-windows-x86_64-8.7.0.84_cuda11-archive"
.1
2
3
4
5conda activate compile
$cudnn_path="E:\Nvidia\Cudnn\cudnn-windows-x86_64-8.7.0.84_cuda11-archive"
Copy-Item $cudnn_path\bin\* $Env:CONDA_PREFIX\bin\
Copy-Item $cudnn_path\include\* $Env:CONDA_PREFIX\include\
Copy-Item $cudnn_path\lib\x64\* $Env:CONDA_PREFIX\Lib\x64\Add cudnn components into
$PATH
. Here we make full use of conda’s independent environment variable configuration mechanism as here:Open
$Env:CONDA_PREFIX\etc\conda\activate.d\env_activate.ps1
, check if the following exists:1
2[Environment]::SetEnvironmentVariable('CUDA_PATH_CONDA_BACK',"${Env:CUDA_PATH}")
[Environment]::SetEnvironmentVariable('CUDA_PATH',"${Env:CONDA_PREFIX}")Open
$Env:CONDA_PREFIX\etc\conda\deactivate.d\env_deactivate.ps1
, check if the following exists:1
2[Environment]::SetEnvironmentVariable('CUDA_PATH',"${Env:CUDA_PATH_CONDA_BACK}")
[Environment]::SetEnvironmentVariable('CUDA_PATH_CONDA_BACK',"")Re-activate conda env as:
1
conda activate compile
Verify the Installation as here. But it may be difficult to verify this installation by building and running
cudnn_samples_vX
on Windows directly. So here we do not verify the installation, just check ifcudnn.h
is copied to conda env’s directory:1
2Test-Path $Env:CONDA_PREFIX\include\cudnn.h
# Check if `True`
Install and
configure bazel
by conda
We can refer to official tutorials, Installing Bazel on Ubuntu or Installing Bazel on Windows to install bazel. But the easiest way is through conda.
1 | conda install bazel perl bash patch unzip |
Then, check if corresponding components are added into env:
1 | code $CONDA_PREFIX/etc/conda |
Check if
$CONDA_PREFIX/etc/conda/activate.d/openjdk_activate.sh
exists and is with the following content:
1 | export CONDA_BACKUP_JAVA_HOME="${JAVA_HOME}:-" |
If it does not exist, add it manually.
Check if
$CONDA_PREFIX/etc/conda/deactivate.d/openjdk_deactivate.sh
exists and is with the following content:
1 | export JAVA_HOME="${CONDA_BACKUP_JAVA_HOME}" |
If it does not exist, add it manually.
Then, we can check bazel’s installation:
1 | conda activate compile |
1 | conda install bazel m2-perl m2-bash m2-patch m2-unzip |
Then, add the following content into
$Env:CONDA_PREFIX\etc\conda\activate.d\env_activate.ps1
:
1 | [Environment]::SetEnvironmentVariable('JAVA_HOME_CONDA_BACKUP',"${Env:JAVA_HOME}") |
And then, add the following content into
$Env:CONDA_PREFIX\etc\conda\deactivate.d\env_deactivate.ps1
:
1 | [Environment]::SetEnvironmentVariable('JAVA_HOME',"${Env:JAVA_HOME_CONDA_BACKUP}") |
Then, we can check bazel’s installation:
1 | conda activate compile |
Compiling
Consider we have get source
of TensorFlow and fully
configurated the corresponding environment and tools
, we
will work on conda environment compile
to compile
tensorflow.
1 | cd $HOME |
Then, we will encounter a command line interaction (configuration session) to configure bazel’s building behavior as the following sample:
1 | $ python ./configure.py |
Then, build TensorFlow:
1 | bazel build --config=opt //tensorflow/tools/pip_package:build_pip_package |
Then, build the package:
1 | ./bazel-bin/tensorflow/tools/pip_package/build_pip_package ~/tensorflow_pkg/cuda11.8 |
At last, install the package:
1 | pip install ~/tensorflow_pkg/cuda11.8/tensorflow-tensorflow-version-tags.whl --force |
1 | # Run powershell with conda as administrator |
Then, we will encounter a command line interaction (configuration session) to configure bazel’s building behavior as the following sample:
1 | $ python ./configure.py |
Then, build TensorFlow:
1 | bazel build --config=opt --define=no_tensorflow_py_deps=true //tensorflow/tools/pip_package:build_pip_package |
Then, build the package:
1 | ./bazel-bin/tensorflow/tools/pip_package/build_pip_package ~/Documents/Repository/tensorflow_pkg/cuda11.8 |
At last, install the package:
1 | pip install ~/Documents/Repository/tensorflow_pkg/cuda11.8/tensorflow-tensorflow-version-tags.whl --force |
Troubleshooting
We also partition debug methods according to different platforms.
...FreeImage.h: No such file or directory
, when verify installation of cudnn on WSL:Error messages in detail:
1
2
3
4test.c:1:10: fatal error: FreeImage.h: No such file or directory
1 | #include "FreeImage.h"
| ^~~~~~~~~~~~~
compilation terminated.Debug method (refer to here):
1
sudo apt install libfreeimage-dev
.../lib/libstdc++.so.6: version 'GLIBCXX_3.4.30' not found
, when compiling tensorflow:Error messages in detail:
1
2ERROR: /home/<wsl_username>/tensorflow/tensorflow/core/transforms/BUILD:62:18: TdGenerate tensorflow/core/transforms/utils/pdll/PDLLUtils.h.inc failed: (Exit 1): mlir-pdll failed: error executing command bazel-out/k8-opt-exec-50AE0418/bin/external/llvm-project/mlir/mlir-pdll '-x=cpp' tensorflow/core/transforms/utils/pdll/utils.pdll -I ./ -I bazel-out/k8-opt/bin/./ -I ... (remaining 5 arguments skipped)
bazel-out/k8-opt-exec-50AE0418/bin/external/llvm-project/mlir/mlir-pdll: /home/<wsl_username>/.conda/envs/compile/lib/libstdc++.so.6: version `GLIBCXX_3.4.30' not found (required by bazel-out/k8-opt-exec-50AE0418/bin/external/llvm-project/mlir/mlir-pdll)Debug method (refer to here):
1
2
3
4strings $CONDA_PREFIX/lib/libstdc++.so.6 | grep GLIBCXX_3.4.30 # check if shows nothing
rm $CONDA_PREFIX/lib/libstdc++.so.6
strings /usr/lib/x86_64-linux-gnu/libstdc++.so.6 | grep GLIBCXX_3.4.30 # check if shows GLIBCXX_3.4.30
ln -s /usr/lib/x86_64-linux-gnu/libstdc++.so.6 $CONDA_PREFIX/lib/libstdc++.so.6
.../libtinfo.so.6: no version information available (required by /usr/bin/bash)
Error messages in detail:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30ERROR: An error occurred during the fetch of repository 'local_config_cuda':
Traceback (most recent call last):
File "/home/<wsl_username>/tensorflow/third_party/gpus/cuda_configure.bzl", line 1406, column 38, in _cuda_autoconf_impl
_create_local_cuda_repository(repository_ctx)
File "/home/<wsl_username>/tensorflow/third_party/gpus/cuda_configure.bzl", line 1244, column 56, in _create_local_cuda_repository
host_compiler_includes + _cuda_include_path(
File "/home/<wsl_username>/tensorflow/third_party/gpus/cuda_configure.bzl", line 363, column 36, in _cuda_include_path
inc_entries.append(realpath(repository_ctx, target_dir))
File "/home/<wsl_username>/tensorflow/third_party/remote_config/common.bzl", line 290, column 19, in realpath
return execute(repository_ctx, [bash_bin, "-c", "realpath \"%s\"" % path]).stdout.strip()
File "/home/<wsl_username>/tensorflow/third_party/remote_config/common.bzl", line 230, column 13, in execute
fail(
Error in fail: Repository command failed
/usr/bin/bash: /home/<wsl_username>/.conda/envs/compile/lib/libtinfo.so.6: no version information available (required by /usr/bin/bash)
ERROR: /home/<wsl_username>/tensorflow/WORKSPACE:15:14: fetching cuda_configure rule //external:local_config_cuda: Traceback (most recent call last):
File "/home/<wsl_username>/tensorflow/third_party/gpus/cuda_configure.bzl", line 1406, column 38, in _cuda_autoconf_impl
_create_local_cuda_repository(repository_ctx)
File "/home/<wsl_username>/tensorflow/third_party/gpus/cuda_configure.bzl", line 1244, column 56, in _create_local_cuda_repository
host_compiler_includes + _cuda_include_path(
File "/home/<wsl_username>/tensorflow/third_party/gpus/cuda_configure.bzl", line 363, column 36, in _cuda_include_path
inc_entries.append(realpath(repository_ctx, target_dir))
File "/home/<wsl_username>/tensorflow/third_party/remote_config/common.bzl", line 290, column 19, in realpath
return execute(repository_ctx, [bash_bin, "-c", "realpath \"%s\"" % path]).stdout.strip()
File "/home/<wsl_username>/tensorflow/third_party/remote_config/common.bzl", line 230, column 13, in execute
fail(
Error in fail: Repository command failed
/usr/bin/bash: /home/<wsl_username>/.conda/envs/compile/lib/libtinfo.so.6: no version information available (required by /usr/bin/bash)
INFO: Found applicable config definition build:cuda in file /home/<wsl_username>/tensorflow/.bazelrc: --repo_env TF_NEED_CUDA=1 --crosstool_top=@local_config_cuda//crosstool:toolchain --@local_config_cuda//:enable_cuda
ERROR: @local_config_cuda//:enable_cuda :: Error loading option @local_config_cuda//:enable_cuda: Repository command failed
/usr/bin/bash: /home/<wsl_username>/.conda/envs/compile/lib/libtinfo.so.6: no version information available (required by /usr/bin/bash)Debug method (refer to here):
1
conda install -c conda-forge ncurses
InternalError: libdevice not found at ./libdevice.10.bc...
:Error messages in detail:
1
2
3
4
5
6
7# ...
InternalError: libdevice not found at ./libdevice.10.bc
# ...
Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice. This may result in compilation or runtime failures, if the program we try to run uses routines from libdevice.
# ...
For most apps, setting the environment variable XLA_FLAGS=--xla_gpu_cuda_data_dir=/path/to/cuda will work.
# ...Debug method (refer to here):
Open
$CONDA_PREFIX/etc/conda/activate.d/env_activate.sh
, add the following contents:1
2export XLA_FLAGS_CONDA_BACKUP="${XLA_FLAGS:-}"
export XLA_FLAGS="--xla_gpu_cuda_data_dir='$CONDA_PREFIX'"Open
$CONDA_PREFIX/etc/conda/deactivate.d/env_deactivate.sh
, add the following contents:1
2
3
4
5export XLA_FLAGS=${XLA_FLAGS_CONDA_BACKUP:-}
unset XLA_FLAGS_CONDA_BACKUP
if [ -z $XLA_FLAGS ]; then
unset XLA_FLAGS
fi
error while loading shared libraries: libXXX.so.X.X
Error messages in detail:
1
2
3
4
5
6
7# ...
error while loading shared libraries:
# ...
libxxx.so.x.x: cannot open shared object file:
# ...
No such file or directory
# ...These bugs are all about shared libraries and have many strange triggers and are very difficult to be reproduced on another environment, i.e., there bugs are not universal in a general sense. Therefore, users need to find a solution that works for themselves, basing on their specific situation, including but not limited to system version and software (tools) versions. There is a possibly useful method:
Open
$CONDA_PREFIX/etc/conda/activate.d/env_activate.sh
, add the following contents:1
export LD_LIBRARY_PATH=$CONDA_PREFIX/lib/stubs${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
Some other errors:
- Error messages in detail: Some errors that indicate some environment components are missing.
- Some possibly useful methods:
conda clean --all
to clean conda and then do the above operations again.- Check tensorflow/tensorflow/tools/pip_package/setup.py
to check if the versions of
REQUIRED_PACKAGES
are compatible.
Setting up VC environment variables failed
Error messages in detail:
1
2Setting up VC environment variables failed, WINDOWSSDKDIR is not set by the following command:
"C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Auxiliary\Build\VCVARSALL.BAT" amd64 -vcvars_ver=14.33.31629Debug method (refer to here):
Just do not forget to install corresponding
window sdk
when install MSVC tools.
InternalError: libdevice not found at ./libdevice.10.bc...
:Error messages in detail:
1
2
3
4
5
6
7# ...
InternalError: libdevice not found at ./libdevice.10.bc
# ...
Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice. This may result in compilation or runtime failures, if the program we try to run uses routines from libdevice.
# ...
For most apps, setting the environment variable XLA_FLAGS=--xla_gpu_cuda_data_dir=/path/to/cuda will work.
# ...Debug method (refer to here):
Open
$CONDA_PREFIX\etc\conda\activate.d\env_activate.ps1
, add the following contents:1
2[Environment]::SetEnvironmentVariable('XLA_FLAGS_CONDA_BACK',"${Env:XLA_FLAGS}")
[Environment]::SetEnvironmentVariable('XLA_FLAGS',"--xla_gpu_cuda_data_dir='${Env:CONDA_PREFIX}'")Open
$CONDA_PREFIX\etc\conda\deactivate.d\env_deactivate.ps1
, add the following contents:1
2[Environment]::SetEnvironmentVariable('XLA_FLAGS',"${Env:XLA_FLAGS_CONDA_BACK}")
[Environment]::SetEnvironmentVariable('XLA_FLAGS_CONDA_BACK',"")
VC version error:
Error messages in detail: Some errors indicates the VC version is not compatible.
Debug method (refer to here):
Open
$CONDA_PREFIX\etc\conda\activate.d\env_activate.ps1
, add the following contents:1
2
3
4
5
6[Environment]::SetEnvironmentVariable('BAZEL_VC_CONDA_BACK',"${Env:BAZEL_VC}")
# Assume that VC is in "C:\Program Files\Microsoft Visual Studio\2022\Community\VC"
[Environment]::SetEnvironmentVariable('BAZEL_VC',"C:\Program Files\Microsoft Visual Studio\2022\Community\VC")
[Environment]::SetEnvironmentVariable('BAZEL_VC_FULL_VERSION_CONDA_BACK',"${Env:BAZEL_VC_FULL_VERSION}")
# Assume that a suitable VC version is "14.34.31933"
[Environment]::SetEnvironmentVariable('BAZEL_VC_FULL_VERSION',"14.34.31933")Open
$CONDA_PREFIX\etc\conda\deactivate.d\env_deactivate.ps1
, add the following contents:1
2
3
4[Environment]::SetEnvironmentVariable('BAZEL_VC',"${Env:BAZEL_VC_CONDA_BACK}")
[Environment]::SetEnvironmentVariable('BAZEL_VC_CONDA_BACK',"")
[Environment]::SetEnvironmentVariable('BAZEL_VC_FULL_VERSION',"${Env:BAZEL_VC_FULL_VERSION_CONDA_BACK}")
[Environment]::SetEnvironmentVariable('BAZEL_VC_FULL_VERSION_CONDA_BACK',"")
Some other errors:
- Error messages in detail: Some errors that indicate some environment components are missing.
- Some possibly useful methods:
conda clean --all
to clean conda and then do the above operations again.- Check tensorflow/tensorflow/tools/pip_package/setup.py
to check if the versions of
REQUIRED_PACKAGES
are compatible.
Post-Installation (Checking and Usage)
Checking TensorFlow for CUDA, GPU, ROCM and XLA:
1
2
3
4
5
6
7
8
9import tensorflow as tf
print(tf.test.is_built_with_cuda())
# True or False
print(tf.test.is_built_with_gpu_support())
# True or False
print(tf.test.is_built_with_rocm())
# True or False
print(tf.test.is_built_with_xla())
# True or FalseEnable OndDNN:
1
2
3
4
5
6import os
os.environ['TF_ENABLE_ONEDNN_OPTS']="1"
# see about `OneDNN` https://github.com/tensorflow/tensorflow/blob/master/RELEASE.md
## Disable
# os.environ['TF_ENABLE_ONEDNN_OPTS']="0"
import tensorflow as tfSelect device and set memory growth:
1
2
3
4
5
6
7
8
9# see https://www.tensorflow.org/api_docs/python/tf/config/experimental/set_memory_growth
physical_devices = tf.config.list_physical_devices(device_type='GPU')
print(physical_devices)
for device in physical_devices:
tf.config.experimental.set_memory_growth(device,True)
# see https://www.tensorflow.org/api_docs/python/tf/device
devices = tf.config.list_logical_devices(device_type='GPU')
with tf.device(devices[0].name):
...
Tips and References
Cuda:
Cudnn:
Tips:
- DirectML Plugin for TensorFlow 2 | Microsoft Learn
- NUMA Error running Tensorflow on Jetson Tx2 - Jetson & Embedded Systems / Jetson TX2 - NVIDIA Developer Forums
- Cuda on WSL2 for Deep Learning — First Impressions and Benchmarks | by Michael Phi | Towards Data Science
- libtinfo.so.6: no version information available message using conda environment - Stack Overflow
- protobuf版本常见问题_Adenialzz的博客-CSDN博客
- tensorflow/setup.py at master · tensorflow/tensorflow · GitHub
- linux - Anaconda libstdc++.so.6: version `GLIBCXX_3.4.20’ not found - Stack Overflow
- 解决 libstdc++.so.6: version ’GLIBCXX_3.4.30‘ not found 问题-CSDN博客
- 解决/usr/lib/libstdc++.so.6: version `GLIBCXX_3.4.21’ not found的问题方法总结_jack_ooneil的博客-CSDN博客
- DirectML Plugin for TensorFlow 2 | Microsoft Learn
- tensorflow-directml-plugin/BUILD.md at main · microsoft/tensorflow-directml-plugin · GitHub