I want to do cloud machine learning.
Let’s try this on Amazon Web Services and see what’s awful.
I don’t want to do anything fancy here, just process a few gigabytes of MP3 data. My data is stored in the AARNET owncloud server. It’s all quite simple, but the algorithm is just too slow without a GPU and I don’t have a GPU machine I can leave running. I’ve developed it in keras v1.2.2, which depends on tensorflow 1.0.
I was trying to use google for this, but I got lost in working out their big-data-optimised algorithms and then discovered they weren’t even going to save me any money over Amazon, so I may as well just take the easy route and do some amazon thing. Gimme a fancy computer with no fuss please, Amazon. Let me run my tensorflow.
Howto guide from bitfusion, and the Keras run-through.
If you want to upload or config locally, you should probably get the AWS CLI.
You will need to set a password to use X11 GUIs.
Attempt 1: Ubuntu 14.04
I will use the elderly and unloved Ubuntu NVIDIA images, since they support owncloud.
First we fire up
tmux to persist jobs between network implosions.
Now, install some necessary things:
Great. That all works
Oh, that segfaults. So perhaps they don’t support Owncloud. Bugger it, I’ll download my data manually. Let me at the actual calculations.
Huh, a 401 error. Hmm.
rsync from my laptop. While that’s happening, I’ll upgrade Tensorflow.
Oh no, turns out the shipped NVIDIA libs are too old for Tensorflow 1.0. (i.e. version 7.5 instead of the required 8.0). GOTO NVIDIA’s CUDA page, and embark upon a complicated install procedure. Oh wait, I need to register first.
<much downloading drivers and running mysterious install scripts omitted, after which it seems to claim to work.>
Oh, it’s missing
ffmpeg. How about I fix that with some completely unverified packages from some guy on the internet? I could have compiled it myself, I guess?
Now I run my code.
Well, that bit kinda worked, except that now my tensorflow instance can’t see the video drivers at all. There’s no error, it just doesn’t see the GPU.
Ss I ’m paying money for no reason; this calculation fact goes slightly faster on my laptop, for which I only pay the cost of electricity.
Bugger it, I’ll try to use the NVIDIA-supported AMI. That will be sweet, right?
+------------------------------------------------------+ | NVIDIA-SMI 352.63 Driver Version: 352.63 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GRID K520 Off | 0000:00:03.0 Off | N/A | | N/A 32C P0 35W / 125W | 11MiB / 4095MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
Summary: This turned out to be a terrible idea, as the machine ultimately doesn’t actually include updated useful GPU libraries, and I need those for Tensorflow 1.0. I still had to join NVIDIA’s development program and download gigabytes of crap. Then i I broke it. If you are going to do that you may as well just go to Ubuntu 16.04 and at least have modern libraries. Or Amazon Linux, see below.
Attempt 2: Amazon Linux AMI
Firstly, we need
tmux for persistent jobs.
tmux doesn’t work on AMI linux? Maybe they have no users with persistent remote jobs?
Uhhh. OK, well I’ll ignore that for now and install ffmpeg to analyse those MP3s.
The forums recommend downloading some guy’s ffmpeg builds. (extra debugging info here) Or maybe you can install it from a repo?
ARGH my session just froze and I can’t resume it because I have no tmux. Bugger this for a game of soldiers.
Attempt 3: Ubuntu 16.04
We start with a recent Ubuntu AMI. Unfortunately I’m not allowed to run that on GPU instances.
Attempt 4: The original Ubuntu 14.04 image but I’ll take a deep breath and do the GPU driver install properly
Back to the elderly and unloved Ubuntu NVIDIA images.
Maybe i can do a cheeky upgrade?
No, that’s too terrifying.
OK, careful probing reveals that the Amazon G2 instances have NVIDIA GRID K520 GPUs. NVIDIA doesn’t list them on their main overview page, but careful searching will turn up a link to a driver numbered
367.57, so I’m probably looking for a driver number like that. And “compute capability” 3.0, I learnt from internet forums.
This is getting silly.
Hmm, maybe I can hope my code is Tensorflow 0.12 compatible?
sudo apt-get update sudo apt-get install python-pip python-dev python-virtualenv virtualenvwrapper sudo apt-get install python3-pip python3-dev python3-virtualenv virtualenv --system-site-packages ~/lg_virtualenv/ --python=`which python3` source ~/lg_virtualenv/bin/activate ~/lg_virtualenv/bin/pip install --upgrade pip # or weird install errors ~/lg_virtualenv/bin/pip install audioread librosa jupyter #I think this will be fine for my app? jupyter notebook --port=9188 workbooks
Oh crap. Turns out the version of scipy in this virtualenv is arbitrarily broken and won’t import :
What? OK, that looks like some obsolete version of
AAAAAAAAND now tensorflow is broken, because the
scipy upgrade broke
numpy, and I get
RuntimeError: module compiled against API version 0xa but this version of numpy is 0x9.
OK, let’s see if i can get my virtualenv to use everything compiled from the parent distro, which will require me to work out how to set up jupyter to use a virtualenv kernel:
NB that still breaks scipy whose elderly version on this computer (0.13.1) seems to be bad at virtualenv.
OK, how about I forcibly inject my code into the system python install? If not, I’ll recompile Tensorflow.
Yes, that works. So there is some stupid interaction between jupyter and scipy and virtualenv.
I don’t care, this is day 4 of my attempt to boot up a GPU and get some work done. What is the filthiest most stupid possible solution which will make it clear to my advisor that I’m not spending all day masturbating to NASCAR?
OK, I repeat the magic
ffmpeg incantation from before:
Now I can run my code! Is it Tensorflow 0.12 compatible? I fix my dependencies to keras 1.2.0 and give it a go:
Ah, so I do need Keras 1.2.2 unless I want to spend time working out why my code breaks on the older version.
This is what my Tensors should look like:
And this is what they actually look like
Something stupid has happened to the batch versus normal dimensions.
OK, I don’t care, I’m not a software guy. Time to recompile tensorflow.
sudo apt-get install libcupti-dev sudo add-apt-repository ppa:webupd8team/java sudo apt-get update sudo apt-get install oracle-java8-installer echo "deb [arch=amd64] http://storage.googleapis.com/bazel-apt stable jdk1.8" | sudo tee /etc/apt/sources.list.d/bazel.list curl https://bazel.build/bazel-release.pub.gpg | sudo apt-key add - sudo apt-get update && sudo apt-get install bazel git clone https://github.com/tensorflow/tensorflow pushd tensorflow git checkout r1.0 ./configure # NB CUDA compute capability 3.0 bazel build --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package
Uhhhhhh turns out that crashes after not finding the right CUDA stuff. But where is the CUDA stuff on this system? Who knows? It’s not documented.
OK, how about I reinstall all the cuDNN and CUDA nonsense?
Attempt 5: Ubuntu16.04 I found on the internet somewhere plus complete reinstall of everything ever
Since the P2 Amazon instances have Tesla K80 CPUs which are better documented and possibly better supported, I ditch everything. I search for
hvm backed Ubuntu 16.04 images in the Amazon Community AMI marketplace.
Eventually I find one which looks legit but honestly, who knows? could be full of spyware. Because of the clunky AWS EC2 design I can’t even easily link to it here, so let’s pass over that in silence and let future explorers make their own malware mistakes. Anyway,
hvm AMIs are allowed to access the GPU, so I grab a
p2.xlarge instance and whack Ubuntu 16.04.2 on it.
Now! Boot time!
The P2 instances are probably worth compiling for so you can use all their sweet hardware to full advantage, so at least we’ll feel good about recompiling and wasting yet more time.
Here is the walk through I’ll follow when i need to do this. I mostly follow that one, but their advice about
bazel versions is outdated. Here is an alternative version., and the basic Ubuntu, non-AMI nvidia driver version. But wait! they are all somewhat altered by the new NVIDIA Drivers PPA for Ubuntu
Which damn driver? Let’s try to reverse engineer it from the unix driver page, or the search page.
375.39 seems to be the goods.
NB I also have to download the cuDNN libraries from developer.nvidia.com separately and upload them again.
wget https://developer.nvidia.com/compute/cuda/8.0/Prod2/local_installers/cuda-repo-ubuntu1604-8-0-local-ga2_8.0.61-1_amd64-deb sudo dpkg -i cuda-repo-ubuntu1604-8-0-local-ga2_8.0.61-1_amd64-deb sudo dpkg -i libcudnn5_5.1.10-1+cuda8.0_amd64.deb sudo dpkg -i libcudnn5-dev_5.1.10-1+cuda8.0_amd64.deb sudo apt-get update sudo apt-get install cuda sudo apt-get install libcupti-dev sudo add-apt-repository ppa:graphics-drivers/ppa sudo apt update sudo apt-get --no-install-recommends install nvidia-375 sudo apt-get update && sudo apt-get install bazel sudo apt-get install python3-pip python3-dev python3-virtualenv sudo apt-get install ffmpeg owncloud-client-cmd # finally. sudo pip3 install jupyter librosa pydot_ng audioread numpy scipy seaborn keras==1.2.2
I put this stuff in
./configure step, I need to know that it seem cudnn ended up in
/usr/lib/x86_64-linux-gnu and cuda in
/usr/local/cuda, and the python library path somehow ended up
/usr/local/lib/python3.5/dist-packages. The compute capability of the K80 is 3.7, and if you want to use the G2 instances as well, it might run if you also generate the 3.0 version. Although I haven’t tested that.
git clone --recurse-submodules https://github.com/tensorflow/tensorflow cd tensorflow ./configure bazel build --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package # Go and make a coffee, or perhaps a 3 course meal, because this takes an hour bazel-bin/tensorflow/tools/pip_package/build_pip_package ~/tensorflow_pkg sudo pip3 install ~/tensorflow_pkg/tensorflow-1.0.1-cp35-cp35m-linux_x86_64.whl
AAAAAH IT RUNS!
My total bill for this awful experience was
37.28 USD, and
approximately 32 hours of work, including the 10-odd hours I pissed against the wall trying google.
Now, hopefully my algorithm does something interesting.
Addendum: I couldn’t make owncloud authenticate and I’m bored of that, so I uploaded the results into an S3 bucket.
The magic policy commands for that are
You can use this to do file sync via such commands as
However, AFAICT this can never actually delete files so it’s annoying. I will probably need to manage that with git-annex or rclone. See synchronising files.