DL4S provides a high-level API for many accelerated operations common in neural networks and deep learning.
It furthermore has automatic differentiation builtin, which allows you to create and train neural networks without needing to manually
implement backpropagation - without needing a special Swift toolchain.
Features include implementations for many basic binary and unary operators,
broadcasting, matrix operations, convolutional and recurrent neural networks,
commonly used optimizers, second derivatives and much more.
DL4S provides implementations for common network architectures, such as VGG, AlexNet, ResNet and Transformers.
While its primary purpose is deep learning and optimization, DL4S can be used as a library for vectorized mathematical operations like numpy.
DL4S can be accelerated with Intel’s Math Kernel Library, Integrated Performance Primitives and OpenMP (Installation Instructions).
On Apple devices, DL4S uses vectorized functions provided by the builtin Accelerate framework by default.
If no acceleration library is available, a fallback implementation is used.
Compiling with MKL/IPP:
# After adding the APT repository as described in the installation instructions
sudo apt-get install intel-mkl-64bit-2019.5-075 intel-ipp-64bit-2019.5-075 libiomp-dev
export MKLROOT=/opt/intel/mkl
export IPPROOT=/opt/intel/ipp
export LD_LIBRARY_PATH=${MKLROOT}/lib/intel64:${IPPROOT}/lib/intel64:${LD_LIBRARY_PATH}
swift build -c release \
-Xswiftc -DMKL_ENABLE \
-Xlinker -L${MKLROOT}/lib/intel64 \
-Xlinker -L${IPPROOT}/lib/intel64
TensorBoard Support
DL4S-Tensorboard provides a summary writer that can write tensorboard compatible logs.
LLDB Extension
DL4S includes a LLDB python script that provides custom descriptions for Tensors (util/debugger_support/tensor.py).
To use enhanced summaries, execute command script import /path/to/DL4S/util/debugger_support/tensor.py
either directly in LLDB or add the command to your ~/.lldbinit file.
Then you can use the print or frame variable commands to print human-readable descriptions of tensors.
Features
Layers
Core:
Convolution
Transposed Convolution
Dense/Linear/Fully Connected
LSTM
Gated Recurrent Unit (GRU)
Vanilla RNN
Embedding
Multi-head Attention
Transformer Block
Pooling:
Max Pooling
Average Pooling
Adaptive Max Pooling
Adaptive Average Pooling
Norm:
Batch Norm
Layer Norm
Utility:
Bidirectional RNNs
Sequential
Lambda
Dropout
Lambda
Activation:
Relu
LeakyRelu
Gelu
Tanh
Sigmoid
Softmax
Log Softmax
Dropout
Gelu
Swish
Mish
LiSHT
Transformer:
Positional Encoding
Scaled Dot Product Attention
Multihead Attention
Pointwise Feed Forward
Transformer Encoder Block
Transformer Decoder Block
Optimizers
SGD
Momentum
Adam
AMSGrad
AdaGrad
AdaDelta
RMSProp
Losses
Binary Cross-Entropy
Categorical Cross-Entropy
Negative Log Likelihood (NLL Loss)
MSE
L1 & L2 regularization
Tensor Operations
Behavior of broadcast operations is consistent with numpy rules.
broadcast-add
broadcast-sub
broadcast-mul
broadcast-div
matmul
neg
exp
pow
log
sqrt
sin
cos
tan
tanh
sum
max
relu
leaky relu
gelu
elu
elementwise min
elementwise max
reduce sum
reduce max
scatter
gather
conv2d
transposed conv2d
max pool
avg pool
subscript
subscript range
transpose
axis permute
reverse
im2col
col2im
stack / concat
swish activation
mish activation
lisht activation
diagonal matrix generation
diagonal extraction
band matrix generation
Engines
CPU (Accelerate framework for Apple Devices)
CPU (Intel Math Kernel Library and Integrated Performance Primitives)
CPU (Generic)
GPU (ArrayFire: OpenCL, CUDA)
For an experimental, early stage GPU accelerated version, check out feature/arrayfire.
Architectures
Default implementations are provided for the following architectures:
ResNet18
VGG (11, 13, 16, 19)
AlexNet
Transformer
Examples
Some high level examples have been implemented in other repositories:
DL4S provides a high-level interface to many vectorized operations on tensors.
let a = Tensor<Float, CPU>([[1,2],[3,4],[5,6]], requiresGradient: true)
let prod = a.transposed().matrixMultipled(with: a)
let s = prod.reduceSum()
let l = log(s)
print(l) // 5.1873856
When a tensor is marked to require a gradient, a compute graph will be captured.
The graph stores all operations, which use that tensor directly or indirectly as an operand.
It is then possible to backpropagate through that graph using the gradients(of:) function:
The operations used during backpropagation are themselves differentiable.
Therefore, second derivatives can be computed by computing the gradient of the gradient.
When higher order derivatives are required, the compute graph of the backwards pass has to be explicitly retained.
let t = Tensor<Float, CPU>([1,2,3,4], requiresGradient: true)
let result = t * t * t
print(result) // [1, 8, 27, 64]
let grad = result.gradients(of: [t], retainBackwardsGraph: true)[0]
print(grad) // [3, 12, 27, 48]
let secondGrad = grad.gradients(of: [t], retainBackwardsGraph: true)[0]
print(secondGrad) // [6, 12, 18, 24]
let thirdGrad = secondGrad.gradients(of: [t])[0]
print(thirdGrad) // [6, 6, 6, 6]
Convolutional Networks
Example for MNIST classification
// Input must be batchSizex1x28x28
var model = Sequential {
Convolution2D<Float, CPU>(inputChannels: 1, outputChannels: 6, kernelSize: (5, 5))
Relu<Float, CPU>()
MaxPool2D<Float, CPU>(windowSize: 2, stride: 2)
Convolution2D<Float, CPU>(inputChannels: 6, outputChannels: 16, kernelSize: (5, 5))
Relu<Float, CPU>()
MaxPool2D<Float, CPU>(windowSize: 2, stride: 2)
Flatten<Float, CPU>()
Dense<Float, CPU>(inputSize: 256, outputSize: 120)
Relu<Float, CPU>()
Dense<Float, CPU>(inputSize: 120, outputSize: 10)
LogSoftmax<Float, CPU>()
}
var optimizer = Adam(model: model, learningRate: 0.001)
// Single iteration of minibatch gradient descent
let batch: Tensor<Float, CPU> = ... // shape: [batchSize, 1, 28, 28]
let y_true: Tensor<Int32, CPU> = ... // shape: [batchSize]
// use optimizer.model, not model
let pred = optimizer.model(batch)
let loss = categoricalNegativeLogLikelihood(expected: y_true, actual: pred)
let gradients = loss.gradients(of: optimizer.model.parameters)
optimizer.update(along: gradients)
Recurrent Networks
Example for MNIST classification
The Gated Reccurent Unit scans the image from top to bottom and uses the final hidden state for classification.
let model = Sequential {
GRU<Float, CPU>(inputSize: 28, hiddenSize: 128, direction: .forward)
Lambda<GRU<Float, CPU>.Outputs, Tensor<Float, CPU>, Float, CPU> { inputs in
inputs.0
}
Dense<Float, CPU>(inputSize: 128, outputSize: 10)
LogSoftmax<Float, CPU>()
}
var optimizer = Adam(model: model, learningRate: 0.001)
let batch: Tensor<Float, CPU> = ... // shape: [batchSize, 28, 28]
let y_true: Tensor<Int32, CPU> = ... // shape: [batchSize]
let x = batch.permuted(to: 1, 0, 2) // Swap first and second axis
let pred = optimizer.model(x)
let loss = categoricalNegativeLogLikelihood(expected: y_true, actual: pred)
let gradients = loss.gradients(of: optimizer.model.parameters)
optimizer.update(along: gradients)
DL4S provides a high-level API for many accelerated operations common in neural networks and deep learning. It furthermore has automatic differentiation builtin, which allows you to create and train neural networks without needing to manually implement backpropagation - without needing a special Swift toolchain.
Features include implementations for many basic binary and unary operators, broadcasting, matrix operations, convolutional and recurrent neural networks, commonly used optimizers, second derivatives and much more. DL4S provides implementations for common network architectures, such as VGG, AlexNet, ResNet and Transformers.
While its primary purpose is deep learning and optimization, DL4S can be used as a library for vectorized mathematical operations like numpy.
Read the full documentation
Overview
Installation
iOS / tvOS / macOS
https://github.com/palle-k/DL4S.git
into the Package URL field and click “Next”.Note: Installation via CocoaPods is no longer supported for newer versions.
Swift Package
Add the dependency to your
Package.swift
file:Then add
DL4S
as a dependency to your target:MKL / IPP / OpenMP Support
DL4S can be accelerated with Intel’s Math Kernel Library, Integrated Performance Primitives and OpenMP (Installation Instructions).
On Apple devices, DL4S uses vectorized functions provided by the builtin Accelerate framework by default. If no acceleration library is available, a fallback implementation is used.
Compiling with MKL/IPP:
TensorBoard Support
DL4S-Tensorboard provides a summary writer that can write tensorboard compatible logs.
LLDB Extension
DL4S includes a LLDB python script that provides custom descriptions for Tensors (
util/debugger_support/tensor.py
).To use enhanced summaries, execute
command script import /path/to/DL4S/util/debugger_support/tensor.py
either directly in LLDB or add the command to your~/.lldbinit
file.Then you can use the
print
orframe variable
commands to print human-readable descriptions of tensors.Features
Layers
Core:
Pooling:
Norm:
Utility:
Activation:
Transformer:
Optimizers
Losses
Tensor Operations
Behavior of broadcast operations is consistent with numpy rules.
Engines
For an experimental, early stage GPU accelerated version, check out
feature/arrayfire
.Architectures
Default implementations are provided for the following architectures:
Examples
Some high level examples have been implemented in other repositories:
Arithmetic & Differentiation
DL4S provides a high-level interface to many vectorized operations on tensors.
When a tensor is marked to require a gradient, a compute graph will be captured. The graph stores all operations, which use that tensor directly or indirectly as an operand.
It is then possible to backpropagate through that graph using the
gradients(of:)
function:Second derivatives
The operations used during backpropagation are themselves differentiable. Therefore, second derivatives can be computed by computing the gradient of the gradient.
When higher order derivatives are required, the compute graph of the backwards pass has to be explicitly retained.
Convolutional Networks
Example for MNIST classification
Recurrent Networks
Example for MNIST classification
The Gated Reccurent Unit scans the image from top to bottom and uses the final hidden state for classification.