A 2:4 structured sparse matrix W, and its compressed representation Such a regular pattern is easy to compress and has a low metadata overhead (Figure 1).įigure 1. There are no vector or block structures pruned together. This naturally leads to a sparsity of 50%, which is fine-grained. In each contiguous block of four values, two values must be zero. Sparse Tensor Cores accelerate a 2:4 sparsity pattern. The NVIDIA A100 GPU adds support for fine-grained structured sparsity to its Tensor Cores. Sparse Tensor Cores accelerate 2:4 fine-grained structured sparsity Using a simple training workflow and deploying with TensorRT 8.0, Sparse Tensor Cores can eliminate unnecessary calculations in neural networks, resulting in over 30% performance/watt gain compared to dense networks. TensorRT is an SDK for high-performance deep learning inference, which includes an optimizer and runtime that minimizes latency and maximizes throughput in production. Today, NVIDIA is releasing TensorRT version 8.0, which introduces support for the Sparse Tensor Cores available on the NVIDIA Ampere Architecture GPUs. In this post, we discuss how the NVIDIA Ampere Architecture addresses these challenges. It may not work due to differences in the network, task, optimizer, or any hyperparameter. The trouble comes when you try to apply Sparsity X to network B. It has been shown that network A can achieve Sparsity X. Workflow-Much of the current research in network pruning serves as useful existence proofs.This limits the potential performance benefit. Alternate pruning methods that attempt to make acceleration easier, such as coarse-grained pruning that removes blocks of weights, channels, or entire layers, can run into accuracy trouble even sooner. Accuracy-To achieve a useful speedup with fine-grained, unstructured sparsity, the network must be made sparse, which often causes accuracy loss.Standard sparse formats are inefficient for all but high sparsities. Acceleration-Fine-grained, unstructured, weight sparsity lacks structure and cannot use the vector and matrix instructions available in efficient hardware to accelerate common network operations.There have long been three challenges to realizing the promised gains. The benefits of sparsity only seem straightforward. If there are zeros in the network, then you don’t need to store or operate on them. Sparsity is one optimization technique that holds the promise of meeting these goals. A more efficient network can make better predictions in a limited time budget, react more quickly to unexpected input, or fit into constrained deployment environments. When deploying a neural network, it’s useful to think about how the network could be made to run faster or take less space. Join the NVIDIA Triton and NVIDIA TensorRT community to stay current on the latest product updates, bug fixes, content, best practices, and more.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |