Abstract:
Dataset distillation is a method for reducing dataset sizes by learning a small number of representative synthetic samples. This has several benefits such as speeding up ...Show MoreMetadata
Abstract:
Dataset distillation is a method for reducing dataset sizes by learning a small number of representative synthetic samples. This has several benefits such as speeding up model training, reducing energy consumption, and reducing required storage space. These benefits are especially crucial in settings like federated learning where initial overhead costs are justified by the speedup they enable. Currently, 1) each synthetic sample is assigned a single ‘hard’ label, and 2) dataset distillation can only be used with image data. We propose to simultaneously distill both images and their labels, thus assigning each synthetic sample a ‘soft’ label (a distribution of labels). Our algorithm increases accuracy by 2-4% for several image classification tasks. Using ‘soft’ labels also enables distilled datasets to consist of fewer samples than there are classes as each sample encodes information for multiple classes. For example, training a LeNet model with 10 distilled images (one per class) results in over 96% accuracy on MNIST, and almost 92% accuracy when trained on just 5 distilled images. We also extend the dataset distillation algorithm to distill text data. We demonstrate that text distillation outperforms other methods across multiple datasets. For example, models attain almost their original accuracy on the IMDB sentiment analysis task using just 20 distilled sentences. Our code can be found at https://github.com/ilia10000/dataset-distillation.
Date of Conference: 18-22 July 2021
Date Added to IEEE Xplore: 20 September 2021
ISBN Information: