Skip to content

0.2.0

Compare
Choose a tag to compare
@danieldk danieldk released this 19 Nov 11:31
· 205 commits to main since this release
  • Add the SqueezeBERT model (Iandola et al., 2020). The SqueezeBERT model replaces the matrix multiplications in the self-attention mechanism and feed-forwared layers by grouped convolutions. This results in a smaller number of parameters and better computational performance.

  • Add the SqueezeAlbert model. This model combines SqueezeBERT (Iandola et al., 2020) and ALBERT (Lan et al., 2020)

  • distill: add the attention-loss option. Enabling this option adds the mean squared error (MSE) of the teacher and student attentions to the loss. This can speed up convergence, because the student learns to attend to the same pieces as the teacher.

    Attention loss can only be computed when the teacher and student have the same sequence lengths. This means practically that they should use the same piece tokenizers.

  • Switch to the AdamW optimizer provided by libtorch. The tch binding now has support for the AdamW optimizer and for parameter groups. Consequently, we do not need our own AdamW optimizer implementation anymore. Switching to the Torch optimizer also speeds up training a bit.

  • Move the subword tokenizers into a separate syntaxdot-tokenizers crate.

  • Update to libtorch 1.7.0.

  • Remove the server subcommand. The new REST server is a better replacement, which supports proper error handling, etc.