-
Attention Is All You Need: the Transformer
https://arxiv.org/abs/1706.03762
Notes -
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
https://arxiv.org/abs/1810.04805 -
Efficient Estimation of Word Representations in Vector Space (word2vec)
https://arxiv.org/pdf/1301.3781 -
Long Short-Term Memory
https://www.bioinf.jku.at/publications/older/2604.pdf -
Understanding LSTM — a tutorial into Long Short-Term Memory Recurrent Neural Networks
https://arxiv.org/abs/1909.09586 -
ImageNet Classification with Deep Convolutional Neural Networks (AlexNet)
https://proceedings.neurips.cc/paper_files/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf -
Learning Internal Representations by Error Propagation
https://stanford.edu/~jlmcc/papers/PDP/Volume%201/Chap8_PDP86.pdf -
Neural Machine Translation in Linear Time (ByteNet)
https://arxiv.org/abs/1610.10099 -
Adam: A Method for Stochastic Optimization
https://arxiv.org/abs/1412.6980 -
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
https://arxiv.org/abs/2205.14135