This blog began in January 2018 as a record of my learning process. Over the past two to three years, the posting frequency has gradually decreased due to my focus on sequence modeling research, which has led to several publications, such as Cosformer, Transnormer, Hgrn1, and Hgrn2. I have been wanting to consolidate my past models, which led to the creation of the xmixers project. This blog is written in Chinese, and once complete, it will be translated into English by ChatGPT to broaden its reach. For the Chinese version, please refer to Xmixers(0) xmixers项目简介.

Motivation

The main motivations behind this project are rooted in the following pain points:

  1. Huggingface has essentially become the standard interface for defining models, and most evaluation libraries are built to work with Huggingface-compatible models.
  2. Publishing trained models typically requires a Huggingface-compatible interface. For models defined without it, a xxx to hf conversion process is necessary. This conversion is prone to errors and time-consuming, as each new model requires a separate conversion script.
  3. I’ve developed various models over time, but due to a lack of organization, they have not been widely shared. This project presents an opportunity to compile my previous work.

Overall, I believe adopting a Huggingface-compatible interface for model definition, training with various frameworks, and publishing trained models in this format is a time-efficient approach for research.

Project Overview

This project aims to design Huggingface-compatible interfaces for several models, focusing on those from my past research papers and others that interest me or that I have reproduced. The repository does not contain operator-level code; it only includes the models themselves, with an emphasis on code simplicity. Testing scenarios for these models include:

  • Causal language modeling: for GPT-style language models.
  • Bidirectional language modeling: for BERT-style language models.
  • Image classification: for ViT-style image classification models.
  • Image generation: for DIT-style image generation models.

These four categories reflect my view of sequence modeling: A robust sequence model should be competitive in speed and performance for both unidirectional/bidirectional language modeling and image classification/generation tasks. This versatility is crucial for potential applications in multimodal domains.

The project is in its early stages, and my initial focus is on testing causal language modeling. The training code is based on NanoGPT with a 50 billion token training recipe and 100,000 updates. The trained models will be published here.

Project Roadmap

I will periodically update this blog with experimental results, with a posting frequency of about one post every one to two weeks. Below is an outline of my experimental plan.

Model Exploration

This section outlines the model architectures I plan to study:

  • Transformer
    • Analyzing the impact of different initialization methods.
    • Exploring prenorm/postnorm configurations.
    • Experimenting with various positional encodings.
  • Linear Transformer
    • Studying the effects of decay mechanisms in Linear Transformers.
    • Developing new Linear Transformer architectures.
  • Linear RNN
  • Long Convolution
  • Hybrid Models

Application Domains

This section lists the tasks for which I will conduct model testing:

  • Causal language modeling.
  • Bidirectional language modeling.
  • Image classification.
  • Image generation.

Evaluation Criteria

This section highlights the aspects I plan to evaluate, primarily focused on causal language modeling. The goal is to identify effective proxy tasks that differentiate between various architectures. For language models, PPL (Perplexity) may not fully reflect real-world performance. Realistic assessments, such as in the Needle in a Haystack setting, require around a 1B-parameter model and 100B tokens for training (as discussed in our paper), which is computationally prohibitive. Synthetic tasks, like MQAR, are highly sensitive to hyperparameters (based on my experiments). While I am still exploring options, key evaluation metrics include:

  • Retrieval
  • Reasoning