# BURT: BERT-inspired Universal Representation from Learning Meaningful Segment BURT：从学习有意义的细分中获得 BERT 启发的通用表示形式

##### author

Yian Li, Hai Zhao

## Abstract

Although pre-trained contextualized language models such as BERT achieve significant performance on various downstream tasks, current language representation still only focuses on linguistic objective at a specific granularity, which may not applicable when multiple levels of linguistic units are involved at the same time. Thus this work introduces and explores the universal representation learning, i.e., embeddings of different levels of linguistic unit in a uniform vector space. We present a universal representation model, BURT (BERT-inspired Universal Representation from learning meaningful segmenT), to encode different levels of linguistic unit into the same vector space. Specifically, we extract and mask meaningful segments based on point-wise mutual information (PMI) to incorporate different granular objectives into the pre-training stage. We conduct experiments on datasets for English and Chinese including the GLUE and CLUE benchmarks, where our model surpasses its baselines and alternatives on a wide range of downstream tasks. We present our approach of constructing analogy datasets in terms of words, phrases and sentences and experiment with multiple representation models to examine geometric properties of the learned vector space through a task-independent evaluation. Finally, we verify the effectiveness of our unified pre-training strategy in two real-world text matching scenarios. As a result, our model significantly outperforms existing information retrieval (IR) methods and yields universal representations that can be directly applied to retrieval-based question-answering and natural language generation tasks.

## 摘要

Artificial Intelligence, Natural Language Processing, Transformer, Language Representation.

## 1INTRODUCTION

Representations learned by deep neural models have attracted a lot of attention in Natural Language Processing (NLP). However, previous language representation learning methods such as word2vec [1], LASER [2] and USE [3] focus on either words or sentences. Later proposed pre-trained contextualized language representations like ELMo [4], GPT[5], BERT [6] and XLNet [7] may seemingly handle different sized input sentences, but all of them focus on sentence-level specific representation still for each word, which leads to unsatisfactory performance in real-world situations. Although the latest BERT-wwm-ext [8], StructBERT [9] and SpanBERT [10] perform MLM on a higher linguistic level, the masked segments (whole words, trigrams, spans) either follow a pre-defined distribution or focus on a specific granularity. Such sampling strategy ignores important semantic and syntactic information of a sequence, resulting in a large number of meaningless segments.

However, universal representation among different levels of linguistic units may offer a great convenience when it is needed to handle free text in language hierarchy in a unified way. As well known that, embedding representation for a certain linguistic unit (i.e., word) enables linguistics-meaningful arithmetic calculation among different vectors, also known as word analogy. For example, vector (“King”) - vector (“Man”) + vector (“Woman”) results in vector (“Queen”). Thus universal representation may generalize such good analogy features or meaningful arithmetic operation onto free text with all language levels involved together. For example, Eat an onion : Vegetable :: Eat a pear : Fruit. In fact, manipulating embeddings in the vector space reveals syntactic and semantic relations between the original sequences and this feature is indeed useful in true applications. For example, “London is the capital of England.” can be formulized as v(capital)+v(England)∼v(London). Given two documents one of which contains “England” and “capital”, the other contains “London”, we consider these two documents relevant. Such features can be generalized onto higher language levels for phrase/sentence embedding.

In this paper, we explore the regularities of representations including words, phrases and sentences in the same vector space. To this end, we introduce a universal analogy task derived from Google’s word analogy dataset. To solve such task, we present BURT, a pre-trained model that aims at learning universal representations for sequences of various lengths. Our model follows the architecture of BERT but differs from its original masking and training scheme. Specifically, we propose to efficiently extract and prune meaningful segments (n-grams) from unlabeled corpus with little human supervision, and then use them to modify the masking and training objective of BERT. The n-gram pruning algorithm is based on point-wise mutual information (PMI) and automatically captures different levels of language information, which is critical to improving the model capability of handling multiple levels of linguistic objects in a unified way, i.e., embedding sequences of different lengths in the same vector space.

Overall, our pre-trained models improves the performance of our baseline in both English and Chinese. In English, BURT-base reaches one percent gain on average over Google BERT-base. In Chinese, BURT-wwm-ext obtains 74.48% on the WSC test set, 3.45% point absolute improvement compared with BERT-wwm-ext and exceeds the baselines by 0.2% ∼ 0.6% point accuracy on five other CLUE tasks including TNEWS, IFLYTEK, CSL, ChID and CMRC 2018. Extensive experimental results on our universal analogy task demonstrate that BERT is able to map sequences of variable lengths into a shared vector space where similar sequences are close to each other. Meanwhile, addition and subtraction of embeddings reflect semantic and syntactic connections between sequences. Moreover, BURT can be easily applied to real-world applications such as Frequently Asked Questions (FAQ) and Natural Language Generation (NLG) tasks, where it encodes words, sentences and paragraphs into the same embedding space and directly retrieves sequences that are semantically similar to the given query based on cosine similarity. All of the above experimental results demonstrate that our well-trained model leads to universal representation that can adapt to various tasks and applications.

## 2 BACKGROUND 背景

### 2.1WORD AND SENTENCE EMBEDDINGS 单词和句子嵌入

Representing words as real-valued dense vectors is a core technique of deep learning in NLP. Word embedding models [1, 11, 12] map words into a vector space where similar words have similar latent representations. ELMo [4] attempts to learn context-dependent word representations through a two-layer bi-directional LSTM network. In recent years, more and more researchers focus on learning sentence representations. The Skip-Thought model [13] is designed to predict the surrounding sentences for an given sentence. [14] improve the model structure by replacing the RNN decoder with a classifier. InferSent [15] is trained on the Stanford Natural Language Inference (SNLI) dataset [16] in a supervised manner. [17, 3] employ multi-task training and report considerable improvements on downstream tasks. LASER [2] is a BiLSTM encoder designed to learn multilingual sentence embeddings. Nevertheless, most of the previous work focused on a specific granularity. In this work we extend the training goal to a unified level and enables the model to leverage different granular information, including, but not limited to, word, phrase or sentence.

### 2.2PRE-TRAINED LANGUAGE MODELS 预训练语言模型

Most recently, the pre-trained language model BERT [6] has shown its powerful performance on various downstream tasks. BERT is trained on a large amount of unlabeled data including two training targets: Masked Language Model (MLM) for modeling deep bidirectional representations, and Next Sentence Prediction (NSP) for understanding the relationship between two sentences. [18] introduce Sentence-Order Prediction (SOP) as a substitution of NSP. [9] develop a sentence structural objective by combining the random sampling strategy of NSP and continuous sampling as in SOP. However, [19] and [10] use single contiguous sequences of at most 512 tokens for pre-training and show that removing the NSP objective improves the model performance. Besides, BERT-wwm [8], StructBERT [10], SpanBERT [9] perform MLM on higher linguistic levels, augmenting the MLM objective by masking whole words, trigrams or spans, respectively. Nevertheless, we concentrate on enhancing the masking and training procedures from a broader and more general perspective.

### 2.3 ANALYSIS ON EMBEDDINGS 嵌入分析

Previous exploration of vector regularities mainly studies word embeddings [1, 20, 21]. After the introduction of sentence encoders and Transformer models [22], more works were done to investigate sentence-level embeddings. Usually the performance in downstream tasks is considered to be the measurement for model ability of representing sentences [15, 3, 23]. Some research proposes probing tasks to understand certain aspects of sentence embeddings [24, 25, 26]. Specifically, [27, 28, 29] look into BERT embeddings and reveal its internal working mechanisms. Besides, [30, 31] explore the regularities in sentence embeddings. Nevertheless, little work analyzes words, phrases and sentences in the same vector space. In this paper, We work on embeddings for sequences of various lengths obtained by different models in a task-independent manner.

### 2.4 FAQ APPLICATIONS 常见问题解答申请

The goal of a Frequently Asked Question (FAQ) task is to retrieve the most relevant QA pairs from the pre-defined dataset given a query. Previous works focus on feature-based methods [32, 33, 34]. Recently, Transformer-based representation models have made great progress in measuring query-Question or query-Answer similarities. [35] make an analysis on Transformer models and propose a neural architecture to solve the FAQ task. [36] come up with an FAQ retrieval system that combines the characteristics of BERT and rule-based methods. In this work, we evaluate the performance of well-trained universal representation models on the FAQ task.

Fig. 1: An illustration of n-gram pre-training.

Fig. 2: An example from the Chinese Wikipedia corpus. n-grams of different lengths are marked with dashed boxes in different colors in the upper part of the figure. During training, we randomly mask n-grams and only the longest n-gram is masked if there are multiple matches, as shown in the lower part of the figure.

## 3 METHODOLOGY 方法

Our BURT follows the Transformer encoder [22] architecture where the input sequence is first split into subword tokens and a contextualized representation is learned for each token. We only perform MLM training on single sequences as suggested in [10]. The basic idea is to mask some of the tokens from the input and force the model to recover them from the context. Here we propose a unified masking method and training objective considering different grained linguistic units.

Specifically, we apply an pruning mechanism to collect meaningful n-grams from the corpus and then perform n-gram masking and predicting. Our model differs from the original BERT and other BERT-like models in several ways. First, instead of the token-level MLM of BERT, we incorporate different levels of linguistic units into the training objective in a comprehensive manner. Second, unlike SpanBERT and StructBERT which sample random spans or trigrams, our n-gram sampling approach automatically discovers structures within any sequence and is not limited to any granularity.

### 3.1 N-GRAM PRUNING

In this subsection, we introduce our approach of extracting a large number of meaningful n-grams from the monolingual corpus, which is a critical step of data processing.

First, we scan the corpus and extract all n-grams with lengths up to N using the SRILM toolkithttp://www.speech.sri.com/projects/srilm/download.html [37]. In order to filter out meaningless n-grams and prevent the vocabulary from being too large, we apply pruning by means of point-wise mutual information (PMI) [38]. To be specific, mutual information I(x,y) describes the association between tokens x and y by comparing the probability of observing x and y together with the probabilities of observing x and y independently. Higher mutual information indicates stronger association between the two tokens.

“B”-BM25, “L”-LASER, “S”-SpanBERT, “U”-BURT

Query: 端午节的由来 (The Origin of the Dragon Boat Festival)
B: 一个中学的高级教师陈老师生动地解读端午节的由来，诵读爱好者进行原创诗歌作品朗诵，深深打动了在场的观众… (Mr. Chen, senior teacher at a middle School, vividly introduced the origin of the Dragon Boat Festival and people are reciting original poems, which deeply moved the audience…)
L: 今天是端午小长假第一天…当天上午，在车厢内满目挂有与端午节相关的民俗故事及有关诗词的文字… (Today is the first day of the Dragon Boat Festival holiday…There are folk stories and poems posted in the carriage…)
S,U: …端午节又称端阳节、龙舟节、浴兰节，是中华民族的传统节日。端午节形成于先秦，发展于汉末魏晋，兴盛于唐… (…Dragon Boat Festival, also known as Duanyang Festival, Longzhou Festival and Yulan Festival is a traditional festival of the Chinese nation. It is formed in the Pre-Qin Dynasty, developed in the late Han and Wei-Jin, and prospered in the Tang…)

Comments: B and L is related to the topic but does not convey the meaning of the query.

Query: 狗的喂养知识 (Dog Feeding Tips)
B: …创建一个“比特狗”账户，并支付 99 元领养一只“比特狗”。然后购买喂养套餐喂养“比特狗”，“比特狗”就能通过每天挖矿产生 BTGS 虚拟货币。 (…First create a “Bitdog” account and pay 99 yuan to adopt a “Bitdog”. Then buy a package to feed the “Bitdog”, which can generate virtual currency BTGS through daily mining.)
L: 要养成定时定量喂食的好习惯，帮助狗狗更好的消化和吸收，同时也要选择些低盐健康的狗粮… (It is necessary to feed your dog regularly and quantitatively, which can help them digest and absorb better. Meanwhile, choose some low-salt and healthy dog food…)
S: 泰迪犬容易褪色是受到基因和护理不当的影响，其次是饮食太咸…一定要注意正确护理，定期洗澡，要给泰迪低盐营养的优质狗粮… (Teddy bear dog’s hair is easy to fade because of its genes and improper care. It is also caused by salty diet… So we must take good care of them, such as taking a bath regularly, and preparing dog food with low salt…)
U: 还可以一周自制一次狗粮给狗狗喂食，就是买些肉类，蔬菜，自己动手做。偶尔吃吃自制狗粮也能增加狗狗的营养，和丰富狗狗的口味，这样狗狗就不会那么容易出现挑食，厌食之类的问题。日常的话，建议选择些适口性强的狗粮，有助磨牙，防止口腔疾病。 (You can also make dog food once a week, such as meats and vegetables. Occasionally eating homemade dog food can also supplement nutrition and enrich the taste, so that dogs will not be so picky or anorexic. In daily life, it is recommended to choose some palatable dog food to help their teeth grinding and prevent oral diseases.)

Comments: B is not a relevant paragraph. S is relevant to the topic but is inaccurate.

TABLE IX: Examples of the retrieved paragraphs and corresponding comments from the judges.

R Judge1 Judge2 Avg.
BM25 60.3 61.8 61.1
LASER 63.9 61.6 62.8
BERT 65.9 67.3 66.6
MLM 65.0 67.5 66.3
Span 69.3 71.5 70.4
BURT 71.8 71.0 71.4
CM Judge1 Judge2 Avg.
BM25 43.5 41.2 42.4
LASER 42.5 38.4 40.5
BERT 48.5 47.8 48.2
MLM 46.1 45.5 45.8
Span 51.6 53.9 52.8
BURT 54.2 56.5 55.4

TABLE XI: Results on NLG according to human judgment. “R” and “CM” represent the percentage of paragraphs that are “relevant” and “convey the meaning”, respectively.

## 可视化

### Single Pattern

Mikolov et al. (2013) use PCA to project word embeddings into a two-dimensional space to visualize a single pattern captured by the Word2Vec model, while in this work we consider embeddings for different granular linguistic units. All pairs in Figure distance belong to the male-female" category and subtracting the two vectors results in roughly the same direction.

### 单个图案

mikolov 等。 （2013）使用 PCA 将嵌入式单词嵌入到二维空间中，以可视化 Word2VEC 模型捕获的单个模式，而在这项工作中我们考虑不同的颗粒语言单位的嵌入。图距离的所有对属于男性 - 女性“类别，并减去两个向量导致大致相同的方向。

p1 p2
p_man: employed by the man p_woman: hired by the woman
p_king: employed by the king p_queen: hired by the queen
s
s man: He was employed by the man when he was 28.
s woman: He was hired by the woman at age 28.
s king: He was employed by the king when he was 28.
s queen: He was hired by the queen at age 28.
s dad: He was employed by his dad when he was 28.
s mom: He was hired by his mom at age 28.

TABLE XII: Annotation of phrases and sentences in Figure 4.

### Clustering 聚类

Given that embeddings of sequences with the same kind of relationship will exhibit the same pattern in the vector space, we obtain the difference between pairs of embeddings for words, phrases and sentences from different categories and visualize them by t-SNE. Figure cluster shows that by subtracting two vectors, pairs that belong to the same category automatically fall into the same cluster. Only the pairs from capital-country" and city-state" cannot be totally distinguished, which is reasonable because they all describe the relationship between geographical entities.

### FAQ

We show examples in Figure faq_vector where BURT successfully retrieve the correct answer while TF-IDF and BM25 fail. Both sentences " and " contain the word ", which is a possible reason why TF-IDF tends to believe they highly match with each other, ignoring that the two sentences are actually describing two different issues. In contrast, using vector-based representations, BURT considers " as a paraphrase of ". As depicted in Figure faq_vector, queries are close to the correct responses and away from other sentences.

## 8 Conclusion

This paper formally introduces the task of universal representation learning and then presents a pre-trained language model for such a purpose to map different granular linguistic units into the same vector space where similar sequences have similar representations and enable unified vector operations among different language hierarchies. In detail, we focus on the less concentrated language representation, seeking to learn a uniform vector form across different linguistic unit hierarchies. Far apart from learning either word only or sentence only representation, our method extends BERT's masking and training objective to a more general level, which leverage information from sequences of different lengths in a comprehensive way and effectively learns a universal representation from words, phrases to sentences. Overall, our proposed BURT outperforms its baselines on a wide range of downstream tasks with regard to sequences of different lengths in both English and Chinese languages. We especially provide an universal analogy task, an insurance FAQ dataset and an NLG dataset for extensive evaluation, where our well-trained universal representation model holds the promise for demonstrating accurate vector arithmetic with regard to words, phrases and sentences and in real-world retrieval applications.