[Day86] 슬로우페이퍼 XLNet: Generalized Autoregressive Pretraining for Language Understanding

AIFFEL Life 2020. 12. 27. 12:16

월요일 아침은 슬로우 페이퍼로 시작합니다. 오늘은 XLNet에 관한 논문(arxiv.org/pdf/1906.08237.pdf)이네요. 아래에 논문을 번역해 가면서 읽으면서 적은 메모 공유합니다~

Abstract With the capability of modeling bidirectional contexts, denoising autoencoding based pretraining like BERT achieves better performance than pretraining approaches based on autoregressive language modeling. However, relying on corrupting the input with masks, BERT neglects dependency between the masked positions and suffers from a pretrain-finetune discrepancy. In light of these pros and cons, we propose XLNet, a generalized autoregressive pretraining method that (1) enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and (2) overcomes the limitations of BERT thanks to its autoregressive formulation. Furthermore, XLNet integrates ideas from Transformer-XL, the state-of-the-art autoregressive model, into pretraining. Empirically, under comparable experiment settings, XLNet outperforms BERT on 20 tasks, often by a large margin, including question answering, natural language inference, sentiment analysis, and document ranking.1 .

1 Introduction Unsupervised representation learning has been highly successful in the domain of natural language processing [7, 22, 27, 28, 10]. Typically, these methods first pretrain neural networks on large-scale unlabeled text corpora, and then finetune the models or representations on downstream tasks. Under this shared high-level idea, different unsupervised pretraining objectives have been explored in literature. Among them, autoregressive (AR) language modeling and autoencoding (AE) have been the two most successful pretraining objectives.

도입

비지도 표현 학습은 매우 성공적이었다 자연어처리 도메인에서. 특히, 이 방법들은 먼저 대규모의 라벨되지 않은 텍스트 코퍼스에 대해 신경망을 사전학습하고 모델이나 다운스트림 작업에 대한 표현을 파인튜닝한다. 이런 공유된 고수준 아이디어하에서, 다른 비지도 사전학습 목표는 문헌에서 탐색되어졌다. 그들 중에, 자동회귀 언어 모델링과 자동인코딩은 가장 성공적인 사전학습 목표였다.

AR language modeling seeks to estimate the probability distribution of a text corpus with an autoregressive model [7, 27, 28]. Specifically, given a text sequence x = (x1, · · · , xT ), AR language modeling factorizes the likelihood into a forward product p(x) = QT t=1 p(xt | x

t). A parametric model (e.g. a neural network) is trained to model each conditional distribution. Since an AR language model is only trained to encode a uni-directional context (either forward or backward), it is not effective at modeling deep bidirectional contexts. On the contrary, downstream language understanding tasks often require bidirectional context information. This results in a gap between AR language modeling and effective pretraining.

에이알 언어 모델링은 텍스트 코퍼스의 확률분포를 예측하려고 한다. 오토회귀 모델로. 구체적으로는, 텍스트 시퀀스가 주어졌을때, 에이알 언어 모델은 우도를 포워드 프러덕트로 인수분해한다. 파라미터 모델은 각각의 조건 분포를 모델하려고 학습된다. 에이알 언어 모델은 단지 한방향 콘텍스트 (순방향이던 역방향이던)를 인코딩하기위해 훈련되기 때문에, 딥 양방향 콘텍스트를 모델링하는데는 효과적이지 않다. 반면, 다운스트림 언어 이해 작업은 종종 양방향 문맥 정보를 요구한다. 그래서 에이알 언어 모델링과 효과적인 사전 학습사이에는 갭이 존재한다.

In comparison, AE based pretraining does not perform explicit density estimation but instead aims to reconstruct the original data from corrupted input. A notable example is BERT [10], which has been the state-of-the-art pretraining approach. Given the input token sequence, a certain portion of tokens are replaced by a special symbol [MASK], and the model is trained to recover the original tokens from the corrupted version. Since density estimation is not part of the objective, BERT is allowed to utilize

bidirectional contexts for reconstruction. As an immediate benefit, this closes the aforementioned bidirectional information gap in AR language modeling, leading to improved performance. However, the artificial symbols like [MASK] used by BERT during pretraining are absent from real data at finetuning time, resulting in a pretrain-finetune discrepancy. Moreover, since the predicted tokens are masked in the input, BERT is not able to model the joint probability using the product rule as in AR language modeling. In other words, BERT assumes the predicted tokens are independent of each other given the unmasked tokens, which is oversimplified as high-order, long-range dependency is prevalent in natural language [9].

비교에서, 사전학습기반 에이이는 명시적 밀도 추적을 수행하지 않고 커럽트된 인풋으로부터 오리지널 데이터를 재구성하는 것을 목표로 한다. 주목할만한 예는 버트인데, 지금까지 최첨단의 사전학습 방법이었다. 인풋 토큰 시퀀스가 주어졌을때, 일정 비율의 토큰은 특수 심볼로 대체되고, 모델은 커럽트된 버전으로부터 오리지널 토큰을 회복하기 위해 학습된다. 밀도추정은 목표의 일부가 아니므로, 버트는 양방향 문맥을 재구성을 위해 사용하는 것이 허락된다. 즉각적인 혜택으로, 이것은 앞에서 말한 에이알 언어 모델에서의 양방향 정보 갭을 줄이고, 개선된 성능으로 이어진다. 그러나, 사전학습도중 버트에 의해 사용되는 인공적인 심볼 (마스크 같은)은 파인튜닝시에 실제 데이터에는 없고, 이로 인해 사전학습-미세튜닝 차이가 발생한다. 게다가, 예측된 토큰은 입력에서 마스킹되서, 에이알 언어 모델에서, 버트는 조인트 확률을 모델할 수 없다. 다른말로하면, 버트는 예측된 토큰이 가각 다른 주어진 마스크되지 않은 토큰과 독럽적이라고 가정하고, 이는 너무단순화되는데, 고차원, 긴범위 의존이 자연어에서 많기때문인다.

Faced with the pros and cons of existing language pretraining objectives, in this work, we propose XLNet, a generalized autoregressive method that leverages the best of both AR language modeling and AE while avoiding their limitations. • Firstly, instead of using a fixed forward or backward factorization order as in conventional AR models, XLNet maximizes the expected log likelihood of a sequence w.r.t. all possible permutations of the factorization order. Thanks to the permutation operation, the context for each position can consist of tokens from both left and right. In expectation, each position learns to utilize contextual information from all positions, i.e., capturing bidirectional context. • Secondly, as a generalized AR language model, XLNet does not rely on data corruption. Hence, XLNet does not suffer from the pretrain-finetune discrepancy that BERT is subject to. Meanwhile, the autoregressive objective also provides a natural way to use the product rule for factorizing the joint probability of the predicted tokens, eliminating the independence assumption made in BERT.

기존 언어 사전학습 목표의 장단점을 보고, 이 작업에서는 우리는 엑스엘넷을 일반화된 오토리그레션 방법으로 제안하는데 이는 에이알 언어 모델링과 에이이의 장점을 활용한다. 그들의 제약사항은 피하면서...

먼저 고정된 순방향/역방향 인수분해 순서를 사용하는 대신, 엑스엘넷은 모든 가능한 인수분해 순서의 순열에 관한 시퀀스 로그 우도 기대값을 최대화한다. 순열 작업덕분에, 각각의 위치를 위한 문맥은 왼쪽과 오른쪽으로부터의 토큰으로 구성될수 있다. 기대하기는, 각각의 포지션은 모든 위치 측, 양방향 문맥을 얻는데서부터 문맥정보 사용하는 것을 학습한다.
두번째로, 일반화된 에이알 언어 모델로서, 엑스엘넷은 데이타 커럽션에 의존하지 않는다. 그러므로, 엑스엘넷은 사전학습-미세조정 차이를 겪지 않는데, 버트는 영향을 받는것이다. 한편, 오토 회귀 목표는 또한 예측된 토큰의 조인트 확률을 팩토라이징하기 위한 프로덕트 룰을 사용하는 자연스러운 방법을 제안하며, 버트에서 만들어진 독립 가정을 제거한다.

In addition to a novel pretraining objective, XLNet improves architectural designs for pretraining. • Inspired by the latest advancements in AR language modeling, XLNet integrates the segment recurrence mechanism and relative encoding scheme of Transformer-XL [9] into pretraining, which empirically improves the performance especially for tasks involving a longer text sequence. • Naively applying a Transformer(-XL) architecture to permutation-based language modeling does not work because the factorization order is arbitrary and the target is ambiguous. As a solution, we propose to reparameterize the Transformer(-XL) network to remove the ambiguity. Empirically, under comparable experiment setting, XLNet consistently outperforms BERT [10] on a wide spectrum of problems including GLUE language understanding tasks, reading comprehension tasks like SQuAD and RACE, text classification tasks such as Yelp and IMDB, and the ClueWeb09-B document ranking task.

새로운 사전학습 목표에 더해, 엑스엘넷은 사전학습을 위한 아키텍처 디자인을 개선한다.

에이알 언어 모델링에 있어서 최시의 진보에 영감을 받아, 엑스엘넷은 세그먼트 반복출현 메커니즘과 트랜스포머 엑스엘의 상대적 인코딩 스킴을 사전학습으로 통합시키고, 그것은 경험적으로 성능을 개선한다. 특히, 더 긴 텍스트 시퀀스를 포함하는 작업을 위해.
순진하게 트랜스포머 아케텍처를 순열기반 언어 모델링에 적용하는 것은 동작하지 않는데, 팩토라이제이션 오더가 임의적이고 대상이 애매하기 때문이다. 솔루션으로 우리는 트랜스포머 네트워크를 를 재파라미터라이즈하여 애매함을 제거할 것을 제안한다. 경험적으로, 비교할만한 실험

Related Work

The idea of permutation-based AR modeling has been explored in [32, 12], but there are several key differences. Firstly, previous models aim to improve density estimation by baking an “orderless” inductive bias into the model while XLNet is motivated by enabling AR language models to learn bidirectional contexts. Technically, to construct a valid target-aware prediction distribution, XLNet incorporates the target position into the hidden state via two-stream attention while previous permutation-based AR models relied on implicit position awareness inherent to their MLP architectures. Finally, for both orderless NADE and XLNet, we would like to emphasize that “orderless” does not mean that the input sequence can be randomly permuted but that the model allows for different factorization orders of the distribution. Another related idea is to perform autoregressive denoising in the context of text generation [11], which only considers a fixed order though.

2 Proposed Method 2.1 Background In this section, we first review and compare the conventional AR language modeling and BERT for language pretraining. Given a text sequence x = [x1, · · · , xT ], AR language modeling performs pretraining by maximizing the likelihood under the forward autoregressive factorization: max θ log pθ(x) = X T t=1 log pθ(xt | x

e(xt)

P x0 exp (hθ(x1:t−1)>e(x 0)), (1

제안하는 방법

2.1. 배경

이 섹션에서 우리는 먼저 전통적인 에이알 언어 모델과ㅏ 버트를 리뷰하고 비교한다. 언어 사전학습을 위해. 주어진 텍스트 시퀀스에서, 에이알 언어 모델링은 사전학습을 수행하는데 우도를 최대화함으로써, 순방항 오토리그레시브 팩토라이제이션하에서.

where hθ(x1:t−1) is a context representation produced by neural models, such as RNNs or Transformers, and e(x) denotes the embedding of x. In comparison, BERT is based on denoising auto-encoding. Specifically, for a text sequence x, BERT first constructs a corrupted version xˆ by randomly setting a portion (e.g. 15%) of tokens in x to a special symbol [MASK]. Let the masked tokens be x¯. The training objective is to reconstruct x¯ from xˆ: max θ log pθ(x¯ | xˆ) ≈ X T t=1 mt log pθ(xt | xˆ) = X T t=1 mt log exp Hθ(xˆ)

t e(xt)

P x0 exp Hθ(xˆ)

t e(x 0) , (

에이치세타는 문맥 표현이다 (신경말 모델에 의해 생성된, 알엔엔이나 트랜스포머 같은) 그리고 이엑스는 엑스의 임베딩을 의미한다. 비교하면, 버트는 오토인코딩의 노이즈를 줄이는 것에 기초한다. 구체적으로, 텍스트 시퀀스 엑스에 대해, 버트는 먼저 커럽트된 버전 엑스를 구성하고, 랜덤하게 엑스에서 토큰의 일정 비율을 특정 심볼로 설정하는것으로. 마스크된 토큰이 엑스라고 해보자. 학습 목표는 엑스햇으로부터 엑스 바를 재구성하는 것이다.

where mt = 1 indicates xt is masked, and Hθ is a Transformer that maps a length-T text sequence x into a sequence of hidden vectors Hθ(x) = [Hθ(x)1, Hθ(x)2, · · · , Hθ(x)T ]. The pros and cons of the two pretraining objectives are compared in the following aspect

여기서 엠티는 1은 엑스티가 마스크된것을 나타내고 에이치세타는 트랜스포머인데, 이는 길이 티 텍스트 시퀀스 엑스를 숨겨진 벡터 에이치세타엑스 시퀀스로 매핑하는 것을 의미한다. 두 사전학습 목표의 장단점은 다음 측면으로 비교된다

• Independence Assumption: As emphasized by the ≈ sign in Eq. (2), BERT factorizes the joint conditional probability p(x¯ | xˆ) based on an independence assumption that all masked tokens x¯ are separately reconstructed. In comparison, the AR language modeling objective (1) factorizes pθ(x) using the product rule that holds universally without such an independence assumption. • Input noise: The input to BERT contains artificial symbols like [MASK] that never occur in downstream tasks, which creates a pretrain-finetune discrepancy. Replacing [MASK] with original tokens as in [10] does not solve the problem because original tokens can be only used with a small probability — otherwise Eq. (2) will be trivial to optimize. In comparison, AR language modeling does not rely on any input corruption and does not suffer from this issue. • Context dependency: The AR representation hθ(x1:t−1) is only conditioned on the tokens up to position t (i.e. tokens to the left), while the BERT representation Hθ(x)t has access to the contextual information on both sides. As a result, the BERT objective allows the model to be pretrained to better capture bidirectional context

독립가정: 방정식 2에서 물결 표시에 으해 강조되듯이, 버트는 조인트 조건 확률 피를 인수분해한다. 독립 가정에 의해서 즉, 모든 가려진 토큰 엑스바는 분리되어 재구성된다. 비교하여, 에이알 언어 모델링 목표는 피세타엑스를 인수분해하는데 프로덕트 룰을 사용하고, 이 룰은 그런 독립적인 가정없이 일반적으로 적용된다.
인풋 노이즈: 버트의 인풋은 인공 심볼을 포함한다 마스크 같은, 다운스트림 작업에서는 절대 나타나지 않고, 사전 학습-미세튜닝 차이를 만들어내는. 10에서처럼 마스크를 원래 토큰으로 대체하는 것은 문제를 푸는 것이아니다. 왜냐면, 원래 토큰은 단지 작은 확률로만 사용될 수 있기 때문아다. 다시 말해 방정식2는 최적화하기 사소한것이다. 비교하면, 에이알 언어 모델링은 어떤 인풋 커럽션에 의곤하지 않고 이 이슈를 겪지않는다.
문맥 의존: 에이알 표현 에이치세타는 단지 조건적으로 포지션 티까지 토큰에 관한 것이다. 반면 버트 표현 에이치 세타는 양쪽 문맥 정보에 접근한다. 결과적으로, 버트 목표는 모델이 양방향 문맥을 더 잘 캡처하기 위해 사전학습되도록 한다.

2.2 Objective: Permutation Language Modeling According to the comparison above, AR language modeling and BERT possess their unique advantages over the other. A natural question to ask is whether there exists a pretraining objective that brings the advantages of both while avoiding their weaknesse

2.2. 목표: 순열 언어 모델링

위 비교에 따르면, 에이알 언어 모델링과 버트는 각자 유니크한 장점이 있다. 당연한 질문은 양쪽의 약점은 피하면서 장점을 가져오는 사전학습 목표가 있느냐는 것이다.

Borrowing ideas from orderless NADE [32], we propose the permutation language modeling objective that not only retains the benefits of AR models but also allows models to capture bidirectional contexts. Specifically, for a sequence x of length T, there are T! different orders to perform a valid autoregressive factorization. Intuitively, if model parameters are shared across all factorization orders, in expectation, the model will learn to gather information from all positions on both sides

순서없는 네이드에서 아이디어를 빌려 우리는 순열 언어 모델 목표를 제안하는데, 이는 에이알 모델의 장점을 얻는 것뿐만 아니라 모델이 양방향 문맥을 캡처하도록 한다. 구체적으로, 길이 티의 시퀀스 엑스에 대해 티팩토리얼의 다른 순서가 있다. 정당한 오토리그레시브 팩토라이제이션을 수행하는데. 직관적으로, 만약 모델 파라미터가 모든 팩토라이제이션 순서에 걸쳐 공유되어 있다면, 모델은 양쪽의 모든 포지션으로부터 정보를 얻어 학습할 것이다.

To formalize the idea, let ZT be the set of all possible permutations of the length-T index sequence [1, 2, . . . , T]. We use zt and z

아이이어를 정규화하기 위해, 지티를 길이티 인덱스 시퀀스의 모든 가능한 순열이라고 생각하자. 우리는 지타와 지를 사용한다.

저작자표시 (새창열림)

'AIFFEL Life' 카테고리의 다른 글

[Day87] HuggingFace 커스텀 프로젝트를 만들어보자 (0)	2020.12.27
[Day86] NLP Framework (0)	2020.12.27
[Day84] 해커톤 일곱번째 (0)	2020.12.26
[Day83] Bert pre-trained model 제작 (0)	2020.12.26
[Day83] 슬로우 페이퍼 BERT: Pre-training of Deep Bidirectional Transformers forLanguage Understanding (0)	2020.12.26

ABOUT ME

소프트웨어공학-Software Engineering 소프트웨어공학-Software Engineering

'AIFFEL Life' 카테고리의 다른 글

티스토리툴바

ABOUT ME

'AIFFEL Life' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바