feat: Add Model

Files changed (7) hide show

README.md ADDED Viewed

+---
+language: ko
+---
+# Pretrained BART in Korean
+This is pretrained BART model with multiple Korean Datasets.
+I used multiple datasets for generalizing the model for both colloquial and written texts.
+The training is supported by [TPU Research Cloud](https://sites.research.google/trc/) program.
+The script which is used to pre-train model is [here](https://github.com/cosmoquester/transformers-bart-pretrain).
+When you use the reference API, you must wrap the sentence with `[BOS]` and `[EOS]` like below example.
+```
+[BOS] 안녕하세요? 반가워요~~ [EOS]
+```
+You can also test mask filling performance using `[MASK]` token like this.
+```
+[BOS] [MASK] 먹었어? [EOS]
+```
+## Used Datasets
+### [모두의 말뭉치](https://corpus.korean.go.kr/)
+- 일상 대화 말뭉치 2020
+- 구어 말뭉치
+- 문어 말뭉치
+- 신문 말뭉치
+### AIhub
+- [개방데이터 전문분야말뭉치](https://aihub.or.kr/aidata/30717)
+- [개방데이터 한국어대화요약](https://aihub.or.kr/aidata/30714)
+- [개방데이터 감성 대화 말뭉치](https://aihub.or.kr/aidata/7978)
+- [개방데이터 한국어 음성](https://aihub.or.kr/aidata/105)
+- [개방데이터 한국어 SNS](https://aihub.or.kr/aidata/30718)
+### [세종 말뭉치](https://ithub.korean.go.kr/)

config.json ADDED Viewed

+{
+  "_name_or_path": "bart-ko-mini",
+  "activation_dropout": 0.1,
+  "activation_function": "gelu",
+  "architectures": [
+    "BartForConditionalGeneration"
+  ],
+  "attention_dropout": 0.1,
+  "bos_token_id": 2,
+  "classifier_dropout": 0.0,
+  "d_model": 256,
+  "decoder_attention_heads": 4,
+  "decoder_ffn_dim": 1024,
+  "decoder_layerdrop": 0.0,
+  "decoder_layers": 2,
+  "decoder_start_token_id": 2,
+  "dropout": 0.1,
+  "encoder_attention_heads": 4,
+  "encoder_ffn_dim": 1024,
+  "encoder_layerdrop": 0.0,
+  "encoder_layers": 2,
+  "eos_token_id": 3,
+  "forced_eos_token_id": 3,
+  "gradient_checkpointing": false,
+  "id2label": {
+    "0": "LABEL_0",
+    "1": "LABEL_1",
+    "2": "LABEL_2"
+  },
+  "init_std": 0.02,
+  "is_encoder_decoder": true,
+  "label2id": {
+    "LABEL_0": 0,
+    "LABEL_1": 1,
+    "LABEL_2": 2
+  },
+  "max_position_embeddings": 2048,
+  "model_type": "bart",
+  "num_hidden_layers": 2,
+  "pad_token_id": 0,
+  "scale_embedding": false,
+  "transformers_version": "4.7.0",
+  "use_cache": false,
+  "vocab_size": 32000
+}

pytorch_model.bin ADDED Viewed

+version https://git-lfs.github.com/spec/v1
+oid sha256:e4aa890dd0c2c6b4108db89c3e46826fa02731b4bdf0560df4e344b1e1db9041
+size 51876625

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"bos_token": "[BOS]", "eos_token": "[EOS]", "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "mask_token": "[MASK]"}

tf_model.h5 ADDED Viewed

+version https://git-lfs.github.com/spec/v1
+oid sha256:2a7863617a9c4d1495b4ffda42552de2a96784969d8662bd9f9129735cc5754e
+size 51823224

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"bos_token": "[BOS]", "eos_token": "[EOS]", "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "mask_token": "[MASK]"}