Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Paper
•
1908.10084
•
Published
•
9
This is a sentence-transformers model finetuned from intfloat/multilingual-e5-small on the mnlp_encoder_data dataset. It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("ngkan146/test-encoder-st-intfloatsmall")
# Run inference
sentences = [
'What is the main purpose of chain coding in image segmentation? \nA. To enhance the color depth of images \nB. To compress binary images by tracing contours \nC. To convert images into three-dimensional models \nD. To increase the size of image files',
'A chain code is a lossless compression based image segmentation method for binary images based upon tracing image contours. The basic principle of chain coding, like other contour codings, is to separately encode each connected component, or "blob", in the image.\n\nFor each such region, a point on the boundary is selected and its coordinates are transmitted. The encoder then moves along the boundary of the region and, at each step, transmits a symbol representing the direction of this movement.\n\nThis continues until the encoder returns to the starting position, at which point the blob has been completely described, and encoding continues with the next blob in the image.\n\nThis encoding method is particularly effective for images consisting of a reasonably small number of large connected components.\n\nVariations \nSome popular chain codes include:\n the Freeman Chain Code of Eight Directions (FCCE)\n Directional Freeman Chain Code of Eight Directions (DFCCE)\n Vertex Chain Code (VCC)\n Three OrThogonal symbol chain code (3OT)\n Unsigned Manhattan Chain Code (UMCC)\n Ant Colonies Chain Code (ACCC)\n Predator-Prey System Chain Code (PPSCC)\n Beaver Territories Chain Code (BTCC)\n Biological Reproduction Chain Code (BRCC)\n Agent-Based Modeling Chain Code (ABMCC)\n\nIn particular, FCCE, VCC, 3OT and DFCCE can be transformed from one to another\n\nA related blob encoding method is crack code. Algorithms exist to convert between chain code, crack code, and run-length encoding.\n\nA new trend of chain codes involve the utilization of biological behaviors. This started by the work of Mouring et al. who developed an algorithm that takes advantage of the pheromone of ants to track image information. An ant releases a pheromone when they find a piece of food. Other ants use the pheromone to track the food. In their algorithm, an image is transferred into a virtual environment that consists of food and paths according to the distribution of the pixels in the original image. Then, ants are distributed and their job is to move around while releasing pheromone when they encounter food items. This helps other ants identify information, and therefore, encode information.\n\nIn use \nRecently, the combination of move-to-front transform and adaptive run-length encoding accomplished efficient compression of the popular chain codes.\nChain codes also can be used to obtain high levels of compression for image documents, outperforming standards such as DjVu and JBIG2.',
'Meripilus sumstinei, commonly known as the giant polypore or the black-staining polypore, is a species of fungus in the family Meripilaceae.\n\nTaxonomy \nOriginally described in 1905 by William Alphonso Murrill as Grifola sumstinei, the species was transferred to Meripilus in 1988.\n\nDescription \nThe cap of this polypore is wide, with folds of flesh up to thick. It has white to brownish concentric zones and tapers toward the base; the stipe is indistinct.\n\nDistribution and habitat \nIt is found in eastern North America from June to September. It grows in large clumps on the ground around hardwood (including oak) trunks, stumps, and logs.\n\nUses \nThe mushroom is edible.\n\nReferences',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
anchor, positive, and negative| anchor | positive | negative | |
|---|---|---|---|
| type | string | string | string |
| details |
|
|
|
| anchor | positive | negative |
|---|---|---|
What are the two key processes that relative nonlinearity depends on for maintaining species diversity? |
Relative nonlinearity is a coexistence mechanism that maintains species diversity via differences in the response to and effect on variation in resource density or some other factor mediating competition. Relative nonlinearity depends on two processes: 1) species have to differ in the curvature of their responses to resource density and 2) the patterns of resource variation generated by each species must favor the relative growth of another species. In its most basic form, one species grows best under equilibrium competitive conditions and another performs better under variable competitive conditions. Like all coexistence mechanisms, relative nonlinearity maintains species diversity by concentrating intraspecific competition relative to interspecific competition. Because resource density can be variable, intraspecific competition is the reduction of per-capita growth rate under variable resources generated by conspecifics (i.e. individuals of the same species). Interspecific competitio... |
Muellerella lichenicola is a species of lichenicolous fungus in the family Verrucariaceae. It was first formally described as a new species in 1826 by Søren Christian Sommerfelt, as Sphaeria lichenicola. David Leslie Hawksworth transferred it to the genus Muellerella in 1979. |
What was the unemployment rate in Japan in 2010? |
The labor force in Japan numbered 65.9 million people in 2010, which was 59.6% of the population of 15 years old and older, and amongst them, 62.57 million people were employed, whereas 3.34 million people were unemployed which made the unemployment rate 5.1%. The structure of Japan's labor market experienced gradual change in the late 1980s and continued this trend throughout the 1990s. The structure of the labor market is affected by: 1) shrinking population, 2) replacement of postwar baby boom generation, 3) increasing numbers of women in the labor force, and 4) workers' rising education level. Also, an increase in the number of foreign nationals in the labor force is foreseen. |
The Aircraft Classification Rating (ACR) - Pavement Classification Rating (PCR) method is a standardized international airport pavement rating system developed by ICAO in 2022. The method is scheduled to replace the ACN-PCN method as the official ICAO pavement rating system by November 28, 2024. The method uses similar concepts as the ACN-PCN method, however, the ACR-PCR method is based on layered elastic analysis, uses standard subgrade categories for both flexible and rigid pavement, and eliminates the use of alpha factor and layer equivalency factors. |
What was the original name of WordMARC before it was changed due to a trademark conflict? |
WordMARC Composer was a scientifically oriented word processor developed by MARC Software, an offshoot of MARC Analysis Research Corporation (which specialized in high end Finite Element Analysis software for mechanical engineering). It ran originally on minicomputers such as Prime and Digital Equipment Corporation VAX. When the IBM PC emerged as the platform of choice for word processing, WordMARC allowed users to easily move documents from a minicomputer (where they could be easily shared) to PCs. |
Parametric stereo (abbreviated as PS) is an audio compression algorithm used as an audio coding format for digital audio. It is considered an Audio Object Type of MPEG-4 Part 3 (MPEG-4 Audio) that serves to enhance the coding efficiency of low bandwidth stereo audio media. Parametric Stereo digitally codes a stereo audio signal by storing the audio as monaural alongside a small amount of extra information. This extra information (defined as "parametric overhead") describes how the monaural signal will behave across both stereo channels, which allows for the signal to exist in true stereo upon playback. |
TripletLoss with these parameters:{
"distance_metric": "TripletDistanceMetric.EUCLIDEAN",
"triplet_margin": 5
}
learning_rate: 2e-05weight_decay: 0.01num_train_epochs: 1warmup_steps: 10remove_unused_columns: Falseoverwrite_output_dir: Falsedo_predict: Falseeval_strategy: noprediction_loss_only: Trueper_device_train_batch_size: 8per_device_eval_batch_size: 8per_gpu_train_batch_size: Noneper_gpu_eval_batch_size: Nonegradient_accumulation_steps: 1eval_accumulation_steps: Nonetorch_empty_cache_steps: Nonelearning_rate: 2e-05weight_decay: 0.01adam_beta1: 0.9adam_beta2: 0.999adam_epsilon: 1e-08max_grad_norm: 1.0num_train_epochs: 1max_steps: -1lr_scheduler_type: linearlr_scheduler_kwargs: {}warmup_ratio: 0.0warmup_steps: 10log_level: passivelog_level_replica: warninglog_on_each_node: Truelogging_nan_inf_filter: Truesave_safetensors: Truesave_on_each_node: Falsesave_only_model: Falserestore_callback_states_from_checkpoint: Falseno_cuda: Falseuse_cpu: Falseuse_mps_device: Falseseed: 42data_seed: Nonejit_mode_eval: Falseuse_ipex: Falsebf16: Falsefp16: Falsefp16_opt_level: O1half_precision_backend: autobf16_full_eval: Falsefp16_full_eval: Falsetf32: Nonelocal_rank: 0ddp_backend: Nonetpu_num_cores: Nonetpu_metrics_debug: Falsedebug: []dataloader_drop_last: Falsedataloader_num_workers: 0dataloader_prefetch_factor: Nonepast_index: -1disable_tqdm: Falseremove_unused_columns: Falselabel_names: Noneload_best_model_at_end: Falseignore_data_skip: Falsefsdp: []fsdp_min_num_params: 0fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}tp_size: 0fsdp_transformer_layer_cls_to_wrap: Noneaccelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed: Nonelabel_smoothing_factor: 0.0optim: adamw_torchoptim_args: Noneadafactor: Falsegroup_by_length: Falselength_column_name: lengthddp_find_unused_parameters: Noneddp_bucket_cap_mb: Noneddp_broadcast_buffers: Falsedataloader_pin_memory: Truedataloader_persistent_workers: Falseskip_memory_metrics: Trueuse_legacy_prediction_loop: Falsepush_to_hub: Falseresume_from_checkpoint: Nonehub_model_id: Nonehub_strategy: every_savehub_private_repo: Nonehub_always_push: Falsegradient_checkpointing: Falsegradient_checkpointing_kwargs: Noneinclude_inputs_for_metrics: Falseinclude_for_metrics: []eval_do_concat_batches: Truefp16_backend: autopush_to_hub_model_id: Nonepush_to_hub_organization: Nonemp_parameters: auto_find_batch_size: Falsefull_determinism: Falsetorchdynamo: Noneray_scope: lastddp_timeout: 1800torch_compile: Falsetorch_compile_backend: Nonetorch_compile_mode: Noneinclude_tokens_per_second: Falseinclude_num_input_tokens_seen: Falseneftune_noise_alpha: Noneoptim_target_modules: Nonebatch_eval_metrics: Falseeval_on_start: Falseuse_liger_kernel: Falseeval_use_gather_object: Falseaverage_tokens_across_devices: Falseprompts: Nonebatch_sampler: batch_samplermulti_dataset_batch_sampler: proportional| Epoch | Step | Training Loss |
|---|---|---|
| 0.1 | 100 | 4.249 |
| 0.2 | 200 | 3.9999 |
| 0.3 | 300 | 3.985 |
| 0.4 | 400 | 3.9575 |
| 0.5 | 500 | 3.9236 |
| 0.6 | 600 | 3.9196 |
| 0.7 | 700 | 3.9299 |
| 0.8 | 800 | 3.8944 |
| 0.9 | 900 | 3.9088 |
| 1.0 | 1000 | 3.8702 |
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
@misc{hermans2017defense,
title={In Defense of the Triplet Loss for Person Re-Identification},
author={Alexander Hermans and Lucas Beyer and Bastian Leibe},
year={2017},
eprint={1703.07737},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
Base model
intfloat/multilingual-e5-small