nielsr HF Staff commited on
Commit
5251df4
·
verified ·
1 Parent(s): c18e2f4

Improve model card with pipeline tag, library name, and license

Browse files

This PR improves the model card by adding the `pipeline_tag`, `library_name`, and `license` metadata. This ensures the model is discoverable through relevant search filters on the Hugging Face Hub and provides crucial information for users.

Files changed (1) hide show
  1. README.md +13 -0
README.md CHANGED
@@ -2,8 +2,21 @@
2
  tags:
3
  - model_hub_mixin
4
  - pytorch_model_hub_mixin
 
 
 
5
  ---
6
 
 
 
 
 
 
 
 
 
 
 
7
  This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:
8
  - Code: https://github.com/chs20/fuselip
9
  - Paper: https://arxiv.org/abs/2506.03096
 
2
  tags:
3
  - model_hub_mixin
4
  - pytorch_model_hub_mixin
5
+ library_name: pytorch
6
+ pipeline_tag: feature-extraction
7
+ license: mit
8
  ---
9
 
10
+ # FuseLIP: Multimodal Embeddings via Early Fusion of Discrete Tokens
11
+
12
+ The model was presented in the paper [FuseLIP: Multimodal Embeddings via Early Fusion of Discrete Tokens](https://arxiv.org/abs/2506.03096).
13
+
14
+ # Paper abstract
15
+
16
+ The abstract of the paper is the following:
17
+
18
+ Contrastive language-image pre-training aligns the features of text-image pairs in a common latent space via distinct encoders for each modality. While this approach achieves impressive performance in several zero-shot tasks, it cannot natively handle multimodal inputs, i.e., encoding image and text into a single feature vector. As a remedy, it is common practice to use additional modules to merge the features extracted by the unimodal encoders. In this work, we present FuseLIP, an alternative architecture for multimodal embedding. Leveraging recent progress in discrete image tokenizers, we propose to use a single transformer model which operates on an extended vocabulary of text and image tokens. This early fusion approach allows the different modalities to interact at each depth of encoding and obtain richer representations compared to common late fusion. We collect new datasets for multimodal pre-training and evaluation, designing challenging tasks for multimodal encoder models. We show that FuseLIP outperforms other approaches in multimodal embedding tasks such as VQA and text-guided image transformation retrieval, while being comparable to baselines on unimodal tasks.
19
+
20
  This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:
21
  - Code: https://github.com/chs20/fuselip
22
  - Paper: https://arxiv.org/abs/2506.03096