wingrune
/

3DGraphLLM

Image-Text-to-Text

3d-scene-understanding

vision-language-model

Model card Files Files and versions

wingrune commited on Dec 25, 2024

Commit

4b9223c

·

verified ·

1 Parent(s): 35ac252

Update README.md

Files changed (1) hide show

README.md +30 -3

README.md CHANGED Viewed

@@ -1,3 +1,30 @@
----
-license: mit
----

+---
+license: mit
+pipeline_tag: visual-question-answering
+---
+# 3DGraphLLM
+3DGraphLLM is a model that uses a 3D scene graph and an LLM to perform 3D vision-language tasks.
+<p align="center">
+<img src="ga.png" width="80%">
+</p>
+## Model Details
+We provide our best checkpoint that uses [Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) as an LLM, [Mask3D](https://github.com/JonasSchult/Mask3D) 3D instance segmentation to get scene graph nodes, [VL-SAT](https://github.com/wz7in/CVPR2023-VLSAT) to encode semantic relations [Uni3D](https://github.com/baaivision/Uni3D) as 3D object encoder, and [DINOv2](https://github.com/facebookresearch/dinov2) as 2D object encoder.
+## Citation
+If you find 3DGraphLLM helpful, please consider citing our work as:
+```
+@misc{zemskova20243dgraphllm,
+      title={3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding},
+      author={Tatiana Zemskova and Dmitry Yudin},
+      year={2024},
+      eprint={2412.18450},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV},
+      url={https://arxiv.org/abs/2412.18450},
+}
+```