Abstract:
Large-scale pre-trained models have demonstrated impressive performance in vision and language tasks within open-world scenarios. Due to the lack of comparable pre-traine...Show MoreMetadata
Abstract:
Large-scale pre-trained models have demonstrated impressive performance in vision and language tasks within open-world scenarios. Due to the lack of comparable pre-trained models for 3D shapes, recent methods utilize language-image pre-training to realize zero-shot 3D shape recognition. However, due to the modality gap, pretrained language-image models are not confident enough in the generalization to 3D shape recognition. Consequently, this paper aims to improve the confidence with view selection and hierarchical prompts. Building on the well-established CLIP model, we introduce view selection in the vision side that minimizes entropy to identify the most informative views for 3D shape. On the textual side, hierarchical prompts combined of hand-crafted and GPT-generated prompts are proposed to refine predictions. The first layer prompts several classification candidates with traditional class-level descriptions, while the second layer refines the prediction based on function-level descriptions or further distinctions between the candidates. Extensive experiments demonstrate the effectiveness of the proposed modules for zero-shot 3D shape recognition. Remarkably, without the need for additional training, our proposed method achieves impressive zero-shot 3D classification accuracies of 84.44%, 91.51%, and 66.17% on ModelNet40, ModelNet10, and ShapeNet Core55, respectively. Furthermore, we will make the code publicly available to facilitate reproducibility and further research in this area.
Published in: IEEE Transactions on Circuits and Systems for Video Technology ( Early Access )
Funding Agency:
Keywords assist with retrieval of results and provide a means to discovering other relevant content. Learn more.
- IEEE Keywords
- Index Terms
- 3D Shape ,
- Shape Recognition ,
- 3D Shape Recognition ,
- Need For Training ,
- Need For Additional Training ,
- Semantic ,
- Average Accuracy ,
- Visual Features ,
- Point Cloud ,
- Electrical Engineering ,
- Depth Map ,
- Language Model ,
- Area Under Curve ,
- Inference Time ,
- Prediction Confidence ,
- 3D Point Cloud ,
- 3D Datasets ,
- Multimodal Learning ,
- Class Of Candidates ,
- Tianjin University ,
- Visual Encoding ,
- National University Of Singapore ,
- View Features ,
- Pre-trained Encoder ,
- Information Technology ,
- Recognition Performance ,
- 3D Data ,
- Textual Features
- Author Keywords
Keywords assist with retrieval of results and provide a means to discovering other relevant content. Learn more.
- IEEE Keywords
- Index Terms
- 3D Shape ,
- Shape Recognition ,
- 3D Shape Recognition ,
- Need For Training ,
- Need For Additional Training ,
- Semantic ,
- Average Accuracy ,
- Visual Features ,
- Point Cloud ,
- Electrical Engineering ,
- Depth Map ,
- Language Model ,
- Area Under Curve ,
- Inference Time ,
- Prediction Confidence ,
- 3D Point Cloud ,
- 3D Datasets ,
- Multimodal Learning ,
- Class Of Candidates ,
- Tianjin University ,
- Visual Encoding ,
- National University Of Singapore ,
- View Features ,
- Pre-trained Encoder ,
- Information Technology ,
- Recognition Performance ,
- 3D Data ,
- Textual Features
- Author Keywords