Are the two architectures essentially similar?

n00b · January 4, 2025, 5:23am

Vision transformer and BERT seem to be made of only encoder section of a transformer. So, are they very similar to each other? If not, please give a short gist where the differences lie. Even better, if you can throw me a material which explains what the BERT architecture is composed of (from what I have seen, videos on YT are confusing or are very shallow when it comes to explaining BERT).

Any aid is much appreciated.

victoria88paul · January 16, 2025, 8:09am

Great question! While Vision Transformers (ViT) and BERT both use the encoder part of the transformer architecture, they serve different purposes and have some key differences:

Purpose: ViT is designed for computer vision tasks, while BERT is focused on natural language processing (NLP).
Input: ViT processes images as sequences of patches, whereas BERT processes text data.
Training: ViT requires large amounts of labeled image data, while BERT is pre-trained on vast amounts of text data and fine-tuned for specific tasks.

For a detailed explanation of BERT’s architecture, you can check out the comprehensive guide on Analytics Vidhya or the Hugging Face blog.

Best regards,
Victoria
Kia Motors Finance

Kasianenko · January 16, 2025, 10:48pm

Did you check the https://www.youtube.com/watch?v=ec9IQMiJBhs&t=0s ?
Disregard LLM generated spam response

jennifer986block · January 24, 2025, 10:10am

Hello,
Vision Transformers (ViT) and BERT are indeed both based on the encoder part of the transformer architecture, but they serve different purposes and have some key differences: DMV NY

Vision Transformers (ViT)

Purpose: Designed for image processing tasks.
Architecture: Uses self-attention mechanisms to process images as sequences of patches, similar to how BERT processes text.
Application: Used in tasks like image classification, object detection, and more.

BERT (Bidirectional Encoder Representations from Transformers)

Purpose: Designed for natural language processing (NLP) tasks.
Architecture: Uses bidirectional training to understand the context of words in a sentence.
Application: Used in tasks like text classification, sentiment analysis, question answering, and more.

For a detailed explanation of BERT’s architecture, you can refer to the Hugging Face BERT 101 guide. It provides a comprehensive overview of how BERT works and its applications.

Best regards,
Jennifer Block

n00b · January 24, 2025, 1:59pm

Hey thanks for the video. I found another one by 3Blue1Brown which was equally good: https://www.youtube.com/watch?v=eMlx5fFNoYc