Intro to

Vector Embeddings

Data Science Dojo & Weaviate

Intro to

Vector Embeddings

Victoria Slocum

Machine Learning Engineer

[1.5, -0.4, 7.2, 19.6, 20.2, 1.7, -0.3, 6.9, 19.1, 21.1]

[1.5, -0.4, 7.2, 19.6, 20.2, 1.7, -0.3, 6.9, 19.1, 21.1]

You can do a bunch of cool things with vectors

[queen] - [king] = [woman] - [man]

source: @3blue1brown

What does this mean?

The vectors are encoding meaning of the things in some way

[0, 0, 0]

red

green

blue

R

G

B

[255, 255, 255]

red

green

blue

R

G

B

[255, 0, 0]

red

green

blue

R

G

B

[255, 0, 255]

red

green

blue

R

G

B

[80, 200, 120]

red

green

blue

R

G

B

Each number represents how much red, green, or blue is in the color.

This is exactly what a vector embedding is - a sequence of numbers that represents meaning.

[0, 0, 0]

red

green

blue

R

G

B

[1.5, -0.4, 7.2, 19.6, 20.2, 1.7, -0.3, 6.9, 19.1, 21.1, ...]

(1000+ dim)

cat

How do embedding models actually work?

word-level dense vector models

it doesn’t address words with multiple meanings (polysemantic)
it doesn’t address words with ambiguous meanings
it only embeds on the word-level*

Uses a neural network to learn word associations from a large corpus of text

(it was initially trained by Google with 100 billion words)

Limitations

(Word2vec)

word-level dense vector models

(Word2vec)

source: https://jalammar.github.io/illustrated-word2vec/

word-level dense vector models

(Word2vec)

source: https://jalammar.github.io/illustrated-word2vec/

word-level dense vector models

(Word2vec)

source: https://jalammar.github.io/illustrated-word2vec/

word-level dense vector models

(Word2vec)

source: https://jalammar.github.io/illustrated-word2vec/

transformer models

increased compute and memory requirements

Uses the transformer architecture, taking the entire input text into account to modify each word by the surrounding text.

(the basis of all modern machine learning)

Limitations

(bert, elmo, etc)

transformer models

(bert, elmo, etc)

Demo time!!

DEMO TIME!!

So, how do you actually chose an embedding model?

Data Performance

Infrastructure considerations

Language Specificity

domain Specificity

comparison/

benchmarking

inference costs

storage costs

latency /

throughput

Data Performance

The MTEB Leaderboard

https://huggingface.co/spaces/mteb/leaderboard

Influences hardware choices

**What is your use case?**

Retrieval performance

inference cost

storage cost

multilingual, code, long context, etc

take with a grain of salt

specific domain datasets

Let's talk about cost

Closed-Source Options

Google

accessed through an API

Benefits

Drawbacks

Speed
Easy to use and access
No hosting

Rate limits
Batch upload
Locked-in to the provider

Benefits

Drawbacks

Control
Flexibility
Visibility and open access

Need to host yourself
Infrastructure costs

Open-Source Options

hosted locally

Embedding models case study

snowflake Arctic Embed 2.0

top models on mteb

vs

Embedding models case study

snowflake Arctic Embed 2.0

top models on mteb

vs

snowflake Arctic Embed 2.0

ai isn't free

(not even close)

next webinar!

with Data Science Dojo