Skip to content

XlmRoBertaSentenceEmbeddings returns huge amount of embeddings instead of set dimensions #14181

Open
@maziyarpanahi

Description

@maziyarpanahi

Discussed in #14180

Originally posted by kkwasnioch February 20, 2024
I am trying to produce embeddings for whole documents in 3 languages: english, polish, finnish. Previously I have tried sentence-transformers/paraphrase-multilingual-mpnet-base-v2 from huggingface and it works fine, returns 768 dims. But when I load model and run it with sparknlp XlmRoBertaSentenceEmbeddings it produce f.e. 26k dims. Am I loading model wrong way? Or are thare any othe issues? Thanks!
https://github.com/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_XlmRoBertaSentenceEmbeddings.ipynb -> here is sample code which i took knowladge
Code:

MODEL_NAME = "sentence-transformers/paraphrase-multilingual-mpnet-base-v2"
EXPORT_PATH = f"onnx_models/{MODEL_NAME}"
robert = XlmRoBertaSentenceEmbeddings.loadSavedModel(f"{EXPORT_PATH}", spark)\
    .setInputCols(["document"])\
    .setOutputCol("embeddings")\
    .setStorageRef('xlmroberta_embeddings_paraphrase_mpnet_base_v2') 

document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

embeddings_finisher = EmbeddingsFinisher() \
  .setInputCols('embeddings') \
  .setOutputCols('finnished_vectors') \
  .setOutputAsVector(False)

pipeline = Pipeline(stages=[document_assembler, robert, embeddings_finisher])

pipelineModel = pipeline.fit(sparkDF)
LightPipelinelightModel = LightPipeline(pipelineModel, parse_embeddings=True)

out = LightPipelinelightModel.transform(sparkDF).select('text', f.explode('finnished_vectors').alias('emb')).withColumn('size', f.size('emb'))

Output:
+--------------------+--------------------+-----+
| text| emb| size|
+--------------------+--------------------+-----+
|Do kościoła jak "... |[0.028680567, 0.2...|29952|
|Audi Q7 właśnie p... |[-0.01756316, -0.... |28416|
|Białoruś. KGB wpr... |[0.07118901, -0.0... |28416|
|"Są prawdziwym za...|[0.0972352, -0.04..|25344|
|Obsesja, za którą... |[0.07850968, 0.15..|32256|
|Ogromny sukces Po...|[-0.034644652, 0..|22272|
|Rolnicy "zajęli... |[-0.06938014, 0.0.. |29952|
|Szokujące wyznani... |[0.08084734, 0.18...|30720|
|Pogoda zaskoczy w...|[-0.086600736, 0....|34560|
|Kiedyś kary fizyc... |[0.059363756, 0.0..|28416|
+--------------------+--------------------+-----+

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions