Description
Discussed in #14180
Originally posted by kkwasnioch February 20, 2024
I am trying to produce embeddings for whole documents in 3 languages: english, polish, finnish. Previously I have tried sentence-transformers/paraphrase-multilingual-mpnet-base-v2 from huggingface and it works fine, returns 768 dims. But when I load model and run it with sparknlp XlmRoBertaSentenceEmbeddings it produce f.e. 26k dims. Am I loading model wrong way? Or are thare any othe issues? Thanks!
https://github.com/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_XlmRoBertaSentenceEmbeddings.ipynb -> here is sample code which i took knowladge
Code:
MODEL_NAME = "sentence-transformers/paraphrase-multilingual-mpnet-base-v2"
EXPORT_PATH = f"onnx_models/{MODEL_NAME}"
robert = XlmRoBertaSentenceEmbeddings.loadSavedModel(f"{EXPORT_PATH}", spark)\
.setInputCols(["document"])\
.setOutputCol("embeddings")\
.setStorageRef('xlmroberta_embeddings_paraphrase_mpnet_base_v2')
document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
embeddings_finisher = EmbeddingsFinisher() \
.setInputCols('embeddings') \
.setOutputCols('finnished_vectors') \
.setOutputAsVector(False)
pipeline = Pipeline(stages=[document_assembler, robert, embeddings_finisher])
pipelineModel = pipeline.fit(sparkDF)
LightPipelinelightModel = LightPipeline(pipelineModel, parse_embeddings=True)
out = LightPipelinelightModel.transform(sparkDF).select('text', f.explode('finnished_vectors').alias('emb')).withColumn('size', f.size('emb'))
Output:
+--------------------+--------------------+-----+
| text| emb| size|
+--------------------+--------------------+-----+
|Do kościoła jak "... |[0.028680567, 0.2...|29952|
|Audi Q7 właśnie p... |[-0.01756316, -0.... |28416|
|Białoruś. KGB wpr... |[0.07118901, -0.0... |28416|
|"Są prawdziwym za...|[0.0972352, -0.04..|25344|
|Obsesja, za którą... |[0.07850968, 0.15..|32256|
|Ogromny sukces Po...|[-0.034644652, 0..|22272|
|Rolnicy "zajęli... |[-0.06938014, 0.0.. |29952|
|Szokujące wyznani... |[0.08084734, 0.18...|30720|
|Pogoda zaskoczy w...|[-0.086600736, 0....|34560|
|Kiedyś kary fizyc... |[0.059363756, 0.0..|28416|
+--------------------+--------------------+-----+