Skip to content

How to install to EMR from maven repository to /usr/lib/spark/jars #2355

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 of 19 tasks
hueiyuan opened this issue Mar 19, 2025 · 0 comments
Open
2 of 19 tasks

How to install to EMR from maven repository to /usr/lib/spark/jars #2355

hueiyuan opened this issue Mar 19, 2025 · 0 comments

Comments

@hueiyuan
Copy link

SynapseML version

1.0.10

System information

  • Language version (e.g. python 3.8, scala 2.12): python 3.9
  • Spark Version (e.g. 3.2.3): 3.5.1
  • Spark Platform (e.g. Synapse, Databricks): AWS EMR Release 7.3.1

Describe the problem

Now I would like to try to install SynapseML to EMR for pyspark. If we execute configuration based on the below command on Jupyter notebooks that is work.

%%configure -f
{
  "name": "synapseml",
  "conf": {
      "spark.jars.packages": "com.microsoft.azure:synapseml_2.12:1.0.9-spark3.5",
      "spark.jars.repositories": "https://mmlspark.azureedge.net/maven"
  }
}

But in production, we don't use Jupyter notebooks. Therefore, we first download corresponding jars from maven repository and copy to the path /usr/lib/spark/jars on EMR and do not work and show com.microsoft.azure.synapse.ml.isolationforest.IsolationForest does not exist in the JVM

Have anyone know what is the root cause result in this? Thank you.

Code to reproduce issue

from synapse.ml.isolationforest import IsolationForest

# print(type(IsolationForest))
hyper_params = {
    'n_estimators': 100,
    'max_samples': 32
    'max_features': 1,
    'bootstrap': False,
    'contamination': 0.1,    
}

isolation_forest_model = (
    IsolationForest()
    .setNumEstimators(hyper_params["n_estimators"])
    .setBootstrap(hyper_params["bootstrap"])
    .setMaxSamples(hyper_params["max_samples"])
    .setMaxFeatures(hyper_params["max_features"])
    .setFeaturesCol("features")
    .setPredictionCol("predictedLabel")
    .setScoreCol("outlierScore")
    .setContamination(hyper_params["contamination"])
    .setContaminationError(0.01 * hyper_params["contamination"])
)

Other info / logs

An error was encountered:
com.microsoft.azure.synapse.ml.isolationforest.IsolationForest does not exist in the JVM
Traceback (most recent call last):
  File "/mnt1/yarn/usercache/livy/appcache/application_1742368398137_0002/container_1742368398137_0002_01_000001/pyspark.zip/pyspark/__init__.py", line 139, in wrapper
    return func(self, **kwargs)
  File "/mnt1/yarn/usercache/livy/appcache/application_1742368398137_0002/container_1742368398137_0002_01_000001/com.microsoft.azure_synapseml-core_2.12-1.0.9-spark3.5.jar/synapse/ml/isolationforest/IsolationForest.py", line 78, in __init__
    self._java_obj = self._new_java_obj("com.microsoft.azure.synapse.ml.isolationforest.IsolationForest", self.uid)
  File "/mnt1/yarn/usercache/livy/appcache/application_1742368398137_0002/container_1742368398137_0002_01_000001/pyspark.zip/pyspark/ml/wrapper.py", line 84, in _new_java_obj
    java_obj = getattr(java_obj, name)
  File "/mnt1/yarn/usercache/livy/appcache/application_1742368398137_0002/container_1742368398137_0002_01_000001/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1664, in __getattr__
    raise Py4JError("{0} does not exist in the JVM".format(new_fqn))
py4j.protocol.Py4JError: com.microsoft.azure.synapse.ml.isolationforest.IsolationForest does not exist in the JVM

What component(s) does this bug affect?

  • area/cognitive: Cognitive project
  • area/core: Core project
  • area/deep-learning: DeepLearning project
  • area/lightgbm: Lightgbm project
  • area/opencv: Opencv project
  • area/vw: VW project
  • area/website: Website
  • area/build: Project build system
  • area/notebooks: Samples under notebooks folder
  • area/docker: Docker usage
  • area/models: models related issue

What language(s) does this bug affect?

  • language/scala: Scala source code
  • language/python: Pyspark APIs
  • language/r: R APIs
  • language/csharp: .NET APIs
  • language/new: Proposals for new client languages

What integration(s) does this bug affect?

  • integrations/synapse: Azure Synapse integrations
  • integrations/azureml: Azure ML integrations
  • integrations/databricks: Databricks integrations
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant