Published on

First contact with spark-jobserver

This describe how to test locally a spark-jobserver instance. This derive from the readme. First impressions are great.

Server side

git clone git@github.com:spark-jobserver/spark-jobserver.git

Activate Basic Auth

# add this at the end config/local.conf
shiro {
authentication = on
# absolute path to shiro config file, including file name
config.path = "config/shiro.ini.basic"
}
# specify the number of query per slot (default 8)
spark {
jobserver.max-jobs-per-context = 2
}

More detail in the source code

Increase the jar size

# in spark section
short-timeout = 60 s

# in root section
spray.can.server {
    parsing.max-content-length = 300m
    idle-timeout = 400s
    request-timeout = 300s
}

Start the server

sbt
# Run this
job-server-extras/reStart config/local.conf --- -Xmx8g

Client

# compile the jar
sbt job-server-tests/package

# add the jar and give it the "test-binary" name
curl --basic --user 'user:pwd' X POST localhost:8090/binaries/test-binary -H "Content-Type: application/java-archive" --data-binary @job-server-tests/target/scala-2.11/job-server-tests_2.11-0.9.1-SNAPSHOT.jar

# pass the basic auth
curl localhost:8090/contexts --basic --user 'user:pwd' -d "" "localhost:8090/contexts/test-context?num-cpu-cores=4&memory-per-node=512m"

# get the contexts
curl -k --basic --user 'user:pw' https://localhost:8090/contexts

# get the jobs
curl -k --basic --user 'user:pw' https://localhost:8090/jobs

# run a job based on the test-binary file
# also make it syncronized and increase the timeout
curl --basic --user 'user:pwd' -d "input.string = a b c a b see" "localhost:8090/jobs?appName=test-binary&classPath=spark.jobserver.WordCountExample&context=test-context&sync=true&timeout=100000"

Async Client

curl --basic --user user:pw' -d "input.string = a b c a b see" "localhost:8090/jobs?appName=omop-spark-job&classPath=io.frama.parisni.spark.omop.job.WordCountExample&context=test-context&sync=false&timeout=100000"
{
  "duration": "Job not done yet",
  "classPath": "io.frama.parisni.spark.omop.job.WordCountExample",
  "startTime": "2020-06-12T16:43:46.980+02:00",
  "context": "test-context",
  "status": "STARTED",
  "jobId": "d492de45-dc1e-4d08-996c-19b47fdb986f",
  "contextId": "47607883-d90c-456e-8884-d160f4c41480"
}

curl -X GET --basic --user 'user:pw' "localhost:8090/jobs/d492de45-dc1e-4d08-996c-19b47fdb986f"
{
  "duration": "0.157 secs",
  "classPath": "io.frama.parisni.spark.omop.job.WordCountExample",
  "startTime": "2020-06-12T16:43:46.980+02:00",
  "context": "test-context",
  "result": {
    "b": 2,
    "a": 2,
    "see": 1,
    "c": 1
  },
  "status": "FINISHED",
  "jobId": "d492de45-dc1e-4d08-996c-19b47fdb986f",
  "contextId": "47607883-d90c-456e-8884-d160f4c41480"
}

Enabling sparkContext with hive support

By extending the SparkSessionJob class and using the below code, you get the hive support:

context-create:
        curl -X POST -k --basic --user '${USER}:${PWD}' "${HOST}:${PORT}/contexts/test-context?num-cpu-cores=4&memory-per-node=512m&context-factory=spark.jobserver.context.SessionContextFactory"

Complex result (use array/maps as object returned):

{
  "jobId": "4e85111f-98aa-4736-8153-11f937d4fd2f",
  "result": [{
    "list": 1,
    "count": "800",
    "resourceType": "Patient"
  }, {
    "list": 2,
    "count": "100",
    "resourceType": "Condition"
  }]
}

Passing config parameters

One way is to store them into the env.conf file, in the passthough section. They are then accessible from the runtime.contextConfigvariable within the runJob method.

They can be accessed from code thought val ZK = runtime.contextConfig.getString("passthrough.the.variable")

Passing extra options

This can be done with the env.sh file. In this example, we configure the basic auth to solr cloud.

MANAGER_EXTRA_SPARK_CONFS="spark.executor.extraJavaOptions=-Dsolr.httpclient.config=<path-to>/auth_qua.txt -Dhdp.version=2.6.5.0-292|spark.driver.extraJavaOptions=-Dhdp.version=2.6.5.0-292,-Dsolr.httpclient.config=<path-to>/auth_qua.txt,spark.yarn.submit.waitAppCompletion=false|spark.files=$appdir/log4jcluster.properties,$conffile"
This page was last modified: