Apache Spark in Java

Apache Spark

Apache Spark is a fast and general engine for large-scale data processing like Hadoop.
It supports Python, Java, R, and Scala API.

Install

I tried Spark in Mac.
Download binary package(Version 2.0) and decompress.
That’s all for preparation.

spark-2.0.0-bin-hadoop2.7
|- bin
    |- spark-submit

Use Java API with gradle project

Create gradle project with IntelliJ
Project directory is following


build.gradle

Add dependencies to work with Spark

group 'com.atmarkplant'
version '1.0-SNAPSHOT'

apply plugin: 'java'


repositories {
    mavenCentral()
}

dependencies {
    compile group: 'org.apache.spark', name: 'spark-core_2.11', version: '2.0.0'
    testCompile group: 'junit', name: 'junit', version: '4.11'
}

jar {
    baseName = 'sparksample'
    version = '0.0.1-SNAPSHOT'
}

Add spark-core library same version as binary package

Test program

Let’s make simple sample

public class Main {
    public static void main(String[] args) {
        String logFile = "/Users/dj110/lib/spark-2.0.0-bin-hadoop2.7/README.md"; // Should be some file on your system
        SparkConf conf = new SparkConf().setAppName("Simple Application");
        JavaSparkContext sc = new JavaSparkContext(conf);
        JavaRDD<String> logData = sc.textFile(logFile).cache();

        long numAs = logData.filter(new Function<String, Boolean>() {
            public Boolean call(String s) { return s.contains("a"); }
        }).count();

        long numBs = logData.filter(new Function<String, Boolean>() {
            public Boolean call(String s) { return s.contains("b"); }
        }).count();

        System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs);
    }
}

Run with Spark

Let’s make jar file to run with Spark
Above build.grale, make jar file.
After build, your jar files are under build directory
build/lib/sparksample-0.0.1-SNAPSHOT.jar
Go to your SPARK HOME and look for spark-submit
And run

./spark-submit --class Main --master local[4]  xxx.jar

Run Main class in your jar file. xxx.jar is your jar file path.