Apache Spark in Java

Apache Spark

Apache Spark is a fast and general engine for large-scale data processing like Hadoop.
It supports Python, Java, R, and Scala API.


I tried Spark in Mac.
Download binary package(Version 2.0) and decompress.
That’s all for preparation.

|- bin
    |- spark-submit

Use Java API with gradle project

Create gradle project with IntelliJ
Project directory is following


Add dependencies to work with Spark

group 'com.atmarkplant'
version '1.0-SNAPSHOT'

apply plugin: 'java'

repositories {

dependencies {
    compile group: 'org.apache.spark', name: 'spark-core_2.11', version: '2.0.0'
    testCompile group: 'junit', name: 'junit', version: '4.11'

jar {
    baseName = 'sparksample'
    version = '0.0.1-SNAPSHOT'

Add spark-core library same version as binary package

Test program

Let’s make simple sample

public class Main {
    public static void main(String[] args) {
        String logFile = "/Users/dj110/lib/spark-2.0.0-bin-hadoop2.7/README.md"; // Should be some file on your system
        SparkConf conf = new SparkConf().setAppName("Simple Application");
        JavaSparkContext sc = new JavaSparkContext(conf);
        JavaRDD<String> logData = sc.textFile(logFile).cache();

        long numAs = logData.filter(new Function<String, Boolean>() {
            public Boolean call(String s) { return s.contains("a"); }

        long numBs = logData.filter(new Function<String, Boolean>() {
            public Boolean call(String s) { return s.contains("b"); }

        System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs);

Run with Spark

Let’s make jar file to run with Spark
Above build.grale, make jar file.
After build, your jar files are under build directory
Go to your SPARK HOME and look for spark-submit
And run

./spark-submit --class Main --master local[4]  xxx.jar

Run Main class in your jar file. xxx.jar is your jar file path.