Apache Spark in Java
Apache Spark
Apache Spark is a fast and general engine for large-scale data processing like Hadoop.
It supports Python, Java, R, and Scala API.
Install
I tried Spark in Mac.
Download binary package(Version 2.0) and decompress.
That’s all for preparation.
spark-2.0.0-bin-hadoop2.7 |- bin |- spark-submit
Use Java API with gradle project
Create gradle project with IntelliJ
Project directory is following
build.gradle
Add dependencies to work with Spark
group 'com.atmarkplant' version '1.0-SNAPSHOT' apply plugin: 'java' repositories { mavenCentral() } dependencies { compile group: 'org.apache.spark', name: 'spark-core_2.11', version: '2.0.0' testCompile group: 'junit', name: 'junit', version: '4.11' } jar { baseName = 'sparksample' version = '0.0.1-SNAPSHOT' }
Add spark-core library same version as binary package
Test program
Let’s make simple sample
public class Main { public static void main(String[] args) { String logFile = "/Users/dj110/lib/spark-2.0.0-bin-hadoop2.7/README.md"; // Should be some file on your system SparkConf conf = new SparkConf().setAppName("Simple Application"); JavaSparkContext sc = new JavaSparkContext(conf); JavaRDD<String> logData = sc.textFile(logFile).cache(); long numAs = logData.filter(new Function<String, Boolean>() { public Boolean call(String s) { return s.contains("a"); } }).count(); long numBs = logData.filter(new Function<String, Boolean>() { public Boolean call(String s) { return s.contains("b"); } }).count(); System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs); } }
Run with Spark
Let’s make jar file to run with Spark
Above build.grale, make jar file.
After build, your jar files are under build directory
build/lib/sparksample-0.0.1-SNAPSHOT.jar
Go to your SPARK HOME and look for spark-submit
And run
./spark-submit --class Main --master local[4] xxx.jar
Run Main class in your jar file. xxx.jar is your jar file path.