基于Python和Java实现单词计数(Word Count)

1 导引

我们在博客《Hadoop: 单词计数(Word Count)的MapReduce实现 》中学习了如何用Hadoop-MapReduce实现单词计数,现在我们来看如何用Spark来实现同样的功能。

2. Spark的MapReudce原理

Spark框架也是MapReduce-like模型,采用“分治-聚合”策略来对数据分布进行分布并行处理。不过该框架相比Hadoop-MapReduce,具有以下两个特点:

  • 对大数据处理框架的输入/输出,中间数据进行建模,将这些数据抽象为统一的数据结构命名为弹性分布式数据集(Resilient Distributed Dataset),并在此数据结构上构建了一系列通用的数据操作,使得用户可以简单地实现复杂的数据处理流程。
  • 采用了基于内存的数据聚合、数据缓存等机制来加速应用执行尤其适用于迭代和交互式应用。

Spark社区推荐用户使用Dataset、DataFrame等面向结构化数据的高层API(Structured API)来替代底层的RDD API,因为这些高层API含有更多的数据类型信息(Schema),支持SQL操作,并且可以利用经过高度优化的Spark SQL引擎来执行。不过,由于RDD API更基础,更适合用来展示基本概念和原理,后面我们的代码都使用RDD API。

Spark的RDD/dataset分为多个分区。RDD/Dataset的每一个分区都映射一个或多个数据文件, Spark通过该映射读取数据输入到RDD/dataset中。

因为我们这里采用的本地单机多线程调试模式,默认分区数即为本地机器使用的线程数,若在代码中设置了local[N](使用N个线程),则默认为N个分区;若设为local[*](使用本地CPU核数个线程),则默认分区数为本地CPU核数。大家可以通过调用RDD对象的getNumPartitions()查看实际分区个数。

我们下面的流程描述中,假设每个文件对应一个分区。

Spark的Map示意图如下:

-1

Spark的Reduce示意图如下:

-2

3. Word Count的Java实现

项目架构如下图:

Word-Count-Spark
├─ input
│  ├─ file1.txt
│  ├─ file2.txt
│  └─ file3.txt
├─ output
│  └─ result.txt
├─ pom.XML
├─ src
│  ├─ main
│  │  └─ java
│  │     └─ WordCount.java
│  └─ test
└─ target

WordCount.java文件如下:

  1. package com.orion;
  2.  
  3. import org.apache.spark.api.java.JavaPairRDD;
  4. import org.apache.spark.api.java.JavaRDD;
  5. import org.apache.spark.sql.SparkSession;
  6.  
  7. import Scala.Tuple2;
  8. import java.util.Arrays;
  9. import java.util.List;
  10. import java.util.regex.Pattern;
  11. import java.io.*;
  12. import java.nio.file.*;
  13.  
  14. public class WordCount {
  15.      private static Pattern SPACE = Pattern.compile(” “);
  16.  
  17.      public static void main(String[] args) throws Exception {
  18.          if (args.length != 3) {
  19.              System.err.println(“Usage: WordCount <intput directory> <output directory> <number of local threads>”);
  20.              System.exit(1);
  21.          }
  22.                  String input_path = args[0];
  23.                  String output_path = args[1];
  24.          int n_threads = Integer.parseInt(args[2]);
  25.  
  26.          SparkSession spark = SparkSession.builder()
  27.              .appName(“WordCount”)
  28.              .master(String.format(“local[%d]”, n_threads))
  29.              .getOrCreate();
  30.  
  31.          JavaRDD<String> lines = spark.read().textFile(input_path).javaRDD();
  32.  
  33.          JavaRDD<String> words = lines.flatMap(-> Arrays.asList(SPACE.split(s)).iterator());
  34.          JavaPairRDD<String, Integer> ones = words.mapToPair(-> new Tuple2<>(s, 1));
  35.          JavaPairRDD<String, Integer> counts = ones.reduceByKey((i1, i2) -> i1 + i2);
  36.  
  37.          List<Tuple2<String, Integer>> output = counts.collect();
  38.  
  39.                  String filePath = Paths.get(output_path, “result.txt”).toString();
  40.                  Bufferedwriter out = new BufferedWriter(new FileWriter(filePath));
  41.          for (Tuple2<?, ?> tuple : output) {
  42.              out.write(tuple._1() + “: “ + tuple._2() + “\n”);
  43.          }
  44.          out.close();
  45.                  spark.stop();
  46.      }
  47. }

pom.xml文件配置如下:

  1. <?xml version=“1.0” encoding=“UTF-8”?>
  2. <project xmlns=“http://maven.apache.org/POM/4.0.0” xmlns:xsi=“http://www.w3.org/2001/XMLSchema-instance”
  3.      xsi:schemaLocation=“http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd”>
  4.      <modelVersion>4.0.0</modelVersion>
  5.      <groupId>com.WordCount</groupId>
  6.      <artifactId>WordCount</artifactId>
  7.      <version>1.0-SNAPSHOT</version>
  8.      <name>WordCount</name>
  9.      <!– FIXME change it to the project’s website –>
  10.      <url>http://www.example.com</url>
  11.      <!– 集中定义版本号 –>
  12.      <properties>
  13.      <scala.version>2.12.10</scala.version>
  14.      <scala.compat.version>2.12</scala.compat.version>
  15.      <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
  16.      <project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
  17.      <project.timezone>UTC</project.timezone>
  18.      <java.version>11</java.version>
  19.      <scoverage.plugin.version>1.4.0</scoverage.plugin.version>
  20.      <site.plugin.version>3.7.1</site.plugin.version>
  21.      <scalatest.version>3.1.2</scalatest.version>
  22.      <scalatest-maven-plugin>2.0.0</scalatest-maven-plugin>
  23.      <scala.maven.plugin.version>4.4.0</scala.maven.plugin.version>
  24.      <maven.compiler.plugin.version>3.8.0</maven.compiler.plugin.version>
  25.      <maven.javadoc.plugin.version>3.2.0</maven.javadoc.plugin.version>
  26.      <maven.source.plugin.version>3.2.1</maven.source.plugin.version>
  27.      <maven.deploy.plugin.version>2.8.2</maven.deploy.plugin.version>
  28.      <nexus.staging.maven.plugin.version>1.6.8</nexus.staging.maven.plugin.version>
  29.      <maven.help.plugin.version>3.2.0</maven.help.plugin.version>
  30.      <maven.gpg.plugin.version>1.6</maven.gpg.plugin.version>
  31.      <maven.surefire.plugin.version>2.22.2</maven.surefire.plugin.version>
  32.      <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
  33.      <maven.compiler.source>11</maven.compiler.source>
  34.      <maven.compiler.target>11</maven.compiler.target>
  35.      <spark.version>3.2.1</spark.version>
  36.      </properties>
  37.      <dependencies>
  38.      <dependency>
  39.          <groupId>junit</groupId>
  40.          <artifactId>junit</artifactId>
  41.          <version>4.11</version>
  42.          <scope>test</scope>
  43.      </dependency>
  44.      <!–======SCALA======–>
  45.      <dependency>
  46.          <groupId>org.scala-lang</groupId>
  47.          <artifactId>scala-library</artifactId>
  48.          <version>${scala.version}</version>
  49.          <scope>provided</scope>
  50.      </dependency>
  51.      <!– https://mvnrepository.com/artifact/org.apache.spark/spark-core –>
  52.      <dependency>
  53.          <groupId>org.apache.spark</groupId>
  54.          <artifactId&gt;spark-core_2.12</artifactId>
  55.          <version>${spark.version}</version>
  56.      </dependency>
  57.      <!– https://mvnrepository.com/artifact/org.apache.spark/spark-core –>
  58.      <dependency> <!– Spark dependency –>
  59.          <groupId>org.apache.spark</groupId>
  60.          <artifactId>spark-sql_2.12</artifactId>
  61.          <version>${spark.version}</version>
  62.          <scope>provided</scope>
  63.      </dependency>
  64.      </dependencies>
  65.      <build>
  66.      <pluginManagement><!– lock down plugins versions to avoid using Maven defaults (may be moved to parent pom) –>
  67.          <plugins>
  68.          <!– clean lifecycle, see https://maven.apache.org/ref/current/maven-core/lifecycles.html#clean_Lifecycle –>
  69.          <plugin>
  70.              <artifactId>maven-clean-plugin</artifactId>
  71.              <version>3.1.0</version>
  72.          </plugin>
  73.          <!– default lifecycle, jar packaging: see https://maven.apache.org/ref/current/maven-core/default-bindings.html#Plugin_bindings_for_jar_packaging –>
  74.          <plugin>
  75.              <artifactId>maven-resources-plugin</artifactId>
  76.              <version>3.0.2</version>
  77.          </plugin>
  78.          <plugin>
  79.              <artifactId>maven-compiler-plugin</artifactId>
  80.              <version>3.8.0</version>
  81.          </plugin>
  82.          <plugin>
  83.              <artifactId>maven-surefire-plugin</artifactId>
  84.              <version>2.22.1</version>
  85.          </plugin>
  86.          <plugin>
  87.              <artifactId>maven-jar-plugin</artifactId>
  88.              <version>3.0.2</version>
  89.          </plugin>
  90.          <plugin>
  91.              <artifactId>maven-install-plugin</artifactId>
  92.              <version>2.5.2</version>
  93.          </plugin>
  94.          <plugin>
  95.              <artifactId>maven-deploy-plugin</artifactId>
  96.              <version>2.8.2</version>
  97.          </plugin>
  98.          <!– site lifecycle, see https://maven.apache.org/ref/current/maven-core/lifecycles.html#site_Lifecycle –>
  99.          <plugin>
  100.              <artifactId>maven-site-plugin</artifactId>
  101.              <version>3.7.1</version>
  102.          </plugin>
  103.          <plugin>
  104.              <artifactId>maven-project-info-reports-plugin</artifactId>
  105.              <version>3.0.0</version>
  106.          </plugin>
  107.          <plugin>
  108.              <artifactId>maven-compiler-plugin</artifactId>
  109.              <version>3.8.0</version>
  110.              <configuration>
  111.                  <source>11</source>
  112.                  <target>11</target>
  113.                  <fork>true</fork>
  114.                  <executable>/Library/Java/JavaVirtualMAChines/jdk-11.0.15.jdk/Contents/Home/bin/javac</executable>
  115.              </configuration>
  116.          </plugin>
  117.          </plugins>
  118.      </pluginManagement>
  119.      </build>
  120. </project>

记得配置输入参数inputoutput3分别代表输入目录、输出目录和使用本地线程数(在VSCode中在launch.json文件中配置)。编译运行后可在output目录下查看result.txt

Tom: 1
Hello: 3
Goodbye: 1
World: 2
David: 1

可见成功完成了单词计数功能。

4. Word Count的Python实现

先使用pip按照pyspark==3.8.2

  1. pip install pyspark==3.8.2

注意PySpark只支持Java 8/11,请勿使用更高级的版本。这里我使用的是Java 11。运行java -version可查看本机Java版本。

(base) orion-orion@MacBook-Pro ~ % java -version
java version “11.0.15” 2022-04-19 LTS
Java(TM) SE Runtime Environment 18.9 (build 11.0.15+8-LTS-149)
Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11.0.15+8-LTS-149, mixed mode)

项目架构如下:

Word-Count-Spark
├─ input
│  ├─ file1.txt
│  ├─ file2.txt
│  └─ file3.txt
├─ output
│  └─ result.txt
├─ src
│  └─ word_count.py

word_count.py编写如下:

  1. from pyspark.sql import SparkSession
  2. import sys
  3. import os
  4. from operator import add
  5.  
  6. if len(sys.argv) != 4:
  7.      print(“Usage: WordCount <intput directory> <output directory> <number of local threads>”, file=sys.stderr)
  8.      exit(1)
  9. input_path, output_path, n_threads = sys.argv[1], sys.argv[2], int(sys.argv[3])
  10.  
  11. spark = SparkSession.builder.appName(“WordCount”).master(“local[%d]” % n_threads).getOrCreate()
  12.  
  13. lines = spark.read.text(input_path).rdd.map(lambda r: r[0])
  14.  
  15. counts = lines.flatMap(lambda s: s.split(” “))\
  16.      .map(lambda word: (word, 1))\
  17.      .reduceByKey(add)
  18.  
  19. output = counts.collect()
  20.  
  21. with open(os.path.join(output_path, “result.txt”), “wt”) as f:
  22.      for (word, count) in output:
  23.          f.write(str(word) +“: “ + str(count) + “\n”)
  24.  
  25. spark.stop()

使用python word_count.py input output 3运行后,可在output中查看对应的输出文件result.txt

Hello: 3
World: 2
Goodbye: 1
David: 1
Tom: 1

可见成功完成了单词计数功能。

到此这篇关于基于Python和Java实现单词计数(Word Count)的文章就介绍到这了,更多相关Python Java单词计数内容请搜索我们以前的文章或继续浏览下面的相关文章希望大家以后多多支持我们!

标签

发表评论