.NET 開発基盤部会 Wiki」は、「Open棟梁Project」,「OSSコンソーシアム .NET開発基盤部会」によって運営されています。

目次

概要

詳細

Notebook系

Apache Sparkの特性上、インタラクティブな環境の方が習得が捗る。

Jupyter Notebook

Azure Databricks

チュートリアル

on Docker

on Azure Databricks

リンク先で、Databricksで動かしてみる。

構造化ストリーミングでウィンドウ操作

https://github.com/apache/spark/blob/master/examples/src/main/python/sql/streaming/structured_network_wordcount_windowed.py

import sys

from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split
from pyspark.sql.functions import window

host = localhost
port = 9999
windowSize = 10

slideSize = int(sys.argv[4]) if (len(sys.argv) == 5) else windowSize
if slideSize > windowSize:
    print("<slide duration> must be less than or equal to <window duration>", file=sys.stderr)
windowDuration = '{} seconds'.format(windowSize)
slideDuration = '{} seconds'.format(slideSize)

spark = SparkSession\
    .builder\
    .appName("StructuredNetworkWordCountWindowed")\
    .getOrCreate()

# Create DataFrame representing the stream of input lines from connection to host:port
lines = spark\
    .readStream\
    .format('socket')\
    .option('host', host)\
    .option('port', port)\
    .option('includeTimestamp', 'true')\
    .load()

# Split the lines into words, retaining timestamps
# split() splits each line into an array, and explode() turns the array into multiple rows
words = lines.select(
    explode(split(lines.value, ' ')).alias('word'),
    lines.timestamp
)

# Group the data by window and word and compute the count of each group
windowedCounts = words.groupBy(
    window(words.timestamp, windowDuration, slideDuration),
    words.word
).count().orderBy('window')

# Start running the query that prints the windowed word counts to the console
query = windowedCounts\
    .writeStream\
    .outputMode('complete')\
    .format('console')\
    .option('truncate', 'false')\
    .start()

query.awaitTermination()

参考

EnsekiTT Blog

CUBE SUGAR CONTAINER

Qiita


トップ   新規 一覧 単語検索 最終更新   ヘルプ   最終更新のRSS