仕事でElastic Searchを使うことになりそうなので、読めということで読んでいる。

データ分析基盤構築入門[Fluentd、Elasticsearch、Kibanaによるログ収集と可視化]

作者: 鈴木健太,吉田健太郎,大谷純,道井俊介
出版社/メーカー: 技術評論社
発売日: 2017/09/21
メディア: 単行本（ソフトカバー）
この商品を含むブログを見る

以下自分の勉強ログ

10章

Elasticsearchの大まかな説明

特徴
- OSS
- ドキュメント指向
- 分散システム
- マルチテナント
- RESTfulなAPI
- リアルタイム（ほぼ）
アーキテクチャ（用語）
- ノード
  - ElasticSearchの１プロセスに相当
- クラスタ
  - 複数ノード、ノード群のこと
- ドキュメント
  - 扱うデータの最小単位はドキュメントと呼ばれる
- フィールド
  - RDBMSのカラムに相当する。型とか指定できる
- インデックス
  - ドキュメントの集合のこと。基本的にインデックス単位でデータを管理していくらしい。フィールドではないのね
- タイプ（インデックスタイプ）
  - インデックスに登録するドキュメントを論理的に分類するための機能。ドキュメントはタイプを１つだけ指定できる。複数できるらしいんだけど、よくわからんが制限があるらしい。ので、基本的には１インデックス１タイプが推奨していて、次のバージョンアップ（もうきているかも）１インデックス１タイプのみになることが確定しているらしい
- シャード（セグメント）
  - 小さな単位に分割したインデックスをシャードと呼ぶ。他データストアではパーティションって呼ばれたりするらしい
  - クラスタの各ノードにシャードを割り当てることでデータを分散することができる。シャードの数がデータを分散させることができる上限数
  - インデックス作成時にしか設定できない
- プライマリシャード・レプリカシャード
  - Elasticsearchは、データ登録のリクエストがくるとプライマリシャードにデータを登録して、そのあとにレプリカシャードにデータをコピーする
- マッピング
  - インデックスに保存されるデータの構造を定義するためにマッピングを利用する。タイプごとにドキュメントのフィールドがどのような名前で、どのような型のデータを保存するかを記述する。。。なるほど。わからん(｀・ω・´)ｼｬｷｰﾝ
- 転置インデックス
  - ElasticsearchのインデックスはApache Luceneを使っているとのこと。そうなのね
ログデータを保存する単位
- 検索したいデータの集まりごとにインデックスを作成して保存するのが基本
- ログデータは毎日データが増えていくから、スケールアウトするたびインデックスを再作成しないといけないので面倒
- また、Elasticsearchはデータ構造の特性上、インデックス中の指定された条件に対するデータだけ削除とかの処理が苦手
- じゃあどうすれば・・・？
- ログデータを保存する場合は、１日分のログで～tあを１つのインデックスに登録すればよい
- 日がかわるたびに新しいインデックスを作成することでスケールアウトに対応できる

11章

Elasticsearchの基本的な使い方を説明している Dockerでコンテナ作って遊びたいと思っているけど、ローカルマシンのDockerが動かないハプニングで、内容だけ読み込み中...

情報
- Javaで動いている
- 起動したらUUIDがノードに付与される
- 同じディレクトリで設定を変更したら（node.max_local_storage_nodes: 2）ノードが２つ起動する
  - が、このやりかたはオススメではない。別々のディレクトリに用意してそれぞれで起動するのがいいんだって
ノードを複数起動したら勝手にクラスタを用意してくれるっぽい。すげーー！

この辺で、dockerがうまく動くようになった.

$ docker run -it --rm -p 9200:9200 -p 9300:9300 \
-e "http.host=0.0.0.0" -e "transport.host=127.0.0.1" \
docker.elastic.co/elasticsearch/elasticsearch:5.1.1

起動したっぽい

$ $ curl -XGET http://127.0.0.1:9200/ | jq
{
  "error": {
    "root_cause": [
      {
        "type": "security_exception",
        "reason": "missing authentication token for REST request [/]",
        "header": {
          "WWW-Authenticate": "Basic realm=\"security\" charset=\"UTF-8\""
        }
      }
    ],
    "type": "security_exception",
    "reason": "missing authentication token for REST request [/]",
    "header": {
      "WWW-Authenticate": "Basic realm=\"security\" charset=\"UTF-8\""
    }
  },
  "status": 401
}

・・・・？

Elastic StackのX-Packを試す（インストール編）｜ Developers.IO

なんか入れたDockerのコンテナにX-Packってのが入っているっぽい. なのでベーシック認証が必須とのこと

$ curl -u elastic 'localhost:9200?pretty'

ユーザー elastic で、アクセスして、パスワードが changeme を入力したら結果が返ってきます。

毎回やらないといけないの？めんどくせえええええ

無効にするオプションがあった。。。よかった

$ docker run -it --rm -p 9200:9200 -p 9300:9300 \
-e "http.host=0.0.0.0" -e "transport.host=127.0.0.1" -e "xpack.security.enabled=false" \
docker.elastic.co/elasticsearch/elasticsearch:5.1.1

$ curl -XGET http://127.0.0.1:9200/ | jq
{
  "name": "fS_DHWK",
  "cluster_name": "docker-cluster",
  "cluster_uuid": "2mg8kb44RGiFDpWLWyiFJw",
  "version": {
    "number": "5.1.1",
    "build_hash": "5395e21",
    "build_date": "2016-12-06T12:36:15.409Z",
    "build_snapshot": false,
    "lucene_version": "6.3.0"
  },
  "tagline": "You Know, for Search"
}

おｋ

クラスタの状態の確認

$ curl -XGET http://localhost:9200/_cluster/health?pretty | jq
{
  "cluster_name": "docker-cluster",
  "status": "yellow",
  "timed_out": false,
  "number_of_nodes": 1,
  "number_of_data_nodes": 1,
  "active_primary_shards": 2,
  "active_shards": 2,
  "relocating_shards": 0,
  "initializing_shards": 0,
  "unassigned_shards": 2,
  "delayed_unassigned_shards": 0,
  "number_of_pending_tasks": 0,
  "number_of_in_flight_fetch": 0,
  "task_max_waiting_in_queue_millis": 0,
  "active_shards_percent_as_number": 50
}

データの操作

インデックスの作成

$ curl -XPUT http://localhost:9200/test_index
{"acknowledged":true,"shards_acknowledged":true}

jq コマンド使うの忘れた

慌てて2回実行したらエラーになった

{
  "error": {
    "root_cause": [
      {
        "type": "index_already_exists_exception",
        "reason": "index [test_index/CvbPEUedT8aJbWciMdaUxQ] already exists",
        "index_uuid": "CvbPEUedT8aJbWciMdaUxQ",
        "index": "test_index"
      }
    ],
    "type": "index_already_exists_exception",
    "reason": "index [test_index/CvbPEUedT8aJbWciMdaUxQ] already exists",
    "index_uuid": "CvbPEUedT8aJbWciMdaUxQ",
    "index": "test_index"
  },
  "status": 400
}

インデックスの削除

curl -XDELETE http://localhost:9200/test_index
{"acknowledged":true}

データの取り込み

$ curl -XPUT http://localhost:9200/test_index/apache_log/1 -d '
{
"host": "localhost",
"timestamp": "06/May/2014:06:11:48 + 0000",
"verb": "GET",
"request": "/category/finance",
"httpversion": "1.1",
"response": "200",
"bytes": "51"
}
' | jq

{
  "_index": "test_index",
  "_type": "apache_log",
  "_id": "1",
  "_version": 1,
  "result": "created",
  "_shards": {
    "total": 2,
    "successful": 1,
    "failed": 0
  },
  "created": true
}

URLに /インデックス名/タイプ名/ID を指定してデータを保存することができる。

データ削除もさっきと一緒。

$ curl -XDELETE http://localhost:9200/test_index/aapche_log/1

大量のデータの取り込み

1件1件登録は面倒でパフォーマンスも悪いので、Bulk APIというのを使いましょう。

NDJSONって形式で投げてほしいとのこと（はじめてきいた）

検索

全件検索

$ curl -XGET http://localhost:9200/test_index/_search -d '
> {
>   "query": {
>     "match_all": {}
>   }
> }' | jq

{
  "took": 60,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 1,
    "hits": [
      {
        "_index": "test_index",
        "_type": "apache_log",
        "_id": "1",
        "_score": 1,
        "_source": {
          "host": "localhost",
          "timestamp": "06/May/2014:06:11:48 + 0000",
          "verb": "GET",
          "request": "/category/finance",
          "httpversion": "1.1",
          "response": "200",
          "bytes": "51"
        }
      }
    ]
  }
}

検索のパラメータ

query - 検索条件. クエリDSLと呼ばれるDSLが複数ある
from - 検索結果に含まれるデータの開始位置
size - 検索結果に含まれるデータのサイズ
sort - ソートの指定. asc/desc
_source - 検索結果に含むデータを指定
aggs - Aggregation。集計処理の実施する

レスポンスの項目

took - 検索にかかった時間（ms）
hits - 検索結果の情報
hits.total - 検索条件にヒットした件数
hits.hits - 検索にヒットしたドキュメントの配列(検索結果のfrom / sizeで指定した場所のドキュメント)

よくわからんので試してみる。（その前にデータを３つほど追加した）

$ curl -XGET http://localhost:9200/test_index/_search -d '
{
  "from": 2
}' | jq

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 1,
    "hits": [
      {
        "_index": "test_index",
        "_type": "apache_log",
        "_id": "3",
        "_score": 1,
        "_source": {
          "host": "localhost",
          "timestamp": "06/May/2014:06:11:48 + 0000",
          "verb": "GET",
          "request": "/category/finance",
          "httpversion": "1.1",
          "response": "200",
          "bytes": "51"
        }
      }
    ]
  }
}

$ curl -XGET http://localhost:9200/test_index/_search -d '
{
  "query": {
    "match_all": {}
  },
  "size": 1
}' | jq

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 1,
    "hits": [
      {
        "_index": "test_index",
        "_type": "apache_log",
        "_id": "2",
        "_score": 1,
        "_source": {
          "host": "localhost",
          "timestamp": "06/May/2014:06:11:48 + 0000",
          "verb": "GET",
          "request": "/category/finance",
          "httpversion": "1.1",
          "response": "200",
          "bytes": "51"
        }
      }
    ]
  }
}

# sortはうまくいかなかった
# fieldがtextなのがダメっぽい
$  curl -XGET http://localhost:9200/test_index/_search -d '
{
  "query": {"match_all": {}}, "sort": [{"bytes": "desc"}]
}' | jq

{
  "error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [bytes] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory."
      }
    ],
    "type": "search_phase_execution_exception",
    "reason": "all shards failed",
    "phase": "query",
    "grouped": true,
    "failed_shards": [
      {
        "shard": 0,
        "index": "test_index",
        "node": "hRHOai-CSA2RDZKHGqtsrg",
        "reason": {
          "type": "illegal_argument_exception",
          "reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [bytes] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory."
        }
      }
    ],
    "caused_by": {
      "type": "illegal_argument_exception",
      "reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [bytes] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory."
    }
  },
  "status": 400
}

# 後半でfieldは複数付けられることでここを思い出す
# textでだめならkeywordでsortしてみる
$ curl -XGET http://localhost:9200/test_index/_search -d '
{"sort": [{"bytes.keyword": "desc"}]}' | jq

{
  "took": 4,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 5,
    "max_score": null,
    "hits": [
      {
        "_index": "test_index",
        "_type": "apache_log",
        "_id": "5",
        "_score": null,
        "_source": {
          "host": "localhost",
          "timestamp": "06/May/2014:06:11:48 + 0000",
          "verb": "GET",
          "request": "/category/finance/hogehogehoghoehoghaoshdifahsidhfiashdifahsidfhiashdfihaisdhfiahsdifhaisdfkakwlejfoawejifjaisjdflkajefjaiohgoiahiehiwhfkasdofaoiefalksjdflkasjdkfalskhfkahsdfhasdhfasdfhasdfhasdfhashdfhasidfhaishdfiahsidfhiashfihasidfhias/category/finance/hogehogehoghoehoghaoshdifahsidhfiashdifahsidfhiashdfihaisdhfiahsdifhaisdfkakwlejfoawejifjaisjdflkajefjaiohgoiahiehiwhfkasdofaoiefalksjdflkasjdkfalskhfkahsdfhasdhfasdfhasdfhasdfhashdfhasidfhaishdfiahsidfhiashfihasidfhiasasdfasdfasdfasdfasdfasdfasdfasdfasdfasdfasdfasdfasdfasdf",
          "httpversion": "1.1",
          "response": "200",
          "bytes": "51"
        },
        "sort": [
          "51"
        ]
      },
      {
        "_index": "test_index",
        "_type": "apache_log",
        "_id": "2",
        "_score": null,
        "_source": {
          "host": "localhost",
          "timestamp": "06/May/2014:06:11:48 + 0000",
          "verb": "GET",
          "request": "/category/finance",
          "httpversion": "1.1",
          "response": "200",
          "bytes": "51"
        },
        "sort": [
          "51"
        ]
      },
      {
        "_index": "test_index",
        "_type": "apache_log",
        "_id": "4",
        "_score": null,
        "_source": {
          "host": "localhost",
          "timestamp": "06/May/2014:06:11:48 + 0000",
          "verb": "GET",
          "request": "/category/finance/hogehogehoghoehoghaoshdifahsidhfiashdifahsidfhiashdfihaisdhfiahsdifhaisdfkakwlejfoawejifjaisjdflkajefjaiohgoiahiehiwhfkasdofaoiefalksjdflkasjdkfalskhfkahsdfhasdhfasdfhasdfhasdfhashdfhasidfhaishdfiahsidfhiashfihasidfhias/category/finance/hogehogehoghoehoghaoshdifahsidhfiashdifahsidfhiashdfihaisdhfiahsdifhaisdfkakwlejfoawejifjaisjdflkajefjaiohgoiahiehiwhfkasdofaoiefalksjdflkasjdkfalskhfkahsdfhasdhfasdfhasdfhasdfhashdfhasidfhaishdfiahsidfhiashfihasidfhiasasdfasdfasdfasdfasdfasdfasdfasdfasdfasdfasdfasdfasdfasdf",
          "httpversion": "1.1",
          "response": "200",
          "bytes": "51"
        },
        "sort": [
          "51"
        ]
      },
      {
        "_index": "test_index",
        "_type": "apache_log",
        "_id": "1",
        "_score": null,
        "_source": {
          "host": "localhost",
          "timestamp": "06/May/2014:06:11:48 + 0000",
          "verb": "GET",
          "request": "/category/finance",
          "httpversion": "1.1",
          "response": "200",
          "bytes": "51"
        },
        "sort": [
          "51"
        ]
      },
      {
        "_index": "test_index",
        "_type": "apache_log",
        "_id": "3",
        "_score": null,
        "_source": {
          "host": "localhost",
          "timestamp": "06/May/2014:06:11:48 + 0000",
          "verb": "GET",
          "request": "/category/finance",
          "httpversion": "1.1",
          "response": "200",
          "bytes": "51"
        },
        "sort": [
          "51"
        ]
      }
    ]
  }
}

# できたーーーーーー！！！

query string query

特殊なクエリ式を利用して複雑なクエリが記述できるとのこと（？

$ curl -XGET http://localhost:9200/test_index/_search -d '
{
  "query": {
    "query_string": {
      "query": "request:category AND response:200"
    }
  }
}' | jq

{
  "took": 26,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 0.5457982,
    "hits": [
      {
        "_index": "test_index",
        "_type": "apache_log",
        "_id": "2",
        "_score": 0.5457982,
        "_source": {
          "host": "localhost",
          "timestamp": "06/May/2014:06:11:48 + 0000",
          "verb": "GET",
          "request": "/category/finance",
          "httpversion": "1.1",
          "response": "200",
          "bytes": "51"
        }
      },
      {
        "_index": "test_index",
        "_type": "apache_log",
        "_id": "1",
        "_score": 0.5457982,
        "_source": {
          "host": "localhost",
          "timestamp": "06/May/2014:06:11:48 + 0000",
          "verb": "GET",
          "request": "/category/finance",
          "httpversion": "1.1",
          "response": "200",
          "bytes": "51"
        }
      },
      {
        "_index": "test_index",
        "_type": "apache_log",
        "_id": "3",
        "_score": 0.5457982,
        "_source": {
          "host": "localhost",
          "timestamp": "06/May/2014:06:11:48 + 0000",
          "verb": "GET",
          "request": "/category/finance",
          "httpversion": "1.1",
          "response": "200",
          "bytes": "51"
        }
      }
    ]
  }
}

Aggregation

集約してくれるらしい

$ curl -XGET http://localhost:9200/test_index/_search -d '
{
  "query": {
    "match_all": {}
  },
  "aggs": {
    "request_aggs": {
      "terms": {
        "field": "request",
        "size": 10
      }
    }
  }
}' | jq

{
  "error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [request] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory."
      }
    ],
    "type": "search_phase_execution_exception",
    "reason": "all shards failed",
    "phase": "query",
    "grouped": true,
    "failed_shards": [
      {
        "shard": 0,
        "index": "test_index",
        "node": "hRHOai-CSA2RDZKHGqtsrg",
        "reason": {
          "type": "illegal_argument_exception",
          "reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [request] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory."
        }
      }
    ],
    "caused_by": {
      "type": "illegal_argument_exception",
      "reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [request] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory."
    }
  },
  "status": 400
}

エラー！？

Elasticsearchは文字列のフィールドに対してはデフォルトで text というフィールド型でデータを登録する。このフィールドはAggregationができない。なんだと・・・

よくわからんけど、 request.keyword に変更したら集約できるとのこと.

$ curl -XGET http://localhost:9200/test_index/_search -d '
{
  "query": {
    "match_all": {}
  },
  "aggs": {
    "request_aggs": {
      "terms": {
        "field": "request.keyword",
        "size": 10
      }
    }
  }
}' | jq

{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 1,
    "hits": [
      {
        "_index": "test_index",
        "_type": "apache_log",
        "_id": "2",
        "_score": 1,
        "_source": {
          "host": "localhost",
          "timestamp": "06/May/2014:06:11:48 + 0000",
          "verb": "GET",
          "request": "/category/finance",
          "httpversion": "1.1",
          "response": "200",
          "bytes": "51"
        }
      },
      {
        "_index": "test_index",
        "_type": "apache_log",
        "_id": "1",
        "_score": 1,
        "_source": {
          "host": "localhost",
          "timestamp": "06/May/2014:06:11:48 + 0000",
          "verb": "GET",
          "request": "/category/finance",
          "httpversion": "1.1",
          "response": "200",
          "bytes": "51"
        }
      },
      {
        "_index": "test_index",
        "_type": "apache_log",
        "_id": "3",
        "_score": 1,
        "_source": {
          "host": "localhost",
          "timestamp": "06/May/2014:06:11:48 + 0000",
          "verb": "GET",
          "request": "/category/finance",
          "httpversion": "1.1",
          "response": "200",
          "bytes": "51"
        }
      }
    ]
  },
  "aggregations": {
    "request_aggs": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "/category/finance",
          "doc_count": 3
        }
      ]
    }
  }
}

Aggregationのタイプ

terms - インデックスの値ごとにドキュメント数を集計
range - インデックスの値をもとに指定された範囲ごとにドキュメント数を集計
histogram - インデックスの数値データをもとに指定された間隔ごとにドキュメント数を集計
statistical - インデックスの数値フィールドの統計値（min、max、ドキュメント数）
filter - 指定されたクエリのドキュメント数を集計

マッピング定義

登録するときにデフォルトで色々やってくれるらしいけど、でもちゃんと定義しないと検索するとき辛いよってことだから、フィールドちゃんと定義してあげようぜってのがこれ。

まずは現在のフィールド情報を確認

$ curl -XGET http://localhost:9200/test_index/_mapping | jq

{
  "test_index": {
    "mappings": {
      "apache_log": {
        "properties": {
          "bytes": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "host": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "httpversion": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "request": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "response": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "timestamp": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "verb": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          }
        }
      }
    }
  }
}

文字列データはデフォルトで、textとkeywordの２つのフィールド型が生成される。（！？なるほど。。。。。こういう感じにデータを格納するのね。。。

なんかこう。。。RDBしか使ってこなかった人間からしたらフラットすぎて馴染みづらい。。。

Multi Field

$ curl -XGET http://localhost:9200/test_index/_search -d '
{
  "query": {
    "query_string": {
      "query": "request:finance"
    }
  },
  "aggs": {
    "request_aggs": {
      "terms": {
        "field": "request.keyword",
        "size": 10
      }
    }
  }
}' | jq

{
# ...省略
  "aggregations": {
    "request_aggs": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "/category/finance",
          "doc_count": 3
        }
      ]
    }
  }
}

はえー。なるほど

なんか、ignore_aboveって設定があるからそれをマッピングで指定して変更することもできるっぽい

$ curl -XPUT http://localhost:9200/test_index2 -d '
{
  "mappings": {
    "apache_log": {
      "properties": {
        "request": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 1000
            }
          }
        }
      }
    }
  }
}' | jq

$ curl -XGET http://localhost:9200/test_index2 | jq

{
  "test_index2": {
    "aliases": {},
    "mappings": {
      "apache_log": {
        "properties": {
          "request": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 1000
              }
            }
          }
        }
      }
    },
    "settings": {
      "index": {
        "creation_date": "1510229419825",
        "number_of_shards": "5",
        "number_of_replicas": "1",
        "uuid": "6T3UPgbVR6OIZ0i6VudoWQ",
        "version": {
          "created": "5010199"
        },
        "provided_name": "test_index2"
      }
    }
  }
}

ふむ

Index Template

ログデータを保存するときはインデックスを日ごとに作成するのがオススメ。なんだけど、毎回マッピングを指定してインデックスを作成するのはだるい。

なので、マッピングをテンプレート化して、条件にマッチしたらそのテンプレートを使ってマッピングするってことができる。

$ curl -XPUT http://localhost:9200/_template/apache_log_template -d '
{
  "template": "test_*",
  "mappings": {
    "apache_log": {
      "properties": {
        "request": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 1000
            }
          }
        }
      }
    }
  }
}
' | jq

{
  "acknowledged": true
}

登録できた・・・？

試してみる。

$ curl -XPUT http://localhost:9200/test_index5 | jq

{
  "acknowledged": true,
  "shards_acknowledged": true
}

$ curl -XGET http://localhost:9200/test_index5 | jq
{
  "test_index5": {
    "aliases": {},
    "mappings": {
      "apache_log": {
        "properties": {
          "request": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 1000
              }
            }
          }
        }
      }
    },
    "settings": {
      "index": {
        "creation_date": "1510230084228",
        "number_of_shards": "5",
        "number_of_replicas": "1",
        "uuid": "iIY3umCBQDmQBD_6M_kUmQ",
        "version": {
          "created": "5010199"
        },
        "provided_name": "test_index5"
      }
    }
  }
}

Ingest Nodes

データ登録のときに、インデックスの登録の簡単な前処理が行える。

$ curl -XPUT http://localhost:9200/_ingest/pipeline/test_pipeline -d '
{
  "description": "parse number and clientip using grok",
  "processors": [
    {
      "grok": {
        "field": "text",
        "patterns": ["%{NUMBER:duration} %{IP:client}"]
      },
      "remove": {
        "field": "text"
      }
    }
  ]
}' | jq

{
  "acknowledged": true
}

$ curl -XGET http://localhost:9200/_ingest/pipeline/ | jq

{
  "xpack_monitoring_2": {
    "description": "2: This is a placeholder pipeline for Monitoring API version 2 so that future versions may fix breaking changes.",
    "processors": []
  },
  "test_pipeline": {
    "description": "parse number and clientip using grok",
    "processors": [
      {
        "grok": {
          "field": "text",
          "patterns": [
            "%{NUMBER:duration} %{IP:client}"
          ]
        },
        "remove": {
          "field": "text"
        }
      }
    ]
  }
}

# 動作確認

$ curl -XPOST http://localhost:9200/_ingest/pipeline/test_pipeline/_simulate -d '
{
  "docs": [
    {
      "_source": {
        "text": "3.44 55.3.244.1"
      }
    }
  ]
}
' | jq

{
  "docs": [
    {
      "doc": {
        "_type": "_type",
        "_index": "_index",
        "_id": "_id",
        "_source": {
          "duration": "3.44",
          "client": "55.3.244.1"
        },
        "_ingest": {
          "timestamp": "2017-11-09T12:30:48.371+0000"
        }
      }
    }
  ]
}

# 登録はこんな感じ
$ curl -XPUT http://localhost:9200/sample_index/sample/1?pipeline=test_pipeline -d '
{
  "text": "3.44 55.3.244.1"
}
' | jq

{
  "_index": "sample_index",
  "_type": "sample",
  "_id": "1",
  "_version": 1,
  "result": "created",
  "_shards": {
    "total": 2,
    "successful": 1,
    "failed": 0
  },
  "created": true
}

$ curl -XGET http://localhost:9200/sample_index/sample/1 | jq

{
  "_index": "sample_index",
  "_type": "sample",
  "_id": "1",
  "_version": 1,
  "found": true,
  "_source": {
    "duration": "3.44",
    "client": "55.3.244.1"
  }
}

12章

メモリ
- マシンの実メモリの半分以上をElasticsearchのヒープに指定しないで
- 実際のインデックスをファイルとして保存するので、検索のたびにファイルへのアクセスが発生するとのこと
- なので、OS側にもファイルキャッシュを活用できるよう十分なメモリを確保しておく必要がある
スレッドプール数
- 特に問題なければ変更する必要なし
リフレッシュインターバル
- データが登録・更新されたらメモリにインデックスを作成する
- これとは別にデフォルトで1秒ごとに1回ディスクにインデックスをファイルとして書き込んでいる
- 大量データを登録・更新するときはファイルIOが多くなりパフォーマンス劣化の恐れ
- そこでリフレッシュインターバル。これを大きくすることで性能向上が検討できる（まだよくわかっていない）
- 書き込む数が多いから、これのインターバルを長く置くことで突発なパフォーマンスの劣化は発生しづらくするってことかね？
インデックス数
- インデックスはファイルに保存されるため、大量に作成するとリソースが大量に消費される恐れ
- 検索対象の過去データのインデックス数を少なくすることでクラスタの性能劣化を防ぐ
- シャード数も必要以上に大きくしないようにする（これはなぜだろう？
- インデックス数を少なくするために定期的にインデックスを削除するという運用を行いましょう
スケールアウト
- スケールアウトするにはノードを追加すればよい。追加の目的になる項目として
  - データ量の増加への対応
  - 検索リクエスト量の増加への対応
- データ量の増加への対応
  - データ量が増加したら、ノードを追加して１ノードあたりが保存するデータ量を小さくすればいいんじゃね？って話
  - Elasticsearchはデフォルトで１インデックスを５シャードに分割して保持する
  - 注意点として、ノードに割り当てることができる最小の単位がシャード。５シャードの構成の場合、クラスタに６台以上のノードが存在しても５台のノードしか有効に活用できないことになる。１インデックスあたりのデータ量が多く、ノードの台数を増やすことができる場合はシャードの分割数を変更する必要があります。
    - これがよくわからん？？？どゆこと？
    - あーーーー、やっとわかった気がした。１インデックス５シャードってことはノードが５台あったら良い分散はそれぞれ１シャードもつってことだから、６台目のノードを追加しても１台暇するよってことか
- 検索リクエストの増加への対応
  - レプリカシャードを増やして、ノード数を増やす。うん。これはまだイメージが付きやすい。
  - ただ、メモリ使用量が増加するし、インデックス性能を劣化するので注意

ITの隊長のブログ

ITの隊長のブログです。Rubyを使って仕事しています。最近も色々やっているお（^ω^ = ^ω^）

部分読み「データ分析基盤構築入門」

10章

11章