elasticsearch 拼音分词

On 2017年4月10日2017年4月11日By yuer

搜索除了支持中文分词，一般还支持拼音和中文混合搜索，好在ES也有对应的插件实现这种功能。

安装插件

拼音分词使用的插件是elasticsearch-analysis-pinyin，它的原理就是把基于NLP自然语言技术将中文字段转成拼音，并对拼音进行分词（形成若干term），并对每个term建立倒排索引。

之前安装过中文分词插件，你可以回顾一下之前的博客来了解前置的环境安装过程，这里只记录拼音分词的安装过程。

下载插件

[work@78d7fa0c263f ~]$ pwd
/home/work
[work@78d7fa0c263f ~]$ git clone https://github.com/medcl/elasticsearch-analysis-pinyin.git
[work@78d7fa0c263f ~]$ cd elasticsearch-analysis-pinyin

[work@78d7fa0c263f ~]$ pwd

/home/work

[work@78d7fa0c263f ~]$ git clone https://github.com/medcl/elasticsearch-analysis-pinyin.git

[work@78d7fa0c263f ~]$ cd elasticsearch-analysis-pinyin

切换分支

[work@78d7fa0c263f elasticsearch-analysis-pinyin]$ git checkout tags/v5.2.2

1	[work@78d7fa0c263f elasticsearch-analysis-pinyin]$ git checkout tags/v5.2.2

编译插件

[work@78d7fa0c263f elasticsearch-analysis-pinyin]$ /usr/local/apache-maven-3.3.9/bin/mvn package

1	[work@78d7fa0c263f elasticsearch-analysis-pinyin]$ /usr/local/apache-maven-3.3.9/bin/mvn package

安装插件并重启ES

unzip target/releases/elasticsearch-analysis-pinyin-5.2.2.zip -d /home/work/elasticsearch/elasticsearch0/plugins/pinyin
unzip target/releases/elasticsearch-analysis-pinyin-5.2.2.zip -d /home/work/elasticsearch/elasticsearch1/plugins/pinyin
unzip target/releases/elasticsearch-analysis-pinyin-5.2.2.zip -d /home/work/elasticsearch/elasticsearch2/plugins/pinyin

unzip target/releases/elasticsearch-analysis-pinyin-5.2.2.zip -d /home/work/elasticsearch/elasticsearch0/plugins/pinyin

unzip target/releases/elasticsearch-analysis-pinyin-5.2.2.zip -d /home/work/elasticsearch/elasticsearch1/plugins/pinyin

unzip target/releases/elasticsearch-analysis-pinyin-5.2.2.zip -d /home/work/elasticsearch/elasticsearch2/plugins/pinyin

插件解压到3个ES目录后，记得重启它们。

测试插件

创建index使用分词

<?php

require_once __DIR__ . "/vendor/autoload.php";

// 客户端
$client = Elasticsearch\ClientBuilder::fromConfig([
    'hosts' => ['localhost:9200', 'localhost:9201', 'localhost:9203'], // 最好在为ES集群搭建Haproxy反向代理
    'retries' => 2
]);

$indices = $client->indices();
// 先删除旧的test索引
$indices->delete(['index' => 'test']);
// 创建test索引的同时指定歌手type mapping
$indices->create([
    'index' => 'test',
    'body' => [
        // index配置
        'settings' => [
            "number_of_shards" => 3,    // 3个分区
            "number_of_replicas" => 2,  // 每个分区有1个主分片和2个从分片
            // 分析器相关配置
            "analysis" => [
                // 分析器
                "analyzer" => [
                    // 一个分析器=字符过滤Character filter+分词tokenizer+过滤Token filtering
                    "default_pinyin_analyzer" => [
                        "type" => "custom", // 自定义分析器
                        "tokenizer" => "default_pinyin_tokenizer", // 自定义的拼音分词器
                    ]
                ],
                // 分词器Tokenizer
                "tokenizer" => [
                    "default_pinyin_tokenizer" => [
                        "type" => "pinyin", // 采用拼音分词插件提供的tokenizer
                        // 下面是插件的配置项（插件默认值可以满足的都不列举在这里）
                        'keep_separate_first_letter' => true,
                        'keep_joined_full_pinyin' => true,
                        'keep_none_chinese_together' => true,
                        'keep_none_chinese_in_joined_full_pinyin' => true,
                        "limit_first_letter_length" => 16,
                    ]
                ]
            ],
        ],
        // type映射
        'mappings' => [
            // 歌手type
            'singers' => [
                // 属性
                'properties' => [
                    // 歌手姓名
                    'singer_name' => [
                        'type' => 'string', // 字符串
                        'index' => 'analyzed', // 全文索引
                        'analyzer' => 'ik_max_word', // 中文分词
                        'fields' => [
                            'singer_name_pinyin' => [
                                'type' => 'string',    // 字符串
                                'index' => 'analyzed', // 全文索引
                                'analyzer' => 'default_pinyin_analyzer', // 拼音分词
                            ]
                        ]
                    ],
                ]
            ]
        ],
    ]
]);

$client->bulk([
    'index' => 'test',
    'type' => 'singers',
    'body' => [
        // index索引请求，元信息是['_id':1]
        ['index' => ['_id' => 1]],
        // 请求体
        [
            'singer_name' => '刘德华',
        ],
    ]
]);

<?php

require_once __DIR__ . "/vendor/autoload.php";

// 客户端

$client = Elasticsearch\ClientBuilder::fromConfig([

'hosts' => ['localhost:9200', 'localhost:9201', 'localhost:9203'], // 最好在为ES集群搭建Haproxy反向代理

'retries' => 2

]);

$indices = $client->indices();

// 先删除旧的test索引

$indices->delete(['index' => 'test']);

// 创建test索引的同时指定歌手type mapping

$indices->create([

'index' => 'test',

'body' => [

// index配置

'settings' => [

"number_of_shards" => 3, // 3个分区

"number_of_replicas" => 2, // 每个分区有1个主分片和2个从分片

// 分析器相关配置

"analysis" => [

// 分析器

"analyzer" => [

// 一个分析器=字符过滤Character filter+分词tokenizer+过滤Token filtering

"default_pinyin_analyzer" => [

"type" => "custom", // 自定义分析器

"tokenizer" => "default_pinyin_tokenizer", // 自定义的拼音分词器

]

// 分词器Tokenizer

"tokenizer" => [

"default_pinyin_tokenizer" => [

"type" => "pinyin", // 采用拼音分词插件提供的tokenizer

// 下面是插件的配置项（插件默认值可以满足的都不列举在这里）

'keep_separate_first_letter' => true,

'keep_joined_full_pinyin' => true,

'keep_none_chinese_together' => true,

'keep_none_chinese_in_joined_full_pinyin' => true,

"limit_first_letter_length" => 16,

]

// type映射

'mappings' => [

// 歌手type

'singers' => [

// 属性

'properties' => [

// 歌手姓名

'singer_name' => [

'type' => 'string', // 字符串

'index' => 'analyzed', // 全文索引

'analyzer' => 'ik_max_word', // 中文分词

'fields' => [

'singer_name_pinyin' => [

'type' => 'string', // 字符串

'index' => 'analyzed', // 全文索引

'analyzer' => 'default_pinyin_analyzer', // 拼音分词

]

]);

$client->bulk([

'index' => 'test',

'type' => 'singers',

'body' => [

// index索引请求，元信息是['_id':1]

['index' => ['_id' => 1]],

// 请求体

[

'singer_name' => '刘德华',

]

]);

自定义了一个analyzer分析器，它使用pinyin插件的tokenizer分词器，注意分析器是index级别配置的。
创建了一个type=singers，它只有一个字段singer_name，但是这个字段定义了2种分词方式：
- 父级singer_name采用ik中文分词。
- 子级（fields）的singer_name_pinyin采用pinyin分词，其数据源与singer_name一样，只是分词方式不同。
插入了一条记录，singer_name是刘德华。

测试分词

通过HTTP可以测试在index=test中配置的分析器default_pinyin_analyzer，看看我存储的”刘德华”被pinyin插件分成哪些TERM了：

[work@78d7fa0c263f nuomi-search]$ curl 'http://localhost:9200/test/_analyze?text=%e5%88%98%e5%be%b7%e5%8d%8e&analyzer=default_pinyin_analyzer&pretty'
{
  "tokens" : [
    {
      "token" : "l",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "liu",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "d",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "de",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "h",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "hua",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "liudehua",
      "start_offset" : 0,
      "end_offset" : 8,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "ldh",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 7
    }
  ]
}

[work@78d7fa0c263f nuomi-search]$ curl 'http://localhost:9200/test/_analyze?text=%e5%88%98%e5%be%b7%e5%8d%8e&analyzer=default_pinyin_analyzer&pretty'

{

"tokens" : [

{

"token" : "l",

"start_offset" : 0,

"end_offset" : 1,

"type" : "word",

"position" : 0

{

"token" : "liu",

"start_offset" : 0,

"end_offset" : 1,

"type" : "word",

"position" : 1

{

"token" : "d",

"start_offset" : 1,

"end_offset" : 2,

"type" : "word",

"position" : 2

{

"token" : "de",

"start_offset" : 1,

"end_offset" : 2,

"type" : "word",

"position" : 3

{

"token" : "h",

"start_offset" : 2,

"end_offset" : 3,

"type" : "word",

"position" : 4

{

"token" : "hua",

"start_offset" : 2,

"end_offset" : 3,

"type" : "word",

"position" : 5

{

"token" : "liudehua",

"start_offset" : 0,

"end_offset" : 8,

"type" : "word",

"position" : 6

{

"token" : "ldh",

"start_offset" : 0,

"end_offset" : 3,

"type" : "word",

"position" : 7

}

]

}

注意，在curl命令中text字段不能直接输入中文，必须经过url编码才能被ES正确处理，这里text编码前就是”刘德华”三个字。
上述分词结果中，position很重要，它表达了每个TERM之间的先后顺序与距离（例如：liu在hua前面），这对于查询很有意义，如果我们搜索”华刘” or “刘华”，当然不希望看到”刘德华”出现，因为hua liu/liu hua和liu de hua相同TERM的顺序与距离不一样。
上面分词结果中，出现了多种分词手段，有ldh这样的首字母，也有liudehua这样的全拼，也有liu，de，hua这样的单个中文的拼音，也有l，d，h这样的单个中文的首字母，它们都分别被索引，这些分词手段受到Tokenizer配置的控制，我考虑个人需求配置如上，你也可以参考pinyin插件的github说明进行订制。

查询测试

在上面，我们知道了”刘德华”是如何被存储的，接下来就是模拟各种查询是否可以正常工作了，为此我们准备2个工具：

发起search检索的代码：

<?php

require_once __DIR__ . "/vendor/autoload.php";

// 客户端
$client = Elasticsearch\ClientBuilder::fromConfig([
    'hosts' => ['localhost:9200', 'localhost:9201', 'localhost:9203'], // 最好在为ES集群搭建Haproxy反向代理
    'retries' => 2
]);

// 搜索关键字
$keyword = '刘德华';

// 嵌套查询
$result = $client->search([
    'index' => 'test', // 数据库
    'type' => 'singers',  // 表
    'body' => [ // 查询体
        'query' => [
            // 全文匹配
            'match' => ['singer_name.singer_name_pinyin' => $keyword],
        ],
    ]
]);

echo 'http://localhost:9200/test/_analyze?text=' . urlencode($keyword) . '&analyzer=default_pinyin_analyzer&pretty' . PHP_EOL;
print_r($result);

<?php

require_once __DIR__ . "/vendor/autoload.php";

// 客户端

$client = Elasticsearch\ClientBuilder::fromConfig([

'hosts' => ['localhost:9200', 'localhost:9201', 'localhost:9203'], // 最好在为ES集群搭建Haproxy反向代理

'retries' => 2

]);

// 搜索关键字

$keyword = '刘德华';

// 嵌套查询

$result = $client->search([

'index' => 'test', // 数据库

'type' => 'singers', // 表

'body' => [ // 查询体

'query' => [

// 全文匹配

'match' => ['singer_name.singer_name_pinyin' => $keyword],

]

]);

echo 'http://localhost:9200/test/_analyze?text=' . urlencode($keyword) . '&analyzer=default_pinyin_analyzer&pretty' . PHP_EOL;

print_r($result);

查看query分词情况的工具：
- curl ‘http://localhost:9200/test/_analyze?text=你的查询短语&analyzer=default_pinyin_analyzer&pretty’

下面测试一些用户常用的查询方式，在每次执行search代码后，如果我对查询结果有疑惑，那么就使用分词工具进行分析。

场景1

keyword=”的”：可以得到结果，因为”的”的拼音是de，这并不让人意外，因为你使用糯米app搜索”的”会得到”德克士”这样的结果，主要是可以解决用户打错字的问题。

场景2

keyword=”刘华”：可以得到结果，因为它们分词为liu，hua，l，h，每一个TERM都都可以命中索引，但实际上我作为用户是不期望搜出”刘德华”的，为什么会这样呢？

这是因为默认ES是不会考虑TERM之间的位置与距离的，只要文档中出现过liu或者hua，那么就能匹配，无论它们谁先谁后，是否同时出现（只要出现任意TERM），是否相邻，都没有关系。

我们想要的效果是：除非文档中包含”liu hua”这样紧邻的TERM才算命中，上面的文档中”liu de hua”中间有一个de，那么”刘华”这样的查询就不应该命。为了实现这个效果，只需要将match替换为match_phrase（点我学习）即可避免匹配，它被称作”短语匹配”，其匹配条件如下：

文档中，liu，hua必须全部出现在某个字段中。
文档中，hua的位置必须比liu的位置大1。

上面的分词TERM只是举例，实际上：

“刘华”的分词TERM顺序是：l,liu,h,hua,liuhua,lh。
“刘德华”的分词TERM顺序是：l,liu,d,de,h,hua,liudehua,ldh。

我们分析一下：

“刘华”的TERM：lh，并没有出现在”刘德华”的TERM数组中，因此第一条规则就不满足。

场景3

keyword=”刘德”，未能匹配，为什么呢？因为我们改成match_phrase后出现了新的问题，分词情况如下：

“刘德”的分词TERM顺序是：l,liu,d,de,liude,ld。
“刘德华”的分词TERM顺序是：l,liu,d,de,h,hua,liudehua,ldh。

分析：

liude,ld并没有出现在”刘德华”的TERM数组中。
除了liude,ld之外的其他TERM(l,liu,d,de)，它们之间的距离和顺序与”刘德华”TERM数组一致。

那么，为什么pinyin分词插件在分析”刘德华”的时候不能把liude, ld这种前缀也拆成TERM呢？

替换插件elasticsearch-analysis-lc-pinyin

经过再三确认，之前的插件不支持前缀拼音索引，另外一款拼音分词插件则实现了这个功能：elasticsearch-analysis-lc-pinyin，并且直接提供符合实践的analyzer，不需要再自己配置tokenizer了。

安装

可惜，插件只支持到Elasticsearch 5.0.1版本，但是5.x版本都可以兼容，因此只需要下载它的代码修改pom.xml中对elasticsearch的依赖为5.2.2版本（我的ES版本）即可：

    <properties>
        <elasticsearch.version>5.2.2</elasticsearch.version>

1 2	<properties> <elasticsearch.version>5.2.2</elasticsearch.version>

之后按照同样的步骤，编译安装插件即可。

重建index

重新建立index和type：

<?php

require_once __DIR__ . "/vendor/autoload.php";

// 客户端
$client = Elasticsearch\ClientBuilder::fromConfig([
    'hosts' => ['localhost:9200', 'localhost:9201', 'localhost:9203'], // 最好在为ES集群搭建Haproxy反向代理
    'retries' => 2
]);

$indices = $client->indices();
// 先删除旧的test索引
$indices->delete(['index' => 'test']);
// 创建test索引的同时指定歌手type mapping
$indices->create([
    'index' => 'test',
    'body' => [
        // index配置
        'settings' => [
            "number_of_shards" => 3,    // 3个分区
            "number_of_replicas" => 2,  // 每个分区有1个主分片和2个从分片
        ],
        // type映射
        'mappings' => [
            // 歌手type
            'singers' => [
                // 属性
                'properties' => [
                    // 歌手姓名
                    'singer_name' => [
                        'type' => 'string', // 字符串
                        'index' => 'analyzed', // 全文索引
                        'analyzer' => 'ik_max_word', // 中文分词
                        'fields' => [
                            'singer_name_pinyin' => [
                                'type' => 'string',    // 字符串
                                'index' => 'analyzed', // 全文索引
                                'analyzer' => 'lc_index', // 索引分词
                                "search_analyzer" => "lc_search", // 查询分词
                            ]
                        ]
                    ],
                ]
            ]
        ],
    ]
]);

$client->bulk([
    'index' => 'test',
    'type' => 'singers',
    'body' => [
        // index索引请求，元信息是['_id':1]
        ['index' => ['_id' => 1]],
        // 请求体
        [
            'singer_name' => '刘德华',
        ],
    ]
]);

<?php

require_once __DIR__ . "/vendor/autoload.php";

// 客户端

$client = Elasticsearch\ClientBuilder::fromConfig([

'hosts' => ['localhost:9200', 'localhost:9201', 'localhost:9203'], // 最好在为ES集群搭建Haproxy反向代理

'retries' => 2

]);

$indices = $client->indices();

// 先删除旧的test索引

$indices->delete(['index' => 'test']);

// 创建test索引的同时指定歌手type mapping

$indices->create([

'index' => 'test',

'body' => [

// index配置

'settings' => [

"number_of_shards" => 3, // 3个分区

"number_of_replicas" => 2, // 每个分区有1个主分片和2个从分片

// type映射

'mappings' => [

// 歌手type

'singers' => [

// 属性

'properties' => [

// 歌手姓名

'singer_name' => [

'type' => 'string', // 字符串

'index' => 'analyzed', // 全文索引

'analyzer' => 'ik_max_word', // 中文分词

'fields' => [

'singer_name_pinyin' => [

'type' => 'string', // 字符串

'index' => 'analyzed', // 全文索引

'analyzer' => 'lc_index', // 索引分词

"search_analyzer" => "lc_search", // 查询分词

]

]);

$client->bulk([

'index' => 'test',

'type' => 'singers',

'body' => [

// index索引请求，元信息是['_id':1]

['index' => ['_id' => 1]],

// 请求体

[

'singer_name' => '刘德华',

]

]);

索引时使用lc分析器lc_index。
查询时使用lc分析器lc_search。

测试查询

继续用这段代码测试：

<?php

require_once __DIR__ . "/vendor/autoload.php";

// 客户端
$client = Elasticsearch\ClientBuilder::fromConfig([
    'hosts' => ['localhost:9200', 'localhost:9201', 'localhost:9203'], // 最好在为ES集群搭建Haproxy反向代理
    'retries' => 2
]);

// 搜索关键字
$keyword = '刘德';

// 嵌套查询
$result = $client->search([
    'index' => 'test', // 数据库
    'type' => 'singers',  // 表
    'body' => [ // 查询体
        'query' => [
            // 全文匹配
            'match_phrase' => ['singer_name.singer_name_pinyin' => $keyword],
        ],
    ]
]);

echo 'http://localhost:9200/test/_analyze?text=' . urlencode($keyword) . '&analyzer=lc_index&pretty' . PHP_EOL;
print_r($result);

<?php

require_once __DIR__ . "/vendor/autoload.php";

// 客户端

$client = Elasticsearch\ClientBuilder::fromConfig([

'hosts' => ['localhost:9200', 'localhost:9201', 'localhost:9203'], // 最好在为ES集群搭建Haproxy反向代理

'retries' => 2

]);

// 搜索关键字

$keyword = '刘德';

// 嵌套查询

$result = $client->search([

'index' => 'test', // 数据库

'type' => 'singers', // 表

'body' => [ // 查询体

'query' => [

// 全文匹配

'match_phrase' => ['singer_name.singer_name_pinyin' => $keyword],

]

]);

echo 'http://localhost:9200/test/_analyze?text=' . urlencode($keyword) . '&analyzer=lc_index&pretty' . PHP_EOL;

print_r($result);

场景1

keyword=”的”，未命中结果，分析一下lc_index和lc_search的分词：

lc_index索引时分词：

[work@78d7fa0c263f nuomi-search]$ curl 'http://localhost:9200/test/_analyze?text=%e7%9a%84&analyzer=lc_index&pretty'
{
  "tokens" : [
    {
      "token" : "的",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "de",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "d",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "di",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "d",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    }
  ]
}

[work@78d7fa0c263f nuomi-search]$ curl 'http://localhost:9200/test/_analyze?text=%e7%9a%84&analyzer=lc_index&pretty'

{

"tokens" : [

{

"token" : "的",

"start_offset" : 0,

"end_offset" : 1,

"type" : "word",

"position" : 0

{

"token" : "de",

"start_offset" : 0,

"end_offset" : 1,

"type" : "word",

"position" : 0

{

"token" : "d",

"start_offset" : 0,

"end_offset" : 1,

"type" : "word",

"position" : 0

{

"token" : "di",

"start_offset" : 0,

"end_offset" : 1,

"type" : "word",

"position" : 0

{

"token" : "d",

"start_offset" : 0,

"end_offset" : 1,

"type" : "word",

"position" : 0

}

]

}

lc_search查询时分词：

[work@78d7fa0c263f nuomi-search]$ curl 'http://localhost:9200/test/_analyze?text=%e7%9a%84&analyzer=lc_search&pretty'
{
  "tokens" : [
    {
      "token" : "的",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    }
  ]
}

[work@78d7fa0c263f nuomi-search]$ curl 'http://localhost:9200/test/_analyze?text=%e7%9a%84&analyzer=lc_search&pretty'

{

"tokens" : [

{

"token" : "的",

"start_offset" : 0,

"end_offset" : 1,

"type" : "word",

"position" : 0

}

]

}

查询分词并没有将”的”转成de，所以查不到是正常的，不过这种错音字查不出来也没有大碍。

一定要注意上面lc_index/lc_search中每个TERM的position，全拼的position相邻，汉字的position也相邻，首字母的position相邻，这些都是为了match_phrase短语匹配才这么做的！

场景2

keyword=刘华，没有命中结果，match_phrase短语匹配在lc_search分词的结果下，显然无法命中，因为”刘”和”华”在lc_index的时候距离是2（中间有一个德），而lc_search中”刘”和”华”相邻，因此不符合短语匹配规则。

场景3

keyword=刘德，可以命中结果，分析过程和场景2相似，因为”刘”和”德”在文档lc_index的TERM列表中相邻。

场景4

keyword=ld，可以命中结果，和场景3、场景2相似，无非”l”和”d”在lc_index时是相邻的首字母TERM（仔细回顾上面的lc_index中position）。

场景5

keyword=liu德，可以命中结果，和场景4、3、2相似，如果你仔细观察会发现”liu”，”l”，”刘”在lc_index中的position是一样的，而”de”，”d”，”德”的position也是一样的。

经过lc_search对”liu德”的分词，成为”liu”和”德”，显然它们同时出现在lc_index的TERM列表中，同时position距离也是1，说到这里我们应该可以理解这个插件的原理了吧。

场景6

keyword=德hua，可以命中结果，原理和上面完全一样，短语匹配要求每个查询TERM都出现，并且距离与索引中的TERM距离一致，因此这里一定可以满足。

本篇博客到此结束，祝你搭建一个称心如意的检索服务。

如果文章帮助您解决了工作难题，您可以帮我点击屏幕上的任意广告，或者赞助少量费用来支持我的持续创作，谢谢~

6 thoughts on “elasticsearch 拼音分词”

邹鹏诚说道：

2017年7月9日上午11:11

楼主，我按照你的步骤走了一遍，用“刘德”也没有匹配上，还请指点一二！

回复
邹鹏诚说道：

2017年7月9日上午11:39

不好意思，我知道了，字段名写错了，阁下的文章令我茅塞顿开，受益匪浅！！！非常感谢

回复
1. yuer说道：
  
  2017年8月15日下午7:37
  
  对你有帮助我也很开心！
  
  回复
彭爽说道：

2018年7月31日上午9:50

怎么样才能够进行前缀匹配呢，我的prefix好像没有生效

回复
1. yuer说道：
  
  2018年7月31日下午4:38
  
  前缀匹配就是用term查询呀。
  
  回复
Pingback： elasticSearch服务搭建和集成java环境 - 栋先生的个人博客

安装插件

下载插件

切换分支

编译插件

安装插件并重启ES

测试插件

创建index使用分词

测试分词

查询测试

场景1

场景2

场景3

替换插件elasticsearch-analysis-lc-pinyin

安装

重建index

测试查询

场景1

场景2

场景3

场景4

场景5

场景6

6 thoughts on “elasticsearch 拼音分词”

发表回复 取消回复

发表回复取消回复