Fork me on GitHub

lucene搜索之高亮显示highlighter

lucene(11)—lucene搜索之高亮显示highlighter

highlighter介绍

我们在做查询的时候,希望对我们自己的搜索结果与搜索内容相近的地方进行着重显示,就如下面的效果

这里我们搜索的内容是“区块”,搜索引擎展示的结果中对用户的输入信息进行了配色方面的处理,这种区分正常文本和输入内容的效果即是高亮显示;

这样做的好处:

视觉上让人便于查找有搜索对应的文本块;
界面展示更友好;
lucene提供了highlighter插件来体现类似的效果;

highlighter对查询关键字高亮处理;

highlighter包包含了用于处理结果页查询内容高亮显示的功能,其中Highlighter类highlighter包的核心组件,借助Fragmenter, fragment Scorer, 和Formatter等类来支持用户自定义高亮展示的功能;

示例程序

这里边我利用了之前的做的目录文件索引

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
package com.lucene.search.util;
import java.io.IOException;
import java.io.StringReader;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.highlight.Highlighter;
import org.apache.lucene.search.highlight.InvalidTokenOffsetsException;
import org.apache.lucene.search.highlight.QueryScorer;
import org.apache.lucene.search.highlight.SimpleFragmenter;
import org.apache.lucene.search.highlight.SimpleHTMLFormatter;
import org.apache.lucene.util.BytesRef;
public class HighlighterTest {
public static void main(String[] args) {
IndexSearcher searcher;
TopDocs docs;
ExecutorService service = Executors.newCachedThreadPool();
try {
searcher = SearchUtil.getMultiSearcher("index", service);
Term term = new Term("content",new BytesRef("lucene"));
TermQuery termQuery = new TermQuery(term);
docs = SearchUtil.getScoreDocsByPerPage(1, 30, searcher, termQuery);
ScoreDoc[] hits = docs.scoreDocs;
QueryScorer scorer = new QueryScorer(termQuery);
SimpleHTMLFormatter simpleHtmlFormatter = new SimpleHTMLFormatter("<B>","</B>");//设定高亮显示的格式<B>keyword</B>,此为默认的格式
Highlighter highlighter = new Highlighter(simpleHtmlFormatter,scorer);
highlighter.setTextFragmenter(new SimpleFragmenter(20));//设置每次返回的字符数
Analyzer analyzer = new StandardAnalyzer();
for(int i=0;i<hits.length;i++){
Document doc = searcher.doc(hits[i].doc);
String str = highlighter.getBestFragment(analyzer, "content", doc.get("content")) ;
System.out.println(str);
}
} catch (IOException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
} catch (InvalidTokenOffsetsException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}finally{
service.shutdown();
}
}
}

lucene的highlighter高亮展示的原理:

根据Formatter和Scorer创建highlighter对象,formatter定义了高亮的显示方式,而scorer定义了高亮的评分;

评分的算法是先根据term的评分值获取对应的document的权重,在此基础上对文本的内容进行轮询,获取对应的文本出现的次数,和它在term对应的文本中出现的位置(便于高亮处理),评分并分词的算法为:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
public float getTokenScore() {
position += posIncAtt.getPositionIncrement();//记录出现的位置
String termText = termAtt.toString();
WeightedSpanTerm weightedSpanTerm;
if ((weightedSpanTerm = fieldWeightedSpanTerms.get(
termText)) == null) {
return 0;
}
if (weightedSpanTerm.positionSensitive &&
!weightedSpanTerm.checkPosition(position)) {
return 0;
}
float score = weightedSpanTerm.getWeight();//获取权重
// found a query term - is it unique in this doc?
if (!foundTerms.contains(termText)) {//结果排重处理
totalScore += score;
foundTerms.add(termText);
}
return score;
}

formatter的原理为:对搜索的文本进行判断,如果scorer获取的totalScore不小于0,即查询内容在对应的term中存在,则按照格式拼接成preTag+查询内容+postTag的格式

详细算法如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
public String highlightTerm(String originalText, TokenGroup tokenGroup) {
if (tokenGroup.getTotalScore() <= 0) {
return originalText;
}
// Allocate StringBuilder with the right number of characters from the
// beginning, to avoid char[] allocations in the middle of appends.
StringBuilder returnBuffer = new StringBuilder(preTag.length() + originalText.length() + postTag.length());
returnBuffer.append(preTag);
returnBuffer.append(originalText);
returnBuffer.append(postTag);
return returnBuffer.toString();
}

其默认格式为“”的形式;

Highlighter根据scorer和formatter,对document进行分析,查询结果调用getBestTextFragments,TokenStream tokenStream,String text,boolean mergeContiguousFragments,int maxNumFragments),其过程为
scorer首先初始化查询内容对应的出现位置的下标,然后对tokenstream添加PositionIncrementAttribute,此类记录单词出现的位置;
对文本内容进行轮询,判断查询的文本长度是否超出限制,如果超出文本长度提示过长内容;
如果获取到指定的文本,先对单次查询的内容进行内容的截取(截取值根据setTextFragmenter指定的值决定),再调用formatter的highlightTerm方法对文本进行重新构建
在本次轮询和下次单词出现之前对文本内容进行处理

查询工具类

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
package com.lucene.search.util;
import java.io.File;
import java.io.IOException;
import java.nio.file.Paths;
import java.util.Set;
import java.util.concurrent.ExecutorService;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.MultiReader;
import org.apache.lucene.index.Term;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.MatchAllDocsQuery;
import org.apache.lucene.search.NumericRangeQuery;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.BooleanClause.Occur;
import org.apache.lucene.search.highlight.Highlighter;
import org.apache.lucene.search.highlight.InvalidTokenOffsetsException;
import org.apache.lucene.search.highlight.QueryScorer;
import org.apache.lucene.search.highlight.SimpleFragmenter;
import org.apache.lucene.search.highlight.SimpleHTMLFormatter;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.BytesRef;
/**lucene索引查询工具类
* @author lenovo
*
*/
public class SearchUtil {
/**获取IndexSearcher对象
* @param indexPath
* @param service
* @return
* @throws IOException
*/
public static IndexSearcher getIndexSearcherByParentPath(String parentPath,ExecutorService service) throws IOException{
MultiReader reader = null;
//设置
try {
File[] files = new File(parentPath).listFiles();
IndexReader[] readers = new IndexReader[files.length];
for (int i = 0 ; i < files.length ; i ++) {
readers[i] = DirectoryReader.open(FSDirectory.open(Paths.get(files[i].getPath(), new String[0])));
}
reader = new MultiReader(readers);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return new IndexSearcher(reader,service);
}
/**多目录多线程查询
* @param parentPath 父级索引目录
* @param service 多线程查询
* @return
* @throws IOException
*/
public static IndexSearcher getMultiSearcher(String parentPath,ExecutorService service) throws IOException{
File file = new File(parentPath);
File[] files = file.listFiles();
IndexReader[] readers = new IndexReader[files.length];
for (int i = 0 ; i < files.length ; i ++) {
readers[i] = DirectoryReader.open(FSDirectory.open(Paths.get(files[i].getPath(), new String[0])));
}
MultiReader multiReader = new MultiReader(readers);
IndexSearcher searcher = new IndexSearcher(multiReader,service);
return searcher;
}
/**根据索引路径获取IndexReader
* @param indexPath
* @return
* @throws IOException
*/
public static DirectoryReader getIndexReader(String indexPath) throws IOException{
return DirectoryReader.open(FSDirectory.open(Paths.get(indexPath, new String[0])));
}
/**根据索引路径获取IndexSearcher
* @param indexPath
* @param service
* @return
* @throws IOException
*/
public static IndexSearcher getIndexSearcherByIndexPath(String indexPath,ExecutorService service) throws IOException{
IndexReader reader = getIndexReader(indexPath);
return new IndexSearcher(reader,service);
}
/**如果索引目录会有变更用此方法获取新的IndexSearcher这种方式会占用较少的资源
* @param oldSearcher
* @param service
* @return
* @throws IOException
*/
public static IndexSearcher getIndexSearcherOpenIfChanged(IndexSearcher oldSearcher,ExecutorService service) throws IOException{
DirectoryReader reader = (DirectoryReader) oldSearcher.getIndexReader();
DirectoryReader newReader = DirectoryReader.openIfChanged(reader);
return new IndexSearcher(newReader, service);
}
/**多条件查询类似于sql in
* @param querys
* @return
*/
public static Query getMultiQueryLikeSqlIn(Query ... querys){
BooleanQuery query = new BooleanQuery();
for (Query subQuery : querys) {
query.add(subQuery,Occur.SHOULD);
}
return query;
}
/**多条件查询类似于sql and
* @param querys
* @return
*/
public static Query getMultiQueryLikeSqlAnd(Query ... querys){
BooleanQuery query = new BooleanQuery();
for (Query subQuery : querys) {
query.add(subQuery,Occur.MUST);
}
return query;
}
/**从指定配置项中查询
* @return
* @param analyzer 分词器
* @param field 字段
* @param fieldType 字段类型
* @param queryStr 查询条件
* @param range 是否区间查询
* @return
*/
public static Query getQuery(String field,String fieldType,String queryStr,boolean range){
Query q = null;
try {
if(queryStr != null && !"".equals(queryStr)){
if(range){
String[] strs = queryStr.split("\\|");
if("int".equals(fieldType)){
int min = new Integer(strs[0]);
int max = new Integer(strs[1]);
q = NumericRangeQuery.newIntRange(field, min, max, true, true);
}else if("double".equals(fieldType)){
Double min = new Double(strs[0]);
Double max = new Double(strs[1]);
q = NumericRangeQuery.newDoubleRange(field, min, max, true, true);
}else if("float".equals(fieldType)){
Float min = new Float(strs[0]);
Float max = new Float(strs[1]);
q = NumericRangeQuery.newFloatRange(field, min, max, true, true);
}else if("long".equals(fieldType)){
Long min = new Long(strs[0]);
Long max = new Long(strs[1]);
q = NumericRangeQuery.newLongRange(field, min, max, true, true);
}
}else{
if("int".equals(fieldType)){
q = NumericRangeQuery.newIntRange(field, new Integer(queryStr), new Integer(queryStr), true, true);
}else if("double".equals(fieldType)){
q = NumericRangeQuery.newDoubleRange(field, new Double(queryStr), new Double(queryStr), true, true);
}else if("float".equals(fieldType)){
q = NumericRangeQuery.newFloatRange(field, new Float(queryStr), new Float(queryStr), true, true);
}else{
Analyzer analyzer = new StandardAnalyzer();
q = new QueryParser(field, analyzer).parse(queryStr);
}
}
}else{
q= new MatchAllDocsQuery();
}
System.out.println(q);
} catch (ParseException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return q;
}
/**根据field和值获取对应的内容
* @param fieldName
* @param fieldValue
* @return
*/
public static Query getQuery(String fieldName,Object fieldValue){
Term term = new Term(fieldName, new BytesRef(fieldValue.toString()));
return new TermQuery(term);
}
/**根据IndexSearcher和docID获取默认的document
* @param searcher
* @param docID
* @return
* @throws IOException
*/
public static Document getDefaultFullDocument(IndexSearcher searcher,int docID) throws IOException{
return searcher.doc(docID);
}
/**根据IndexSearcher和docID
* @param searcher
* @param docID
* @param listField
* @return
* @throws IOException
*/
public static Document getDocumentByListField(IndexSearcher searcher,int docID,Set<String> listField) throws IOException{
return searcher.doc(docID, listField);
}
/**分页查询
* @param page 当前页数
* @param perPage 每页显示条数
* @param searcher searcher查询器
* @param query 查询条件
* @return
* @throws IOException
*/
public static TopDocs getScoreDocsByPerPage(int page,int perPage,IndexSearcher searcher,Query query) throws IOException{
TopDocs result = null;
if(query == null){
System.out.println(" Query is null return null ");
return null;
}
ScoreDoc before = null;
if(page != 1){
TopDocs docsBefore = searcher.search(query, (page-1)*perPage);
ScoreDoc[] scoreDocs = docsBefore.scoreDocs;
if(scoreDocs.length > 0){
before = scoreDocs[scoreDocs.length - 1];
}
}
result = searcher.searchAfter(before, query, perPage);
return result;
}
public static TopDocs getScoreDocs(IndexSearcher searcher,Query query) throws IOException{
TopDocs docs = searcher.search(query, getMaxDocId(searcher));
return docs;
}
/**高亮显示字段
* @param searcher
* @param field
* @param keyword
* @param preTag
* @param postTag
* @param fragmentSize
* @return
* @throws IOException
* @throws InvalidTokenOffsetsException
*/
public static String[] highlighter(IndexSearcher searcher,String field,String keyword,String preTag, String postTag,int fragmentSize) throws IOException, InvalidTokenOffsetsException{
Term term = new Term("content",new BytesRef("lucene"));
TermQuery termQuery = new TermQuery(term);
TopDocs docs = getScoreDocs(searcher, termQuery);
ScoreDoc[] hits = docs.scoreDocs;
QueryScorer scorer = new QueryScorer(termQuery);
SimpleHTMLFormatter simpleHtmlFormatter = new SimpleHTMLFormatter(preTag,postTag);//设定高亮显示的格式<B>keyword</B>,此为默认的格式
Highlighter highlighter = new Highlighter(simpleHtmlFormatter,scorer);
highlighter.setTextFragmenter(new SimpleFragmenter(fragmentSize));//设置每次返回的字符数
Analyzer analyzer = new StandardAnalyzer();
String[] result = new String[hits.length];
for (int i = 0; i < result.length ; i++) {
Document doc = searcher.doc(hits[i].doc);
result[i] = highlighter.getBestFragment(analyzer, field, doc.get(field));
}
return result;
}
/**统计document的数量,此方法等同于matchAllDocsQuery查询
* @param searcher
* @return
*/
public static int getMaxDocId(IndexSearcher searcher){
return searcher.getIndexReader().maxDoc();
}
}

源码下载地址

http://download.csdn.net/detail/wuyinggui10000/8726407

坚持原创技术分享,您的支持将鼓励我继续创作!