Lucene 索引删除策略源码解析

这篇文章主要为大家介绍了Lucene 索引删除策略源码解析,有需要的朋友可以借鉴参考下,希望能够有所帮助,祝大家多多进步,早日升职加薪

Lucene

从今天开始,我们要开始介绍Lucene中索引构建的流程。因为索引构建的逻辑涉及到的东西非常多,如果从构建入口IndexWriter来开始介绍,是很难说清楚的。所以接下来按化零为整的方式 ,从构建相关的各个组件开始介绍,尽量每一篇文章都是可以独立阅读,依赖的前置知识都是我已经介绍的内容。不管就算如此,还是会有部分内容可能需要结合整体流程才能明白,对于这部分的内容,大家可以先留个印象,以后介绍相关联的内容时,我会再重新指出。

今天我们一起来看看索引文件删除相关的。

IndexCommit

Lucene中,需要持久化的索引信息都要进行commit操作,然后会生成一个segments_N的索引文件记录此次commit相关的索引信息。

一次commit生成segments_N之后,就对应了一个IndexCommit,IndexCommit只是一个接口,它定义了可以从IndexCommit中获取哪些信息:

public abstract class IndexCommit implements Comparable { // commit对应的segments_N public abstract String getSegmentsFileName(); // commit关联的所有的索引文件 public abstract Collection getFileNames() throws IOException; // 索引所在的Directory public abstract Directory getDirectory(); // 删除commit,后面会看到,删除其实减少commit关联的索引文件的引用计数 public abstract void delete(); // commit是否被删除了 public abstract boolean isDeleted(); // commit关联了几个segment public abstract int getSegmentCount(); // segments_N文件中的N public abstract long getGeneration(); // commit可以记录一些用户自定义的信息 public abstract Map getUserData() throws IOException; // 用来读取commit对应的索引数据 StandardDirectoryReader getReader() { return null; } } 

IndexCommit有三个实现类:

  • CommitPoint
  • ReaderCommit
  • SnapshotCommitPoint

这个三个实现类都有对应的使用场景,在用到的时候我会再详细介绍,本文中会涉及到SnapshotCommitPoint,后面会详细介绍它。

IndexDeletionPolicy

在索引的生命周期中,可以有多次的commit操作,因此也会生成多个segments_N文件,对于这些文件是否要保留还是删除,lucene中是通过IndexDeletionPolicy来管理的。我们先来看下IndexDeletionPolicy的接口定义:

public abstract class IndexDeletionPolicy { protected IndexDeletionPolicy() {} // 重新打开索引的时候,对所有commit的处理 public abstract void onInit(List commits) throws IOException; // 有新提交时对所有commit的处理 public abstract void onCommit(List commits) throws IOException; } 

从上面我可以看到,索引的删除策略其实只在两个地方进行应用,一个是加载索引的时候,打开一个旧索引时,根据当前设置的IndexDeletionPolicy进行处理。另一个是有新的commit产生时,借这个机会处理所有的commit。Lucene中提供的索引删除策略一共有四种,不过可以分为三类:

NoDeletionPolicy

NoDeletionPolicy索引删除策略就是保留所有的commit信息,效果就是你有多少次commit就多少个segments_N文件,看个例子:

public class DeletionPolicyTest { private static final Random RANDOM = new Random(); public static void main(String[] args) throws IOException { Directory directory = FSDirectory.open(new File("D:\\code\\lucene-9.1.0-learning\\data").toPath()); WhitespaceAnalyzer analyzer = new WhitespaceAnalyzer(); IndexWriterConfig indexWriterConfig = new IndexWriterConfig(analyzer); indexWriterConfig.setUseCompoundFile(true); indexWriterConfig.setIndexDeletionPolicy(NoDeletionPolicy.INSTANCE); IndexWriter indexWriter = new IndexWriter(directory, indexWriterConfig); indexWriter.addDocument(getDoc(RANDOM.nextInt(10000),RANDOM.nextInt(10000))); // 第一次commit,生成segments_1 indexWriter.commit(); indexWriter.addDocument(getDoc(RANDOM.nextInt(10000),RANDOM.nextInt(10000))); // 第二次commit,生成segments_2 indexWriter.commit(); indexWriter.close(); } private static Document getDoc(int... point) { Document doc = new Document(); IntPoint intPoint = new IntPoint("point", point); doc.add(intPoint); return doc; } } 

上面的例子中有两次commit,下图是NoDeletionPolicy策略进行了两次commit的索引目录结构,可以看到生成了两个segments_N文件:

NoDeletionPolicy的代码实现非常简单,单例实现,并且在onCommit和onInit的时候都是空操作:

public final class NoDeletionPolicy extends IndexDeletionPolicy { public static final IndexDeletionPolicy INSTANCE = new NoDeletionPolicy(); private NoDeletionPolicy() { } public void onCommit(List commits) {} public void onInit(List commits) {} } 

KeepOnlyLastCommitDeletionPolicy

KeepOnlyLastCommitDeletionPolicy是Lucene默认的索引删除策略,只保留最新的一次commit,从索引目录看不管执行多少次commit只保留了N最大的segments_N文件,下图是KeepOnlyLastCommitDeletionPolicy策略进行了两次commit的结果,KeepOnlyLastCommitDeletionPolicy删除策略只保留了segments_2。把上面示例代码中的删除策略替换成KeepOnlyLastCommitDeletionPolicy,即可得到,注意需要先清空索引目录:

KeepOnlyLastCommitDeletionPolicy代码实现也比较简单,除了最后一个commit之外,其他的commit都删除:

public final class KeepOnlyLastCommitDeletionPolicy extends IndexDeletionPolicy { public KeepOnlyLastCommitDeletionPolicy() {} public void onInit(List commits) { onCommit(commits); } // commits是从旧到新排序的 public void onCommit(List commits) { // 只保留最新的一个 int size = commits.size(); for (int i = 0; i 

两个快照相关的删除策略

快照相关的删除策略有两个,SnapshotDeletionPolicy和PersistentSnapshotDeletionPolicy,分别对应了不可持久化和可持久化的模式。不管是SnapshotDeletionPolicy还是PersistentSnapshotDeletionPolicy,他们都封装了其他的IndexDeletionPolicy来执行删除策略,他们只是提供了为当前最新的commit生成快照的能力。只要快照存在,则跟快照相关的所有索引文件都会被无条件保留。

SnapshotDeletionPolicy

例子

public class SnapshotDeletionPolicyTest { private static final Random RANDOM = new Random(); public static void main(String[] args) throws IOException, InterruptedException { Directory directory = FSDirectory.open(new File("D:\\code\\lucene-9.1.0-learning\\data").toPath()); WhitespaceAnalyzer analyzer = new WhitespaceAnalyzer(); IndexWriterConfig indexWriterConfig = new IndexWriterConfig(analyzer); indexWriterConfig.setUseCompoundFile(true); SnapshotDeletionPolicy snapshotDeletionPolicy = new SnapshotDeletionPolicy(new KeepOnlyLastCommitDeletionPolicy()); indexWriterConfig.setIndexDeletionPolicy(snapshotDeletionPolicy); IndexWriter indexWriter = new IndexWriter(directory, indexWriterConfig); indexWriter.addDocument(getDoc(RANDOM.nextInt(10000),RANDOM.nextInt(10000))); // 第一次commit,生成segments_1 indexWriter.commit(); indexWriter.addDocument(getDoc(RANDOM.nextInt(10000),RANDOM.nextInt(10000))); // 第二次commit,生成segments_2 indexWriter.commit(); // segments_2当做快照,无条件保留 snapshotDeletionPolicy.snapshot(); indexWriter.addDocument(getDoc(RANDOM.nextInt(10000),RANDOM.nextInt(10000))); // 第三次commit,生成segments_3 indexWriter.commit(); indexWriter.close(); } private static Document getDoc(int... point) { Document doc = new Document(); IntPoint intPoint = new IntPoint("point", point); doc.add(intPoint); return doc; } } 

在上面的例子中,我们使用SnapshotDeletionPolicy,SnapshotDeletionPolicy底层封装的是KeepOnlyLastCommitDeletionPolicy,我们进行了三次commit,理论上KeepOnlyLastCommitDeletionPolicy只会保留最后一次,但是因为我们对第一次的commit进行了快照,所以第一次commit也被保留了:

接下来我们看看SnapshotDeletionPolicy是怎么实现。SnapshotDeletionPolicy保证生成快照的commit不会被删除的原理就是引用计数,SnapshotDeletionPolicy会记录每个commit生成快照的次数,在删除的时候,只会删除引用计数为0的commit。

成员变量

  // key是IndexCommit的generation,value是对应的IndexCommit有多少个快照 // 需要注意的是,有被快照引用的才会记录在refCounts中,也就是只要被记录在refCounts中,引用次数至少是1 protected final Map refCounts = new HashMap<>(); // key是IndexCommit的generation,value是对应的IndexCommit protected final Map indexCommits = new HashMap<>(); // SnapshotDeletionPolicy只是增加了支持快照的功能,删除的逻辑是由primary参数对应的删除策略提供的 private final IndexDeletionPolicy primary; // 最近一次提交的commit,只会对这个IndexCommit生成快照 protected IndexCommit lastCommit; // 是否初始化的标记,实例化后,必须先调用onInit方法 private boolean initCalled; 

生成快照

生成快照只会对当前最新的一个commit进行快照:

  public synchronized IndexCommit snapshot() throws IOException { if (!initCalled) { throw new IllegalStateException( "this instance is not being used by IndexWriter; be sure to use the instance returned from writer.getConfig().getIndexDeletionPolicy()"); } if (lastCommit == null) { throw new IllegalStateException("No index commit to snapshot"); } // 新增lastCommit的引用计数 incRef(lastCommit); return lastCommit; } protected synchronized void incRef(IndexCommit ic) { long gen = ic.getGeneration(); Integer refCount = refCounts.get(gen); int refCountInt; if (refCount == null) { // 第一次被引用 indexCommits.put(gen, lastCommit); refCountInt = 0; } else { refCountInt = refCount.intValue(); } // 引用计数加+1 refCounts.put(gen, refCountInt + 1); } 

释放指定的快照

public synchronized void release(IndexCommit commit) throws IOException { long gen = commit.getGeneration(); releaseGen(gen); } protected void releaseGen(long gen) throws IOException { if (!initCalled) { throw new IllegalStateException( "this instance is not being used by IndexWriter; be sure to use the instance returned from writer.getConfig().getIndexDeletionPolicy()"); } Integer refCount = refCounts.get(gen); if (refCount == null) { throw new IllegalArgumentException("commit gen=" + gen + " is not currently snapshotted"); } int refCountInt = refCount.intValue(); assert refCountInt > 0; refCountInt--; if (refCountInt == 0) { // 引用计数为0,直接从refCounts中移除 refCounts.remove(gen); indexCommits.remove(gen); } else { refCounts.put(gen, refCountInt); } } 

删除commit

  public synchronized void onCommit(List commits) throws IOException { // 把commits中的所有IndexCommit都封装成SnapshotCommitPoint,再使用primary执行onCommit方法 primary.onCommit(wrapCommits(commits)); // 更新最新的commit lastCommit = commits.get(commits.size() - 1); } @Override public synchronized void onInit(List commits) throws IOException { // 设置初始化的标记 initCalled = true; primary.onInit(wrapCommits(commits)); for (IndexCommit commit : commits) { if (refCounts.containsKey(commit.getGeneration())) { indexCommits.put(commit.getGeneration(), commit); } } if (!commits.isEmpty()) { lastCommit = commits.get(commits.size() - 1); } } private List wrapCommits(List commits) { List wrappedCommits = new ArrayList<>(commits.size()); for (IndexCommit ic : commits) { // 把IndexCommit都封装成 SnapshotCommitPoint wrappedCommits.add(new SnapshotCommitPoint(ic)); } return wrappedCommits; } 

前面我们列出了SnapshotCommitPoint是IndexCommit的一个实现类,但是没有详细介绍,SnapshotCommitPoint除了能够提供IndexCommit接口所提供的信息之外,最核心的是在删除的时候,会先判断IndexCommit是否被快照引用,只有没有任何快照引用的IndexCommit才能删除:

public void delete() { synchronized (SnapshotDeletionPolicy.this) { if (!refCounts.containsKey(cp.getGeneration())) { cp.delete(); } } } 

存在的问题

需要注意的是SnapshotDeletionPolicy的快照信息是没有持久化,我们重新打开SnapshotDeletionPolicyTest例子中生成的索引:

public class SnapshotDeletionPolicyTest2 { public static void main(String[] args) throws IOException, InterruptedException { Directory directory = FSDirectory.open(new File("D:\\code\\lucene-9.1.0-learning\\data").toPath()); WhitespaceAnalyzer analyzer = new WhitespaceAnalyzer(); IndexWriterConfig indexWriterConfig = new IndexWriterConfig(analyzer); indexWriterConfig.setUseCompoundFile(true); SnapshotDeletionPolicy snapshotDeletionPolicy = new SnapshotDeletionPolicy(new KeepOnlyLastCommitDeletionPolicy()); indexWriterConfig.setIndexDeletionPolicy(snapshotDeletionPolicy); // 重新打开索引 IndexWriter indexWriter = new IndexWriter(directory, indexWriterConfig); indexWriter.close(); } } 

可以发现segments_1被删除了,因为没有持久化快照信息,所以根据KeepOnlyLastCommitDeletionPolicy的删除策略,只保留了最新的一个commit:

PersistentSnapshotDeletionPolicy

例子

PersistentSnapshotDeletionPolicy主要是为了解决SnapshotDeletionPolicy无法持久化的问题。PersistentSnapshotDeletionPolicy持久化的时候会生成snapshots_N的索引文件,我们看个例子:

public class PersistentSnapshotDeletionPolicyTest { private static final Random RANDOM = new Random(); public static void main(String[] args) throws IOException, InterruptedException { Directory directory = FSDirectory.open(new File("D:\\code\\lucene-9.1.0-learning\\data").toPath()); WhitespaceAnalyzer analyzer = new WhitespaceAnalyzer(); IndexWriterConfig indexWriterConfig = new IndexWriterConfig(analyzer); indexWriterConfig.setUseCompoundFile(true); PersistentSnapshotDeletionPolicy persistentSnapshotDeletionPolicy = new PersistentSnapshotDeletionPolicy(new KeepOnlyLastCommitDeletionPolicy(), directory); indexWriterConfig.setIndexDeletionPolicy(persistentSnapshotDeletionPolicy); IndexWriter indexWriter = new IndexWriter(directory, indexWriterConfig); indexWriter.addDocument(getDoc(RANDOM.nextInt(10000),RANDOM.nextInt(10000))); // 第一次commit,生成segments_1 indexWriter.commit(); // segments_1当做快照,无条件保留 persistentSnapshotDeletionPolicy.snapshot(); indexWriter.addDocument(getDoc(RANDOM.nextInt(10000),RANDOM.nextInt(10000))); // 第二次commit,生成segments_2 indexWriter.commit(); indexWriter.addDocument(getDoc(RANDOM.nextInt(10000),RANDOM.nextInt(10000))); // 第三次commit,生成segments_3 indexWriter.commit(); indexWriter.close(); } private static Document getDoc(int... point) { Document doc = new Document(); IntPoint intPoint = new IntPoint("point", point); doc.add(intPoint); return doc; } } 

上面的例子和我们在介绍SnapshotDeletionPolicy的时候逻辑一样,只是把SnapshotDeletionPolicy换成了PersistentSnapshotDeletionPolicy,我们看结果:

从上面结果图中可以看到,segments_1和segments_3同样被保留了,但是多了一个持久化的快照信息的文件snapshots_0,有了这个文件,索引重新打开的时候就可以恢复快照信息,segments_1还是会被保留,用下面的例子我们重新打开索引,可以发现segments_1还是被保留了:

public class PersistentSnapshotDeletionPolicyTest2 { public static void main(String[] args) throws IOException, InterruptedException { Directory directory = FSDirectory.open(new File("D:\\code\\lucene-9.1.0-learning\\data").toPath()); WhitespaceAnalyzer analyzer = new WhitespaceAnalyzer(); IndexWriterConfig indexWriterConfig = new IndexWriterConfig(analyzer); indexWriterConfig.setUseCompoundFile(true); PersistentSnapshotDeletionPolicy persistentSnapshotDeletionPolicy = new PersistentSnapshotDeletionPolicy(new KeepOnlyLastCommitDeletionPolicy(), directory); indexWriterConfig.setIndexDeletionPolicy(persistentSnapshotDeletionPolicy); IndexWriter indexWriter = new IndexWriter(directory, indexWriterConfig); indexWriter.close(); } } 

接下来我们看看PersistentSnapshotDeletionPolicy的实现,主要就是持久化和恢复快照信息的逻辑。

成员变量

  // 持久化快照信息的文件名snapshots_N中的N,从0开始 private long nextWriteGen; // 持久化的文件所在的目录 private final Directory dir; 

构造函数

  public PersistentSnapshotDeletionPolicy(IndexDeletionPolicy primary, Directory dir) throws IOException { this(primary, dir, OpenMode.CREATE_OR_APPEND); } public PersistentSnapshotDeletionPolicy(IndexDeletionPolicy primary, Directory dir, OpenMode mode) throws IOException { super(primary); this.dir = dir; if (mode == OpenMode.CREATE) { // 新建索引的模式,则需要清除所有的快照信息,索引模式以后再介绍 clearPriorSnapshots(); } // 加载快照信息 loadPriorSnapshots(); if (mode == OpenMode.APPEND && nextWriteGen == 0) { throw new IllegalStateException("no snapshots stored in this directory"); } } 

生成快照

public synchronized IndexCommit snapshot() throws IOException { // 使用SnapshotDeletionPolicy来生成快照 IndexCommit ic = super.snapshot(); // 标记持久化是否成功,不成功的话需要删除快照 boolean success = false; try { // 持久化最新的快照信息 persist(); success = true; } finally { if (!success) { // 持久化失败,删除快照 try { super.release(ic); } catch ( @SuppressWarnings("unused") Exception e) { // Suppress so we keep throwing original exception } } } return ic; } 

释放快照

public synchronized void release(IndexCommit commit) throws IOException { // 使用SnapshotDeletionPolicy来释放快照 super.release(commit); // 持久化快照信息是否成功 boolean success = false; try { // 持久化最新的快照信息 persist(); success = true; } finally { if (!success) { // 持久化失败,重新加回快照信息 try { incRef(commit); } catch ( @SuppressWarnings("unused") Exception e) { // Suppress so we keep throwing original exception } } } } 

持久化快照信息

private synchronized void persist() throws IOException { // 快照文件名 String fileName = SNAPSHOTS_PREFIX + nextWriteGen; boolean success = false; try (IndexOutput out = dir.createOutput(fileName, IOContext.DEFAULT)) { CodecUtil.writeHeader(out, CODEC_NAME, VERSION_CURRENT); out.writeVInt(refCounts.size()); for (Entry ent : refCounts.entrySet()) { // 持久化所有的引用信息 out.writeVLong(ent.getKey()); out.writeVInt(ent.getValue()); } success = true; } finally { if (!success) { IOUtils.deleteFilesIgnoringExceptions(dir, fileName); } } dir.sync(Collections.singletonList(fileName)); if (nextWriteGen > 0) { String lastSaveFile = SNAPSHOTS_PREFIX + (nextWriteGen - 1); // 删除前一个快照文件,因为每次持久化都是把当前的快照信息全量持久化,所以只需要保留最新的一个就可以 // 这里有可能删除失败,所以在启动加载的时候会再次尝试把旧版本的文件都删掉 IOUtils.deleteFilesIgnoringExceptions(dir, lastSaveFile); } nextWriteGen++; } 

加载快照信息

private synchronized void loadPriorSnapshots() throws IOException { long genLoaded = -1; IOException ioe = null; List snapshotFiles = new ArrayList<>(); for (String file : dir.listAll()) { if (file.startsWith(SNAPSHOTS_PREFIX)) { // 找到快照文件 long gen = Long.parseLong(file.substring(SNAPSHOTS_PREFIX.length())); if (genLoaded == -1 || gen > genLoaded) { // 找到gen最大的快照文件 snapshotFiles.add(file); Map m = new HashMap<>(); IndexInput in = dir.openInput(file, IOContext.DEFAULT); try { CodecUtil.checkHeader(in, CODEC_NAME, VERSION_START, VERSION_START); int count = in.readVInt(); for (int i = 0; i  1) { String curFileName = SNAPSHOTS_PREFIX + genLoaded; for (String file : snapshotFiles) { if (!curFileName.equals(file)) { IOUtils.deleteFilesIgnoringExceptions(dir, file); } } } nextWriteGen = 1 + genLoaded; } } 

总结

本文介绍的索引删除策略是在IndexCommit粒度的控制,具体到每个索引文件是怎么控制的,我们下一篇文章介绍。

以上就是Lucene 索引删除策略源码解析的详细内容,更多关于Lucene 索引删除策略的资料请关注0133技术站其它相关文章!

以上就是Lucene 索引删除策略源码解析的详细内容,更多请关注0133技术站其它相关文章!

赞(0) 打赏
未经允许不得转载:0133技术站首页 » Java