ブルームフィルタ

お花フィルタ？ではなく、ブルームという人が発明したアルゴリズムらしい

ブルームフィルタ（bloom filter）とは

ある要素が集合に含まれているかどうか高速に判定できる
空間効率がいい

ただし

要素そのものを保持してるわけではないので、要素は別に保存する必要あり
偽陽性（ないのにあると誤判定）が発生する
- ただし偽陰性（あるのにないと誤判定）は発生しない。つまり、ないと判定されたものは確実にない

というもの

例えば、あるネトゲにキリトという名前のユーザがいるかどうか調べたいという場合、あらかじめブルームフィルタに全ユーザ名を突っ込んでおけば、高速に判定できるってワケ

ただ偽陽性が発生するのでキリトという名前のユーザがいないのにいると判定されてしまうことがある。例えば、同じ名前のユーザが他にいたら登録不可という仕様で、同名のユーザが「いない」と判定された場合はそのまま登録して良いが、「いる」と判定された場合はユーザテーブルをチェックして本当にいるかどうか確認しなくてはならない。

まあ、そのような用途では最初からDBをチェックするだろうからブルームフィルタを使うまでもないが、情報の取得にDBではなく外部APIをコールする必要があり、かつコール数で課金されるAPIだった場合、ないと判定された場合に無駄なAPI呼び出しを回避するために利用できるかもしれない

それでは実装してみる。アルゴリズムの詳細はWikipediaでも見てください。本気で使うなら何か良い感じのライブラリを探しましょう

用意するもの

コードはNode.jsです。

長さmのビット配列
- タダの配列で代用
k個のハッシュ関数（生成されるハッシュ値は0以上、m未満の整数になるようにする）
- md5の先頭から1バイトずつ取り出し、複数のハッシュ関数の結果とみなす
- なので今回はmは256未満、kは16以下という制限付き

// bloomfilter.js
'use strict';
const crypto = require('crypto');

class BloomFilter {
    constructor(k, m) {
        this.k = k;
        this.m = m;
        this.bits = new Array(m).fill(0);
    }

    hash(data) {
        const buf = Buffer.from(data, 'utf8');
        const hash = crypto.createHash('md5').update(buf).digest();
        return Array.from(hash.slice(0, this.k)).map(e => e % this.m);
    }

    add(data) {
        this.hash(data).forEach((e) => this.bits[e] = 1);
    }

    has(data) {
        return this.hash(data).every((e) => this.bits[e] === 1);
    }
}

exports.BloomFilter = BloomFilter;

// index.js
'use strict';
const k = 4;
const m = 256;

const bf = new (require('./bloomfilter').BloomFilter)(k, m);
const st = new Set();

const readline = require('readline');
const rl = readline.createInterface({
    input: process.stdin
});

rl.on('line', (line) => {
    line.replace(/[^a-zA-Z0-9 ]/g, '').split(' ').map(word => word.trim()).forEach((word) => {
        bf.add(word);
        st.add(word);
    });
});

const tests = process.argv.slice(2);
rl.on('close', () => {
    for (const test of tests) {
        if (bf.has(test)) {
            console.log(`${test} ある`, !st.has(test) ? '(偽陽性)': '');
        } else {
            console.log(`${test} ない`);
        }
    }
});

ブルームフィルタに突っ込む単語をファイルに保存。Wikipedia英語版のBloomFilterの1段落目です

A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not –in other words, a query returns either "possibly in set" or "definitely not in set". Elements can be added to the set, but not removed (though this can be addressed with a "counting" filter); the more elements that are added to the set, the larger the probability of false positives.'

実行。標準入力がブルームフィルタに登録する単語、引数が存在チェックする単語

$ <words.txt node index.js hoge fuga data nyan this true false right left
hoge ない
fuga ない
data ある
nyan ない
this ある
true ない
false ある
right ある (偽陽性)
left ない

rightなんて単語は含まれていないのにあると判定された。これが偽陽性というヤツですな。