对于字典加速,有一个通用的转换/过程可以大大减少提前的字典比较。
鉴于上面的网格只包含16个字符,其中一些字符是重复的,您可以通过简单地过滤掉具有不可获取字符的条目来大大减少字典中的总键数。
我认为这是明显的优化,但看到没有人这么做,我就提出来了。
在输入过程中,它将我的字典从20万个键减少到只有2000个键。这至少减少了内存开销,并且这肯定会映射到某个地方的速度增加,因为内存不是无限快的。
Perl实现
我的实现有点头重脚轻,因为我重视能够知道每个提取的字符串的确切路径,而不仅仅是其中的有效性。
我也有一些适应在那里,理论上允许一个网格中有洞的功能,网格有不同大小的线(假设你得到了正确的输入,它以某种方式对齐)。
早期筛选器是我的应用程序中最重要的瓶颈,正如之前怀疑的那样,注释掉了一行从1.5s膨胀到7.5s的代码。
在执行时,它似乎认为所有的个位数都在他们自己的有效单词上,但我很确定这是由于字典文件的工作方式。
它有点臃肿,但至少我重用了cpan中的Tree::Trie
其中有些部分是受到现有实现的启发,有些是我已经想到的。
建设性的批评和改进的方法欢迎(/我注意到他从来没有在CPAN上搜索过一个拼字游戏解决器,但这更有趣)
新标准更新
#!/usr/bin/perl
use strict;
use warnings;
{
# this package manages a given path through the grid.
# Its an array of matrix-nodes in-order with
# Convenience functions for pretty-printing the paths
# and for extending paths as new paths.
# Usage:
# my $p = Prefix->new(path=>[ $startnode ]);
# my $c = $p->child( $extensionNode );
# print $c->current_word ;
package Prefix;
use Moose;
has path => (
isa => 'ArrayRef[MatrixNode]',
is => 'rw',
default => sub { [] },
);
has current_word => (
isa => 'Str',
is => 'rw',
lazy_build => 1,
);
# Create a clone of this object
# with a longer path
# $o->child( $successive-node-on-graph );
sub child {
my $self = shift;
my $newNode = shift;
my $f = Prefix->new();
# Have to do this manually or other recorded paths get modified
push @{ $f->{path} }, @{ $self->{path} }, $newNode;
return $f;
}
# Traverses $o->path left-to-right to get the string it represents.
sub _build_current_word {
my $self = shift;
return join q{}, map { $_->{value} } @{ $self->{path} };
}
# Returns the rightmost node on this path
sub tail {
my $self = shift;
return $self->{path}->[-1];
}
# pretty-format $o->path
sub pp_path {
my $self = shift;
my @path =
map { '[' . $_->{x_position} . ',' . $_->{y_position} . ']' }
@{ $self->{path} };
return "[" . join( ",", @path ) . "]";
}
# pretty-format $o
sub pp {
my $self = shift;
return $self->current_word . ' => ' . $self->pp_path;
}
__PACKAGE__->meta->make_immutable;
}
{
# Basic package for tracking node data
# without having to look on the grid.
# I could have just used an array or a hash, but that got ugly.
# Once the matrix is up and running it doesn't really care so much about rows/columns,
# Its just a sea of points and each point has adjacent points.
# Relative positioning is only really useful to map it back to userspace
package MatrixNode;
use Moose;
has x_position => ( isa => 'Int', is => 'rw', required => 1 );
has y_position => ( isa => 'Int', is => 'rw', required => 1 );
has value => ( isa => 'Str', is => 'rw', required => 1 );
has siblings => (
isa => 'ArrayRef[MatrixNode]',
is => 'rw',
default => sub { [] }
);
# Its not implicitly uni-directional joins. It would be more effient in therory
# to make the link go both ways at the same time, but thats too hard to program around.
# and besides, this isn't slow enough to bother caring about.
sub add_sibling {
my $self = shift;
my $sibling = shift;
push @{ $self->siblings }, $sibling;
}
# Convenience method to derive a path starting at this node
sub to_path {
my $self = shift;
return Prefix->new( path => [$self] );
}
__PACKAGE__->meta->make_immutable;
}
{
package Matrix;
use Moose;
has rows => (
isa => 'ArrayRef',
is => 'rw',
default => sub { [] },
);
has regex => (
isa => 'Regexp',
is => 'rw',
lazy_build => 1,
);
has cells => (
isa => 'ArrayRef',
is => 'rw',
lazy_build => 1,
);
sub add_row {
my $self = shift;
push @{ $self->rows }, [@_];
}
# Most of these functions from here down are just builder functions,
# or utilities to help build things.
# Some just broken out to make it easier for me to process.
# All thats really useful is add_row
# The rest will generally be computed, stored, and ready to go
# from ->cells by the time either ->cells or ->regex are called.
# traverse all cells and make a regex that covers them.
sub _build_regex {
my $self = shift;
my $chars = q{};
for my $cell ( @{ $self->cells } ) {
$chars .= $cell->value();
}
$chars = "[^$chars]";
return qr/$chars/i;
}
# convert a plain cell ( ie: [x][y] = 0 )
# to an intelligent cell ie: [x][y] = object( x, y )
# we only really keep them in this format temporarily
# so we can go through and tie in neighbouring information.
# after the neigbouring is done, the grid should be considered inoperative.
sub _convert {
my $self = shift;
my $x = shift;
my $y = shift;
my $v = $self->_read( $x, $y );
my $n = MatrixNode->new(
x_position => $x,
y_position => $y,
value => $v,
);
$self->_write( $x, $y, $n );
return $n;
}
# go through the rows/collums presently available and freeze them into objects.
sub _build_cells {
my $self = shift;
my @out = ();
my @rows = @{ $self->{rows} };
for my $x ( 0 .. $#rows ) {
next unless defined $self->{rows}->[$x];
my @col = @{ $self->{rows}->[$x] };
for my $y ( 0 .. $#col ) {
next unless defined $self->{rows}->[$x]->[$y];
push @out, $self->_convert( $x, $y );
}
}
for my $c (@out) {
for my $n ( $self->_neighbours( $c->x_position, $c->y_position ) ) {
$c->add_sibling( $self->{rows}->[ $n->[0] ]->[ $n->[1] ] );
}
}
return \@out;
}
# given x,y , return array of points that refer to valid neighbours.
sub _neighbours {
my $self = shift;
my $x = shift;
my $y = shift;
my @out = ();
for my $sx ( -1, 0, 1 ) {
next if $sx + $x < 0;
next if not defined $self->{rows}->[ $sx + $x ];
for my $sy ( -1, 0, 1 ) {
next if $sx == 0 && $sy == 0;
next if $sy + $y < 0;
next if not defined $self->{rows}->[ $sx + $x ]->[ $sy + $y ];
push @out, [ $sx + $x, $sy + $y ];
}
}
return @out;
}
sub _has_row {
my $self = shift;
my $x = shift;
return defined $self->{rows}->[$x];
}
sub _has_cell {
my $self = shift;
my $x = shift;
my $y = shift;
return defined $self->{rows}->[$x]->[$y];
}
sub _read {
my $self = shift;
my $x = shift;
my $y = shift;
return $self->{rows}->[$x]->[$y];
}
sub _write {
my $self = shift;
my $x = shift;
my $y = shift;
my $v = shift;
$self->{rows}->[$x]->[$y] = $v;
return $v;
}
__PACKAGE__->meta->make_immutable;
}
use Tree::Trie;
sub readDict {
my $fn = shift;
my $re = shift;
my $d = Tree::Trie->new();
# Dictionary Loading
open my $fh, '<', $fn;
while ( my $line = <$fh> ) {
chomp($line);
# Commenting the next line makes it go from 1.5 seconds to 7.5 seconds. EPIC.
next if $line =~ $re; # Early Filter
$d->add( uc($line) );
}
return $d;
}
sub traverseGraph {
my $d = shift;
my $m = shift;
my $min = shift;
my $max = shift;
my @words = ();
# Inject all grid nodes into the processing queue.
my @queue =
grep { $d->lookup( $_->current_word ) }
map { $_->to_path } @{ $m->cells };
while (@queue) {
my $item = shift @queue;
# put the dictionary into "exact match" mode.
$d->deepsearch('exact');
my $cword = $item->current_word;
my $l = length($cword);
if ( $l >= $min && $d->lookup($cword) ) {
push @words,
$item; # push current path into "words" if it exactly matches.
}
next if $l > $max;
# put the dictionary into "is-a-prefix" mode.
$d->deepsearch('boolean');
siblingloop: foreach my $sibling ( @{ $item->tail->siblings } ) {
foreach my $visited ( @{ $item->{path} } ) {
next siblingloop if $sibling == $visited;
}
# given path y , iterate for all its end points
my $subpath = $item->child($sibling);
# create a new path for each end-point
if ( $d->lookup( $subpath->current_word ) ) {
# if the new path is a prefix, add it to the bottom of the queue.
push @queue, $subpath;
}
}
}
return \@words;
}
sub setup_predetermined {
my $m = shift;
my $gameNo = shift;
if( $gameNo == 0 ){
$m->add_row(qw( F X I E ));
$m->add_row(qw( A M L O ));
$m->add_row(qw( E W B X ));
$m->add_row(qw( A S T U ));
return $m;
}
if( $gameNo == 1 ){
$m->add_row(qw( D G H I ));
$m->add_row(qw( K L P S ));
$m->add_row(qw( Y E U T ));
$m->add_row(qw( E O R N ));
return $m;
}
}
sub setup_random {
my $m = shift;
my $seed = shift;
srand $seed;
my @letters = 'A' .. 'Z' ;
for( 1 .. 4 ){
my @r = ();
for( 1 .. 4 ){
push @r , $letters[int(rand(25))];
}
$m->add_row( @r );
}
}
# Here is where the real work starts.
my $m = Matrix->new();
setup_predetermined( $m, 0 );
#setup_random( $m, 5 );
my $d = readDict( 'dict.txt', $m->regex );
my $c = scalar @{ $m->cells }; # get the max, as per spec
print join ",\n", map { $_->pp } @{
traverseGraph( $d, $m, 3, $c ) ;
};
Arch/执行信息进行比较:
model name : Intel(R) Core(TM)2 Duo CPU T9300 @ 2.50GHz
cache size : 6144 KB
Memory usage summary: heap total: 77057577, heap peak: 11446200, stack peak: 26448
total calls total memory failed calls
malloc| 947212 68763684 0
realloc| 11191 1045641 0 (nomove:9063, dec:4731, free:0)
calloc| 121001 7248252 0
free| 973159 65854762
Histogram for block sizes:
0-15 392633 36% ==================================================
16-31 43530 4% =====
32-47 50048 4% ======
48-63 70701 6% =========
64-79 18831 1% ==
80-95 19271 1% ==
96-111 238398 22% ==============================
112-127 3007 <1%
128-143 236727 21% ==============================
关于正则表达式优化的更多嘟囔
我使用的正则表达式优化对于多解字典是无用的,而对于多解字典,您将需要一个完整的字典,而不是一个预先修整过的字典。
然而,也就是说,对于一次性解决,它真的很快。(Perl正则表达式是在C!:))
以下是一些不同的代码添加:
sub readDict_nofilter {
my $fn = shift;
my $re = shift;
my $d = Tree::Trie->new();
# Dictionary Loading
open my $fh, '<', $fn;
while ( my $line = <$fh> ) {
chomp($line);
$d->add( uc($line) );
}
return $d;
}
sub benchmark_io {
use Benchmark qw( cmpthese :hireswallclock );
# generate a random 16 character string
# to simulate there being an input grid.
my $regexen = sub {
my @letters = 'A' .. 'Z' ;
my @lo = ();
for( 1..16 ){
push @lo , $_ ;
}
my $c = join '', @lo;
$c = "[^$c]";
return qr/$c/i;
};
cmpthese( 200 , {
filtered => sub {
readDict('dict.txt', $regexen->() );
},
unfiltered => sub {
readDict_nofilter('dict.txt');
}
});
}
s/iter unfiltered filtered
unfiltered 8.16 -- -94%
filtered 0.464 1658% --
Ps: 8.16 * 200 = 27分钟。