2007-10-28

HTML::Feature::Engine::TsubuanLike

Tsubuanが死んでて困ってるところにHTML::Featureが来てこれで勝つる！とおもったけどHTML::Elementで返してくれるのがやっぱり欲しいので、TsubuanっぽいアルゴリズムでEngineを実装してみた。Tsubuanの基礎となるタグ／テキスト比をつかったアルゴリズムはブログの記事本文を抽出するスクリプトをつくってみたを参照。Elementを結果として返したい都合上ちょっと違いますが、まあTsubuanとだいたい似たような結果になります。
使い方としては、$result->{element}でHTML::Elementがとれて、そのas_HTMLは$result->{html}にも入っています。ちなみにあのアルゴリズムは空文字列を返すこともあるんですが、このモジュールでは何もとれなかった場合はHTML::TreeBuilderのルートがかえってきます。ちなみにちょっと試してみたところH::F::Engine::TagStructureのほうが性能いい感じなので、あのアルゴリズムでHTML::Elementで返してくれるのもあるとうれしいと思いました。

USAGE

use HTML::Feature;

my $feature = HTML::Feature->new(
    engine => 'TsubuanLike'
);
my $result = $feature->parse(shift);

unless ($result->{success}) {
    $result->{element} # root
}

$result->{element} # HTML::Element
$result->{html}    # HTML strings
$result->{text}    # default attributes: 'text', 'description' and 'title'

HTML/Feature/Engine/TsubuanLike.pm

package HTML::Feature::Engine::TsubuanLike;
use strict;
use warnings;
use base qw(HTML::Feature::Engine);
use HTML::TreeBuilder;

sub run {
    my $self = shift;
    my $c = shift;
    $self->_tag_cleaning($c);
    $self->_score($c);
    return $self;
} 

# this method is from HTML::Feature::Engine::TagStructure
sub _tag_cleaning {
    my $self = shift;
    my $c = shift;
    return unless $c->{html};
    # preprocessing
    $c->{html} =~ s{<!-.*?->}{}xmsg;
    $c->{html} =~ s{<script[^>]*>.*?<\/script>}{}xmgs;
    $c->{html} =~ s{&nbsp;}{ }xmg;
    $c->{html} =~ s{&quot;}{\'}xmg;
    $c->{html} =~ s{\r\n}{\n}xmg;
    $c->{html} =~ s{^\s*(.+)$}{$1}xmg;
    $c->{html} =~ s{^\t*(.+)$}{$1}xmg;
    # control code ( 0x00 - 0x1F, and 0x7F on ascii)
    for ( 0 .. 31 ) {
        my $control_code = '\x' . sprintf( "%x", $_ );
        $c->{html} =~ s{$control_code}{}xmg;
    }
    $c->{html} =~ s{\x7f}{}xmg;
}

sub _score {
    my $self = shift;
    my $c = shift;
    my $root = HTML::TreeBuilder->new;
    $root->parse( $c->{html} );
    
    if (my $title = $root->find("title")) {
        $self->{title} = $title->as_text;
    }

    if (my $desc = $root->look_down(
        _tag => 'meta',
        name => 'description'
    )) {
        my $string = $desc->attr('content');
        $string =~ s{<br>}{}xms;
        $self->{desc} = $string;
    }

    my @tsubuan_score = grep {
        ($self->_tag_text_frac($_) > 0)
        && ($self->_tag_text_frac($_) < 0.1)
    } $root->descendants;
    
    my $target;
    if (@tsubuan_score) {
        @tsubuan_score = sort {
            length($b->as_text) <=> length($a->as_text)
        } @tsubuan_score;
        $self->{success} = 1;
        $target = $tsubuan_score[0];
    }
    else {
        $self->{success} = 0;
        $target = $root;
    }
    
    $self->{html} = $target->as_HTML;
    $self->{text} = $target->as_text;
    $self->{element} = $target;
    delete $self->{tag_text_frac};
    
    if ( $c->{enc_type} ) {
        map {
            Encode::encode( $c->{enc_type}, $self->{$_} )
        } qw/title desc text html/;
    }
}

sub _tag_text_frac {
    my ($self, $elem) = @_;
    
    unless ($self->{tag_text_frac}) {
        $self->{tag_text_frac} = {};
    }
    
    unless (defined($self->{tag_text_frac}->{$elem->idf})) {
        my $text = $elem->as_text;
        my @objs = $elem->descendants;
        
        $self->{tag_text_frac}->{$elem->idf} =
            (@objs * 2)
            / (length($text) + 1);
    }
    
    return $self->{tag_text_frac}->{$elem->idf};
}

1;

2007-10-20

Livedoor Readerのピンをフィードに変換し、またLDRで読むためのシステム

LDRのピンを100本以上刺し、Atomフィードを吐いて、/pin/clearコマンドで保存したり、LDRでゆっくり読み直したりするためのツールを書いたので、ソースを晒してみる。APIの乗っ取りにはPlaggerLDRで使用されているHack LDR APIというGM User Scriptをちょっとだけいじったものを使うので、たぶんPlaggerLDRで使うにはそのままじゃ無理です。というかそのままじゃどんな環境でもたぶん大抵動かないし、セキュリティやパフォーマンスもいっさい気にしていないので、こんなことをやってまで大量のピンを刺す変態もいるんだーという参考程度にごらんください。
ぐりもんをGM0.8+Fx3で動くバージョンにおきかえました。ほかはそのままです。
githubに置きました。

ソース

http://github.com/fuba/ldr-enhanced-pin/tree/master

サンプルAtom

http://fuba.moaningnerds.org/pin_atom/

2007-10-01

新デザインになったマイミク最新日記をEntryFullTextだけで読む

Plagger

mixiにログイン済みのCookieが必須なのでこれが使える環境の人はあまりいないと思うけど。今回のデザイン変更は主にmixi廃人方面からいろいろ言われているけど、こんなのがサクッと書けるようになったことに関してはとても良かったですね。
author追加と、titleもテキストノードを指定するように変更。

assets/plugins/Filter-EntryFullText/mixi_diary_20071001.yaml

author: fuba
custom_feed_handle: http://mixi\.jp/new_friend_diary\.pl
custom_feed_follow_link: view_diary\.pl\?id=\d+\&owner_id=\d+
handle: http://mixi\.jp/view_diary\.pl\?id=\d+\&owner_id=\d+
extract_xpath:
  author: //div[@id="bodyMainArea"]//h2/text()
  title: //dl[@class='clearfix']/dt/text()
  body: //div[@id='diary_body']

config.yaml

plugins:
  - module: Subscription::Config
    config:
      feed:
        - url: http://mixi.jp/new_friend_diary.pl
  - module: Filter::EntryFullText
  - module: Publish::Gmail

2007-09-26

PlaggerでTumblrのFriendsリストをDashboardからOPML化

buzzっぽいけど役に立つ人10人もいなさそうなレシピ。できたらFastladderにつっこむ。
Tumblr v4で動くように修正。

global:
  user_agent:
    cookies: /Users/ec/Library/Cookies/Cookies.plist

plugins:
  - module: Subscription::XPath
    config:
      url: http://www.tumblr.com/following
      xpath: //a[@class="username"]

  - module: Publish::OPML
    config:
      filename: /Users/ec/Desktop/tumblr.xml

2007-09-21

Plagger::Plugin::CustomFeed::Script用のスクリプトとしてとらのあな通販新着チェックを書き直してみた

Web::Scraperはじめて使うので、かなり変なコードになってる感触。文字コードの扱いとか、あんまり資料がないのでソース読めってことかな。HTMLの状態で文字コード変換してからscraperに突っ込むとかできるのかしら。あとis_adultとかの処理はscraperの中に無理矢理入れない方がいいのかなー…どこまでscraperでやってどこまで後処理でやるべきなのか。とりあえずPPC::ToranoanaMailorder互換な感じで動くようにはなったので晒して寝ます。引数は日付、タイムゾーンになってます。省略時はAsia/Tokyoの今日。ツッコミ期待。
miyagawaさんの添削に従っていくつか修正。とらのたわけたヘッダへの対応も含めてありがとうございます。あとはDateTime周りかなー。

assets/plugins/CustomFeed-Script/toranoanamailorder.pl

http://fuba.moaningnerds.org/src/toranoanamailorder.pl

config.yaml

plugins:      
  - module: Subscription::Config
    config:
      feed:        
        - script:lib/Plagger/assets/plugins/CustomFeed-Script/toranoanamailorder.pl 0920
  - module: CustomFeed::Script
  
  - module: Aggregator::Simple
  
  - module: Filter::Rule
    rule:                                         
      expression: expression: !$args->{entry}->{meta}->{is_yaoi}
                                            
  - module: Publish::Gmail

2007-09-21

Plagger::Plugin::CustomFeed::ToranoanaMailorder、をとりあえず公開停止

ということで通販対応部分のみを大雑把に切り分けたPPC::ToranoanaMailorderをつくった。今後はこっちだけメンテします。ショップ入荷情報が大幅にシステム改変したとかで動かなくなってもフォローしないので、ユーザの人はよろしくおねがいします。
otsuneさんのツッコミでCustomFeed::Scriptのことおもいだしたので、そっち対応でかきなおします。とりあえずPPC::ToranoanaMailorderとしては公開停止しておく。
かきなおしました。

2007-09-21

Plagger::Plugin::Subscription::Toranoanaの歴史

歴史ってなんだよ。

オリジナル http://d.hatena.ne.jp/fuba/20060804/1154680334
- 大きいサンプル画像付き http://plagger.g.hatena.ne.jp/SweetPotato/20070920/toranoana
通販対応 http://d.hatena.ne.jp/fuba/20060909/1157765169
metaにメタデータが入ったり、18禁タグがついたり http://d.hatena.ne.jp/fuba/20061026/1161806823

なんか混乱してもアレなんで、とりあえず大きいサンプル画像、通販、metaに18禁、女性向けなどに対応した全部入り版を用意しました。
http://fuba.moaningnerds.org/src/Toranoana.pm
ていうかショップ入荷情報は個人的にはまったく使わなくなっちゃったので、ここのメンテだけ切り分けたいんだよな。ということで