I'm Jay's father

3月 22, 2010

How to Backup the Images from the Hinet Xuite

There is no existing software which can help you to do it.
To avoid the user to download the image files from their blog continuously, some ISPs will not show the real link of the image file on the Web page directly.

To solve this, I need to analyze the content of the Web page, try to see if the pattern can be found so that I can guess where the link of the real image file is.

As the Hinet Xuite is considered, the index page of each photo album has the following similar pattern:

http://3.share.photo.xuite.net/account/3193424/121088812_c.jpg........

where
         3193424 is the album's index
         12108881 is image file's index
         *_c.jpg is the real link of the icon image file
         *_l.jpg is the real link of the 640x480 image file
         *_x.jpg is the real link of the 1024x768 image file

Therefore, there is enough information for us to get the real link of the image file.
I'll try to use Perl to extract the link because there is an existing module HTML::LinkExtractor.

Preparing Perl

The easiest way to install Perl modules on Linux is by using the CPAN module. Login to your server by SSH or Telnet as root. Then type:

perl -MCPAN -e shell;

If this is the first time you have used this method then you will need to answer a few configuration questions. Make sure you answer these correctly (usually just do what it suggests). If it suggests 'yes' then type 'yes' not 'y' or 'return'. If it offers automatic config then go for it. From now all you have to do to install Perl modules is type:

cpan> install HTML::LinkExtractor

OK, let's do it now.

Implementation

1. The first step is to browse the photo album and save the source of that Web page.
The source should contain the pattern I mentioned, extract the anchor tag containing the pattern:

grep account album_id.html |grep share > album_id.txt

2. Write a Perl script to extract the links and use wget to download it.

# extract the link from the index page of the photo album by HTML::LinkExtractor

use HTML::LinkExtractor;
use Data::Dumper;

print "album_id = $ARGV[0]\n";
my $album_index = $ARGV[0];
my $album_page = $ARGV[0] . ".txt";
open(ALBUM, $album_page) or die("Error: cannot open file '$album_page'\n");

# write the link to a file line-by-line
my $LX = new HTML::LinkExtractor();

if (-d $album_index ){
        }
else {
        mkdir($album_index, 0777) || die "can create directory $album_id";
        }
chdir($album_index);
system("pwd");
$number=0;
while (){
        chomp;
        $LX->parse(\$_);
        for my $Link( @{ $LX->links } ) {
                $size = length($$Link{src});
                if( ($$Link{tag} == "img") && ($size > 0) ) {
                        # for each target link found
                        $target_link = $$Link{src};
                        #print "$number: $target_link,size=$size";
                        $target_link =~ s/_c.jpg/_x.jpg/;
                       if (-e $image) {
                                #print "no\n";
                                }
                         # if the file has not been downloaded
                        else {
                                #print "yes\n";
                                $command = "wget $target_link\n";
                                print "$number:$command";
                                system($command);
                                }
                        $number++;
                        }
                }
        }

close(ALBUM);

搜尋此網誌

I'm Jay's father

留言

熱門文章

A Tutorial on the Device Tree

Linux Modem Manager