NAME
Web::PageMeta - get page open-graph / meta data
SYNOPSIS
use Web::PageMeta;
my $page = Web::PageMeta->new(url => "https://www.apa.at/");
say $page->title;
say $page->image;
async fetch previews and images:
use Web::PageMeta;
my @urls = qw(
https://www.apa.at/
http://www.diepresse.at/
https://metacpan.org/
https://github.com/
);
my @page_views = map { Web::PageMeta->new( url => $_ ) }
@urls;
Future->wait_all( map { $_->fetch_image_data_ft, } @page_views )->get;
foreach my $pv (@page_views) {
say 'title> '.$pv->title;
say 'img_size> '.length($pv->image_data);
}
# alternativelly instead of Future->wait_all()
use Future::Utils qw( fmap_void );
fmap_void(
sub { return $_[0]->fetch_image_data_ft },
foreach => [@page_views],
concurrent => 3
)->get;
DESCRIPTION
Get (not only) open-graph web page meta data. can be used in both normal
and async code.
For any other than 200 http status codes during data downloads,
HTTP::Exception is thrown.
ACCESSORS
new
Constructor, only "url" is required.
url
HTTP url to fetch data from.
timeout
In addition to AnyEvent::HTTP timeout will also check time during
download as the data are being downloaded and dies when over the limit.
Default 5 minutes.
max_size
Will die when the document or image size is greater than this limit.
Default 100MB.
user_agent
User-Agent header to use for http requests. Default is one from Chrome
89.0.4389.90.
extra_headers
HashRef with extra http request headers.
cookie_jar
Accepts optional HTTP::Cookies compatible object that must provide
"get_cookies()" method. If set will send http cookie headers with each
request.
title
Returns title of the page.
description
Returns description of the page.
canonical_url
Returns open-graph url. If not present returns "url".
image
Returns image location of the page.
image_data
Returns image binary data of "image" link.
Will throw 404 exception if there is not "image" link.
page_meta
Returns hash ref with all open-graph data.
extra_scraper
Web::Scraper::LibXML object to fetch image, title or description from
different than default location.
use Web::Scraper::LibXML;
use Web::PageMeta;
my $escraper = scraper {
process_first '.slider .camera_wrap div', 'image' => '@data-src';
};
my $wmeta = Web::PageMeta->new(
url => 'https://www.meon.eu/',
extra_scraper => $escraper,
);
page_body_hdr
Returns array ref with page [$body,$headers]. Can be useful for
post-processing or special/additional data extractions.
Only "text/html" content-type is accepted for fetching.
fetch_page_meta_ft
Returns future object for fetching paga meta data. See "ASYNC USE". On
done "page_meta" hash is returned.
fetch_image_data_ft
Returns future object for fetching image data. See "ASYNC USE" On done
"image_data" scalar is returned.
fetch_page_body_hdr_ft
Returns future object for fetching page content and headers. See "ASYNC
USE" On done "page_body_hdr" array ref is returned.
ASYNC USE
To run multiple page meta data or image http requests in parallel or to
be used in async programs "fetch_page_meta_ft" and fetch_image_data_ft
returning Future object can be used. See "SYNOPSIS" or t/02_async.t for
sample use.
SEE ALSO
AUTHOR
Jozef Kutej, ""
LICENSE AND COPYRIGHT
Copyright 2021 jkutej@cpan.org
This program is free software; you can redistribute it and/or modify it
under the terms of either: the GNU General Public License as published
by the Free Software Foundation; or the Artistic License.
See http://dev.perl.org/licenses/ for more information.