package AI::Embedding;

use strict;
use warnings;

use HTTP::Tiny;
use JSON;
use Data::CosineSimilarity;

our $VERSION = '0.1_1';
$VERSION = eval $VERSION;

my $http = HTTP::Tiny->new;

# Create Embedding object
sub new {
    my $class = shift;
    my %attr  = @_;
    
    $attr{'error'}  = '';
    
    $attr{'api'}    = 'OpenAI' unless $attr{'api'};
    $attr{'error'}  = 'Invalid API' unless $attr{'api'} eq 'OpenAI';
    $attr{'error'}  = 'API Key missing' unless $attr{'key'};
    
    $attr{'model'}  = 'text-embedding-ada-002' unless $attr{'model'};
    
    return bless \%attr, $class;
}

# Define endpoints for APIs
my %url    = (
    'OpenAI' => 'https://api.openai.com/v1/embeddings',
);

# Define HTTP Headers for APIs
my %header = (
    'OpenAI' => \&_get_header_openai,
);

# Returns true if last operation was success
sub success {
    my $self = shift;
    return !$self->{'error'};
}

# Returns error if last operation failed
sub error {
    my $self = shift;
    return $self->{'error'};
}

# Header for calling OpenAI
sub _get_header_openai {
    my $self = shift;
    return {
        headers => {
             'Authorization' => $self->key,
             'Content-type'  => 'application/json'
         }
     };
 }
 
 # Fetch Embedding response
 sub _get_embedding {
     my ($self, $text) = @_;
     
     return $http->post($url{$self->{'api'}}, {
         $header{$self->{'api'}},
         content => encode_json {
             input  => $text,
             model  => $self->{'model'},
         }
     });
 }
 
 # Return Embedding as a CSV string
 sub embedding {
     my ($self, $text) = @_;
     
     my $response = $self->_get_embedding($text);
     if ($response->{'success'}) {
         my $embedding = decode_json($response->{'content'});
         return join (',', @{$embedding->{'data'}[0]->{'embedding'}});
     }
     $self->{'error'} = 'HTTP Error - ' . $response->{'reason'};
     return $response;
 }
 
 # Return Embedding as an array
 sub raw_embedding {
     my ($self, $text) = @_;
     
     my $response = $self->_get_embedding($text);
     if ($response->{'success'}) {
         my $embedding = decode_json($response->{'content'});
         return @{$embedding->{'data'}[0]->{'embedding'}};
     }
     $self->{'error'} = 'HTTP Error - ' . $response->{'reason'};
     return $response;
 }
 
 # Convert a CSV Embedding into a hashref
 sub _make_vector {
     my ($self, $embed_string) = @_;
     
     my %vector;
     my @embed = split /,/, $embed_string;
     for (my $i = 0; $i < @embed; $i++) {
        $vector{'feature' . $i} = $embed[$i];
    }
    return \%vector;
}    
 
# Set a vector to compare
sub comparator {
    my ($self, $embed) = @_;
    
    $self->{'comparator'} = $self->_make_vector($embed);
    return;
}

# Compare 2 Embeddings
sub compare {
    my ($self, $embed1, $embed2) = @_;
    
    my $vector1 = $self->_make_vector($embed1);
    my $vector2;
    if (defined $embed2) {
        $vector2 = $self->_make_vector($embed2);
    } else {
        $vector2 = $self->{'comparator'};
    }
    
    if (!defined $vector2) {
        $self->{'error'} = 'Nothing to compare!';
        return;
    }
    
    if (scalar keys %$vector1 != scalar keys %$vector2) {
        $self->{'error'} = 'Embeds are unequal length';
        return;
    }
    
    my $cs = Data::CosineSimilarity->new;
    $cs->add( label1 => $vector1 );
    $cs->add( label2 => $vector2 );
    return $cs->similarity('label1', 'label2')->cosine;
}

1;

__END__

=head1 NAME

AI::Embedding - Perl module for working with text embeddings using various APIs

=head1 VERSION

Version 0.01

=head1 SYNOPSIS

    use AI::Embedding;

    my $embedding = AI::Embedding->new(
        api => 'OpenAI',
        key => 'your-api-key'
    );

    my $csv_embedding = $embedding->embedding('Some text');
    my @raw_embedding = $embedding->raw_embedding('Some text');
    $embedding->comparator($csv_embedding2);

    my $similarity = $embedding->compare($csv_embedding1);
    my $similarity_with_other_embedding = $embedding->compare($csv_embedding1, $csv_embedding2);

=head1 DESCRIPTION

The L<AI::Embedding> module provides an interface for working with text embeddings using various APIs. It currently supports the L<OpenAI|https://www.openai.com> L<Embeddings API|https://platform.openai.com/docs/guides/embeddings/what-are-embeddings>. This module allows you to generate embeddings for text, compare embeddings, and calculate cosine similarity between embeddings.

An Embedding is a multi-dimensional vector representing the meaning of a piece of text.  The Embedding vector is created by an AI Model.  The default model (OpenAI's C<text-embedding-ada-002>) produces a 1536 dimensional vector.  The resulting vector can be obtained as a Perl array or a Comma Separated String. As the Embedding will typically be used homogeneously, having it as a CSV String is usually more convenient.  This is suitable for storing in a C<TEXT> field of a database.

=head2 Comparator

Embeddings are used to compare similarity of meaning between two passages of text.  A typical work case is to store a number of pieces of text (e.g. articles or blogs) in a database and compare each one to some user supplied search text.  L<AI::Embedding> provides a C<compare> method to either compare two Embeddings or one Embedding to a previously supplied C<compatator>.  The C<comparator> can either be set when the object is constructed or by using the B<comparator> method.  When comparing multiple Embeddings to the same Embedding (such as search text) it is faster to use a C<comparator>.

=head1 CONSTRUCTOR

=head2 new

    my $embedding = AI::Embedding->new(
        api         => 'OpenAI', 
        key         => 'your-api-key',
        model       => 'text-embedding-ada-002',
        comparator  => $search_string,
    );
    
Creates a new AI::Embedding object. It requires the 'key' parameter. The 'key' parameter is the API key provided by the service provider and is required.

Parameters:

=over

=item *

C<key> - B<required> The API Key

=item *

C<api> - The API to use.  Currently only 'OpenAI' is supported and this is the default.

=item *

C<model> - The language model to use.  Defaults to C<text-embedding-ada-002> - see L<OpenAI docs|https://platform.openai.com/docs/guides/embeddings/what-are-embeddings>

=item *

C<comparator> - Set the C<comparator> - see L</"Comparator">

=back

=head1 METHODS

=head2 success

Returns true if the last method call was successful

=head2 error

Returns the last error message or an empty string if B<success> returned true

=head2 embedding

    my $csv_embedding = $embedding->embedding('Some text passage');

Generates an embedding for the given text and returns it as a comma-separated string. The C<embedding> method takes a single parameter, the text to generate the embedding for.

Returns a (rather long) string that can be stored in a C<TEXT> database field.

If the method call fails it sets the L</"error"> message and returns the complete L<HTTP::Tiny> response object.

=head2 raw_embedding

    my @raw_embedding = $embedding->raw_embedding('Some text passage');

Generates an embedding for the given text and returns it as an array. The C<raw_embedding> method takes a single parameter, the text to generate the embedding for.

It is not normally necessary to use this method as the Embedding will almost always be used as a single homogeneous unit.

If the method call fails it sets the L</"error"> message and returns the complete L<HTTP::Tiny> response object.

=head2 comparator

    $embedding->comparator($csv_embedding2);

Sets a vector as a C<comparator> for future comparisons. The B<comparator> method takes a single parameter, the comma-separated embedding string to use as the comparator.

See L</"Comparator">

=head2 compare

    my $similarity = $embedding->compare($csv_embedding1);
    my $similarity_with_other_embedding = $embedding->compare($csv_embedding1, $csv_embedding2);

Compares two embeddings and returns the cosine similarity between them. The B<compare> method takes two parameters: $csv_embedding1 and $csv_embedding2 (both comma-separated embedding strings). 

If only one parameter is provided, it is compared with the previously set C<comparator>.

Returns the cosine similarity as a floating-point number between -1 and 1, where 1 represents identical embeddings, 0 represents no similarity, and -1 represents opposite embeddings.

The absolute number is not usually relevant for text comparision.  It is usually sufficient to rank the comparison results in order of high to low to reflect the best match to the worse match.

=head1 SEE ALSO

L<https://openai.com> - OpenAI official website

=head1 AUTHOR

Ian Boddison <ian at boddison.com>

=head1 BUGS

Please report any bugs or feature requests to C<bug-ai-embedding at rt.cpan.org>, or through
the web interface at L<https://rt.cpan.org/NoAuth/ReportBug.html?Queue=bug-ai-embedding>.  I will be notified, and then you'll
automatically be notified of progress on your bug as I make changes.

=head1 SUPPORT

You can find documentation for this module with the perldoc command.

    perldoc AI::Embedding

You can also look for information at:

=over 4

=item * RT: CPAN's request tracker (report bugs here)

L<https://rt.cpan.org/NoAuth/Bugs.html?Dist=AI-Embedding>

=item * Search CPAN

L<https://metacpan.org/release/AI::Embedding>

=back

=head1 ACKNOWLEDGEMENTS

Thanks to the help and support provided by members of Perl Monks L<https://perlmonks.org/>

=head1 COPYRIGHT AND LICENSE

This software is copyright (c) 2023 by Ian Boddison.

This is free software; you can redistribute it and/or modify it under
the same terms as the Perl 5 programming language system itself.

=cut
