There and Back Again Arrows for Invertible Computation

TopicModelsVB.jl

v1.x compatible.

A Julia package for variational Bayesian topic modeling.

Topic models are Bayesian hierarchical models designed to discover the latent low-dimensional thematic structure within corpora. Topic models are fit using either Markov concatenation Monte Carlo (MCMC), or variational inference (VI).

Markoff chain Monte Carlo methods are slow merely consistent, given enough time MCMC will fit the exact model asymptotically. Contrarily, variational inference is fast only inconsistent, every bit one must approximate distributions in order to ensure tractability.

This package takes the latter arroyo to topic modeling.

Installation

(@v1.vii) pkg>              add TopicModelsVB

Dependencies

DelimitedFiles SpecialFunctions LinearAlgebra Random Distributions OpenCL Crayons

Datasets

Included in TopicModelsVB.jl are 2 datasets:

  1. National Scientific discipline Foundation Abstracts 1989 - 2003:
  • 128804 documents
  • 25319 vocabulary
  1. CiteULike Scientific discipline Article Database:
  • 16980 documents
  • 8000 vocabulary
  • 5551 users

Corpus

Let'due south begin with the Corpus data construction. The Corpus data construction has been designed for maximum ease-of-use. Datasets must still be cleaned and put into the appropriate format, simply once a dataset is in the proper format and read into a corpus, information technology can easily be modified to see the user's needs.

There are four plaintext files that make up a corpus:

  • docfile
  • vocabfile
  • userfile
  • titlefile

None of these files are mandatory to read a corpus, and in fact reading no files will result in an empty corpus. Even so in order to train a model a docfile will exist necessary, since it contains all quantitative information known about the documents. On the other hand, the vocab, user and title files are used solely for interpreting output.

The docfile should be a plaintext file containing lines of delimited numerical values. Each document is a block of lines, the number of which depends on what data is known about the documents. Since a document is at its essence a listing of terms, each document must contain at least one line containing a nonempty list of delimited positive integer values corresponding to the terms of which information technology is composed. Whatsoever further lines in a document block are optional, notwithstanding if they are present they must be present for all documents and must come in the post-obit order:

terms - A line of delimited positive integers corresponding to the terms which make up the certificate (this line is mandatory).

counts - A line of delimited positive integers, equal in length to terms, corresponding to the number of times a term appears in a document.

readers - A line of delimited positive integers corresponding to those users which have read the document.

ratings - A line of delimited positive integers, equal in length to readers, corresponding to the rating each reader gave the document.

An example of a unmarried doc block from a docfile with all possible lines included,

              ... 4,ten,3,100,57 1,i,ii,1,three 1,9,x 1,1,5 ...                          

The vocab and user files are tab delimited dictionaries mapping positive integers to terms and usernames (resp.). For example,

              1    this 2    is iii    a 4    vocab 5    file                          

A userfile is identitcal to a vocabfile, except usernames will appear in place of vocabulary terms.

Finally, a titlefile is only a list of titles, not a dictionary, and is of the form,

              title1 title2 title3 title4 title5                          

The club of these titles stand for to the order of document blocks in the associated docfile.

To read a corpus into Julia, use the post-obit office,

              readcorp(;docfile=                              "                "              , vocabfile=                              "                "              , userfile=                              "                "              , titlefile=                              "                "              , delim=                              ','              , counts=              false, readers=              false, ratings=              false)

The file keyword arguments indicate the path where the respective file is located.

Information technology is oftentimes the case that even in one case files are correctly formatted and read, the corpus volition nevertheless contain formatting defects which prevent it from existence loaded into a model. Therefore, before loading a corpus into a model, information technology is important that one of the post-obit is run,

or

Padding a corpus volition ensure that whatever documents which contain vocab or user keys not in the vocab or user dictionaries are not removed. Instead, generic vocab and user keys will be added as necessary to the vocab and user dictionaries (resp.).

The fixcorp! role allows for significant customization of the corpus object.

For example, let's brainstorm by loading the CiteULike corpus,

A standard preprocessing pace might involve removing finish words, removing terms which appear less than 200 times, and alphabetizing our corpus.

              fixcorp!(corp, finish=              truthful, abridge=              200, index=              true, trim=              truthful)                              ## Mostly y'all volition as well want to trim your corpus.                              ## Setting trim=true will remove leftover terms from the corpus vocabulary.            

Afterward removing stop words and abridging our corpus, the vocabulary size has gone from 8000 to 1692.

A consequence of removing and so many terms from our corpus is that some documents may at present past empty. Nosotros can remove these documents from our corpus with the following command,

              fixcorp!(corp, remove_empty_docs=              true)

In addition, if you would like to preserve term order in your documents, so you should refrain from condesing your corpus.

For example,

corp              =              Corpus(Document(1              :              9), vocab=              split(                "the quick brown fob jumped over the lazy domestic dog"              ))              showdocs(corp)
                              ●●● Certificate i the quick brown fox jumped over the lazy dog                          
              fixcorp!(corp, condense=              true)              showdocs(corp)
                              ●●● Document i jumped play tricks over the quick canis familiaris lazy brown the                          

Important. A corpus is only a container for documents.

Whenever you load a corpus into a model, a copy of that corpus is made, such that if you lot modify the original corpus at corpus-level (remove documents, re-gild vocab keys, etc.), this volition not touch any corpus attached to a model. However! Since corpora are containers for their documents, modifying an individual document will bear on it in all corpora which contain it. Therefore,

  1. Using fixcorp! to modify the documents of a corpus will non result in corpus defects, but will cause them also to exist changed in all other corpora which contain them.

  2. If you would similar to brand a copy of a corpus with independent documents, employ deepcopy(corp).

  3. Manually modifying documents is dangerous, and can result in corpus defects which cannot be fixed past fixcorp!. It is brash that you don't do this without expert reason.

Models

The bachelor models are equally follows:

CPU Models

              LDA(corp, K) Latent Dirichlet allocation model with K topics.              fLDA(corp, Grand) Filtered latent Dirichlet allocation model with Chiliad topics.              CTM(corp, One thousand) Correlated topic model with K topics.              fCTM(corp, One thousand) Filtered correlated topic model with M topics.              CTPF(corp, Yard) Collaborative topic Poisson factorization model with One thousand topics.

GPU Models

              gpuLDA(corp, Thou) GPU accelerated latent Dirichlet resource allotment model with G topics.              gpuCTM(corp, K) GPU accelerated correlated topic model with Chiliad topics.              gpuCTPF(corp, Thousand) GPU accelerated collaborative topic Poisson factorization model with Thousand topics.

Tutorial

Latent Dirichlet Resource allotment

Let'southward begin our tutorial with a uncomplicated latent Dirichlet allotment (LDA) model with 9 topics, trained on the offset 5000 documents from the NSF corpus.

              using              TopicModelsVB              using              Random              using              Distributions  Random.              seed!(vii);  corp              =              readcorp(:nsf)   corp.docs              =              corp[1              :              5000];              fixcorp!(corp, trim=              true)                              ## It'due south strongly recommended that yous trim your corpus when reducing its size in order to remove excess vocabulary.                                            ## Observe that the post-fix vocabulary is smaller later removing all only the beginning 5000 docs.              model              =              LDA(corp,              ix)              train!(model, iter=              150, tol=              0)                              ## Setting tol=0 will ensure that all 150 iterations are completed.                              ## If you don't want to compute the ∆elbo, set checkelbo=Inf.                              ## preparation...              showtopics(model, cols=              9,              20)
              topic 1        topic 2        topic iii        topic iv         topic five         topic 6          topic 7          topic 8         topic 9 enquiry       system         data           theory          research        research         research         research        plant problems       enquiry       earthquake     written report           university      data             project          study           cell blueprint         data           project        bug        support         project          written report            chemical science       species systems        systems        research       research        students        study            data             loftier            protein algorithms     control        written report          equations       program         ocean            social           studies         cells parallel       time           soil           work            science         h2o            agreement    backdrop      plants information           design         damage         investigator    award           studies          economic         chemical        studies projection        project        seismic        principal       scientists      processes        important        materials       research based          analysis       response       project         dr              provide          information      structure       genetic models         processing     structures     geometry        sciences        field            policy           programme         gene model          solar          sites          mathematical    projects        time             evolution      surface         study system         calculator       ground         systems         conference      important        work             reactions       molecular analysis       information    analysis       differential    scientific      climate          theory           electron        proteins techniques     high           information    algebraic       national        marine           provide          metallic           deoxyribonucleic acid methods        techniques     materials      groups          technology     models           political        experimental    dr problem        development    provide        space           provide         measurements     science          molecular       genes performance    models         buildings      analysis        projection         sea              models           systems         important computer       developed      results        methods         year            species          change           energy          agreement work           based          important      solutions       researchers     understanding    scientific       project         specific developed      image          programme        finite          mathematical    global           studies          phase           decide                          

If you are interested in the raw topic distributions. For LDA and CTM models, you may admission them via the matrix,

model.beta                              ## K x Five matrix                              ## Grand = number of topics.                              ## V = number of vocabulary terms, ordered identically to the keys in model.corp.vocab.            

Now that we've trained our LDA model we can, if we desire, take a wait at the topic proportions for individual documents.

For instance, document one has topic breakup,

              println(round.(topicdist(model,              1), digits=              3))                              ## = [0.0, 0.0, 0.0, 0.0, 0.0, 0.435, 0.082, 0.0, 0.482]            

This vector of topic weights suggests that certificate 1 is mostly near biology, and in fact looking at the document text confirms this observation,

              showdocs(model,              one)                              ## Could also accept done showdocs(corp, 1).            
                              ●●● Certificate 1  ●●● CRB: Genetic Diversity of Endangered Populations of Mysticete Whales: Mitochondrial Deoxyribonucleic acid and Historical Demography commercial exploitation past hundred years swell extinction variation sizes populations prior minimal population size electric current permit analyses effects  differing levels species distributions life history...                          

Just for fun, let'southward consider one more than document (document 25),

              println(round.(topicdist(model,              25), digits=              three))                              ## = [0.0, 0.0, 0.0, 0.849, 0.0, 0.149, 0.0, 0.0, 0.0]              showdocs(model,              25)
                              ●●● Document 25  ●●● Mathematical Sciences: Nonlinear Partial Differential Equations from Hydrodynamics work project continues mathematical research nonlinear elliptic problems arising perfect fluid hydrodynamics accent analytical study propagation waves stratified media techniques analysis partial differential equations form basis studies primary goals empathize nature  internal presence vortex rings arise density stratification due salinity temperature...                          

Nosotros see that in this case document 25 appears to exist about environmental computational fluid dynamics, which corresponds precisely to topics four and half dozen.

Furthermore, if nosotros want to, nosotros can also generate bogus corpora by using the gencorp function.

Generating bogus corpora will in plough run the underlying probabilistic graphical model as a generative process in gild to produce entirely new collections of documents, let'southward try it out,

Random.              seed!(7);  artificial_corp              =              gencorp(model,              5000, laplace_smooth=              1e-5)                              ## The laplace_smooth argument governs the amount of Laplace smoothing (defaults to 0).              artificial_model              =              LDA(artificial_corp,              9)              railroad train!(artificial_model, iter=              150, tol=              0, checkelbo=              10)                              ## preparation...              showtopics(artificial_model, cols=              nine)
              topic 1        topic 2      topic 3          topic iv       topic v         topic 6        topic 7        topic 8         topic 9 system         plant        research         research      research        enquiry       project        theory          data research       species      project          design        written report           university     data           written report           enquiry data           cell         study            issues      chemistry       back up        convulsion     problems        projection systems        studies      data             algorithms    high            students       research       research        study control        protein      social           systems       backdrop      plan        structures     equations       water project        cells        important        parallel      studies         science        study          work            ocean models         genetic      economic         project       chemical        laurels          response       geometry        field processing     plants       understanding    information          materials       scientists     soil           investigator    provide high           research     policy           models        reactions       sciences       program        principal       of import analysis       molecular    data      based         program         dr             materials      mathematical    convulsion solar          dna          evolution      system        phase           scientific     information    project         assay design         factor         work             model         structure       projects       structural     differential    effects fourth dimension           proteins     political        analysis      experimental    engineering science    seismic        algebraic       studies computer       study        provide          methods       surface         national       sites          groups          time performance    genes        models           techniques    electron        conference     provide        systems         marine                          

Correlated Topic Model

For our adjacent model, let's upgrade to a (filtered) correlated topic model (fCTM).

Filtering the correlated topic model will dynamically identify and suppress stop words which would otherwise clutter upwardly the topic distribution output.

Random.              seed!(7);  model              =              fCTM(corp,              9)              train!(model, tol=              0, checkelbo=              Inf)                              ## grooming...              showtopics(model,              20, cols=              9)
              topic 1          topic 2           topic 3         topic 4         topic 5         topic half-dozen        topic 7         topic 8        topic 9 algorithms       earthquake        theory          students        ocean           economical       chemistry       physics        poly peptide design           information              bug        science         water           social         chemic        optical        cell parallel         soil              equations       support         sea             theory         metal           solar          cells system           damage            geometry        university      climate         policy         reactions       high           plant systems          species           investigator    research        marine          political      molecular       light amplification by stimulated emission of radiation          species operation      seismic           mathematical    program         measurements    marketplace         surface         particle       gene bug         ground            chief       sciences        information            labor          materials       quantum        genetic network          sites             algebraic       conference      pacific         decision       organic         devices        proteins networks         response          differential    scientific      global          women          molecules       electron       dna command          buildings         infinite           scientists      atmospheric     factors        compounds       materials      plants based            wood            groups          national        species         human          reaction        radiation      molecular problem          take chances            solutions       projects        trace           children       flow            temperature    genes processing       site              mathematics     workshop        ice             public         liquid          plasma         regulation reckoner         san               nonlinear       year            sediment        examine        phase           particles      expression software         national          spaces          engineering     circulation     change         electron        magnetic       role efficient        human             finite          faculty         north           direction     properties      stars          populations programming      archaeological    problem         mathematical    catamenia            population     gas             free energy         specific neural           october           manifolds       months          chemical        life           experimental    waves          bounden computational    earthquakes       dimensional     academic        samples         individuals    temperature     wave           mechanisms distributed      patterns          numerical       equipment       curtain          competition    spectroscopy    ray            evolutionary                          

Based on the top 20 terms in each topic, nosotros might tentatively assign the following topic labels:

  • topic 1: Computer science
  • topic 2: Archaeology
  • topic 3: Mathematics
  • topic iv: Academia
  • topic 5: Earth Scientific discipline
  • topic vi: Economic science
  • topic 7: Chemistry
  • topic 8: Physics
  • topic 9: Molecular Biology

Now let's have a look at the topic-covariance matrix,

model.sigma                              ## Top two off-diagonal positive entries:              model.sigma[one,3]                              #                = 18.275              model.sigma[five,ix]                              #                = eleven.393                              ## Peak two negative entries:              model.sigma[iii,ix]                              #                = -27.430              model.sigma[3,5]                              #                = -19.441            

According to the list in a higher place, the about closely related topics are topics 1 and three, which correspond to the Informatics and Mathematics topics, followed by 5 and 9, corresponding to Earth Scientific discipline and Molecular Biological science.

As for the virtually unlikely topic pairings, almost strongly negatively correlated are topics 3 and ix, respective to Mathematics and Molecular Biology, followed by topics iii and v, corresponding to Mathematics and Earth Science.

Topic Prediction

The topic models so far discussed can also exist used to railroad train a classification algorithm designed to predict the topic distribution of new, unseen documents.

Allow's take our v,000 certificate NSF corpus, and partitioning it into training and exam corpora,

train_corp              =              copy(corp) train_corp.docs              =              train_corp[1              :              4995];  test_corp              =              re-create(corp) test_corp.docs              =              test_corp[4996              :              5000];

Now we tin train our LDA model on only the training corpus, and then use that trained model to predict the topic distributions of the five documents in our test corpus,

Random.              seed!(7);  train_model              =              LDA(train_corp,              9)              railroad train!(train_model, checkelbo=              Inf)  test_model              =              predict(test_corp, train_model)

The predict function works by taking in a corpus of new, unseen documents, and a trained model, and returning a new model of the aforementioned blazon. This new model can then be inspected direct, or using topicdist, in order to see the topic distribution for the documents in the test corpus.

Let'due south first take a look at both the topics for the trained model and the documents in our test corpus,

              showtopics(train_model, cols=              9,              xx)
              topic 1        topic 2        topic 3        topic 4         topic 5         topic 6          topic seven          topic 8         topic nine research       system         data           theory          research        enquiry         research         research        plant pattern         research       earthquake     study           university      information             project          study           cell problems       data           project        problems        support         projection          written report            chemical science       species systems        systems        research       research        students        study            data             loftier            poly peptide algorithms     control        written report          equations       program         ocean            social           studies         cells parallel       time           soil           work            science         water            understanding    chemical        plants data           project        damage         investigator    award           studies          economic         properties      research based          design         seismic        primary       scientists      processes        important        materials       studies project        analysis       response       projection         dr              provide          information      structure       genetic models         solar          structures     geometry        sciences        field            policy           plan         gene model          processing     ground         mathematical    projects        time             development      surface         report system         information    sites          systems         conference      of import        work             reactions       molecular assay       high           analysis       differential    scientific      climate          theory           electron        proteins techniques     development    information    algebraic       national        marine           provide          metal           deoxyribonucleic acid methods        techniques     materials      groups          engineering     sea              political        experimental    dr performance    estimator       provide        space           provide         models           science          molecular       genes problem        developed      buildings      assay        project         species          models           systems         of import computer       models         programme        methods         year            measurements     modify           projection         agreement work           based          important      solutions       researchers     understanding    scientific       energy          specific developed      prototype          results        finite          mathematical    global           studies          phase           determine                          
              showtitles(corp,              4996              :              5000)
                              • Document 4996 Determination-Making, Modeling and Forecasting Hydrometeorologic Extremes Under Climate change  • Document 4997 Mathematical Sciences: Representation Theory Briefing, September thirteen-xv, 1991, Eugene, Oregon  • Document 4998 Irregularity Modeling & Plasma Line Studies at High Latitudes  • Document 4999 Uses and Simulation of Randomness: Applications to Cryptography,Programme Checking and Counting Problems.  • Document 5000 New Possibilities for Agreement the Part of Neuromelanin                          

Now let's have a look at the predicted topic distributions for these 5 documents,

              for              d              in              i              :              v              println(                "Document                "              ,              4995              +              d,                              ":                "              ,              round.(topicdist(test_model, d), digits=              3))              end            
              Certificate 4996: [0.372, 0.003, 0.0, 0.0, 0.001, 0.588, 0.035, 0.001, 0.0] Document 4997: [0.0, 0.0, 0.0, 0.538, 0.385, 0.001, 0.047, 0.027, 0.001] Document 4998: [0.0, 0.418, 0.0, 0.0, 0.001, 0.462, 0.0, 0.118, 0.0] Certificate 4999: [0.46, 0.04, 0.002, 0.431, 0.031, 0.002, 0.015, 0.002, 0.016] Document 5000: [0.0, 0.044, 0.0, 0.001, 0.001, 0.001, 0.0, 0.173, 0.78]                          

Collaborative Topic Poisson Factorization

For our final model, we take a look at the collaborative topic Poisson factorization (CTPF) model.

CTPF is a collaborative filtering topic model which uses the latent thematic structure of documents to meliorate the quality of document recommendations across what would be possible using merely the document-user matrix solitary. This blending of thematic construction with known user prefrences non only improves recommendation accuracy, but also mitigates the common cold-showtime trouble of recommending to users never-before-seen documents. As an example, permit's load the CiteULike dataset into a corpus and then randomly remove a unmarried reader from each of the documents.

Random.              seed!(1);  corp              =              readcorp(:citeu)  ukeys_test              =              Int[];              for              doctor              in              corp     index              =              sample(1              :              length(doc.readers),              ane)[1]              push!(ukeys_test, doctor.readers[index])              deleteat!(doc.readers, alphabetize)              deleteat!(doc.ratings, alphabetize)              end            

Important. We refrain from fixing our corpus in this case, starting time because the CiteULike dataset is pre-packaged and thus pre-fixed, but more importantly, because removing user keys from documents and then fixing a corpus may result in a re-ordering of its user dictionary, which would in turn invalidate our test set.

Subsequently training, we will evaluate model quality by measuring our model'south success at imputing the right user back into each of the document libraries.

It's also worth noting that after removing a single reader from each certificate, 158 of the documents now have zero readers,

              sum([isempty(md.readers)              for              doc              in              corp])                              #                = 158            

Fortunately, since CTPF can if need be depend entirely on thematic structure when making recommendations, this poses no problem for the model.

Now that we've ready our experiment, permit's instantiate and railroad train a CTPF model on our corpus. Furthermore, in the interest of time, we'll too go ahead and GPU accelerate it.

model              =              gpuCTPF(corp,              100)              railroad train!(model, iter=              l, checkelbo=              Inf)                              ## grooming...            

Finally, nosotros evaluate the performance of our model on the examination set.

ranks              =              Float64[];              for              (d, u)              in              enumerate(ukeys_test)     urank              =              findall(model.drecs[d]              .==              u)[ane]     nrlen              =              length(model.drecs[d])              push!(ranks, (nrlen              -              urank)              /              (nrlen              -              1))              finish            

The following histogram shows the proportional ranking of each examination user within the list of recommendations for their respective document.

GPU Benchmark

Let's also have a await at the top recommendations for a item document,

ukeys_test[one]                              #                = 997              ranks[ane]                              #                = 0.978              showdrecs(model,              i,              120)
                              ●●● Document i  ●●● The metabolic world of Escherichia coli is not modest  ... 117. #user4586 118. #user5395 119. #user531 120. #user997                          

What the above output tells u.s. is that user 997's test document placed him or her in the top 2.2% (position 120) of all non-readers.

For evaluating our model'due south user recommendations, let'due south take a more holistic approach.

Since large heterogenous libraries make the qualitative cess of recommendations difficult, permit'due south search for a user with a small focused library,

                              ●●● User 1741  • Region-Based Memory Direction  • A Syntactic Approach to Type Soundness  • Imperative Functional Programming  • The essence of functional programming  • Representing monads  • The matrimony of effects and monads  • A Taste of Linear Logic  • Monad transformers and modular interpreters  • Comprehending Monads  • Monads for functional programming  • Edifice interpreters by composing monads  • Typed memory management via static capabilities  • Computational Lambda-Calculus and Monads  • Why functional programming matters  • Tackling the Bad-mannered Team: monadic input/output, concurrency, exceptions, and foreign-language calls in Haskell  • Notions of Computation and Monads  • Recursion schemes from comonads  • There and dorsum again: arrows for invertible programming  • Composing monads using coproducts  • An Introduction to Category Theory, Category Theory Monads, and Their Relationship to Functional Programming                          

The 20 articles in user 1741's library suggest that he or she is interested in programming language theory.

Now compare this with the meridian 25 recommendations (the peak 0.15%) made by our model,

              showurecs(model,              1741,              25)
                              ●●● User 1741 1.  On Agreement Types, Data Abstraction, and Polymorphism 2.  Functional programming with bananas, lenses, envelopes and barbed wire 3.  Can programming exist liberated from the von {N}eumann way? {A} functional style and its algebra of programs 4.  Monadic Parser Combinators five.  Domain specific embedded compilers 6.  Type Classes with Functional Dependencies 7.  Theorems for Free! viii.  Scrap your boilerplate: a applied design blueprint for generic programming 9.  Types, abstraction and parametric polymorphism x. Linear types tin alter the globe! 11. Haskell's overlooked object organisation 12. Lazy functional state threads xiii. Functional response of a generalist insect predator to one of its prey species in the field. xiv. Improving literature based discovery support by genetic knowledge integration. 15. A new note for arrows 16. Total Functional Programming 17. Monadic Parsing in Haskell 18. Types and programming languages 19. Applicative Programming with Furnishings twenty. Triangle: {E}ngineering a {second} {Q}uality {M}esh {G}enerator and {D}elaunay {T}riangulator 21. Motion doodles: an interface for sketching character movement 22. 'I've Got Nada to Hide' and Other Misunderstandings of Privacy 23. Human cis natural antisense transcripts initiated by transposable elements. 24. Codata and Comonads in Haskell 25. How to brand ad-hoc polymorphism less ad hoc                          

For the CTPF models, yous may access the raw topic distributions by computing,

Raw scores, every bit well as document and user recommendations, may be accessed via,

model.scores                              ## M x U matrix                              ## M = number of documents, ordered identically to the documents in model.corp.docs.                              ## U = number of users, ordered identically to the keys in model.corp.users.              model.drecs model.urecs

Notation, as was done by Blei et al. in their original newspaper, if you would similar to warm outset your CTPF model using the topic distributions generated by one of the other models, simply do the post-obit prior to preparation your model,

ctpf_model.alef              =              exp.(model.beta)                              ## For model of type: LDA, fLDA, CTM, fCTM, gpuLDA, gpuCTM.            

GPU Acceleration

GPU accelerating your model runs its operation bottlenecks on the GPU.

There's no reason to instantiate GPU models straight, instead you can simply instantiate the normal version of a supported model, and so employ the @gpu macro to railroad train it on the GPU,

model              =              LDA(readcorp(:nsf),              twenty)              @gpu              train!(model, checkelbo=              Inf)                              ## preparation...            

Important. Notice that nosotros did not bank check the ELBO at all during training. While you may check the ELBO if you wish, information technology'south recommended that y'all do so infrequently, as computing the ELBO is washed entirely on the CPU.

Here are the log scaled benchmarks of the coordinate ascent algorithms for the GPU models, compared against their CPU equivalents,

GPU Benchmark

Equally we can see, running your model on the GPU is significantly faster than running information technology on the CPU.

Note that information technology'due south expected that your computer will lag when grooming on the GPU, since you lot're effectively siphoning off its rendering resource to fit your model.

Glossary

Types

              mutable struct              Document                              "Document mutable struct"                                            "terms:   A vector{Int} containing keys for the Corpus vocab Dict."                                            "counts:  A Vector{Int} denoting the counts of each term in the Certificate."                                            "readers: A Vector{Int} cogent the keys for the Corpus users Dict."                                            "ratings: A Vector{Int} cogent the ratings for each reader in the Certificate."                                            "championship:   The title of the certificate (Cord)."                            terms::              Vector{Int}              counts::              Vector{Int}              readers::              Vector{Int}              ratings::              Vector{Int}              title::              String              mutable struct              Corpus                              "Corpus mutable struct."                                            "docs:  A Vector{Certificate} containing the documents which vest to the Corpus."                                            "vocab: A Dict{Int, String} containing a mapping term Int (primal) => term Cord (value)."                                            "users: A Dict{Int, String} containing a mapping user Int (key) => user String (value)."                            docs::              Vector{Document}              vocab::              Dict{Int, String}              users::              Dict{Int, String}              abstract type              TopicModel              finish              mutable struct              LDA              <:              TopicModel                              "LDA mutable struct."                            corpus::              Corpus              M::              Int              ...              mutable struct              fLDA              <:              TopicModel                              "fLDA mutable struct."                            corpus::              Corpus              K::              Int              ...              mutable struct              CTM              <:              TopicModel                              "CTM mutable struct."                            corpus::              Corpus              M::              Int              ...              mutable struct              fCTM              <:              TopicModel                              "fCTM mutable struct."                            corpus::              Corpus              K::              Int              ...              mutable struct              CTPF              <:              TopicModel                              "CTPF mutable struct."                            corpus::              Corpus              1000::              Int              ...              mutable struct              gpuLDA              <:              TopicModel                              "gpuLDA mutable struct."                            corpus::              Corpus              M::              Int              ...              mutable struct              gpuCTM              <:              TopicModel                              "gpuCTM mutable struct."                            corpus::              Corpus              1000::              Int              ...              mutable struct              gpuCTPF              <:              TopicModel                              "gpuCTPF mutable struct."                            corpus::              Corpus              K::              Int              ...            

Document/Corpus Functions

              function              check_doc(doc::              Document)                              "Check Document parameters."                            function              check_corp(corp::              Corpus)                              "Check Corpus parameters."                            office              readcorp(;docfile::              String              =                              "                "              , vocabfile::              String              =                              "                "              , userfile::              String              =                              "                "              , titlefile::              String              =                              "                "              , delim::              Char              =                              ','              , counts::              Bool              =              imitation, readers::              Bool              =              false, ratings::              Bool              =              false)                              "Load a Corpus object from text file(s)."                                            ## readcorp(:nsf)   	- National Science Foundation Corpus.                              ## readcorp(:citeu)	- CiteULike Corpus.              part              writecorp(corp::              Corpus; docfile::              String              =                              "                "              , vocabfile::              String              =                              "                "              , userfile::              String              =                              "                "              , titlefile::              String              =                              "                "              , delim::              Char              =                              ','              , counts::              Bool              =              imitation, readers::              Bool              =              false, ratings::              Bool              =              false)                              "Write a corpus."                            function              abridge_corp!(corp::              Corpus, north::              Integer              =              0)                              "All terms which appear less than n times in the corpus are removed from all documents."                            function              alphabetize_corp!(corp::              Corpus; vocab::              Bool              =              true, users::              Bool              =              truthful)                              "Alphabetize vocab and/or user dictionaries."                            role              remove_terms!(corp::              Corpus; terms::              Vector{Cord}              =[])                              "Vocab keys for specified terms are removed from all documents."                            function              compact_corp!(corp::              Corpus; vocab::              Bool              =              true, users::              Bool              =              truthful)                              "Relabel vocab and/or user keys then that they course a unit of measurement range."                            part              condense_corp!(corp::              Corpus)                              "Ignore term order in documents."                                            "Multiple seperate occurrences of terms are stacked and their associated counts increased."                            part              pad_corp!(corp::              Corpus; vocab::              Bool              =              truthful, users::              Bool              =              true)                              "Enter generic values for vocab and/or user keys which appear in documents but not in the vocab/user dictionaries."                            part              remove_empty_docs!(corp::              Corpus)                              "Documents with no terms are removed from the corpus."                            role              remove_redundant!(corp::              Corpus; vocab::              Bool              =              true, users::              Bool              =              truthful)                              "Remove vocab and/or user keys which map to redundant values."                                            "Reassign Document term and/or reader keys."                            function              stop_corp!(corp::              Corpus)                              "Filter finish words in the associated corpus."                            function              trim_corp!(corp::              Corpus; vocab::              Bool              =              true, users::              Bool              =              true)                              "Those keys which appear in the corpus vocab and/or user dictionaries but not in whatsoever of the documents are removed from the corpus."                            function              trim_docs!(corp::              Corpus; terms::              Bool              =              true, readers::              Bool              =              true)                              "Those vocab and/or user keys which appear in documents just non in the corpus dictionaries are removed from the documents."                            function              fixcorp!(corp::              Corpus; vocab::              Bool              =              true, users::              Bool              =              true, abridge::              Integer              =              0, alphabetize::              Bool              =              imitation, condense::              Bool              =              faux, pad::              Bool              =              false, remove_empty_docs::              Bool              =              false, remove_redundant::              Bool              =              imitation, remove_terms::              Vector{String}              =String[], stop::              Bool              =              simulated, trim::              Bool              =              false)                              "Generic function to ensure that a Corpus object can be loaded into a TopicModel object."                                            "Either pad_corp! or trim_docs!."                                            "compact_corp!."                                            "Contains other optional keyword arguments."                            office              showdocs(corp::              Corpus, docs              /              doc_indices)                              "Brandish document(s) in readable format."                            function              showtitles(corp::              Corpus, docs              /              doc_indices)                              "Brandish document title(south) in readable format."                            role              getvocab(corp::              Corpus)              office              getusers(corp::              Corpus)

Model Functions

              role              showdocs(model::              TopicModel, docs              /              doc_indices)                              "Display document(south) in readable format."                            function              showtitles(model::              TopicModel, docs              /              doc_indices)                              "Display document title(s) in readable format."                            function              check_model(model::              TopicModel)                              "Check model parameters."                            function              railroad train!(model::              TopicModel; iter::              Integer              =              150, tol::              Real              =              one.0, niter::              Integer              =              1000, ntol::              Real              =              ane              /model.K^              2, viter::              Integer              =              10, vtol::              Real              =              i              /model.K^              ii, checkelbo::              Union{Integer, Inf}              =              1, printelbo::              Bool              =              truthful)                              "Train TopicModel."                                            ## 'iter'	- maximum number of iterations through the corpus.                              ## 'tol'	- absolute tolerance for ∆elbo as a stopping criterion.                              ## 'niter'	- maximum number of iterations for Newton'south and interior-point Newton's methods. (not included for CTPF and gpuCTPF models.)                              ## 'ntol'	- tolerance for change in function value equally a stopping criterion for Newton's and interior-betoken Newton's methods. (not included for CTPF and gpuCTPF models.)                              ## 'viter'	- maximum number of iterations for optimizing variational parameters (at the document level).                              ## 'vtol'	- tolerance for change in variational parameter values equally stopping criterion.                              ## 'checkelbo'	- number of iterations betwixt ∆elbo checks (for both evaluation and convergence of the evidence lower-bound).                              ## 'printelbo'	- if truthful, print ∆elbo to REPL.              @gpu              train!                              "Train model on GPU."                            function              gendoc(model::              TopicModel, laplace_smooth::              Real              =              0.0)                              "Generate a generic certificate from model parameters by running the associated graphical model as a generative procedure."                            role              gencorp(model::              TopicModel, M::              Integer, laplace_smooth::              Real              =              0.0)                              "Generate a generic corpus of size M from model parameters."                            function              showtopics(model::              TopicModel, 5::              Integer              =              xv; topics::              Union{Integer, Vector{<:Integer}, UnitRange{<:Integer}}              =              one              :model.K, cols::              Integer              =              4)                              "Display the superlative 5 words for each topic in topics."                            function              showlibs(model::              Union{CTPF, gpuCTPF}, users::              Wedlock{Integer, Vector{<:Integer}, UnitRange{<:Integer}})                              "Show the document(s) in a user'south library."                            function              showdrecs(model::              Union{CTPF, gpuCTPF}, docs::              Union{Integer, Vector{<:Integer}, UnitRange{<:Integer}}, U::              Integer              =              xvi; cols=              four)                              "Testify the meridian U user recommendations for a document(s)."                            office              showurecs(model::              Union{CTPF, gpuCTPF}, users::              Union{Integer, Vector{<:Integer}, UnitRange{<:Integer}}, M::              Integer              =              10; cols=              1)                              "Show the top M certificate recommendations for a user(s)."                            role              predict(corp::              Corpus, train_model::              Union{LDA, gpuLDA, fLDA, CTM, gpuCTM, fCTM}; iter::              Integer              =              10, tol::              Real              =              ane              /train_model.K^              two, niter::              Integer              =              yard, ntol::              Existent              =              1              /train_model.Thou^              2)                              "Predict topic distributions for corpus of documents based on trained LDA or CTM model."                            function              topicdist(model::              TopicModel, doc_indices::              Matrimony{Integer, Vector{<:Integer}, UnitRange{<:Integer}})                              "Get TopicModel topic distributions for certificate(due south) as a probability vector."                          

Bibliography

  1. Latent Dirichlet Allocation (2003); Blei, Ng, Jordan. pdf
  2. Filtered Latent Dirichlet Allocation: Variational Algorithm (2016); Proffitt. pdf
  3. Correlated Topic Models (2006); Blei, Lafferty. pdf
  4. Content-based Recommendations with Poisson Factorization (2014); Gopalan, Charlin, Blei. pdf
  5. Numerical Optimization (2006); Nocedal, Wright. Amazon
  6. Machine Learning: A Probabilistic Perspective (2012); Murphy. Amazon
  7. OpenCL in Action: How to Accelerate Graphics and Computation (2011); Scarpino. Amazon

pughtherned.blogspot.com

Source: https://github.com/ericproffitt/TopicModelsVB.jl

0 Response to "There and Back Again Arrows for Invertible Computation"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel