6 April 2015 | Draft

Mapping of Co-occurrences of Document Citations in Laetus-in-Praesens

Eliciting a coherent overview from a large set of interrelated documents

- / -


Introduction
Preliminary experiments in document relationship visualization
Comments relative to co-occurrences
Technical issues
Future developments under consideration
Preliminary test variants


Introduction

Over 2,000 articles and other documents are accessible on this Laetus-in-Praesens site. Some dating back to the 1960s. Many of the documents cite other documents on the site, resulting in a total of over 1,500 links from one to another. The documents cover a wide range of topics. They have been variously grouped by some 30 specific subjects.

The question of interest is the nature of any coherence characterizing this range of writings. One approach is to take advantage of the analysis of these documents for such links in the process of importing segmented versions of the documents into the parallel Kairos content management system (using Drupal).

The challenge is how to gain some sense of the interrelationships between the contents by exploiting the relatively new Data-Driven Document format (D3.js, or just D3). This is a JavaScript library for producing dynamic, interactive data visualizations in web browsers. It makes use of the widely implemented SVG, HTML5, and CSS standards. The links can of course be represented together using the Java script D3 approach . This was firat done experimentally in a series of preliminary experiments noted below

A more promising alternative is to analyze patterns of co-occurrence of citations and use these as a basis for generating maps with an integrative bias. The procedure is described below. The set of maps is accessible via any one of them. Two sets of maps have been created:

The second set of maps uses the same technique to the authors of the many off-site docuements cited within the documents on the Laetus-in-Praesens site. The aim isto elicit thepatterns of most-cited authors, namely the co-occurrences of authors in documents on that site. This offers the possibility of recognizing those authors most influential in the writings on the Laetus site.

The purpose of this page is to offer access to ongoing experiments using this technique, with indication of constraints. A notable constraint is the technical limitation of the author in exploiting the possibiities. A further interest is to compare the results with previous exercises using SVG. Many of these are accessible via a separate page, arising partly from work on the Encyclopedia of World Problems and Human Potential. The following experiments were enabled by Tomáš Fülöpp.

It should be emphasized that these mapping exercises are experimental and primarily undertaken to respond to the question of the nature of any coherence underlying the array of documents on a wide range of variously interconnected subjects -- as well as the degree of implicit bias of the author. The fundamental preoccupation here is the possibility of eliciting any sense of integration amongst the wide range of topics in the light of the manner in which the documents link to each other or cite a relatively limited set of references by other authors. The extension of the approach to explicit topics is currently a secondary concern.

Given the relatively large number of items, exploring these interactive visulizations through a browser requires some attention to the degree of zoom (using that browser facility) appropriate in any given case, especially given the manner in which the viewing window may be affected by zooming.

Preliminary experiments in document relationship visualization

Visualization based on co-occurrences between authors of external references: Here the data set is based on the names of authors -- typically of books cited in the references in documents on the on the Laetus-in-Praesens website. The approach was to identify co-occurrences of authors cited in a document and to focus on those co-occurrences of highest frequency. A total of 311, 278 author pairs (from a total of 576 documents with sets of references) constituted the pool from which multiple co-occurrences were selected (naturally excluding those with a single occurrence). Expressed otherwise, the purpose was to select the pairs of authors to be found more frequently in any single document. This gives a sense of the set of authors most influential in the elaboration of the documents as a set over the years. Author names are given in popups on mouseover (with number of internal documents in which the author is referenced given in parentheses)

Co-occurrences Authors Links Lesser distance between nodes Greater distance between nodes
>20 (161) 39 2370 Test p20  
>15 (301) 53 2747 Test p15
>10 (601) 95 3476 Test p10  
>5 (1844) 202 4560 Test p05  

Visualization based on co-occurrences between references internal to the website: Here the data set is based on the names of documents on the Laetus-in-Praesens website -- namely the specific links included in one document to another on the website. The approach was to identify co-occurrences of such document citations and to focus on those co-occurrences of highest frequency. A total of 126,249 document pairs (from a total of 986 documents citing others) constituted the pool from which multiple co-occurrences were selected, naturally excluding those with a single occurrence. Expressed otherwise, the purpose was to select the pairs of documents to be found most frequently in any single document. This gives a sense of the core interconnectivity of the set of documents -- namely the core preoccupation of the website. Document names are given in popups on mouseover (with the number of associated internal references in parentheses)

Co-occurrences Docs. Links Lesser distance between nodes Greater distance between nodes
>20 (80) 388 3182 Test d20  
>15 (1123) 138 4909 Test d15 (does not work)  
>10 (3645) 232 7655 Test 1 Test 2
> 9 (4621) 264 8560    
>7 (7599) 319 10211    
         

The results are interesting as a matter of curiosity and technical feasibility, but they are far from useful, epecially given the number of documents.

Comments relative to co-occurrences

Between documents: The subsequent approach taken has been to assume that eliciting an overview of the core concerns of the Laetus-in-Praesens site could be achieved by discovering which documents were cited most frequently together in documents of that site, naely patterns of co-occurrences. The focus was not therefore on the citing documents but on those cited.

The first step was to take document pairs and identify the documents citing those pairs. This gave a set of 126,249 pairs. Whilst a very high proportion of these were naturally only associated with a single document, others are indicated below

Co-occurrences Document Pairs   Co-occurrences Document Pairs   Co-
occurrences
Document
Pairs
  Co-
occurrences
Document
Pairs
43 1   28 17   19 116   9 1301
38 1   27 20   18 132   8 1677
37 3   26 27   17 175   7 2272
36 1   25 35   16 213   6 3201
35 1   24 32   15 286   5 4578
33 6   23 49   14 405   4 7035
32 4   22 85   13 469   3 11936
31 7   21 71   12 586   2 25316
30 8   20 99   11 776   1 64312
29 20   19 116   10 976     Total 126249

The choice was then made to isolate triplets of pairs, namely three pairs constituing a triangle. The issue was then how to filter triplets such as to present those pairs-forming-triangles to reflect a higher proportion of co-occurrences.

The maps of this kind reflect the result, using filtration criteria up to 27 pairs -- one map for each filter criteria. The configuration of small circles in each case corresponds to the pool of documents meeting the particular filtration criterion -- within which the triangles were identified as reflecting the highest degree of co-occurrence.

Triangles were identified by sorting the pool of documents (for a given map) in descending order of the number of pairs. Starting with that of the highest frequency (A), the issue was to isolate two pairs (A-B and A-C) associated with the next lowest frequency. A procedure was then used to ensure that the remaining/resulting pair (B-C) was of the highest frequency. The process was designed to ensure that all three points (A, B, and C) constituted a unique selection from the pool, meaning that as triangles were selected the points were excluded from the process of selecting further triangles.

The maps are presented as circles of documents linked by triangles. The name of the document is given for the point of each triangle.

By moving from map to map, there is a sense of how the core preoccupation emerges from those maps based on the highest filter criteria (namely 27 rather than 10, for example). The documents configured in the later (simpler) maps offer a sense of the preoccupation of the author, whether this is conscious or simply a hidden bias, or a combination of both.

Given the interdisciplinary preoccupation of many documents, and of the site as a whole, of interest is how any connectivity -- potentially of a higher order -- emerges from the exercise

From documents: A similar exercise was undertaken to isolate which clusters of authors (of external texts) were cited most frequently together from documents on the Laetus-in-Praesens site.

The maps of this kind reflect the result, using filtration criteria up to 27 pairs -- one map for each filter criteria

Technical issues

Legibility of the document names: This is clearly an issue, partially alleviated by the possibility of using the browser zooming feature. Experiments were undertaken using mouseover techniques to increase the font size of the document on which the cursor was placed. This feature has not been enabled in all browsers and tends anyway to give unpredictable results.

Resizing: Clearly use of the browser resizing facility enables adjustments to be made to what is seen. A compromise is necessary between an overview of the pattern and any ability to read the document names.

Missing names in a map: This is obviously a bug. It results from difficulties in isolating triangles without associating them with other triangles -- and is dependent on the particular filter criteria. Some variants of this bug have already been eliminated. Further improvements are part of work in progress

Hotlinking from document names: Again this is a feature which should work, but proved problematic in experiments. It is a known bug on some browsers. If enabled it would enable direct access to documents.

SVG issues: There are presumably interesting possibilities for improving the displays with greater skills in scalable vector graphics than possessed by the author -- notably with respect to sizing the maps and zooming facilities (avoiding dependence on browser facilities). Again one difficulty is which features work on which browsers, on which versions of those browsers, and on which operating systems. It is noteworthy that older browsers/operating systems do not display the document names, for example

SVG vs HTML: Earlier experiments in generating SVG documents served to highlight some of the issues of browser compatibility. A decision was therefore made to embed the SVG code within a more conventionak HTML document.

Future developments under consideration

Possibilities include:

  • Additional information: Clearly there is a case for indicating the number of triangles identified within each selection, rather than (or in addition to) the minimum number of links.
  • Greater range of maps: Clearly it is possible to extend the selection for less than 10 co-occurrences
  • Presentation: Clearly much more attention could be given to the presentation of the maps and enabling users to modify those presentations. Possibilities include:
    • Line thickness and colour
    • Font size
    • Appending other information to document names (year, etc)
  • Additional configurations (square, pentagram, hexagram, etc): Whether as an alternative or superimposed, the pattern of triangles could be interlinked by indicating square, or other configurations between the points associated with different triangles. This would serve to interlink the currently isolated triangles, increasing the integrity of the pattern as a whole..
  • Configuration of polygons on polyhedra: Clearly triangles, squares and pentsgrams identified by this process could be "transferred" onto polyhedral configurations to offier an even sense of integration

Preliminary test variants

Final maps of citations between documents are presented separately -- following the experimental failures indicated below in seeking a useful mode of presentation. Also of interest in their own right, the screen shots below can be viewed at larger scale.

Mapping of inter-document xitations on a website Mapping of inter-document xitations on a website Mapping of inter-document xitations on a website
Mapping of inter-document xitations on a website Mapping of inter-document xitations on a website Mapping of inter-document xitations on a website
Mapping of inter-document xitations on a website Mapping of inter-document xitations on a website Mapping of inter-document xitations on a website
Interesting image resulting from a programming error
(if only this corresponded to the integrity of the pattern of documents on the site !)
Mapping of inter-document xitations on a website

creative commons license
this work is licenced under a creative commons licence.