6 April 2015 | Draft
Mapping of Co-occurrences of Document Citations in Laetus-in-Praesens
Eliciting a coherent overview from a large set of interrelated documents
- / -
Over 2,000 articles and other documents are accessible on this Laetus-in-Praesens site. Some dating back to the 1960s. Many of the documents cite other documents on the site, resulting in a total of over 1,500 links from one to another. The documents cover a wide range of topics. They have been variously grouped by some 30 specific subjects.
The question of interest is the nature of any coherence characterizing this range of writings. One approach is to take advantage of the analysis of these documents for such links in the process of importing segmented versions of the documents into the parallel Kairos content management system (using Drupal).
A more promising alternative is to analyze patterns of co-occurrence of citations and use these as a basis for generating maps with an integrative bias. The procedure is described below. The set of maps is accessible via any one of them. Two sets of maps have been created:
The second set of maps uses the same technique to the authors of the many off-site docuements cited within the documents on the Laetus-in-Praesens site. The aim isto elicit thepatterns of most-cited authors, namely the co-occurrences of authors in documents on that site. This offers the possibility of recognizing those authors most influential in the writings on the Laetus site.
The purpose of this page is to offer access to ongoing experiments using this technique, with indication of constraints. A notable constraint is the technical limitation of the author in exploiting the possibiities. A further interest is to compare the results with previous exercises using SVG. Many of these are accessible via a separate page, arising partly from work on the Encyclopedia of World Problems and Human Potential. The following experiments were enabled by Tomáš Fülöpp.
It should be emphasized that these mapping exercises are experimental and primarily undertaken to respond to the question of the nature of any coherence underlying the array of documents on a wide range of variously interconnected subjects -- as well as the degree of implicit bias of the author. The fundamental preoccupation here is the possibility of eliciting any sense of integration amongst the wide range of topics in the light of the manner in which the documents link to each other or cite a relatively limited set of references by other authors. The extension of the approach to explicit topics is currently a secondary concern.
Given the relatively large number of items, exploring these interactive visulizations through a browser requires some attention to the degree of zoom (using that browser facility) appropriate in any given case, especially given the manner in which the viewing window may be affected by zooming.
Visualization based on co-occurrences between authors of external references: Here the data set is based on the names of authors -- typically of books cited in the references in documents on the on the Laetus-in-Praesens website. The approach was to identify co-occurrences of authors cited in a document and to focus on those co-occurrences of highest frequency. A total of 311, 278 author pairs (from a total of 576 documents with sets of references) constituted the pool from which multiple co-occurrences were selected (naturally excluding those with a single occurrence). Expressed otherwise, the purpose was to select the pairs of authors to be found more frequently in any single document. This gives a sense of the set of authors most influential in the elaboration of the documents as a set over the years. Author names are given in popups on mouseover (with number of internal documents in which the author is referenced given in parentheses)
Visualization based on co-occurrences between references internal to the website: Here the data set is based on the names of documents on the Laetus-in-Praesens website -- namely the specific links included in one document to another on the website. The approach was to identify co-occurrences of such document citations and to focus on those co-occurrences of highest frequency. A total of 126,249 document pairs (from a total of 986 documents citing others) constituted the pool from which multiple co-occurrences were selected, naturally excluding those with a single occurrence. Expressed otherwise, the purpose was to select the pairs of documents to be found most frequently in any single document. This gives a sense of the core interconnectivity of the set of documents -- namely the core preoccupation of the website. Document names are given in popups on mouseover (with the number of associated internal references in parentheses)
The results are interesting as a matter of curiosity and technical feasibility, but they are far from useful, epecially given the number of documents.
Between documents: The subsequent approach taken has been to assume that eliciting an overview of the core concerns of the Laetus-in-Praesens site could be achieved by discovering which documents were cited most frequently together in documents of that site, naely patterns of co-occurrences. The focus was not therefore on the citing documents but on those cited.
The first step was to take document pairs and identify the documents citing those pairs. This gave a set of 126,249 pairs. Whilst a very high proportion of these were naturally only associated with a single document, others are indicated below
The choice was then made to isolate triplets of pairs, namely three pairs constituing a triangle. The issue was then how to filter triplets such as to present those pairs-forming-triangles to reflect a higher proportion of co-occurrences.
The maps of this kind reflect the result, using filtration criteria up to 27 pairs -- one map for each filter criteria. The configuration of small circles in each case corresponds to the pool of documents meeting the particular filtration criterion -- within which the triangles were identified as reflecting the highest degree of co-occurrence.
Triangles were identified by sorting the pool of documents (for a given map) in descending order of the number of pairs. Starting with that of the highest frequency (A), the issue was to isolate two pairs (A-B and A-C) associated with the next lowest frequency. A procedure was then used to ensure that the remaining/resulting pair (B-C) was of the highest frequency. The process was designed to ensure that all three points (A, B, and C) constituted a unique selection from the pool, meaning that as triangles were selected the points were excluded from the process of selecting further triangles.
The maps are presented as circles of documents linked by triangles. The name of the document is given for the point of each triangle.
By moving from map to map, there is a sense of how the core preoccupation emerges from those maps based on the highest filter criteria (namely 27 rather than 10, for example). The documents configured in the later (simpler) maps offer a sense of the preoccupation of the author, whether this is conscious or simply a hidden bias, or a combination of both.
Given the interdisciplinary preoccupation of many documents, and of the site as a whole, of interest is how any connectivity -- potentially of a higher order -- emerges from the exercise
From documents: A similar exercise was undertaken to isolate which clusters of authors (of external texts) were cited most frequently together from documents on the Laetus-in-Praesens site.
The maps of this kind reflect the result, using filtration criteria up to 27 pairs -- one map for each filter criteria
Legibility of the document names: This is clearly an issue, partially alleviated by the possibility of using the browser zooming feature. Experiments were undertaken using mouseover techniques to increase the font size of the document on which the cursor was placed. This feature has not been enabled in all browsers and tends anyway to give unpredictable results.
Resizing: Clearly use of the browser resizing facility enables adjustments to be made to what is seen. A compromise is necessary between an overview of the pattern and any ability to read the document names.
Missing names in a map: This is obviously a bug. It results from difficulties in isolating triangles without associating them with other triangles -- and is dependent on the particular filter criteria. Some variants of this bug have already been eliminated. Further improvements are part of work in progress
Hotlinking from document names: Again this is a feature which should work, but proved problematic in experiments. It is a known bug on some browsers. If enabled it would enable direct access to documents.
SVG issues: There are presumably interesting possibilities for improving the displays with greater skills in scalable vector graphics than possessed by the author -- notably with respect to sizing the maps and zooming facilities (avoiding dependence on browser facilities). Again one difficulty is which features work on which browsers, on which versions of those browsers, and on which operating systems. It is noteworthy that older browsers/operating systems do not display the document names, for example
SVG vs HTML: Earlier experiments in generating SVG documents served to highlight some of the issues of browser compatibility. A decision was therefore made to embed the SVG code within a more conventionak HTML document.
Final maps of citations between documents are presented separately -- following the experimental failures indicated below in seeking a useful mode of presentation. Also of interest in their own right, the screen shots below can be viewed at larger scale.
this work is licenced under a creative commons licence.