Conversation Map/Warren Sack

The Conversation Map system is a Usenet newsgroup browser that analyzes the text of an archive of newsgroup messages and outputs a graphical interface that can be used to search and read the messages of the archive. The system incorporates a series of novel text analysis procedures that automatically computes (1) a set of social networks detailing who is responding to and/or citing whom in the newsgroup; (2) a set of “discussion themes” that are frequently used in the newsgroup archive; and, (3) a set of semantic networks that represent the main terms under discussion and some of their relationships to one another. The text analysis procedures are written in the Perl programming language. Their results are recorded as HTML, and the HTML is displayed with a Java applet. With the Java-based graphical interface one can browse a set of Usenet newsgroup articles according to who is “talking” to whom, what they are “talking” about, and the central terms and possible emergent metaphors of the conversation. In this paper it is argued that the Conversation Map system is just one example of a new kind of content-based browser that will combine the analysis powers of computational linguistics with a graphical interface to allow network documents and messages to be viewed in ways not possible with today's, existing, format-based browsers which do not analyze the contents of the documents or messages.

1. INTRODUCTION

Recent advances in computational linguistics and quantitative sociology make it possible to envision new designs for existing, network-based browsers and clients (e.g., web browsers, news readers, email clients, etc.). These new content-based browsers and clients will treat the contents of the messages and documents displayed and not just their formats. Roughly speaking, these new designs will incorporate the functionality of existing browsers together with text analysis and information retrieval capabilities more sophisticated than those now used in, for example, web-based search engines.

This paper describes the design of a prototype Usenet newsgroup browser, Conversation Map. The Conversation Map system employs a set of text analysis procedures to produce a graphical interface. With the graphical interface one can browse a set of Usenet newsgroup articles according to who is “talking” to whom, what they are “talking” about, and the central terms and possible emergent metaphors of the conversation. To allow this combination of social and semantic navigation [5] the Conversation Map system computes a social network (cf., [24]) corresponding to who is replying to (or citing) whose messages. The Conversation Map system also parses and analyzes the contents of the newsgroup articles to calculate a semantic network (cf., [15]) that highlights frequently used terms that are similar to one another in the Usenet newsgroup discussion. For example, if the discussion includes messages concerning “time” and other messages concerning “money” and these two terms (“time” and “money”) are used in similar ways by the discussants (e.g., “You're wasting my time,” “You're wasting my money,” “You need to budget your time,” “You need to budget your money”) then the two terms will show up close to one another in the graphically displayed semantic network and so indicate the presence of a literal or metaphorical similarity between the terms (e.g., “Time is money”). In addition, the Conversation Map system analyzes connections between messages to extract an approximation of the discussion themes shared between newsgroup participants.

The output of the text analysis procedures are automatically translated into interface devices that allow one to browse the Usenet newsgroup articles in ways that would be impossible with a conventional, “format-based” news reader (e.g., RN, Eudora, or Netscape). One of the purposes of this research is to produce a better Usenet newsgroup browser for newsgroup participants and others who might like a quick way of discovering the terms and social structure of a newsgroup (e.g., sociologists and anthropologists of on-line text and social activity). The text analysis procedures are implemented in the Perl programming language and the graphical interface is programmed in Java. The example of the Conversation Map interface to be discussed in this paper can be found here: http://www.media.mit.edu/~wsack/CM. Viewing the example Java interface requires a newer web browser (e.g., Netscape >= version 4.5) and a operating system that supports Java 1.2 (e.g., Windows or Linux).

2. THE GRAPHICAL INTERFACE

The image shown below was produced by the Conversation Map system after an analysis of over 1200 messages from the Usenet newsgroup soc.culture.albanian, a group devoted to the discussion of Albanian culture in general, but at this period in time (16 April 1999 - 4 May 1999) especially the war in Kosovo. The following explanations of the interface will use images from the analysis of this newsgroup as an example. However, it should be clear that this is only one example. The Conversation Map system can be run on the message archive of any newsgroup concerning any topic and will produce a unique interface image for each and every newsgroup archive.

2.1 Social Networks

By automatically identifying who has either responded to and/or quoted from whom, the Conversation Map system calculates a social network given an archive of Usenet newsgroup messages. The nodes in the network represent people -- i.e., participants in the online discussion -- and the links represent reciprocating quotations and/or responses. Thus, if participant A responds to or quotes a message from participant B and then, later in the discussion, participant B quotes from or responds to a message from participant A, a link is drawn between nodes labeled “A” and “B.” In the calculated social networks, if A and B have reciprocated frequently, the link between them will be shorter than if they have only quoted from or responded to one another once or twice. By positioning the mouse over the social networks panel and then pushing the right mouse button, the names labeling the nodes of the social network can be turned off.

With the names off, it becomes easier to see that some participants are central to the newsgroup discussion and others are more marginal. The nodes with many connections represent participants who are both responding to and being responded to by many other participants. In other words, reciprocity is highlighted in the computed social networks. The layout algorithm used tends to push the central participants to the center. By simultaneously holding down the Shift key and the mouse button one can drag the nodes of the social networks around and get a better feel for the connectivity of various portions of the networks.

If one clicks the mouse button over one of the nodes in the networks, a small portion of a network is highlighted and the rest of the social networks disappear. The node selected (representing one participant in the newsgroup) and all the nodes linked to it are highlighted. At the same time, all of the threads in the archive are highlighted (with a light gray oval) in which the selected participant posted one or more messages.

By holding down the Control key and simultaneously clicking the mouse button, a second participant in the social networks can be selected. The edge between the two selected participants is highlighted, the threads where the two exchanged messages (and/or citations) are highlighted (in the case shown below, only one thread is highlighted), and, also, the discussion themes apropos of the messages exchanged by the pair are highlighted in the themes menu (in this case, two themes are visibly highlighted: the posters sent messages and/or quoted one another on the subject of the North Atlantic Treaty Organization (NATO) and the subject of war).

2.2 Discussion Themes

If participant A mentioned the word “baseball” in a post that also quoted a part of a message from participant B wherein B wrote about the term “football,” and then, later in the conversation participant B wrote about basketball in response to a message by A concerning soccer, then the link between A and B in the social network might be labeled with the term “sports” since baseball, football, soccer, and basketball are all sports. An analysis of discussion themes of this sort is done by the Conversation Map system.

A parenthetical note on “discussion themes”: Strictly speaking -- i.e., according to the terminology of linguistics -- the Conversation Map system does not identify discussion themes per se , but, rather, performs an analysis of lexical cohesion. Performing an analysis of lexical cohesion is only one step of many that would be required if – within linguistics -- it was to be claimed that the Conversation Map system identified discussion themes. However, since an analysis of lexical cohesion is a necessary step in the determination of discussion themes, we will call the analysis an analysis of discussion themes for the sake of simplicity.

In the interface, the results of the discussion theme analysis are displayed as a menu of themes. When one clicks on the menu item “sports” the link between A and B is highlighted (along with the links between any other pairs of posters who are connected through a discussion of sports). We refer to this combination of the social network and a discussion themes analysis as an analysis of social cohesion [19]. Following is a picture of the same social network shown in the previous figure along with the menu of discussion themes that link messages, and thus, people together in conversation about the larger topic of Kosovo and Albanian culture in general. The “NATO” item in the themes menu has been highlighted by clicking on it with the mouse. The figure shows which pairs of posters have exchanged messages concerning NATO. Again, the unhighlighted portions of the social networks disappear from view and the portions of the archive where NATO connects two or more messages together are highlighted in the lower portion of the interface.

Note that only two pairs of posters seem to have exchanged messages about NATO, but many threads in the archive use NATO as a lexical tie between messages. It is probably not the case that the four participants highlighted in the social networks are responsible for all of the threads concerning NATO. Rather, it must be kept in mind that a pair of posters is highlighted if and only if they have a two-way, back-and-forth exchange involving a given theme while, in contrast, the criteria for highlighting a thread in the archive is less rigorous: a thread in the archive is highlighted for a given theme if the theme connects even one pair of messages in the thread.

Themes in the menu are listed according to the number of pairs of participants they connect in the social network. Thus “United States” is listed above “NATO” because “NATO” links only two pairs of posters while “United States” links three pairs. All of the themes down to “war; state of war; warfare” link two pairs; “America; the Americas” links one pair as do the rest of the following items in the menu.

Clicking on a theme is equivalent to searching the message threads, but the search performed differs from a conventional keyword search. A keyword search would find, for instance, every mention of the term “NATO.” In contrast, the theme search criteria are more rigorous. The theme search criteria are only fulfilled if, for instance, “NATO” is mentioned in one message of the thread and then again in a response or quoting message later in the thread.

2.3 The Messages

Threads in a newsgroup discussion consist of an initial message concerning some subject, a set of responses to the initial message, a set of responses to the responses, and so forth. Therefore, conceptually, a thread is a “tree” in which the initial message is the "root" and links between responses are the “branches” of the "tree." Graphically, a thread tree can be plotted as a “spider web” in which the initial post is placed in the middle, the responses to the initial post are plotted in a circle around the initial post, the responses to the responses are plotted in a circle around the responses, etc. One of the nice features of plotting the thread trees as “spider webs” is that, at least in theory, any size tree can be plotted within a given amount of space.

In the bottom half of the figure below, over 400 threads are plotted as spider webs constrained into rectangular (rather than circular) spaces. The threads are arranged chronologically from upper-left to lower-right. By passing the mouse over each thread, the start and end dates and the subject lines of each thread can be read in turn in light gray text written into the dark gray strip at the bottom of the interface.

Since each thread is allotted the same amount of screen space, a rough guide to newsgroup activity can be read off of the panel in which all of the threads are plotted. If a thread without many messages is plotted, the rectangle containing it in the panel appears as mostly black. Threads containing many messages, and thus a lot of activity, appear very green.

In the figure below, one thread from the archive has been selected with a mouse click. The thread selected has a white oval drawn around it. Note also that the dates when the messages of the thread were posted (27 April 1999 - 1 May 1999) and the subject line of the first message in the thread is printed in the dark gray strip at the bottom of the interface: “Re: Response to: European trouble from a bird eye.” In addition, parts of the social network, the themes menu, and the semantic network have also been highlighted. In the social network, those participants who are part of the social networks and who also have posted to the selected thread are highlighted. In the themes menu, those themes which connect two or more messages in the selected thread are highlighted. In the semantic network, those terms which correspond to the highlighted themes are also highlighted. The connection between the themes and the terms in the semantic network will be more fully explained in the section below devoted to the semantic network.

2.4 Message Threads

Normally, the nodes of a thread (representing messages in the thread) would be labeled with the names of the participants who posted them. In the figure above, however, the names have been turned off (using the right mouse button or Meta-click combination). In addition, some of the nodes of the thread have been moved around (by holding down the Shift key and dragging the mouse).

The spider web shape of the thread tree can be seen. If the thread was perfectly balanced (i.e., if each message had exactly the same number of responses as every other message), then the graphical plot of thread would more closely resemble a symmetrical web. However, a symmetrical shape is more the exception than the rule. The initial message of the thread is plotted as the largest green node in the center. In the thread shown above, the discussion theme “Croatia” has been highlighted. The menu of discussion themes can be scrolled by holding down the Shift key and dragging the mouse. By clicking on a discussion theme in the menu of themes, it is highlighted in white and the portion of the thread in which it is used as a theme is also highlighted in white. In this case, it can be seen that three of the messages of the thread are connected together by the theme “Croatia.”

2.5 Message Display

The use of “Croatia” as a discussion theme that links two of the messages of the thread is visible in the display of the message shown above. “Montenegro” is mentioned in a quote from a previous message and “Croatia” is discussed in the present message. The discussion themes analysis procedure of the Conversation Map system connected these two terms together because, in the thesaurus used in the Conversation Map system (i.e., Wordnet version 1.6, [6]), Montenegro is listed as a part of Croatia. The text of the message displayed above also illustrates two other features of the Conversation Map system as a Usenet newsgroup browser: (1) Since quotations within messages are identified as a part of the analysis procedure for building the social networks, quotations within a given message are automatically highlighted as hypertext within the display of the text of the message. Clicking on the text of a quotation will open a new window containing the full text of the quoted message. (2) Near the top of every message is a PREVIOUS and a NEXT label. If there is a • symbol listed next to the PREVIOUS label, clicking on the • will open a window containing the text of the message that precedes the current message. A message, A, is said to precede another message, B, if B is sent in reply to A. Since several messages might be sent in reply to a message, one or more •s might appear after a NEXT label. Click on each of the •s listed after the NEXT label to see all of the messages sent in response to the current message.

2.6 Semantic Network

The central terms of a discussion are often connected to two or more other terms. Thus, in the soc.culture.albanian archive “people” is computed to be a central, perhaps neutral, term is the vicious argumentation that characterizes the content of many of the messages in the example archive. In this archive Albanians are “talked about” as people, Serbs are talked about as people, refugees are talked about as people, as are governments and countries. In other words, it appears to be the case that all sides of the argument (which is predominantly an argument pitting the Albanian view of the Kosovo situation against the Serbian view) can agree that the more general term “people” is applicable to both Serbs and Albanians.

The graphical interface uses the same spider web algorithm to lay out the semantic network as it uses to display the thread trees. Note that the algorithm sometimes overlaps nodes of the graph. In the figure above, the nodes of the semantic network have been rearranged for legibility by holding down the Shift key and dragging the mouse.

Nodes of the semantic network can be selected by clicking the mouse. For example, if the term “country” is selected, all of the themes synonymous with country are highlighted in the themes. Simultaneously, all of the participants in the social network connected by the highlighted themes are also highlighted, and all of the threads wherein “country,” or a synonym of country is used as a discussion theme are also highlighted.

The associations displayed in the image above were calculated by the Conversation Map system. The Conversation Map system parses and analyzes the contents of the newsgroup messages to calculate the semantic network. In the semantic network, terms that are similar to one another in the newsgroup messages are connected together by a line. To calculate which terms are similar to one another, the Conversation Map system compares the list of associations for each term against the list of associations of every other term. For example, if the discussion includes messages concerning “time” and other messages concerning “money” and these two terms (“time” and “money”) are used in similar ways by the discussants (e.g., “You're wasting my time,” “You're wasting my money,” “You need to budget your time,” “You need to budget your money”) then the two terms will show up close to one another in the graphically displayed semantic network and so indicate the presence of a literal or metaphorical similarity between the terms (e.g., “Time is money”). Specifically, two terms are “talked about” in similar ways if they are often used with the same verbs, appear together with the same nouns, and share a large number of adjectives with they are both modified.

The word associations that can be viewed by double-clicking on a term in the semantic network is a complete list of the verbs, adjectives, and nouns that are used with the given term. Each of the word associations can be “opened” with a single click. If the verb "consider" is clicked on from the display shown above, a web browser window containing the following table appears. This table shows all of sentences in the archive of messages where the term “country” has appeared as the subject of the verb “consider.” To see the message that contains an example sentence, click on the sentence and a new web browser window will be opened containing the text of the message.

It is also possible to compare the associations of one term with the associations of another term. Return to the main window displaying the semantic network. In the semantic network, hold down the Control key and click the mouse twice, once over the term “country” and then over “nation.” Now, hold down the Control key again and move the mouse over one of the two selected terms, and double click the mouse.

A new window is created. It displays the difference and union of the associations for “country” and “nation.” Associations unique to “country” are displayed in green. Associations unique to “nation” are shown in silver. And, associations common to both “country” and “nation” are written in white. Clicking on any of the terms listed in green or silver will create a window of example sentences like the window shown above for the examples of “country” used as the subject of the verb “consider.” If any of the white terms are clicked on, a similar window of examples will be created containing sentences using the term “country” and other sentences using the term “nation.”

3. THE TEXT ANALYSIS PROCEDURE

(f) The words in the messages are divided into sentences, tagged with part-of-speech information, and their roots are identified. To divide the words into sentences, a tool built at the University of Pennsylvania is used [18]. To accomplish the part-of-speech tagging, a simple trigram based tagger has been constructed. The morphological analyzer built for the Conversation Map system uses a freely-available morphology and syntax database [12].

(i) An analysis of lexical cohesion is performed on every pair of messages where a pair consists of one message of a “thread” and another message that either immediately follows the first message in the thread (i.e., is a reply to the first message) and/or follows the first message in the thread and contains a quotation from the first message. This analysis produces a series of lexical ties between messages that can be understood as a crude approximation to the theme of the conversation in a sequence of messages. The lexical database WordNet [6] is used in the lexical cohesion procedure. See [8] for a definition of lexical cohesion. See [10] for an example implementation of a somewhat analogous lexical cohesion routine.

(j) By using the index created in step (d) with the results of step (i) a set of lexical ties are computed for every pair of posters who have replied to and/or quoted from one another over the course of time represented by the Usenet newsgroup archive under analysis. These aggregated lexical ties are layered on top of the social network computed in step (e). The result is that most of the links between pairs of posters are labeled with one or more lexical ties (i.e., one or more “discussion themes”). The combination of social networks and lexical cohesion results is called social cohesion . The social cohesion analysis procedure developed for the Conversation Map system is partially described in [19].

(k) The lexicosyntactic context of every noun in the archive is compared to the lexicosyntactic context of every other noun in the archive. Nouns that are used or discussed in the same manner are calculated to be similar and are placed close to one another in the semantic network. An algorithm similar to the one described in [7] is used. Once all of the noun-noun pairs have been compared and a nearest neighbor for each noun computed, a subset of the semantic networks computed are selected for display by ranking the semantic networks. The top-ranked semantic network contains a set of terms (used as “discussion themes”) that connect the greatest number of poster pairs linked in step (j). In this manner, information about the social networks of the newsgroup is used as a kind of “lens” to select an important subset of the semantic information. Effectively, this type of interlacing of the social and semantic information supports social and semantic navigation [5] in the interface generated for the newsgroup.

4. RELATED WORK

Several other content-based Usenet newsgroup readers have been built with text analysis procedures simpler than those incorporated into the Conversation Map system discussed in this paper. For example, [11] describe an intelligent network news reader that performs a sort of example-based, relevance feedback procedure to select small collections of messages from an archive given an example message. The intelligent network news reader also contains a method for identifying sub-threads within larger threads by analyzing the content of the messages in a thread [23]. However, systems of this sort (cf., [21]) are mostly concerned with filtering messages rather than with one of the problems addressed by the Conversation Map system: How can all of the messages in an archive be graphically displayed and organized according to content of the messages and the social structure representative of the participants' interactions?

Many of the computational techniques developed for the analysis of Usenet newsgroups do not take the linguistic content of the messages into account at all using, instead, exclusively information that can be garnered from the headers of the messages; see, for example, [22]. Other work does employ some keyword spotting techniques to identify and sort the messages into categories but does not involve the analysis of grammatical or discourse structures; see, for instance, [4].

Work that does use the contents of the messages for analysis often does not take the threading of the messages into account, or, if it does, does not pay attention to the social network produced by newsgroup participants (e.g., [2]). Or, if the work does take the threading and citation information into account it does not necessarily use any of the linguistic contents of the messages to compute the graphical display (cf., [3]).

Research that has combined content analysis with an analysis of co-referencing of messages and discussion participants has often employed non-computational means to categorize the contents of messages (e.g., [1]). Some of the most interesting work that analyzes message threading, participant interaction, and the form and content of messages is often ethnographically-oriented, sociolinguistic analyses of newsgroup interactions that is done without the assistance of computers and is so, necessarily, based on a reading of only a small handful of messages (e.g., [9]). Ideally one could program the computer to emulate the latter sort of analysis, but that will require many advances in the field of computational linguistics. What is unique to the text analysis procedures of the Conversation Map system is the automatic construction and combination of social and semantic networks that, together, provide a means for exploring both the social and semantic structure of a Usenet newsgroup.

The novel text analysis procedures in combination with a graphical interface make the Conversation Map system an example of a new sort of content-based browser. Earlier examples of content-based browsers (e.g., [17]) used simpler text analysis procedures akin those employed in information retrieval systems. New content-based browsers, clients, and readers (like the Conversation Map system) will incorporate more sophisticated text analysis (and probably, eventually, image analysis) techniques.

5. CONCLUSIONS

The Conversation Map system is an attempt to construct a prototype, content-based Usenet newsgroup browser that shows not only the terms being discussed but also how the discussion conducted in the newsgroup constitutes a set of social relations between participants. The text analysis procedures of the Conversation Map system produce (1) a set of social networks; (2) a list of high-frequency “discussion themes”; and, (3) a set of semantic networks. These three results are displayed in a Java-based graphical interface. Using this interface one can get a quick overview of some of the social and semantic structures of the newsgroup discussion.

The prototype system is being developed with two different groups of users in mind: (1) The Conversation Map system could be used as a newsgroup reader by newsgroup participants. This use would require that a newsgroup be archived (as is done at sites like, for instance, www.dejanews.com) and the Conversation Map system run periodically on the archive. The graphical interface of the Conversation Map system would then provide newsgroup participants an alternative way of reading the archive of past messages. (2) The Conversation Map system is being developed in coordination with a small set of professional users; i.e., anthropologists, sociologists, and others who are professional discourse analysts interested in having a tool that provides them with a first cut at their data. Specifically, the Conversation Map system allows them a means to quickly overview thousands of newsgroup messages and so provides a place to start doing closer readings of parts of the archive. One example collaboration of this sort involves the anthropologist of science and technology Joseph Dumit at MIT. Together we are attempting to use the Conversation Map system to explore a set of newsgroups concerned with health and medicine [20]. While the Conversation Map system is currently being used as an archive interface by a small number of newsgroup participants, the development, refinement, and evaluation of the Conversation Map system is currently being accomplished more through a process of participatory design with professional discourse analysts.

6. REFERENCES

[1] Michael Berthold, Fay Sudweeks, Sid Newton, Richard Coyne. “It makes sense: Using an autoassociative neural network to explore typicality in computer mediated discussions” In F. Sudweeks, M. McLaughlin, and S. Rafaeli (editors) Network and Netplay: Virtual Groups on the Internet (Cambridge, MA: AAAI/MIT Press, 1998)