Text Summarization Presentation

Annotations:

(Shawn)
Slide 1: Introduction
What is Text Summarization?
A method of compressing a document into a shorter document which describes what the document is about...

Slide 2: Introduction
It produces a summary.

Slide 3: Introduction
...which is automatically generated...

Slide 4 Introduction
...in regards to a certain document or set of documents...

Slide 5: Introduction
...though, we want to produce a summary that is at least as good as a human would be able to produce.

Slide 6: Introduction
Typically though, Summarization is a hard problem. So, as is usual, we borrow from people who have part of the problem solved, and try to come up with something from there.

Information Extraction: Text -> Structured Representation(DB)
-- Structured representations are easier to deal with for computers.

Information Retrieval: Retrieving Documents based on free text (natural language) queries.

Text Mining: Getting specific information from free text.

Text Generation: Generating text from some sort of information (meaning) structure.

(Keri)
Slide 7: Types of Text Summarization

There are many different types of summaries that fulfill different user needs. Indicative summaries are used for quick categorization while informative summaries employ some content processing. Similarly, an extract simply lists key phrases from a text while an abstract is cohesive and coherent and may be re-phrased. Generic summaries reflect a specific author's viewpoint while query-oriented summaries respond to a user's needs. Background summaries offer extensive information because they assume a reader's understanding of the subject is poor while just-the-news summaries assume that the reader understands the topic well and only needs to be updated. Single-document summaries only have one source while multi-document summaries fuse together information from many sources (Hovy and Marcu, 2000).

Slide 8: Types of Text Summarization

Summaries can look at all the information in a document or only the information that is deemed relevant.

Slide 9: Types of Text Summarization

Summarization tasks can also be categorized into two broad types of approach: approached from the top-down (a query-driven focus) versus bottom-up (text-driven focus) (Hovy and Marcu, 2000).

Slide 10: What do Human Summarizers Do?

Before taking a look at the computational approach to the summarization problem, it is interesting to
examine how humans handle the problem. General processes that humans use when summarizing written
or spoken text include: deleting extraneous information where possible and rewriting remaining
information to make it more general and more compact.

Slide 11: What do Human Summarizers Do?

To illustrate, consider an example from Engres-Niggemeyer (1998).

Example:
"Father was washing dishes. Mother was working on her new book. The daughter was busy painting the window frames."

After summarization:
"The whole family was busy."

The task of summarizing the following sentences involves all three steps mentioned above. First, extraneous material - that the book Mother was working on was new - was deleted. Then, generalizations were made. Mother, father and daughter are all elements of a family. Washing, writing, and painting are all things that one can be busy with. Dishes, book and window frames are all objects that can be replaced by any-object. Finally, a process that Engres-Niggemeyer calls construction allows all three sentences to be compacted into one: one member of the family is busy doing something with any-object + one member of the family is busy doing something with any-object object + one member of the family is busy doing something with any-object = the whole family is busy (with any object). "With any object" can then be deleted, because it is extraneous information.

Slide 12: What do Human Summarizers Do?

However, this summarization does not work when the paragraph is expanded to include additional information (see slide for expanded example).

Slide 13: What do Human Summarizers Do?

Now, the most important point of the story is not that the whole family was busy, but rather the entire family supported mother in getting her book done. This example is intended to stress the importance of understanding the entire story before abstracting from it

Slide 14: What do Human Summarizers Do?

Humans summarize information about the world on a constant basis in daily life. Faced with large amounts of incoming information, humans combine information into meaningful and manageable representative bits. In conversation, for example, we chunk information from the discourse into distilled and reduced higher representations. We generalize (a bottom-up process) to representation of meaning when taking in information and go from a generalized meaning to a specific meaning (a top-down process) when we create or reinterpret information. In addition, we make use of cues to understand what the discourse is about, and therefore, how to generalize (or summarize) it best. These signals include knowledge about the domain of the topic, syntactic cues (such as topic-comment structures and connectives such as "like", "but", "however", "because"), stylistic and rhetoric cues ("The most pressing thing to do was", "I conclude that"), structure knowledge (narrative structure), and context or situational cues (Engres-Niggemeyer, 1998).

Slide 15: What do Human Summarizers Do?

Humans tend to keep statements of fact, items relating to the topic, items that discuss purpose, items that are stated positively (rather than negatively), items that contrast with each other, and items that are stressed when summarizing. We tend to discard reasons for an argument, comments about a topic, and examples illustrating a point.

Slide 16: What do Human Summarizers Do?

Studies (TRW as cited in Engres-Niggemeyer, 1998) have found that when abstracting documents, individuals will vary even when summarizing the same material at two different points in time. Two different human subjects vary even more significantly. Even though human subjects tend to produce very
moderately consistent results, all the summaries they produced were judged as adequate. This is not the case with summaries produced by computers.

(Shawn)
Slide 17: Computational Approaches

There are two approaches, a knowledge based approach, and a selection based approach. So what is the difference? The knowledge-based approach takes longer, on the surface.

So why does it? The knowledge based approach actually builds semantic representations of what the text means. Following this, it tries to chop out anything that is extraneous, unimportant, and only leave the main points (some systems actually simply rate the topics for importance and then pick out enough key points to fill the summary size requested), and finally produce a text that contains all of the meanings that were found to be key. Unfortunately, this can take a long time, especially when one considers how we gain meaning from a sentence, such as:
   "The tall man chased the small yorkshire terrier into an alley."
can be summarized by simply cutting out descriptive words:
   "The man chased the terrier into an alley."
or one can generalize the terrier to simply dog. Though when one adds the next sentence:
   "The tall man chased the small yorkshire terrier into an alley. Unfortunately for the terrier, the alley was a dead end."
one can now summarize this as:
   "The man trapped the dog in an alley."

Though this example also illustrates something that is not immediately obvious for summarization. In summarization we tend to generalize about something if it is not important to include the specifics. So terrier can become dog if no properties that are unique to terriers are being called upon. Unfortunately, Selectional Methods don't have this flexibility. Why? Selectional methods have the property that they choose words from the original text in order to summarize the text, and usually, simply cut out unimportant words. This will usually work, but for the above example, the end summarization would have to look like:
"The man chased the terrier into an alley."
This loses the property of generalizations such as terrier->dog.
Though selectional methods do have their strong points. Selectional methods tend to use well defined statistical (mathematical) formulas to determine what is included and what is not. This has the distinct advantage of being much faster. So here we have to balance the speed and the accuracy of the summary. Typically though we also have some combination of selectional and knowledge based, due to the speed problems of the knowledge based method (for large documents).

(Keri)
Slide 18: Historical Approaches

In order to understand the basic subtasks in a computational summarization task, it might be useful to outline an early algorithm created by Luhn (1958) (as cited in Engres-Niggemeyer, 1998). This simple algorithm is a selection-based summarization approach. The method works as follows:
1) words are input from the text;
2) common/non-substantive words are deleted through table look-up;
3) content words are stored, along with their position in the text, as well as any punctuation that is located immediately to the left and/or right of the word;
4) content words are sorted alphabetically;

Slide 19: Historical Approaches

Luhn (1958) Algorithm (cont.):

5a) similar spellings are consolidated into word types (a rough approximation of a stemmer - any token with less than seven letter non-matches were considered to be of the same word type),

Slide 20: Historical Approaches

Luhn (1958) Algorithm (cont.):

5b) The frequencies of word types are compared,
5c) the low frequencies deleted, and
5d) the remaining words are considered significant;

An aside - anaphora resolution is just one problem presented by word count methods. Word counts for frequency are skewed if there is no anaphora resolution. In a paragraph about white elephants, any reference to "elephants" or "those big animals" or "they" should all count as equal to "white elephants." Anaphora resolution helps to more accurately select the topic.

In a similar way, word sense disambiguation is an important subtask, as ambiguous words may or may not be counted as relating to the topic depending on which sense is used.

Slide 21: Historical Approaches

Luhn (1958) Algorithm (cont.):

6) the remaining word types are sorted into location order and
7) sentence representativeness was determined by dividing sentences into substrings that consisted of a significant word separated from other significant words by no more than four words (significant words separated from other significant words by more than four words were viewed as isolated and not considered further);

Sentence representativeness substrings are illustrated with the sentence "Better to eat you with, my dear." Assume that the following words are considered significant by the algorithm: better, eat, you, dear. The substrings that are selected include:

Substring 1: Better to
Substring 2: to see
Substring 3: you with, my
Substring 4: with, my dear

Slide 22: Historical Approaches

Luhn (1958) Algorithm (cont.):

8) for each substring, a representativeness value was calculated by dividing the number of representative tokens in the cluster by the total number of tokens in the cluster;

In this example, assume that "better" has a frequency of 2, "see" has a frequency of 4, "you" has a frequency of 6, and "dear" has a frequency of 1 in the sentence "Better to eat you with, my dear."

Substring 1: 2/2=1
Substring 2: 4/2=2
Substring 3: 6/3=2
Substring 4: 1/3=0.333

Total value for sentence = 5.33

9) sentences reaching a representativeness value above a preset threshold were selected for inclusion in the abstract (Engres-Niggemeyer, 1998).

Slide 23: Historical Approaches

The TRW approach of the 1960's built upon Luhn's model by adding weights for words that occurred in the title or subtitles of the document. It also weighted sentences based upon their location. Sentences earlier or later in a paragraph were given higher weights than those in the middle. However, largest drawback at this point is that whole sentences are extracted, not rewritten (Engres-Niggemeyer, 1998).

Slide 24: Historical Approaches

The Luhn model and the TRW model are both approaches that are selection-based. Knowledge-based approaches, in contrast, make use of frames and schemas, which are formats of knowledge representations. There are two models that are influenced by models used in cognitive science: FRUMP and PAULINE.

Slide 25: Historical Approaches

FRUMP is an expectation driven model. It has a knowledge base that has been built and the model looks for instances of the knowledge base in the text to be summarized. Full parsing is not necessary for this method to work. The model only needs to recognize a member of one of its "sketchy scripts" in order to
classify a word as belonging to a certain domain. A sketchy script is not as fully fleshed-out as a regular script (Engres-Niggemeyer, 1998).

Slide 26: Historical Approaches

PAULINE is a model that works on the pragmatics of the situation. It will generate a number of summaries, each targeted towards a different goal, from politeness to persuasion. It can generate 100 different summaries from one original. The system initially asks the user for information to help guide its behavior and then asks the user for conversation topics that are included in its scripts. PAULINE then collects information on the topic and creates sentences. Some examples of the pragmatic goals that are used include: make the listener like me, use a "highfalutin" tone of voice, persuade the listener to change their opinion (Engres-Niggemeyer, 1998).

Slide 27: Current Approaches

So far, we have only looked at older approaches to text summarization. Newer methods are characterized by: the use of stochastic methods, integration of corpus linguistics, shallow parsing methods, lexical semantics knowledge through use of WordNet, integration of different methods in one model,
summarization from structured knowledge and integration of information from different media (Engres-Niggemeyer, 1998).

(Shawn)
Slide 28: Current Approaches
One of the current approaches involves combining related fields in order to produce summaries. Information Extraction is used to put the text into a more computer friendly structure, this structure is them compressed by weighting the structure for importance and creating a new structure which only includes the key concepts. Finally, the new structure is turned into a text document through text generation procedures. This is actually pretty popular because it takes less time from a development perspective. It mainly uses already existing technologies, and simply glues them together in different ways.
(All sentence compression data taken from: Knight & Marcu, 2000)

Slide 29: Current Approaches
Though, why think so large? Instead we can simply work on compressing more manageable chunks and then after we get that working, apply it in a divide-and-conquer type methodology.

Slide 30: Current Approaches
Imagine for a moment that a sentence is just a channel with alot of noise. So we have a signal obscured by alot of extra data. All we have to do in this case is remove the noise, and then we have the actual signal. So, thinking about this we can just use a machine learning method to train the program to determine what is noise, and what is stuff that we want. This is actually done by presenting the "noisy data" and showing the program what the data looks like with all the noise stripped out.

Slide 31: Current Approaches
Source: Generate shorter strings, using the words in the original sentence, and then probabalistically determine which of the shorter strings is more likely to be the source of the original string. As one could see, this could get to be REALLY time consuming for a large sentence, because ALL possible sentences are generated using one or more words from the original sentence.

Channel: The strings are compared, using pairs of strings, the original and the short string, to see if the original string is a likely expansion for the generated short string.

Decoder: Search for the maximum probability that the original is just an expansion of the new sentence.

Slide 32: Current Approaches
Our focus is on preserving the important information, not so much on making the output grammatical, though wee still do care about this, just not as much.

Slide 33: Current Approaches
We can't really work with sentences well, on a computer, though we can work well on parse trees. For such a case, we usually use parse trees such as those produced by the Collins Parser (Collings, 1997).

Slide 34: Current Approaches

Slide 35: Current Approaches
The sentence is followed by a rating. The rating is determined in such a way that the lower the rating, the better.

Slide 36: Current Approaches

Slide 37: Current Approaches

Slide 38: Current Approaches

Slide 39: Current Approaches

Slide 40: Current Approaches

Slide 41: Current Approaches

Slide 42: Current Approaches

Slide 43: Current Approaches
This is interesting... Make sure to notice the rating.

Slide 44: Current Approaches
...And here is why that was interesting. This example has a higher rating, but is more compressed. The reason? The compression routine likes determiners, such as the, because of the specificity that it denotes. So, Operations is not rated as favorably as The Operations...

Slide 45: Current Approaches

Slide 46: Current Approaches

Slide 47: Current Approaches

Slide 48: Current Approaches

Slide 49: Current Approaches

Slide 50: Current Approaches

Slide 51: Current Approaches
From here on the examples do not include the intermediate steps...

Slide 52: Current Approaches

Slide 53: Current Approaches

Slide 54: Current Approaches

Slide 55: Current Approaches

Slide 56: Current Approaches
The users of this approach measured themselves against human summaritzers. Here, we can see it did well, the same as humans.

Slide 57: Current Approaches
Well... Nothing is perfect... What happened here is that the compression was too conservative, and did not notice that it could remove so much of the sentence, and in fact thought that the sentence entirely was important to the meaning. This is actually simply an inflexibility of the algorithm, not a real error... (better safe than sorry appears to be the approach here).

Slide 58: Current Approaches
Well, here is very interesting, the two different methods of compressing the sentence behaved differently. The Noisy-Channel method was too conservative, while the Decision-based method was too liberal... AHHH!! The politics are here too! Unfortunately though, we must regard the too liberal method as worse off, because it does not include all of the information that the conservative one does, and so is a less accurate summary. Though there is also the chance that the humans were wrong...

Slide 59: Future Work
All of these have promise, though,

Noisy Channel: This appears to be the way to go... As this becomes more complex, perhaps better learning algorithms, or other improvements, this could easily become a very accurate method.

Knowledge-Based: Hmm, well this is actually supposedly the way that humans do it, so this should probably be the direction that people head, but the problem with this method is that it takes so long, so... the solution is get faster computers, and wow, faster computers come out every year, and so sometime in the future, it is forseeable that we may have enough computing power to make this practical. Though we already have some ideas on this actually. There is the CYC project. This project is based on using common sense knowledge to perform intelligently. One of the things that humans appear to use greatly in summarization is just common sense knowledge of what really matters, and what does not. One of the goals in fact of the CYC project was to produce a machine that could learn not a specific topic, but instead to learn how to learn. So, this could quickly be something that turns up to matter, and considering the speed of the CYC system, there may yet be a way to perform this efficiently.

Other: Here we can think about the IE->DBCompress->Text Generation method. There are better and better algorithms for each of these every year. What this means is that methods that use these algorithms will get better and better every year, but there is also the probability that not only the methods of doing these tasks will get better, but also that the methods of gluing these tasks together will get better, and as such, will improve this line as well.

So, the short story here is that all of the current methods appear to be viable, and with increasing computing power, and simply experience, all of them appear to be the direction of the future. Though something else interesting is the fact that perhaps separating knowledge based and selectional based systems is something that could be less separate in the future. These two methods could quickly combine to produce better summarization systems. One can quickly see that should we remove all of the extraneous data that has nothing to do with the topic, or is simply 'icing' one could use a previously too slow knowledge based method on the remainder of the data to produce a much better summary.

(Keri)
Slide 60: Summary

Text summarization has several different methods and subtasks and, like most recent developments in the
area of Computational Linguistics, there is more to be done to make automatic processes match human
expectations.

Links

Little Red Riding Hood - an example of something to summarize

MS-Word reportedly uses the ProSum summary method, which appears to simply take the first sentences from each paragraph and output those as a summary (Extract Method).
Example of a MS-Word 97 Summary

SUMMARIST uses the method of summarization in which Information Retrieval is used to get the key concepts, followed by Information Retrieval again to interpret those concepts, and finally a text generation method is used to produce the summary.
SUMMARIST Example

Bibliography

Endres-Niggemeyer, B. (1998). Summarizing Information. Springer, New York, NY.

Hovy, E. and Marcu, D. (2000). Automated Text Summarization Tutorial. In Proceedings of the Thirty-
sixth Conference of the Association of Computational Linguistics (ACL-98). Pre-conference tutorial.
Montreal, Quebec, Canada.