Summary Slides (MS-PowerPoint97)
Annotations:
(Shawn)
Slide 1: Introduction
What is Text Summarization?
A method of compressing a document into a shorter document
which describes what the document is about...
Slide 2: Introduction
It produces a summary.
Slide 3: Introduction
...which is automatically generated...
Slide 4 Introduction
...in regards to a certain document or set of documents...
Slide 5: Introduction
...though, we want to produce a summary that is at least
as good as a human would be able to produce.
Slide 6: Introduction
Typically though, Summarization
is a hard problem. So, as is usual, we borrow from people who have
part of the problem solved, and try to come up with something from there.
Information Extraction:
Text -> Structured Representation(DB)
-- Structured representations
are easier to deal with for computers.
Information Retrieval: Retrieving Documents based on free text (natural language) queries.
Text Mining: Getting specific information from free text.
Text Generation: Generating text from some sort of information (meaning) structure.
(Keri)
Slide 7: Types of Text
Summarization
There are many different types of summaries that fulfill
different user needs. Indicative summaries are used for quick categorization
while informative summaries employ some content processing. Similarly,
an extract simply lists key phrases from a text while an abstract is cohesive
and coherent and may be re-phrased. Generic summaries reflect a specific
author's viewpoint while query-oriented summaries respond to a user's needs.
Background summaries offer extensive information because they assume a
reader's understanding of the subject is poor while just-the-news summaries
assume that the reader understands the topic well and only needs to be
updated. Single-document summaries only have one source while multi-document
summaries fuse together information from many sources (Hovy and Marcu,
2000).
Slide 8: Types of Text Summarization
Summaries can look at all the information in a document or only the information that is deemed relevant.
Slide 9: Types of Text Summarization
Summarization tasks can also be categorized into two broad types of approach: approached from the top-down (a query-driven focus) versus bottom-up (text-driven focus) (Hovy and Marcu, 2000).
Slide 10: What do Human Summarizers Do?
Before taking a look at the computational approach to
the summarization problem, it is interesting to
examine how humans handle the problem. General
processes that humans use when summarizing written
or spoken text include: deleting extraneous information
where possible and rewriting remaining
information to make it more general and more compact.
Slide 11: What do Human Summarizers Do?
To illustrate, consider an example from Engres-Niggemeyer (1998).
Example:
"Father was washing dishes. Mother was working
on her new book. The daughter was busy painting the window frames."
After summarization:
"The whole family was busy."
The task of summarizing the following sentences involves all three steps mentioned above. First, extraneous material - that the book Mother was working on was new - was deleted. Then, generalizations were made. Mother, father and daughter are all elements of a family. Washing, writing, and painting are all things that one can be busy with. Dishes, book and window frames are all objects that can be replaced by any-object. Finally, a process that Engres-Niggemeyer calls construction allows all three sentences to be compacted into one: one member of the family is busy doing something with any-object + one member of the family is busy doing something with any-object object + one member of the family is busy doing something with any-object = the whole family is busy (with any object). "With any object" can then be deleted, because it is extraneous information.
Slide 12: What do Human Summarizers Do?
However, this summarization does not work when the paragraph is expanded to include additional information (see slide for expanded example).
Slide 13: What do Human Summarizers Do?
Now, the most important point of the story is not that the whole family was busy, but rather the entire family supported mother in getting her book done. This example is intended to stress the importance of understanding the entire story before abstracting from it
Slide 14: What do Human Summarizers Do?
Humans summarize information about the world on a constant basis in daily life. Faced with large amounts of incoming information, humans combine information into meaningful and manageable representative bits. In conversation, for example, we chunk information from the discourse into distilled and reduced higher representations. We generalize (a bottom-up process) to representation of meaning when taking in information and go from a generalized meaning to a specific meaning (a top-down process) when we create or reinterpret information. In addition, we make use of cues to understand what the discourse is about, and therefore, how to generalize (or summarize) it best. These signals include knowledge about the domain of the topic, syntactic cues (such as topic-comment structures and connectives such as "like", "but", "however", "because"), stylistic and rhetoric cues ("The most pressing thing to do was", "I conclude that"), structure knowledge (narrative structure), and context or situational cues (Engres-Niggemeyer, 1998).
Slide 15: What do Human Summarizers Do?
Humans tend to keep statements of fact, items relating to the topic, items that discuss purpose, items that are stated positively (rather than negatively), items that contrast with each other, and items that are stressed when summarizing. We tend to discard reasons for an argument, comments about a topic, and examples illustrating a point.
Slide 16: What do Human Summarizers Do?
Studies (TRW as cited in Engres-Niggemeyer, 1998) have
found that when abstracting documents, individuals will vary even when
summarizing the same material at two different points in time. Two
different human subjects vary even more significantly. Even though
human subjects tend to produce very
moderately consistent results, all the summaries they
produced were judged as adequate. This is not the case with summaries
produced by computers.
(Shawn)
Slide 17: Computational
Approaches
There are two approaches, a knowledge based approach, and a selection based approach. So what is the difference? The knowledge-based approach takes longer, on the surface.
So why does it? The knowledge based approach actually
builds semantic representations of what the text means. Following
this, it tries to chop out anything that is extraneous, unimportant, and
only leave the main points (some systems actually simply rate the topics
for importance and then pick out enough key points to fill the summary
size requested), and finally produce a text that contains all of the meanings
that were found to be key. Unfortunately, this can take a long time,
especially when one considers how we gain meaning from a sentence, such
as:
"The tall man chased the small yorkshire
terrier into an alley."
can be summarized by simply cutting out descriptive words:
"The man chased the terrier into an alley."
or one can generalize the terrier to simply dog.
Though when one adds the next sentence:
"The tall man chased the small yorkshire
terrier into an alley. Unfortunately for the terrier, the alley was
a dead end."
one can now summarize this as:
"The man trapped the dog in an alley."
Though this example also illustrates something that is
not immediately obvious for summarization. In summarization we tend
to generalize about something if it is not important to include the specifics.
So terrier can become dog if no properties that are unique to terriers
are being called upon. Unfortunately, Selectional Methods don't have
this flexibility. Why? Selectional methods have the property
that they choose words from the original text in order to summarize the
text, and usually, simply cut out unimportant words. This will usually
work, but for the above example, the end summarization would have to look
like:
"The man chased the terrier into an alley."
This loses the property of generalizations such as terrier->dog.
Though selectional methods do have their strong points.
Selectional methods tend to use well defined statistical (mathematical)
formulas to determine what is included and what is not. This has
the distinct advantage of being much faster. So here we have to balance
the speed and the accuracy of the summary. Typically though we also
have some combination of selectional and knowledge based, due to the speed
problems of the knowledge based method (for large documents).
(Keri)
Slide 18: Historical Approaches
In order to understand the basic subtasks in a computational
summarization task, it might be useful to outline an early algorithm created
by Luhn (1958) (as cited in Engres-Niggemeyer, 1998). This simple
algorithm is a selection-based summarization approach. The method
works as follows:
1) words are input from the text;
2) common/non-substantive words are deleted through table
look-up;
3) content words are stored, along with their position
in the text, as well as any punctuation that is located immediately to
the left and/or right of the word;
4) content words are sorted alphabetically;
Slide 19: Historical Approaches
Luhn (1958) Algorithm (cont.):
5a) similar spellings are consolidated into word types (a rough approximation of a stemmer - any token with less than seven letter non-matches were considered to be of the same word type),
Slide 20: Historical Approaches
Luhn (1958) Algorithm (cont.):
5b) The frequencies of word types are compared,
5c) the low frequencies deleted, and
5d) the remaining words are considered significant;
An aside - anaphora resolution is just one problem presented by word count methods. Word counts for frequency are skewed if there is no anaphora resolution. In a paragraph about white elephants, any reference to "elephants" or "those big animals" or "they" should all count as equal to "white elephants." Anaphora resolution helps to more accurately select the topic.
In a similar way, word sense disambiguation is an important subtask, as ambiguous words may or may not be counted as relating to the topic depending on which sense is used.
Slide 21: Historical Approaches
Luhn (1958) Algorithm (cont.):
6) the remaining word types are sorted into location
order and
7) sentence representativeness was determined by
dividing sentences into substrings that consisted of a significant word
separated from other significant words by no more than four words (significant
words separated from other significant words by more than four words were
viewed as isolated and not considered further);
Sentence representativeness substrings are illustrated with the sentence "Better to eat you with, my dear." Assume that the following words are considered significant by the algorithm: better, eat, you, dear. The substrings that are selected include:
Substring 1: Better to
Substring 2: to see
Substring 3: you with, my
Substring 4: with, my dear
Slide 22: Historical Approaches
Luhn (1958) Algorithm (cont.):
8) for each substring, a representativeness value was calculated by dividing the number of representative tokens in the cluster by the total number of tokens in the cluster;
In this example, assume that "better" has a frequency of 2, "see" has a frequency of 4, "you" has a frequency of 6, and "dear" has a frequency of 1 in the sentence "Better to eat you with, my dear."
Substring 1: 2/2=1
Substring 2: 4/2=2
Substring 3: 6/3=2
Substring 4: 1/3=0.333
Total value for sentence = 5.33
9) sentences reaching a representativeness value above a preset threshold were selected for inclusion in the abstract (Engres-Niggemeyer, 1998).
Slide 23: Historical Approaches
The TRW approach of the 1960's built upon Luhn's model by adding weights for words that occurred in the title or subtitles of the document. It also weighted sentences based upon their location. Sentences earlier or later in a paragraph were given higher weights than those in the middle. However, largest drawback at this point is that whole sentences are extracted, not rewritten (Engres-Niggemeyer, 1998).
Slide 24: Historical Approaches
The Luhn model and the TRW model are both approaches that are selection-based. Knowledge-based approaches, in contrast, make use of frames and schemas, which are formats of knowledge representations. There are two models that are influenced by models used in cognitive science: FRUMP and PAULINE.
Slide 25: Historical Approaches
FRUMP is an expectation driven model. It has a knowledge
base that has been built and the model looks for instances of the knowledge
base in the text to be summarized. Full parsing is not necessary
for this method to work. The model only needs to recognize a member
of one of its "sketchy scripts" in order to
classify a word as belonging to a certain domain.
A sketchy script is not as fully fleshed-out as a regular script (Engres-Niggemeyer,
1998).
Slide 26: Historical Approaches
PAULINE is a model that works on the pragmatics of the situation. It will generate a number of summaries, each targeted towards a different goal, from politeness to persuasion. It can generate 100 different summaries from one original. The system initially asks the user for information to help guide its behavior and then asks the user for conversation topics that are included in its scripts. PAULINE then collects information on the topic and creates sentences. Some examples of the pragmatic goals that are used include: make the listener like me, use a "highfalutin" tone of voice, persuade the listener to change their opinion (Engres-Niggemeyer, 1998).
Slide 27: Current Approaches
So far, we have only looked at older approaches to text
summarization. Newer methods are characterized by: the use of stochastic
methods, integration of corpus linguistics, shallow parsing methods, lexical
semantics knowledge through use of WordNet, integration of different methods
in one model,
summarization from structured knowledge and integration
of information from different media (Engres-Niggemeyer, 1998).
(Shawn)
Slide 28: Current Approaches
One of the current approaches involves
combining related fields in order to produce summaries. Information
Extraction is used to put the text into a more computer friendly structure,
this structure is them compressed by weighting the structure for importance
and creating a new structure which only includes the key concepts.
Finally, the new structure is turned into a text document through text
generation procedures. This is actually pretty popular because it
takes less time from a development perspective. It mainly uses already
existing technologies, and simply glues them together in different ways.
(All sentence compression data
taken from: Knight & Marcu, 2000)
Slide 29: Current Approaches
Though, why think so large?
Instead we can simply work on compressing more manageable chunks and then
after we get that working, apply it in a divide-and-conquer type methodology.
Slide 30: Current Approaches
Imagine for a moment that a sentence
is just a channel with alot of noise. So we have a signal obscured
by alot of extra data. All we have to do in this case is remove the
noise, and then we have the actual signal. So, thinking about this
we can just use a machine learning method to train the program to determine
what is noise, and what is stuff that we want. This is actually done
by presenting the "noisy data" and showing the program what the data looks
like with all the noise stripped out.
Slide 31: Current Approaches
Source: Generate shorter
strings, using the words in the original sentence, and then probabalistically
determine which of the shorter strings is more likely to be the source
of the original string. As one could see, this could get to be REALLY
time consuming for a large sentence, because ALL possible sentences are
generated using one or more words from the original sentence.
Channel: The strings are compared, using pairs of strings, the original and the short string, to see if the original string is a likely expansion for the generated short string.
Decoder: Search for the maximum probability that the original is just an expansion of the new sentence.
Slide 32: Current Approaches
Our focus is on preserving the
important information, not so much on making the output grammatical, though
wee still do care about this, just not as much.
Slide 33: Current Approaches
We can't really work with sentences
well, on a computer, though we can work well on parse trees. For
such a case, we usually use parse trees such as those produced by the Collins
Parser (Collings, 1997).
Slide 34: Current Approaches
Slide 35: Current Approaches
The sentence is followed by a rating.
The rating is determined in such a way that the lower the rating, the better.
Slide 36: Current Approaches
Slide 37: Current Approaches
Slide 38: Current Approaches
Slide 39: Current Approaches
Slide 40: Current Approaches
Slide 41: Current Approaches
Slide 42: Current Approaches
Slide 43: Current Approaches
This is interesting... Make
sure to notice the rating.
Slide 44: Current Approaches
...And here is why that was interesting.
This example has a higher rating, but is more compressed. The reason?
The compression routine likes determiners, such as the, because of the
specificity that it denotes. So, Operations is not rated as favorably
as The Operations...
Slide 45: Current Approaches
Slide 46: Current Approaches
Slide 47: Current Approaches
Slide 48: Current Approaches
Slide 49: Current Approaches
Slide 50: Current Approaches
Slide 51: Current Approaches
From here on the examples do not
include the intermediate steps...
Slide 52: Current Approaches
Slide 53: Current Approaches
Slide 54: Current Approaches
Slide 55: Current Approaches
Slide 56: Current Approaches
The users of this approach measured
themselves against human summaritzers. Here, we can see it did well,
the same as humans.
Slide 57: Current Approaches
Well... Nothing is perfect...
What happened here is that the compression was too conservative, and did
not notice that it could remove so much of the sentence, and in fact thought
that the sentence entirely was important to the meaning. This is
actually simply an inflexibility of the algorithm, not a real error...
(better safe than sorry appears to be the approach here).
Slide 58: Current Approaches
Well, here is very interesting,
the two different methods of compressing the sentence behaved differently.
The Noisy-Channel method was too conservative, while the Decision-based
method was too liberal... AHHH!! The politics are here too!
Unfortunately though, we must regard the too liberal method as worse off,
because it does not include all of the information that the conservative
one does, and so is a less accurate summary. Though there is also
the chance that the humans were wrong...
Slide 59: Future Work
All of these have promise, though,
Noisy Channel: This appears to be the way to go... As this becomes more complex, perhaps better learning algorithms, or other improvements, this could easily become a very accurate method.
Knowledge-Based: Hmm, well this is actually supposedly the way that humans do it, so this should probably be the direction that people head, but the problem with this method is that it takes so long, so... the solution is get faster computers, and wow, faster computers come out every year, and so sometime in the future, it is forseeable that we may have enough computing power to make this practical. Though we already have some ideas on this actually. There is the CYC project. This project is based on using common sense knowledge to perform intelligently. One of the things that humans appear to use greatly in summarization is just common sense knowledge of what really matters, and what does not. One of the goals in fact of the CYC project was to produce a machine that could learn not a specific topic, but instead to learn how to learn. So, this could quickly be something that turns up to matter, and considering the speed of the CYC system, there may yet be a way to perform this efficiently.
Other: Here we can
think about the IE->DBCompress->Text Generation method. There are
better and better algorithms for each of these every year. What this
means is that methods that use these algorithms will get better and better
every year, but there is also the probability that not only the methods
of doing these tasks will get better, but also that the methods of gluing
these tasks together will get better, and as such, will improve this line
as well.
So, the short story here is that all of the current methods appear to be viable, and with increasing computing power, and simply experience, all of them appear to be the direction of the future. Though something else interesting is the fact that perhaps separating knowledge based and selectional based systems is something that could be less separate in the future. These two methods could quickly combine to produce better summarization systems. One can quickly see that should we remove all of the extraneous data that has nothing to do with the topic, or is simply 'icing' one could use a previously too slow knowledge based method on the remainder of the data to produce a much better summary.
(Keri)
Slide 60: Summary
Text summarization has several different methods and subtasks
and, like most recent developments in the
area of Computational Linguistics, there is more to be
done to make automatic processes match human
expectations.
Links
Little Red Riding Hood - an example of something to summarize
MS-Word reportedly uses the ProSum summary method,
which appears to simply take the first sentences from each paragraph and
output those as a summary (Extract Method).
Example of a MS-Word 97 Summary
SUMMARIST uses the method of summarization in which
Information Retrieval is used to get the key concepts, followed by Information
Retrieval again to interpret those concepts, and finally a text generation
method is used to produce the summary.
SUMMARIST
Example
Bibliography
Endres-Niggemeyer, B. (1998). Summarizing Information. Springer, New York, NY.
Hovy, E. and Marcu, D. (2000). Automated Text Summarization
Tutorial. In Proceedings of the Thirty-
sixth Conference of the Association of Computational
Linguistics (ACL-98). Pre-conference tutorial.
Montreal, Quebec, Canada.
<more coming>