FAQ for ConceptDoppler
Press releases: U.C. Davis, University of New Mexico.
Press: BBC, Schneier on Security, EWeek, Slashdot, The California Aggie, Epoch Times, Ars Technica, WSJ Business Technology Blog, Albuquerque Tribune, Daily Lobo
In other languages: Viet
Namese, 2,
3,
Spanish,
Chinese
(Big5), (GB), Czech,
German, French,
Japanese,
Norwegian, Arabic,
Romanian
Paper: here.
E-mail: conceptREMOVECAPITALSdoppler@gmail.com.
What is the GFC?
The Great Firewall of China (GFC) is a system set up by the Chinese
government for censoring the Internet. It is comprised of many techniques, including IP address blocking, DNS redirection, and keyword filtering.
Keyword filtering can be done to prevent the transmission of blacklisted keywords or to flag pages for manual inspection. It can be implemented in specific programs (e.g. QQChat, a popular chat client), blog sites, e-mail, or in Internet routers.
All of our results are specific to HTTP keyword filtering by Internet routers. This filtering is applied to World Wide Web traffic. We chose to focus on a specific mechanism and learn as much as possible about that mechanism alone. For more general information on other mechanisms, we will refer you to The Open Net Initiative.
What is ConceptDoppler?
ConceptDoppler is a weather tracker for Internet censorship. Using ConceptDoppler we can track the list of keywords that a government uses to censor Internet traffic. For GFC keyword filtering, we can also locate the routers performing filtering and deduce the architecture of this censorship mechanism.
Why do you call it ConceptDoppler?
We use Latent Semantic Analysis (LSA) to prioritize the words we
check. Just as an understanding of the mixing of gases led to effective weather tracking, understanding the relationship between sensitive concepts and blocked keywords will lead to more effective tracking of Internet censorship. More details are available in the paper.
Is the GFC a "fraud?"
We do not consider the GFC to be a "fraud," nor were we the originators of that particular language. The hypothesis set forth in our paper is that the GFC is effective in a very different way, because it acts as a Panopticon and not strictly as a firewall. The Chinese government does not make any specific claims as to the GFC's
effectiveness, so we have not debunked any specific claim. We are attempting to clear a misconception: that all information is necessarily filtered strictly at the border of the Chinese Internet without exception.
What is really interesting about this work, i.e., what are your suggested soundbites?
- 1) That the GFC's keyword filtering mechanism is not a firewall that peremptorily blocks at the border of the Chinese Internet (blocking may occur deeper into the Chinese Internet, may be sporadic during busy Internet periods, and for 28% of the paths we tested did not occur at all);
- 2) The keywords that are targeted with censorship are surprising, and include words such as (in Chinese) conversion rate, Mein Kampf, Hitler, Deauville, and many other unexpected results; and
- 3) Keyword filtering is a seemingly precise mechanism with imprecise results that have unique implications, meaning that a great deal more content is filtered than what the censors intended. Examples are any web page containing the keyword (in Chinese) massacre; or that contain (in Chinese) North Rhine Westphalia which, when spelled in the Chinese characters used for foreign words, appears to contain the word "falun" that is meant to target content about Falun Gong.
Why is knowing the keyword blacklist important?
A major reason why knowing the blacklist is important is that we can compare a particular government's application of Internet censorship to their stated purposes. For example, consider this page that quotes U.S. Congressman Jim Leach:
Even though the Chinese Constitution requires that restrictions on freedom
of speech and press be openly legislated and transparently applied, he
said, "In reality, restrictions imposed by officials are often premised
upon ill-defined concepts of 'social stability,' 'state security,' and
'sedition' that mask what is in fact mere intolerance of dissent."
Knowing exactly what is on the keyword blacklist can help us to support or refute these kinds of statements. For example, the fact that keywords such as (in Chinese) conversion rate, Mein Kampf, Hitler, and Deauville (where the Asian Film Festival is located) appear on the list lends support to Congressman Leach's statement.
Did you call the GFC a "Panopticon" to imply that the Chinese Internet is a prison?
We never intended this sort of political connotation. The target audience of our paper were other technical researchers whose research focus is on privacy and censorship, and we were trying to communicate that censorship mechanisms do not have to be 100% effective to fulfill the goals of the censor. This is important because it means that evading the GFC is a harder problem. While evading a firewall a single time defeats its purpose, it would be necessary to evade a Panopticon almost every time.
Are you trying to "topple" the Great Firewall of China?
No. We are trying to understand it, and to help those who care about forming policy concerning Internet censorship (in all countries around the world) to understand it as well. There are many lessons to be learned about keyword filtering and the technical challenges of Internet router filtering, and these lessons have broad implications.
If we choose to do any research about evasion in the future it will be only to aid in understanding.
Are your efforts aimed against China and/or against Censorship?
No. Different members of the ConceptDoppler team have different political opinions on various issues, including government Internet censorship. We have tried very hard to not politicize our technical research, but this is very difficult and sometimes particular opinions come out in a choice of wording or in our answers to questions from the media. Also, sometimes media reports offer their own interpretation of the results of a technical study on which they are reporting. We are happy to see this variety of opinions in the discussion of our research, but encourage you to read our paper and the original press release before attributing any particular statements to the ConceptDoppler team.
What about censorship in other countries?
We chose to focus on China because the GFC's keyword filtering implementation, where reset packets are sent in both directions, enabled the type of probing we wanted to do, and because the GFC is the most elaborate Internet censorship mechanism. The Open Net Initiative is a good source of general information about Internet censorship in other countries and censorship mechanisms other than keyword filtering.
Do your results reflect on the GFC as a whole?
Without similarly testing other components of the GFC (e.g., IP blocking or e-mail keyword filtering), the only definitive conclusions we can reach are specific to HTTP keyword filtering.
Why focus on keyword filtering?
One reason is that the GFC's keyword filtering mechanism allows us to probe it from outside of China. This allows us to perform two types of probing: 1) to see where in the Chinese Internet the filtering routers are located, and 2) to test keywords to find out if they are blocked or not. Locating the filtering routers can give us insights into how and why the filtering is implemented, such as our proposition that the GFC is more of a Panopticon than a firewall. Reverse-engineering the blacklist of keywords can tell us what topics the government is targeting, with sometimes surprising results such as Hitler and Mein Kampf (in Chinese) or the Deauville Asian Film Festival.
A more important reason is that keyword filtering is unique compared to other forms of Internet censorship. Filtering a keyword can lead to censorship of many more topics than intended, an example is how filtering for Falun Gong (in Chinese) can lead to censorship of articles about North-Rhine Westphalia, a state in western Germany.
Why did you choose the term "Panopticon?"
We wanted to capture the idea that the GFC is a mechanism that promotes self-censorship, and a Panopticon was a convenient analogy for doing this. The idea behind a Panopticon is that those being watched modify their behavior precisely because they are not sure if others may be watching them. The fact that the original coinage of the term Panopticon describes a prison design was not an important part of our choice of wording. We are following the more abstract usage of Michel Foucault in his book "Discipline and Punish: The Birth of the Prison."
Many experts had made this observation about the GFC, that it promotes self-censorship, before we did our work, but we chose the term "Panopticon" as an alternative to "firewall," since our results clearly show that the GFC is not a "firewall" in the strict sense of the term. A firewall would block all offending traffic at the border of the country's Internet. Our work adds quantitative measurements to an idea that had already been expressed by many experts about the GFC.
The opinions of Chinese Internet users vary widely, but many that we have talked to feel that keyword filtering is applied not just to block access to web sites but also to log Internet traffic for review by law enforcement. This is (as of September 2007) stated as a fact on the Chinese-language version of Wikipedia: http://zh.wikipedia.org/wiki/GFW. Whether this is rumor or fact, the effect that such a belief has on the Internet usage behavior of those who believe it is real. The fact that some Chinese Internet users regularly use proxies and other evasion techniques to get around the censorship does not change this fact.
What assumptions are you making by testing only GET requests from outside of China?
There are six possibilities for keyword filtering of HTTP traffic: both GET requests (when you request a web page from a web server, in other words) and HTML responses (the actual web page that comes back to you) may be filtered; and this might occur for U.S.-to-China traffic, China-to-U.S. traffic, and China-to-China traffic. We believe the keyword filtering is symmetric in two ways: it does not matter to the filtering router which direction the traffic is going, and the blacklist of keywords is the same for GET requests and HTML responses. We also believe that China-to-China connections can be subject to keyword filtering, but due to router placement this is more unlikely than for international connections. We plan to verify these assumptions, and would be interested in any input you might have (conceptREMOVECAPITALSdoppler@gmail.com).
How does the keyword filtering work?
From our news release: "In 2006, a team at the University of Cambridge, England, discovered that when the Chinese system detects a banned word in data travelling across the network, it sends a series of three 'reset' commands to both the source and the destination. These 'resets' effectively break the connection." For more details see our paper or Clayton et al., "Ignoring the Great Firewall of China," at the 6th Workshop on Privacy Enhancing Technologies, 2006.
Much work is needed before we can answer specific questions, such as whether the filtering routers reconstruct TCP flows or simply scan each packet and ignore the possibility that a keyword is broken up across two packets. Also, we believe that the keyword filtering implementation is heterogeneous, meaning that it works differently in different places even if the keyword list appears to be the same, and both ourselves and the Cambridge researchers have found the implementation to change over time. One example is that their research concluded that no SYN/SYNACK handshake was necessary for keyword filtering to occur, while we found this handshake to be necessary in at least some places. The GFC implementation might have changed between their experiments and ours.
Why not ask the Chinese government for the blacklist and the locations of the filtering routers?
We doubt that the Chinese government would make this information public. We base this on several events that have been published in the media. The first was the release of the QQChat keyword blacklist, where the hackers responsible for extracting this list from the QQChat software chose to remain anonymous. Given the media attention this release received, if the HTTP list were easy to obtain we believe that the media would have done so already. By comparing our first keyword list produced by ConceptDoppler to this list it is apparent that the HTTP list is different or has changed a great deal, for example Hitler and Mein Kampf were not on the 2005 QQChat list. Also, dissident's names and other keywords may be added to the list and removed based on current events, making it necessary to track the list over time. Finally, it appears that many companies that are complicit in keyword censorship are unwilling to release the keywords, for example "[No] Skype executive, however, has clarified exactly which regulations are being complied with or which keywords are involved." If the companies are unwilling to release the blacklist of keywords, it is unlikely that the government would be willing to do so.
Furthermore, if the information were made public it would still be necessary to verify it. Also, ConceptDoppler can be applied to any form of keyword-based censorship where an answer is returned as to whether a word was censored or not.
If you know who to contact to get the up-to-date blacklist and the locations of the filtering routers, please e-mail us (conceptREMOVECAPITALSdoppler@gmail.com) and let us know.
Have you tested every word?
We have not. Rather than testing every word, we
test words that are related to concepts that are known to be
controversial or blocked. Efficiency in terms of the number of words that we test is critical for several reasons. One reason is that continually testing millions of terms that will probably never be blacklisted hinders our ability to discover a new keyword as soon as possible after it is added to the blacklist. Another important reason is that probing is invasive, since it uses the network resources of others.
Can I try it?
If you are curious about just trying out some of our words then you
can. The simplest way to test a word into enter it into www.yahoo.cn. For
instance if you search for 'falun' in www.yahoo.cn you will, most
likely, receive an error message saying that your connection has been reset. Typically, if you do receive a reset, you will remain blocked from yahoo.cn for approximately 90 seconds.
What should I do if I find new words that are not on your list?
We would be very interested to hear about new words that you might have stumbled upon or
have special insight on. Please email us at conceptREMOVECAPITALSdoppler@gmail.com
and let us know. We can feed this new word into our Doppler engine to
search for a whole new batch of words.
How can I contribute?
If you have ideas about topics or concepts that might be blocked
please let us know. You can email us at conceptREMOVECAPITALSdoppler@gmail.com.
What is a Panopticon?
The original Panopticon was a prison design developed by the English philosopher Jeremy Bentham in the 18th century. Bentham proposed that a central observer would be able to watch all the prisoners, while the prisoners would not know when they were being watched.
Is ConceptDoppler open source?
Please e-mail us at conceptREMOVECAPITALSdoppler@gmail.com with any inquiries about the source code. Also, all packets (with timestamps) from the Internet measurement part of the research were saved in an SQL database. If you want to attempt to duplicate our results or use this for some other research purpose we can make the database available to you.
From North America/Europe, it appears that yahoo.cn is subject to keyword filtering and google.cn is not. What does this mean?
We are not sure. If you know, or have tested this from another continent, please e-mail us (conceptREMOVECAPITALSdoppler@gmail.com).
One possibility is that there
is a router between where you are testing from and yahoo.cn where keyword filtering is being
applied, but there is no such router between your source location and google.cn. Whether keyword filtering is applied depends on the path
that the packets take, and whether that path contains at least one router that performs
keyword filtering. For example, we see keyword filtering of GET requests destined for yahoo.cn, but there is apparently no filtering of HTML responses from yahoo.cn, not even for keywords that we know to be filtered in HTML responses coming from other places. Internet routes are often assymetric, meaning that there may be a filtering router along our route to yahoo.cn but no filtering router along yahoo.cn's route to us.
Another possible explanation as to why google.cn does not seem to be subject to keyword filtering:
"Since announcing its intent to comply with Internet censorship laws in
the People's Republic of China, Google China has been the focus of
controversy over what critics view as capitulation to the 'Golden
Shield Project' (also known as the Great Firewall of China). Because
of its self-imposed censorship, whenever people search for interdicted
Chinese keywords on a blocked list maintained by the PRC government,
google.cn will display the following at the bottom of the page
(translated): In accordance with local laws, regulations and policies,
part of the search result is not shown."