北京翻译公司 0086 10-82115891, 0086 21-31200158
001 647 624 9243, 0061 02 91885890
 
翻译样稿 更多>>
· 食品卫生例行检查与新食品卫...
· 网站翻译样稿:北方故事旅行社
· 加拿大旅游网站翻译:北极光...
· 英中旅游网站翻译样稿:在市...
· 北方故事旅行社北极光之旅网...
· 杰克 韦尔奇领导辞典图书翻...
· 脑机界面的进展_美国国家工...
· 脑机界面的进展_美国国家工...
· 中国人民解放军境内外练兵方...
· 中国人民解放军境内外练兵方...
小语种翻译业绩 更多>>
· 法语翻译业绩
· 德语翻译业绩
· 俄语翻译业绩
· 日语翻译业绩
· 西班牙语翻译业绩
· 韩语翻译业绩
· 意大利语翻译业绩
· 葡萄牙语翻译业绩
电子通信英译中翻译样稿
当前位置:首页 > 翻译样稿 > 电子通信英译中翻译样稿

利用合作产生的内容促进自然语言理解_美国国家工程院2011年美国工程前沿研讨会上宣读的论文(节选)_英文原文_20120027-8

Advancing Natural Language Understanding with Collaboratively Generated Content
Evgeniy Gabrilovich
Yahoo! Research
 
 
 
Proliferation of ubiquitous access to the Internet enables millions of Web users to collaborate online in a variety of activities. Many of these activities result in the construction of large repositories of knowledge, either as their primary aim
(e.g., Wikipedia) or as a by-product (e.g., Yahoo! Answers). In this paper, we discuss
how to use the cornucopia of world knowledge encoded in the repositories of collaboratively generated content (CGC) for advancing computers’ ability to process human language.
Prior to the advent of CGC repositories, many computational approaches to natural language employed the WordNet electronic dictionary (Fellbaum, 1998), which covers approximately 150,000 words painstakingly encoded by professional linguists over the course of more than 20 years. In contrast, the collaborative Wiktionary project (www.wiktionary.org) includes more than 2.5 million words in English alone. Encyclopaedia Britannica, published since 1798, has approximately 65,000 articles, while Wikipedia has over 3.7 million articles in English and over 15 million articles in over 200 other anguages. Ramakrishnan and Tomkins (2007) estimated the amount of user-generated content produced worldwide on a daily basis to be 8-10 gigabytes, and this amount has likely increased considerably since then.
 
REPOSITORIES OF COLLABORATIVELY GENERATED CONTENT AS AN ENABLING RESOURCE
The unprecedented amounts of information in CGC enable new, knowledgerich approaches to natural language processing, which are significantly more powerful than the conventional word-based methods. Considerable progress has been made in this direction over the past few years. Examples include explicit manipulation of human-defined concepts and their use to augment the bag of words in information retrieval (Egozi et al., 2011), or using Wikipedia for better word sense disambiguation (Bunescu and Pasca, 2006; Cucerzan, 2007).
One way to use CGC repositories is to treat them as huge additional corpora, for instance, to compute more reliable term statistics or to construct comprehensive lexicons and gazetteers. They can also be used to extend existing knowledge repositories, increasing the concept coverage and adding usage examples for previously listed concepts. Some CGC repositories, such as Wikipedia, record each and every change to their content, thus making the document authoring process directly observable. This abundance of editing information allows us to comeup with better models of term importance in documents, assuming that terms introduced earlier in the document life are more central to its topic. The recently proposed Revision History Analysis (Aji et al., 2010) captures this intuition to provide more accurate retrieval of versioned documents.
An even more promising research direction, however, is to distill the world knowledge from the structure and content of CGC repositories. This knowledge can give rise to new representations of texts beyond the conventional bag of words and allow reasoning about the meaning of texts at the level of concepts rather than individual words or phrases. Consider, for example, the following text fragment: “Wal-Mart supply chain goes real time.” Without relying on large amounts of external knowledge, it would be quite difficult for a computer to understand the meaning of this sentence. Explicit Semantic Analysis (ESA) (Gabrilovich and Markovitch, 2009) offers a way to consult Wikipedia in order to fetch highly relevant concepts such as “Sam Walton” (the Wal-Mart founder); “Sears,” “Target,” and “Albertsons” (prominent competitors of Wal-Mart); “United Food and Commercial Workers” (a labor union that has been trying to organize Wal-Mart’s workers); and “hypermarket” and “chain store” (relevant general concepts). Arguably, the most insightful concept generated by consulting Wikipedia is “RFID” (radio frequency identification), a technology extensively used by Wal-Mart to manage its stock.
None of these concepts are explicitly mentioned in the given text fragment, yet when available they help shed light on the meaning of this short text. In the remainder of this article, I first discuss using CGC repositories for computing semantic relatedness of words and then proceed to higher-level applications such as information retrieval.
 
COMPUTING SEMANTIC SIMILARITY OF WORDS AND TEXTS
How related are “cat” and “mouse?” And what about “preparing a manuscript” and “writing an article?” Reasoning about semantic relatedness of natural language utterances is routinely performed by humans but remains challenging for computers. Humans do not judge text relatedness merely at the level of text words.
Words trigger reasoning at a much deeper level that manipulates concepts—the basic units of meaning that serve humans to organize and share their knowledge. Thus, humans interpret the specific wording of a document in the much larger context of their background knowledge and experience. Prior work on semantic relatedness was based on purely statistical techniques that did not make use of background knowledge (Deerwester et al., 1990), or on lexical resources that incorporate limited knowledge about the world (Budanitsky and Hirst, 2006). CGC-based approaches differ from the former in that they manipulate concepts explicitly defined by humans, and from the latter in the sheer number of concepts and the amount of background knowledge. One class of new approaches to computing semantic relatedness uses the structure of CGC repositories, such as category hierarchies (Strube and Ponzetto, 2006) or links among the concepts (Milne and Witten, 2008). Given a pair of words whose relatedness needs to be assessed, these methods map them to relevant concepts, (e.g., articles in Wikipedia) and then use the structure of the repository to compute the relatedness between these concepts. Gabrilovich and Markovitch (2009) proposed an alternative approach that uses the entire content of Wikipedia and represents the meaning of words and texts in the space of Wikipedia concepts. Their method, ESA, represents texts as weighted vectors of concepts. The meaning of a text fragment is thus interpreted in terms of its affinity with a host of Wikipedia concepts.
Computing semantic relatedness of texts then amounts to comparing their vectors in the space defined by the concepts, for example, using the cosine metric. Subsequently proposed approaches offer ways to combine the structure-based and concept-based methods in a principled manner (Yeh et al., 2009). Beyond Wikipedia, Zesch et al. (2008) proposed a method for computing semantic relatedness of words using Wiktionary. Recently, Radinsky et al. (2011) proposed a way to augment the knowledge extracted from CGC repositories with temporal information by studying patterns of word usage over time. Consider, for example, an archive of the New York Times spanning 150 years. Two words such as “war” and “peace” might rarely co-occur in the same articles, yet their patterns of use over time might be similar, which allows us to better judge their true relatedness.
 
CONCEPT-BASED INFORMATION RETRIEVAL
Information retrieval systems traditionally rely on textual keywords to index and retrieve documents. Keyword-based retrieval may return inaccurate and incomplete results when different keywords are used to describe the same concept in the documents and in the queries. Furthermore, the relationship between those related keywords may be semantic rather than syntactic, and capturing it thus requires access to comprehensive human world knowledge. Previous approaches have attempted to tackle these difficulties by using manually built thesauri, by relying on term co-occurrence data, or by extracting latent word relationships and concepts from a corpus. ESA introduced in the previous section, which represents the meaning of texts in a very high-dimensional space of Wikipedia concepts, has been shown to offer superior performance over the previous state-of-the-art algorithms. In contrast to the task of computing semantic relatedness, which usually deals with short texts whose overlap is often empty, information retrieval usually deals with longer documents. It is noteworthy that in such cases optimal results can be obtained by extending the bag of words with concepts rather than merely relying on the conceptual representation alone.
Intuitively, one might expect domain-specific knowledge to be key for processing texts in terminology-rich domains such as medicine. However, as Gabrilovich and Markovitch (2007) showed, it is the general purpose knowledge that leads to much higher improvements in text classification accuracy. In the follow-up article (Gabrilovich and Markovitch, 2009), the authors also showed that using larger repositories of knowledge,(e.g., later Wikipedia snapshots) leads to superior performance as more knowledge becomes available.
Potthast et al. (2008) and Sorg and Cimiano (2008) independently proposed CL-ESA, a cross-lingual extension to ESA. Using cross-language links available between a growing number of Wikipedia articles, the approach allows to map the meaning of texts across different languages. This allows, for example, formulating a query in one language and then using it to retrieve documents written in a different language.
 
CONCLUSION
Publicly available repositories of collaboratively generated content encode massive amounts of human knowledge about the world. In this paper, we showed that the structure and content of these repositories can be used to augment representation of natural language texts with information that cannot be deduced from the input text alone.
Using knowledge from CGC repositories leads to double-digit accuracy improvements in a range of tasks, from computing semantic relatedness of words and texts to information retrieval and text classification. The most important aspects of using exogenous knowledge are its ability to address synonymy and polysemy, which are arguably the two most important problems in natural language processing. The former manifests itself when two texts discuss the same topic using different words, and the conventional bag-of-words representation is not able to identify this commonality. On the other hand, the mere fact that the two texts contain the same polysemous word does not necessarily imply that they discuss the same topic, since that word could be used in the two texts in two different meanings. We believe that concept-based representations are so successful because they allow generalizations and refinements, which partially address synonymy and polysemy.
原件下载:
翻译语种 更多>>
英语翻译 德语翻译 法语翻译
俄语翻译 日语翻译 韩语翻译
西班牙语 葡萄牙语 荷兰语翻译
乌克兰语 意大利语 波兰语翻译
丹麦语翻译 希腊语翻译 泰语翻译
瑞典语翻译 越南语翻译 阿拉伯语
专业范围 更多>>
· 安全环保 · 电力能源 · 银行保险
· 法学翻译 · 天文地理 · 钢铁冶金
· 航空航天 · 道路桥梁 · 地质采矿
· 建筑工程 · 金融财会 · 经济管理
· 交通运输 · 仪器仪表 · 医疗器械
· 医药卫生 · 石油化工 · 机械电子
小语种译员 更多>>
· 黄女士 法国佩皮尼昂大学硕...
· 法语翻译 核电专业法语翻译...
· 熊先生 法国某大学市场营销...
· 陆先生 国际经济与贸易本科...
· 宁先生 法国南特大学 工商...
北京翻译公司 地址:海淀区太阳园4号楼1507室 电话:010-82115891 82115892 bjhyw@263.net QQ:800022641
上海翻译公司 地址:上海市闵行区古美路443弄10号楼804 电话:021-31200158 shkehu@263.net, QQ:390645976
美国翻译公司 地址:450 N Atlantic Blvd Monterey Park, CA 91754, Tel:1 626 768 3096 信箱chinatranslation.net@gmail.com
加拿大翻译公司 地址:46 Ealing Dr, North York, Toronto, ON, M2L 2R5 电话:647 624 9243 bjctn@vip.sina.com
太原翻译公司 地址:太原市万柏林区迎泽西大街奥林匹克花园7D202 电话:15034183909 Email:tykehu@163.com
澳大利亚Chinese Translation客服电话:61 02 91885890,国内其他地区统一电话:950 404 80511  
京ICP备05038718号-3
北京翻译公司