北京翻译公司 0086 10-82115891, 0086 21-31200158
001 647 624 9243, 0061 02 91885890
 
翻译样稿 更多>>
· 食品卫生例行检查与新食品卫...
· 网站翻译样稿:北方故事旅行社
· 加拿大旅游网站翻译:北极光...
· 英中旅游网站翻译样稿:在市...
· 北方故事旅行社北极光之旅网...
· 杰克 韦尔奇领导辞典图书翻...
· 脑机界面的进展_美国国家工...
· 脑机界面的进展_美国国家工...
· 中国人民解放军境内外练兵方...
· 中国人民解放军境内外练兵方...
小语种翻译业绩 更多>>
· 法语翻译业绩
· 德语翻译业绩
· 俄语翻译业绩
· 日语翻译业绩
· 西班牙语翻译业绩
· 韩语翻译业绩
· 意大利语翻译业绩
· 葡萄牙语翻译业绩
电子通信英译中翻译样稿
当前位置:首页 > 翻译样稿 > 电子通信英译中翻译样稿

内容的自动文本理解和文本质量_美国国家工程院2011年美国工程前沿研讨会上宣读的论文(节选)_英文原文_20120027-8

Automatic Text Understanding of
Content and Text Quality
Ani Nenkova
University of Pennsylvania
Reading involves two rather different kinds of semantic processing. One is related to understanding what information is conveyed in the text and the other to appreciating the style of the text—how well or poorly it is written. For people, text content and stylistic quality are inextricably linked. For machines, robust understanding of written material has become feasible in many contexts but text quality has been out of reach so far. The mismatch matters a great deal because people rely on machines to locate and navigate information sources and increasingly read machine-generated text, for example as machine translations or text summaries. In this presentation I discuss some of the simple and elegant intuitions that have enabled semantic processing in machines, as well as some of the emerging directions in text quality assessment.
TEXT SEMANTICS (MEANING)
Reading and Understanding the Web
A single insight about language semantics has led to successes in a variety of automatic text understanding tasks. Words tend to appear in specific contexts and these contexts convey rich information about the type of word, its meaning, and connotation (Harris, 1968). Computers can learn much semantic information without human supervision simply by collecting statistics of (hundreds of) thousands of texts.
The context of a target word, consisting of other phrases or words that occur nearby in texts more often than expected by chance, is accumulated over large text collections. For example, the word tea may be characterized by the context [drink: 60, green:55, milk:40, sip:30, enjoy:10, . . .].
Each entry shows a word that appeared five words before or after tea, and the number of times the pair was seen in a large text collection. Taking just the number of occurrences of context words makes the representation even more convenient, because various standard (geometric) approaches exist for comparing the distance between numeric vectors. In this manner, a machine can compute the similarity between any two words.
Here is an example from Pantel and Lin (2002) of the 15 words most similar to wine computed by this approach:
Wine: beer, white wine, red wine, Chardonnay, champagne, fruit, food, coffee, juice, Cabernet, cognac, vinegar, Pinot noir, milk, vodka, . . .
The list may not look immediately useful but is certainly impressive if one considers how little similarity there is in the sequence of letters wine, beer, Chardonnay.
Building upon these representations, it has become possible to automatically discover words with multiple senses by clustering words similar to them (plant: (plant, factory, facility, refinery) (shrub, ground cover, perennial, bulb)), finding synonyms and antonyms.
To aid analysis of customer reviews, researchers at Google developed a large lexicon of almost 200,000 positive and negative words and phrases, identified through their similarity to a handful of predefined positive or negative words such as excellent, amazing, bad, horrible. Among the positive phrases in the automatically constructed lexicon were cute, fabulous, top of the line, melt in your mouth; negative examples included subpar, crappy, out of touch, sick to my stomach (Velikovich et al., 2010). Another line of research in semantic processing exploits the stable meaning of some contexts. For example, patterns like “X such as Y,” if occurring often in texts, is very likely an indicator that Y is a kind of X (i.e., “Red wines such as Cabernet and Pinot noir . . .”). Similarly a phrase like “The mayor of X” is a good indicator that X is a city. NELL (Never Ending Language Learning, http://rtw.ml.cmu.edu/rtw/) is a system that constantly learns unary and binary predicates, corresponding to categories and relations such as isCity(Philadelphia)
and playsInstrument(George_Harrison, guitar). The learning of each type of fact starts with minimal supervision in the form of several examples of category instances or entities between which a relation holds, given by the researchers. Then the system starts an infinite loop in which it finds web pages that contain the examples, finds phrase patterns that typically occur with the examples, selects the best patterns that indicate the predicate with high probability, and then applies the patterns to new texts to discover more instances for which the predicate is true. Different flavors of this approach to machine understanding have been developed to help search and question answering (Etzioni et al., 2008, Pasca et al., 2006).
Reading and Understanding a Text
In the semantic processing I have discussed so far, the computer reads numerous textual documents with the objective to learn representations of words, come up with a lexicon of phrases with positive or negative connotation, or learn category instances and relations. A more difficult task for a computer is to understand a specific text.
Much traditional research related to computer processing of a single text has relied on supervised techniques. Researchers invested effort to prepare collections in which human annotators marked positive and negative examples of a semantic distinction of interest. For example, they could mark the different senses of a word, the part of speech of words, or would mark that Roger Federer is a person, Bulgaria is a country. Then features describing the context of the categories of interest would be extracted from the text, and a statistical classifier would use the positive and negative examples to combine the features and predict the same type of information on unseen text. More recently it has become clear that the unsupervised approach in which computers accumulate knowledge and statistics from large amounts of text and the supervised approach can be combined effectively and result in better systems for semantic processing. When reading a specific text, computers also need to resolve what entity in the document is referred to by pronouns such as “he/his,” “she/her,” and “it/its.” Systems are far from perfect but are getting better at this task. Usually pronouns appear in the text near noun phrases, i.e., “the professor prepared his lecture,” but in other situations gender and number information is necessary to correctly resolve the pronoun, as in, “John told Mary he had booked the trip.” Machines can rather accurately learn the likely gender of names and nouns, again by reading large volumes of text and collecting statistics of co-occurrence. Statistics about the co-occurrence of a pronoun of a given gender and the immediately preceding noun or honorifics and names (Mr. John Black, Mrs. Mary White), collected over thousands of documents, give surprisingly good guesses about the likely gender of nouns (Bergsma, 2005).
TEXT QUALITY (STYLE)
Automatic assessment of text quality, or style, is a far more difficult task compared to the acquisition of semantics, or at least considerably less researched. Much of the effort in my lab has been focused on developing models of text quality.
I will discuss two successful endeavors: prediction of general and specific sentences and automatic assessment of sentence fluency in machine translation and summary coherence in text summarization. A well-written text contains a balanced mix of general overview statements and specific detailed sentences. If a text contains too many general sentences it will be perceived as insufficiently informative, and too much specificity can be confusing for the reader. To train a classifier, we exploit a resource of 1 million words of Wall Street Journal text with discourse annotations (Louis and Nenkova, 2011a). The discourse annotations, among other things, specify the way two adjacent sentences in the text are related. There could be an implicit comparison between two statements (John is always punctual. Mary often arrives late.), or a contingency (causal) relation (I hurt my foot. I cannot go dancing tonight.), or temporal relations. One of the discourse relations annotated in the corpus is instantiation. It holds between two adjacent sentences where the second gives a specific example of information mentioned in the first, as in, “He is very smart. He solved the problem in five minutes.” We considered that the first sentence is general while the second is specific in all instances of instantiation relation. We computed a number of features that according to our intuition would distinguish between the two categories. We expected that the presence of opinion or evaluative statements would characterize the general sentences as well as unusual use of language that would later be interpreted or clarified in a specific sentence. Among the features were
? the length of the sentence.
? the number of opinion or subjective words, derived from existing dictionaries.
? the specificity of words in the sentences, derived from corpus statistics as the fraction of documents in one year of New York Times articles that contain the word. The fewer documents contain the word, the more specific it is.
? mentions of numbers and people, companies, and geographical locations; such mentions are detected automatically.
? syntactic features related to adjectives, adverbs, verbs, and prepositions.
? probabilities of sequences of one, two, or three consecutive words computed over one year of New York Times articles. A logistic regression classifier, trained on around 2,800 examples of general and specific sentences from instantiation relations, learned to predict the distinction incredibly well. On a completely independent set of news articles, five different people were asked to mark each sentence as general or specific. For sentences in which all five annotators agreed about the class, the classifier can predict the correct class with 95 percent accuracy. For examples on which only four out of the five annotators agreed, the accuracy is 85 percent. For all examples, which included sentences for which people found it hard to classify in terms of general and specific, the accuracy of prediction was 75 percent. Moreover, the confidence of the classifier turned out to be highly correlated with annotator agreement, so it was possible to identify which sentences would not fit squarely into one of the classes. The degree of specificity of a sentence given by the classifier gives an accurate indication of how a sentence will be perceived by people. Applying the general-or-specific classifier to samples of automatic and human summaries of clusters of news articles has demonstrated that machine summaries are overly specific and has indicated ways for improving system performance (Louis and Nenkova, 2011b). Word co-occurrence statistics and subjective language have also been successful in automatically distinguishing implicit comparison, contingency, and temporal discourse relations (Pitler et al., 2009). Identification of such relations is not only necessary for semantic processing of text, it is also required for robust assessment of text quality (Pitler and Nenkova, 2008). Finally, statistics on types, length, and distance between verb, noun, and prepositional phrases, as well as probabilities of occurrence and co-occurrence of words, are highly predictive of the perceived quality of summaries (Nenkova et al., 2010).
原件下载:
翻译语种 更多>>
英语翻译 德语翻译 法语翻译
俄语翻译 日语翻译 韩语翻译
西班牙语 葡萄牙语 荷兰语翻译
乌克兰语 意大利语 波兰语翻译
丹麦语翻译 希腊语翻译 泰语翻译
瑞典语翻译 越南语翻译 阿拉伯语
专业范围 更多>>
· 安全环保 · 电力能源 · 银行保险
· 法学翻译 · 天文地理 · 钢铁冶金
· 航空航天 · 道路桥梁 · 地质采矿
· 建筑工程 · 金融财会 · 经济管理
· 交通运输 · 仪器仪表 · 医疗器械
· 医药卫生 · 石油化工 · 机械电子
小语种译员 更多>>
· 黄女士 法国佩皮尼昂大学硕...
· 法语翻译 核电专业法语翻译...
· 熊先生 法国某大学市场营销...
· 陆先生 国际经济与贸易本科...
· 宁先生 法国南特大学 工商...
北京翻译公司 地址:海淀区太阳园4号楼1507室 电话:010-82115891 82115892 bjhyw@263.net QQ:800022641
上海翻译公司 地址:上海市闵行区古美路443弄10号楼804 电话:021-31200158 shkehu@263.net, QQ:390645976
美国翻译公司 地址:450 N Atlantic Blvd Monterey Park, CA 91754, Tel:1 626 768 3096 信箱chinatranslation.net@gmail.com
加拿大翻译公司 地址:46 Ealing Dr, North York, Toronto, ON, M2L 2R5 电话:647 624 9243 bjctn@vip.sina.com
太原翻译公司 地址:太原市万柏林区迎泽西大街奥林匹克花园7D202 电话:15034183909 Email:tykehu@163.com
澳大利亚Chinese Translation客服电话:61 02 91885890,国内其他地区统一电话:950 404 80511  
京ICP备05038718号-3
北京翻译公司