Mark Hopkins
Senior Research Scientist
Conference Papers (Refereed)

We present a simple preordering approach for machine translation based on a featurerich logistic regression model to predict whether two children of the same node in the sourceside parse tree should be swapped or not. Given the pairwise children regression scores we conduct an efficient depthfirst branchandbound search through the space of possible children permutations, avoiding using a cascade of classifiers or limiting the list of possible ordering outcomes. We report experiments in translating English to Japanese and Korean, demonstrating superior performance as (a) the number of crossing links drops by more than 10% absolute with respect to other stateoftheart preordering approaches, (b) BLEU scores improve on 2.2 points over the baseline with lexicalised reordering model, and (c) decoding can be carried out 80 times faster. Less

What do we want to learn from a translation competition and how do we learn it with confidence? We argue that a disproportionate focus on ranking competition participants has led to lots of different rankings, but little insight about which rankings we should trust. In response, we provide the first framework that allows an empirical comparison of different analyses of competition results. We then use this framework to compare several analytical models on data from theWorkshop on Machine Translation (WMT). Less

We offer a simple, effective, and scalable method for statistical machine translation parameter tuning based on the pairwise approach to ranking (Herbrich et al., 1999). Unlike the popularMERT algorithm (Och, 2003), our pairwise ranking optimization (PRO) method is not limited to a handful of parameters and can easily handle systems with thousands of features. Moreover, unlike recent approaches built upon the MIRA algorithm of Crammer and Singer (2003) (Watanabe et al., 2007; Chiang et al., 2008b), PRO is easy to implement. It uses offtheshelf linear binary classifier software and can be built on top of an existing MERT framework in a matter of hours. We establish PRO’s scalability and effectiveness by comparing it to MERT and MIRA and demonstrate parity on both phrasebased and syntaxbased systems in a variety of language pairs, using large scale data scenarios. Less

Extraction Programs: A Unified Approach to Translation Rule Extraction

Conventional wisdom dictates that synchronous contextfree grammars (SCFGs) must be converted to Chomsky Normal Form (CNF) to ensure cubic time decoding. For arbitrary SCFGs, this is typically accomplished via the synchronous binarization technique of (Zhang et al., 2006). A drawback to this approach is that it inflates the constant factors associated with decoding, and thus the practical running time. (DeNero et al., 2009) tackle this problem by defining a superset of CNF called Lexical Normal Form (LNF), which also supports cubic time decoding under certain implicit assumptions. In this paper, we make these assumptions explicit, and in doing so, show that LNF can be further expanded to a broader class of grammars (called "scope3") that also supports cubictime decoding. By simply pruning nonscope3 rules from a GHKMextracted grammar, we obtain better translation performance than synchronous binarization. Less

We present the main ideas behind a new syntaxbased machine translation system, based on reducing the machine translation task to a treelabeling task. This tree labeling is further reduced to a sequence of decisions (of four varieties), which can be discriminatively trained. The optimal tree labeling (i.e. translation) is then found through a simple depthfirst branchandbound search. An early system founded on these ideas has been shown to be competitive with Pharaoh when both are trained on a small subsection of the Europarl corpus. Less

In this paper, we propose a new syntaxbased machine translation (MT) approach based on reducing the MT task to a treelabeling task, which is further decomposed into a sequence of simple decisions for which discriminative classifiers can be trained. The approach is very flexible and we believe that it is particularly wellsuited for exploiting the linguistic knowledge encoded in deep grammars whenever possible, while at the same time taking advantage of databased techniques that have proven a powerful basis for MT, as recent advances in statistical MT show. Less

We revisit the idea of historybased parsing, and present a historybased parsing framework that strives to be simple, general, and exible. We also provide a decoder for this probability model that is linearspace, optimal, and anytime. A parser based on this framework, when evaluated on Section 23 of the Penn Treebank, compares favorably with other stateoftheart approaches, in terms of both accuracy and speed. Less

The standard PCFG approach to parsing is quite successful on certain domains, but is relatively inexible in the type of feature information we can include in its probabilistic model. In this work, we discuss preliminary work in developing a new probabilistic parsing model that allows us to easily incorporate many different types of features, including crosslingual information. We show how this model can be used to build a successful parser for a small handmade goldstandard corpus of 188 sentences (in Less

Abstract: We propose a theory that gives formal semantics to wordlevel alignments defined over parallel corpora. We use our theory to introduce a linear algorithm that can be used to derive from wordaligned, parallel corpora the minimal set of syntactically motivated transformation rules that explain human translation data.

We analyze a new property of directed acyclic graphs (DAGs), called layerwidth, arising from a class of DAGs proposed by Eiter and Lukasiewicz. This class of DAGs permits certain problems of structural modelbased causality and explanation to be tractably solved. In this paper, we first address an open question raised by Eiter and Lukasiewicz – the computational complexity of deciding whether a given graph has a bounded layerwidth. After proving that this problem is NPcomplete, we proceed by proving numerous important properties of layerwidth that are helpful in efficiently computing the optimal layerwidth. Finally, we compare this new DAG property to two other important DAG properties: treewidth and bandwidth. Less

Recently, Halpern and Pearl proposed a definition of actual cause within the framework of structural models. In this paper, we explicate some of the assumptions underlying their definition, and reevaluate the effectiveness of their account. We also briefly contemplate the suitability of structural models as a language for expressing subtle notions of commonsense causation. Less

Algorithms that triangulate a graph such that the resulting triangulation is at most a constantfactor from the treewidth are important theoretically, but perform poorly in practice. In this paper, we show that a successful heuristic method for triangulating graphs (based on a structure known as a dtree) can be viewed as a relaxation of a wellknown family of constantfactor approximation algorithms. In doing so, we provide a theoretical underpinning for this heuristic, and show how these theoretically elegant approximation algorithms can be effectively applied in practice. Less

Strategies for Determining Causes of Events

Using Recursive Decomposition to Construct Elimination Orders, Jointrees, and Dtrees
Darwiche has recently proposed a graphical model for driving conditioning algorithms, called a dtree, which specifies a recursive decomposition of a directed acyclic graph (DAG) into its families. A main property of a dtree is its width, and it was shown previously how to convert a DAG elmination order of width w into a dtree of width ≤ w. The importance of this conversion is that any algorithm for constructing lowwidth elimination orders can be directly used for constructing lowwidth dtrees. We propose in this paper a more directly method for constructing dtrees based on hypergraph partitioning. Less