Skip to content

Can We Translate English To English?

November 12, 2010


Can an automatic system improve our writing

Mary-Claire van Leunen is not a mathematician nor a complexity theorist. She is the author of a great book on technical writing called Handbook For Scholars. The book was used in a course at Stanford run by Don Knuth at one time. Even though she is not a theorist she does have a publication in SIGACT NEWS—see this. I have mentioned her before here.

Today I want to talk about writing, and how technology might be able to make it better.

When I arrived at Yale in 1973, Mary-Claire was a secretary for the computer science department. In those days, way before TeX, we wrote our papers in longhand. Then a “typist” used an IBM selectric to make it into a typed paper. Special symbols like {\Gamma} or {\rightarrow}, we take for granted now, were added either by hand like this:

or by using special IBM type balls. Modern typesetting has made writing papers so much easier.

When I did first got to Yale I was a terrible writer, just terrible. Mary-Claire helped me tremendously to become just bad. I hope I have learned over the years and now am okay. But in those days I really needed her help. One trick she did was when she typed up my papers she purposely made tons of typing errors—she was a near perfect typist. This was her subtle way to let me know that the “final draft” I had given her needed some work.

She eventually wrote her book and mentioned me in the preface to the book:

To Richard Lipton , who suggested many fine points that I hadn’t thought of and will undoubtedly take my gratitude as sarcasm.

She had strong opinions, one I recall was not to use quotes at all, or at least very sparingly. I try to avoid that in general, but today it is hard. There are just too many cool quotes about her. William Waterhouse wrote a great view of her book, here is the end of the review:

When explicit motivation is necessary, be on guard against grandiose, far-reaching statements. Early in my career I had the task of correcting an extraordinary essay from a student that began, “All the world is turning to thoughts of mortuary science.” A book like this cannot really be reviewed. It can be (and is) recommended.

When I first arrived at Princeton one of the major priorities was to raise money. We wrote NSF and other grants of all kinds to try and get dollars. There was a program that every year asked for asked for large department wide projects every year. One year I ran the proposal writing, was the PI, and helped put everything together. We did get a site visit, but did not get funded. The NSF folks did like the proposal enough—or had pity on us—that they did partially fund us. So at least we got some serious dollars from them. This money was much appreciated in the early days of the department. Princeton had then and still does have a huge endowment, but that does not mean that they are interested in using any of those dollars to fund a new effort. So getting even partial money from NSF was important.

We waited a year and then decided to try again to raise a large amount of money from NSF again. This time David Dobkin was the PI, and the rest of us were helpers. We planned to spend most of the summer just working on the proposal. I had the following idea: why not invite Mary-Claire to visit us, in order to help make the proposal better written. By then after her book was a success, she had moved on to be a member of the technical staff at Xerox Parc. Dobkin agreed and soon we had arranged to have her visit us for about a month. The plan was that she would help us write a great proposal.

We would, under her leadership, spend hours working on the structure of the proposal. We would even sometimes spend hours working the lead sentence for a whole section. She felt very strongly that even getting the first sentence just right would help make the document well written. We all worked very hard, the proposal was finally done just on time, and sent to NSF for review.

That year we did not even get a site visit form NSF. We were not even competitive enough to make it that far. Dobkin, as PI got the written reviews, in those days in a letter from NSF. He was not happy. He showed them to me. They were terrible. I mean really terrible. The reviewers did not like our proposal. I still recall my favorite:

This is one of the best written and slickest proposal I have ever read. Do NOT fund this work under any circumstances.

Somehow we had written a terrific piece of English, but we had squeezed out all the content. David was not amused; my idea on bringing in Mary-Claire had not worked.

We never did that again. Two years later we again tried and this time landed a fully funded project. The proposal had content, but was not nearly as well written as the previous one. Oh well.

The Challenge

Google can now do a pretty good job on translation. This sentence becomes translated into French as:

Google peut maintenant faire un très bon travail sur la traduction.

Then, back to English as:

Google can now do a very good job on the translation.

Which becomes a fixed point. Pretty good job.

I wondered in a note to them a while ago if they could tackle the following problem: translate English not to French or some other language but translate English to English. Or more generally X to X. The idea was that perhaps they could help improve the quality of this or any other written piece. Another idea was to translate within types of English. Formal to informal, legal to layperson, and so on.

Open Problems

Can Google, or anyone, actually build a system that would take an English paper and make it into “better” English?

I am very serious about this. Currently I rely on Subrahmanyam Kalyanasundaram to help make this into English. It would be great to have a way to automate the changes he makes to each post. Some changes are to content, but many are to the wording.

49 Comments leave one →
  1. November 12, 2010 10:03 am

    Some interesting efforts in this direction are http://www.polishmywriting.com/ and http://code.google.com/p/soylent/

    I’ve used the first of these services a fair bit, and it’s often quite helpful. I have not used the latter, but it’s a fascinating idea – crowdsourced editing.

    • rjlipton permalink*
      November 12, 2010 1:29 pm

      Michael,

      Thanks for the pointers

      • November 12, 2010 1:49 pm

        I should point out that neither service is fully automated. But both are attempts to move in that direction. I find polishmywriting.com particularly helpful for identifying overuse of the passive voice in my writing.

  2. anon permalink
    November 12, 2010 10:26 am

    “Currently I rely on Subrahmanyam Kalyanasundaram to help make this into English. It would be great to have a way to automate the changes he makes to each post.”
    looks like Subrahmanyam is going to graduate soon😉

  3. November 12, 2010 10:34 am

    Very interesting idea. I am not native English speaker, and so the translator to “better” English would be very helpful for me. Unfortunately, I did not see any good translator program from English to Russian, for example. Yes, computer program is capable to translate a simple statement (like this “Google can now do a very good job on the translation.”) without problem. But the fact is that now we have a lot of papers and books translated to very poor Russian by computer. A man (editor) is necessary to improve such translations. I think this is open problem for AI. If and when we will have something like C-3PO (Star Wars space opera by George Lucas), when he would be able to “take an English paper and make it into “better” English”😉

  4. Phillip Hammonds permalink
    November 12, 2010 10:48 am

    Dr. Lipton,

    I think it is possible. I have been thinking about this for some time. My colleague, Bernard Zeigler and I have written about it, although we were focused specifically on data/metadata engineering for Service Oriented Architecture/NetCentric web applications, but the problem is quite similar if not the same. It is not an easy task, but I think possible. The key is tagging words that have rich and specific syntactic, semantic and pragmatic content (The right meaning in the right context at the right time). I would like to discuss it further if you are interested.

    • Phillip Hammonds permalink
      November 12, 2010 11:05 am

      Here is an overly simple example:

      I see the plane.

      The syntax, semantics and pragmatics of the first three words are fairly clear, although “see” could have several synonyms. “Plane”, however is ambiguous. Does the author mean a geometric object, an airplane of some kind, or a tool for finishing wood? If you tag the word plane with contextual information (let’s say you use XML or some othercoding scheme). If the context is a battle field or a radar site, it is unlikely that a tool or a geometric object relevant.
      The tagging can be automated to a high degree, but the author could be prompted to amplify on certain terms during the writing or editing process. Once a text is tagged, the translation process would become much easier.

      Phil

  5. Sylvain permalink
    November 12, 2010 10:51 am

    On a related subject, Google announced some time ago on its research blog an EMNLP paper on poetry translation; in particular they noted that “the system is also able to translate anything into poetry” . They are mostly interested in meter and rhyme—I am not sure technical writing would also be amenable to a similar process.

  6. Jack permalink
    November 12, 2010 10:55 am

    I was in this class, and still have MC’s writing exercises. The one I use (and recommend) most is to underline all verbs in a passage, giving being verbs double underlines and passive verbs wiggly underlines. If more than half of the verbs are of these types, consider rewriting.

  7. Carsten Milkau permalink
    November 12, 2010 11:04 am

    From what Google tells the public, we know that they use web documents available in multiple languages and a clever indexing method to guess a good translation of a given phrase. This method is expensive, but in most cases superior to any other fully-automated translation I am aware of. The sources used have to be carefully chosen, so the resulting English is likely to be quite formal (or junk).

    Indeed the same method might work if applied to original and reviewed versions of text, suggesting improved versions. The (in my eyes brilliant, though much disputed) key idea of Google’s new approach to translation is that it actually does not care much what it translates from or to, as long as a “correct” mapping has some kind of continuity (small mistakes don’t have disastrous effects), and it has enough data to deduce on.

    • November 12, 2010 1:29 pm

      Yes this is the so called Context-based machine translation and it could indeed be applied to english to english translation, improving the writing style of a paper by matching it to a large corpus of “good papers” from the same domain.

  8. aaa permalink
    November 12, 2010 11:11 am

    What is a Subrahmanyam Kalyanasundaram and where can I get one?

  9. November 12, 2010 11:21 am

    I’ve always thought that this was possible. A paper contains not just the underlying information, but also a lot of convention and style. If you could extract the raw information in some form, you could simply re-apply a ‘writing style’ presentation layer back onto it. That style could be anything. It’s basically the same as how different language translators will alter the things they are working on, in different ways.

    In that way, you could get personalized views on all possible information. I don’t think there are any real-world reasons why we can’t do this, we’re just not technically sophisticated enough yet, and we’re still learning how to leverage our computers. Someday.

    Paul.

  10. November 12, 2010 11:43 am

    Your fixed point idea is interesting, modulo an equivalence relation for exact synonyms.

    However, I’m not sure that such a translator would be a net gain for the world. A “standard” version might be a big improvement for terrible originals, but enforcing canonicity would polish off the slight edges that are required for interesting writing. I suspect the fixed point would become a goal in itself in scientific publishing, much as word counts, citation standards, manuals of style, specific rules of typesetting, and even proprietary document formats have become. This would result in a blandly uniform writing style, enforced by automated submission systems. In the long run, the so-so papers are forgotten, but the key papers are read and re-read for decades. So ensuring that great papers are not made slightly worse seems more important than ensuring that so-so work is improved.

    • November 12, 2010 1:41 pm

      Google transtated your message into Russian:

      Ваша навязчивая идея точки Интересно, по модулю отношения эквивалентности для точных синонимов.

      Однако, я не уверен, что такой переводчик будет чистая прибыль для всего мира. “Стандартной” версией может быть большой шаг вперед для страшное оригиналы, но соблюдения каноничности бы отполировать небольшим края, которые необходимы для интересных письменной форме. Я подозреваю, что неподвижная точка станет самоцелью в научной публикации, так же как слово имеет значение, цитата стандартов, руководств стиля, конкретные правила верстки, и даже собственные форматы документов стали. Это привело бы вежливо единый стиль письма, в жизнь автоматизированных систем представления. В долгосрочной перспективе, так себе документы забыл, но ключевые документы читать и перечитывать на протяжении десятилетий. Так обеспечение того, чтобы большие документы не сделал немного хуже, кажется более важным, чем обеспечение того, чтобы так улучшена работа.

      — something from this has no sense, but I asked google to translate this Russian text to English as is. I got:

      “Your obsession is interesting points, modulo the equivalence relations for exact synonyms.

      However, I am not sure that such a translator will be the net profit for the whole world. “Standard” version, can be a big step forward for the dreadful originals, but compliance is canonical to polish the little edge needed for interesting writing. I suspect that the fixed point becomes an end in itself in a scientific publication, as well as word counts, citation standards, guidelines, style, specific rules for layout, and even their own document formats have become. It would politely single style of writing, the life of automated reporting systems. In the long run, since the documents themselves have forgotten, but the key documents to read and reread for decades. So to ensure that large documents have not made any worse, it seems more important than ensuring that both improved.”

      — something from this has sense, but I am not sure that for 3rd or 4th pass (English ->Russian->English) we will get more sense😉

  11. Chris Surname permalink
    November 12, 2010 12:57 pm

    Well, it’s not the most original idea ( see “halfbakery.com/idea/Smart_20Ass_20Translator ) but it’s a great one all the same😉

  12. November 12, 2010 1:18 pm

    An example from my paper in Изв. РАН (original Russian text):

    “Из доказанного утверждения можно сделать следующие
    выводы.
    1. Если графы G и G´ изоморфны, то для любого i существует j такое, что решения систем A•X=e(i) и A´•X´=e(j) совпадают с точностью до перестановки координат векторов X и X´.”

    Interpreter (the best interpreter in Russian Academy of Sci) did (see http://dx.doi.org/10.1007/s11172-006-0105-6 ):

    “The assertion proved suggests that
    1. If graphs G and G´ are isomorphic, for any i there exists j such that the solutions of the systems A•X=e(i) and A´•X´=e(j) coincide to an accuracy of permutation of the coordinates of the
    vectors X and X´.”

    Google did:

    “From the above statements can make the following
    conclusions.
    1. If graphs G and G ‘are isomorphic, then for any i there exists j such that the solution of systems of A • X = e (i) and A ‘• X’ = e (j) coincide up to permutation of the vectors X and X ‘.”

    My opponent wrote about http://dx.doi.org/10.1007/s11172-006-0105-6 :
    “”Coincide to an accuracy of ε” makes sense, although not, generally, in a mathematical paper. “Coincide to an accuracy” without a specified value does not.”

    I asked my colleague and friend, he is prof. from well-known Canadian university and he wrote six well-known books. He suggested:

    “Proved Assertion 1 suggests that:
    Conclusion 1. If G and G´are isomorphic, for any i there exists j such that the solutions of
    the systems A•X=e(i) and A´•X´=e(j) coincide modulo a permutation of the coordinates of the vectors X and X´.”

    I used this suggestion (with my thanks to him) in my preprint http://arxiv.org/pdf/1004.1808 Nobody asked me about this version!😉

    Please, compare google’s version and the professor’s version.

  13. November 12, 2010 1:37 pm

    Why not use a CAPTCHA that corrects grammar as a byproduct? Sort of like reCAPTCHA but for grammar?

    For example, the CAPTCHA might present the user with two sentences. The user would determine for each sentence whether there is a grammar error. Any error found must be corrected in the most obvious way(s).

    One of the sentences would be generated automatically by introducing an obvious grammar error.

    The other sentence would be taken from text whose grammar is to be corrected.

    The user does not know which is which and must at least correct the automatically introduced grammar error to pass the CAPTCHA.

    • Sid permalink
      November 12, 2010 2:22 pm

      I’m not sure web users in general would perform better than bots for detecting and correcting grammatical errors…

    • Sid permalink
      November 12, 2010 2:22 pm

      In addition, I think the concern here is not grammatical errors but the quality of writing.

  14. November 12, 2010 3:49 pm

    One should question why good writing is important. Isn’t the content more important?

    If the writing is good enough so that the content is easily understood, then what is the problem?

    Why make a big deal about mastering a messy natural language? Perhaps this is an IQ test in disguise? One day it might become illegal to discriminate against someone with poor writing skills.

    • November 12, 2010 4:35 pm

      > One day it might become illegal to discriminate against someone with poor writing skills.

      Interesting idea, I think😉 But I see: good writing is very important. I am able to do very good writing in Russian, and I can compare with my writing in English. Perhaps, somebody here understands what I want to say, but I can much more😦

    • November 12, 2010 4:46 pm

      Writing is our primary means of communicating ideas. If the writing isn’t good enough to encourage the readers, then they will never get past it. If you make it too difficult for the readers to continue, few will and the ideas will languish.

      These days there is so much out there that someone with great ideas has to compete to get them heard. I’d hate to think about all of the knowledge and experience that is quietly buried in unreadable texts. It must be huge. The language may be messy, but it’s not about the writer, it is the reader that really matters. If you want to share what you know, you have to make it easy for the other guy (or gal) to absorb it.

  15. November 12, 2010 4:04 pm

    The sentence, “There was a program that every year asked for asked for large department wide projects every year” has at least one extra “every year” in it.🙂

    I, too, would love an English-to-English translator, but, for now, I have to merely say, thank you for introducing me to Ms. van Leunen’s book; I don’t know how I’d missed it.

    • rjlipton permalink*
      November 12, 2010 4:36 pm

      Kate,

      That was not intentional. But I guess makes the point. We proof this stuff carefully, but oops.

  16. November 12, 2010 4:41 pm

    Spammers have been doing this for a long time, their goal is not to make the content better but merely “original”. They’ll take an entire blog and duplicate it, “spinning” it to make it “original” and then feeding off of it like a parasite. The language translators are a popular way of doing this (English->French->English or whatever).

    • November 12, 2010 4:55 pm

      > English->French->English or whatever

      May be it would be very-very-very productive idea for spammers. Let’s keep silence about 🙂

  17. Vijay D permalink
    November 12, 2010 9:39 pm

    Translation party tries to find an English-Japanese fixed point.
    http://www.translationparty.com

    It’s fun to play with.

    • Anonymous permalink
      November 14, 2010 11:43 pm

      Nice. I found some phrases where it enters a loop of length four, and so never reaches equilibrium.

    • November 15, 2010 5:44 am

      Hmmmm …. the fixed point associated to “Yes, we have no bananas” turns out to be “Yes, we have bananas.” Somehow I don’t think this approach is going to unlock the secrets of AI. 🙂

      • November 16, 2010 12:57 am

        Lol. That’s awesome. I don’t know anything about Japanese, but I infer from this result that Japanese is one of those “double negative just means negative” languages.

  18. November 13, 2010 3:46 am

    I think Google probably is working on this already (full disclosure: Google is my employer). For example, they have published a DVD containing n-gram using the Web corpus, which could be used to build language models:

    http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html

    The hope is that people will use this to develop novel algorithms, and it appears their “evil” plan is working:

    http://n-gram-patterns.sourceforge.net/sourceforge-flamengo-report.pdf

    I imagine you could use that kind of data to build an English to English translation system. I’m not sure exactly how it would work, but if you could get corpi in different styles (formal, legal, lyrical, South African, etc.), you’d come up with different n-gram frequencies, which would lead to different language models.

    Unfortunately, you probably can’t do the same thing they did to translate between different languages, because their method relies on pairs of translated documents, and you won’t find many of those where BOTH languages are in the same language.

    • November 13, 2010 3:49 am

      DVD containing n-gram using -> DVD containing n-gram counts using

    • November 13, 2010 3:51 am

      BOTH languages are in -> BOTH documents are in

      WordPress really needs an edit button, or I need to proof read harder (or both).

  19. proaonuiq permalink
    November 13, 2010 3:16 pm

    Mes excuses si ce commentaire est hors topique mais il pourrait etre interessant pour les lecteurs de votre blog:

    http://www.fqxi.org/community/essay : Is reality digital or analogue ?

    Hmmm…..la quantité n´est pas intéressante pour moi…mais pour ceux qui pourraient etre interessés quelques informations sur le membres de cette organisation:

    http://www.fqxi.org/who
    http://www.fqxi.org/members

    La creme de la creme !

    P.s. as it must be apparent by the abondance of typos, no automatic translation in here.

    • November 14, 2010 2:26 pm

      To counter Mary-Claire van Leunen’s “strong opinion … not to use quotes” we can quote Emerson:

      ———
      “All minds quote. Old and new make the warp and woof of every moment. There is no thread that is not a twist of these two strands. By necessity, by proclivity, and by delight, we all quote…. A great man quotes bravely and will not draw on his invention when his memory serves him with a word as good. … There is, besides, a new charm in such intellectual works as, passing through long time, have had a multitude of authors and improvers. … He that comes second must needs quote him that comes first.”
      ———

  20. November 14, 2010 3:09 pm

    Here’s another interesting writing assistant: http://www.netspeak.eu. It’s a phrase dictionary that includes usage frequencies derived from a large quantity of English Web pages for every phrase, and it allows for wildcard queries. This way, one may check alternative phrases for their commonness in everyday writing, which is helpful in selecting common phrases over uncommon ones, especially for non-native English speakers.

  21. Gil Kalai permalink
    November 14, 2010 6:45 pm

    “Currently I rely on Subrahmanyam Kalyanasundaram to help make this into English. It would be great to have a way to automate the changes he makes to each post. ”

    So whose role on our blogs can be automated earlier the writers or the editors?

    Sometimes our insights about what is easy and what is hard to automatize can be misleading, no?

  22. Greg permalink
    November 16, 2010 6:01 pm

    The answer to your question is “no”. I say that it is impossible to build a machine that would “improve” English writing (beyond simply correcting spelling mistakes or grammatical errors).

    My reasoning is straightforward. If a machine could be capable of improving a written piece, then surely a human reader (who starts out with a tremendous cognitive advantage over a machine) could make the same necessary improvements mentally (albeit with some effort) while reading the piece. However, the usefulness of this machine is to improve the writing in ways that a human reader could not. But since we have just established that any such mechanical improvements to writing could not exceed any improvements made by the reader – then the only value of this machine would be labor-saving (that is, sparing the reader the effort of making sense out of the writing). But in all other respects, a machine would add no unique value to the written document.

    Another way to look at this situation is to consider the question: what makes bad writing bad? After all, it’s much easer to describe bad writing than good writing (because the set of all poorly-written versions of a particular document greatly outnumbers the set of all well-written versions of the same document). And as others have already pointed out, a poorly written document essentially lacks information. That is, the information presented may be ambiguous (for example, pronouns whose antecedents are unclear) or incomplete or even contradictory (such as technical terms used without being defined or terms that seem to be defined in more than one way).

    So, just as a human reader is not able to resolve ambiguities, resolve contradictions, or provide missing information safely – a machine would not be able to do any of those things either. However, where a machine could perhaps excel, would be in the detection and in the reporting of such problems to the author (in much the same way that any good human editor would).

    So my conclusion is that we will never have a mechanical co-author who will make original contributions to our writings – but we could very well (someday) have a mechanical editor show us where our writing needs to be improved.

    • rjlipton permalink*
      November 16, 2010 6:10 pm

      Greg

      Does this apply to chess playing? Computers are rated much higher than any humans

      • November 17, 2010 9:34 am

        No. Chess problem may be solved via simple alpha-beta algorithm. It may be very-very long process but we will get the best solution. We do not know any algorithm to solve some important writing problems!

    • November 16, 2010 6:20 pm

      Not so sure I agree. A skilled editor can dramatically lift up the work of a poor writer. Also there are often secondary writers that create interesting works by analyzing and adding to a difficult piece. When I was interested in the Riemann Hypothesis for example, I chose to read a rather large book that nicely explained the original eight pages in a level of detail that I could understand (we’ll almost🙂

      Writers can build on other writer’s work. They might not be able to tell you exactly what was in the author’s mind at the time, but they can certainly put it into context, clarify it and even modernize it. If it’s wrong, it is wrong but at least a good writer can explain why.

      A computer that could extract the raw information, then re-apply a comfortable writing style to it, could also cross reference a massive database like Wikipedia and fill in the blanks. It could even change it significantly to address a specific audience (although in my case with Riemann’s work the resulting book may be millions of pages …)

      Paul.

    • November 17, 2010 3:21 am

      But in all other respects, a machine would add no unique value to the written document.

      Heavier than air cannot fly, it’s well known.
      Except that the knowledge database used by the machine would embody, not only grammatical/stylistic knowledge, but out of necessity (for contexts recognition) essential domain knowledge from the field of interest which would supplement the paper’s findings.
      Not that achieving this is an easy endeavor bit there is no impossibility from principles.

  23. November 25, 2010 10:21 am

    there are too many comment about this subject. I haven’t have time to read all of the comments. But in my experiences there is no better than people. Machines doesn’t help us now and in the future.

  24. madhuvanthinie arangannal permalink
    February 27, 2011 7:05 am

    hi how r u all????????????????

  25. ashok permalink
    August 19, 2011 11:17 am

    Thanks for a lot of details at one site…..

Trackbacks

  1. Tweets that mention Can We Translate English To English? « Gödel’s Lost Letter and P=NP -- Topsy.com
  2. How To Stop Wikileaks? « Gödel’s Lost Letter and P=NP

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s