Introduction to regular expressions in Transit NXT

Regular expressions, also called regex or regexp for short, are a very powerful functionality that allows certain text operations to be performed in an ergonomic way, thus optimizing workflows and saving time. Being an advanced functionality, it certainly takes some effort to master. But it always pays off. This functionality is not exclusive to Transit/TermStar NXT and WebTerm — most programming languages, text editors and CAT tools include it (with some possible variations between different systems).

Definition: Short and simple, a regular expression is a string that matches a text pattern. To be precise, it is a sequence of characters which abstractly describes and stands for certain common properties of specific text strings instances, so that we can use it to match them all in one go rather than running as many searches as text strings we want to match. To do this, certain characters are used which have special meanings. For example, the dot (.) stands for any character, or ranges between square brackets allow you to be more precise, e.g. [aeiou] matches any vowel and [0-5] matches any number from 0 to 5.

Functions: In Transit NXT, regular expressions can be used to find (and replace) text in segments, to create segment filters, to create translation exceptions or to customize conversion filters; and in TermStar NXT, they can be used to create record filters and input verification rules, to search and replace text in any field of the term record or other more advanced operations. In short, they are of great use in QA tasks. Hopefully we’ll devote future tooltips to all those functions but for the time being let’s stick to the simplest example: searching with a regexp.

Example: Launch the search dialog (e.g. pressing Ctrl+F), and check the option Regular expression. Enter the expression you want to find and configure all the other options as appropriate. In the screenshot, you can see expression on[\s\-\.]?line. This will find “online”, “on line”, “on-line”, etc. The part between square brackets matches either a space, or a hyphen or a dot, and the question mark makes that character optional.

Searching regular expressions

Searching regular expressions

Basic syntax: To write a valid regular expression, you must know what special meaning each character has and how to combine them. You can use:

  • Standard characters and symbols. They will match themselves, e.g. online will match “online”, % will match %, [%@#] will match either “%” or “#” or “@”, and so on. This category includes all characters except those that are used as metacharacters.
  • Metacharacters. They include . & * + ? [ ] ( ) $ ^ ! \ | # and have a special meaning when used in a regular expression, unless they are escaped (i.e. preceded by a backslash, e.g. \.).
  • Control characters. They are non-printing characters that control the appearance of the text: \s is a space, \t is a tab, \n is a line break, etc.

A few examples. The special meaning of metacharacters might be more easily understood with a few instances of how they are used.

  • The dot (.) matches any character, so gr.y will match both “gray” and “grey”.
  • A metacharacter must be escaped (\.) in order to match itself literally, so you will need to write gr.y\.doc to match “grey.doc” and “gray.doc”.
  • The question mark means that the previous character is optional, so counsell?or or judge?ment will match both “counsellor” and “counselor”, or “judgement” and “judgment”, respectively.
  • The square brackets contain ranges or classes of characters, so s[ck]eptical or reali[sz]e will match both “skeptical” and “sceptical” or “realize” and “realise”, respectively.
  • The star sign will match none, one or more of the preceding character or character class, so that AB[0-9]* will match “AB”, “AB1”, “AB45”, “AB42490”, etc.
  • The plus sign will match one or more of the preceding character or character class, so cros+ection will match “crosection”, “crossection”, “crosssection”, etc. or AB[0-9]+ will match “AB1”, “AB45”, “AB42490”, etc. but not “AB”.
  • The ampersand (&) must be used between other characters and will match one or more characters, so counse&or will match “counselor”, “counsellor”, but also “counselllor”, etc.
  • The exclamation mark negates the class or range it precedes, so that AB[!0-9] will match “ABC” but not “AB3”.
  • The placement characters ^ and $ will match the beginning and the end of the line respectively, so ^[a-z] will match lines beginning with a lower case (as long as the Match case option is checked) or [!,;:\.]$ will match lines ending in anything but a punctuation mark.
  • The parenthesis groups an expression together so that any other operators can be applied to it as if it was a single character. So if [a-z\.\-_]+\@[a-z\.\-_]+ matches an email address, the expression ([a-z\.\-_]+\@[a-z\.\-_]+, )+ would match many email addresses separated by a comma.
  • The vertical bar can be used to separate alternatives, so kerb|curb will match both “curb” and “kerb”.

More information and more advanced examples will appear in future tooltips. In the meantime you can also check chapter 12 Regular expressions of the Reference guide (which you will find in http://www.star-transit.net > Downloads > Technical documentation).

Special thanks to Karen Ellis for reviewing this post.

Advertisements

About Manuel Souto Pico

Linguist and translation technologist. Google profile.
This entry was posted in advanced level, project management, QA, review, TermStar NXT, Transit NXT, translation and tagged , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s