The system tr utility transliterates characters. For example, it is
often used to map upper-case letters into lower-case, for further
processing.
generate data | tr '[A-Z]' '[a-z]' | process data ...
You give tr two lists of characters enclosed in square brackets.
Usually, the lists are quoted to keep the shell from attempting to do a
filename expansion.(23) When processing the input, the
first character in the first list is replaced with the first character in the
second list, the second character in the first list is replaced with the
second character in the second list, and so on.
If there are more characters in the "from" list than in the "to" list,
the last character of the "to" list is used for the remaining characters
in the "from" list.
Some time ago,
a user proposed to us that we add a transliteration function to gawk.
Being opposed to "creeping featurism," I wrote the following program to
prove that character transliteration could be done with a user-level
function. This program is not as complete as the system tr utility,
but it will do most of the job.
The translate program demonstrates one of the few weaknesses of
standard
awk: dealing with individual characters is very painful, requiring
repeated use of the substr, index, and gsub built-in
functions
(see section Built-in Functions for String Manipulation).(24)
There are two functions. The first, stranslate, takes three
arguments.
from
to
target
Associative arrays make the translation part fairly easy. t_ar holds
the "to" characters, indexed by the "from" characters. Then a simple
loop goes through from, one character at a time. For each character
in from, if the character appears in target, gsub
is used to change it to the corresponding to character.
The translate function simply calls stranslate using $0
as the target. The main program sets two global variables, FROM and
TO, from the command line, and then changes ARGV so that
awk will read from the standard input.
Finally, the processing rule simply calls translate for each record.
# translate --- do tr like stuff
# Arnold Robbins, arnold@gnu.ai.mit.edu, Public Domain
# August 1989
# bugs: does not handle things like: tr A-Z a-z, it has
# to be spelled out. However, if `to' is shorter than `from',
# the last character in `to' is used for the rest of `from'.
function stranslate(from, to, target, lf, lt, t_ar, i, c)
{
lf = length(from)
lt = length(to)
for (i = 1; i <= lt; i++)
t_ar[substr(from, i, 1)] = substr(to, i, 1)
if (lt < lf)
for (; i <= lf; i++)
t_ar[substr(from, i, 1)] = substr(to, lt, 1)
for (i = 1; i <= lf; i++) {
c = substr(from, i, 1)
if (index(target, c) > 0)
gsub(c, t_ar[c], target)
}
return target
}
function translate(from, to)
{
return $0 = stranslate(from, to, $0)
}
# main program
BEGIN {
if (ARGC < 3) {
print "usage: translate from to" > "/dev/stderr"
exit
}
FROM = ARGV[1]
TO = ARGV[2]
ARGC = 2
ARGV[1] = "-"
}
{
translate(FROM, TO)
print
}
While it is possible to do character transliteration in a user-level
function, it is not necessarily efficient, and we started to consider adding
a built-in function. However, shortly after writing this program, we learned
that the System V Release 4 awk had added the toupper and
tolower functions. These functions handle the vast majority of the
cases where character transliteration is necessary, and so we chose to
simply add those functions to gawk as well, and then leave well
enough alone.
An obvious improvement to this program would be to set up the
t_ar array only once, in a BEGIN rule. However, this
assumes that the "from" and "to" lists
will never change throughout the lifetime of the program.
Go to the first, previous, next, last section, table of contents.