Tuesday, July 10, 2012

Text Manipulation with sed

 in

Replace text on the fly, without even starting an editor, using this classic tool.
The filter sed can process text from standard input and write its results to standard output. The input can be redirected from a file, and the output also can be redirected to a file using your shell's redirection capabilities. It has hundreds of uses, and once you learn sed, you really would miss it if you lost it.
sed can append lines, remove lines, change lines, rearrange lines, substitute text strings and more. Using sed, you can write simple scripts that can become powerful text manipulating commands.
sed can use regular expressions to define what processing will occur on lines of text and which lines it processes. If you have never seen or used regular expressions before, you may want to become familiar with the basic syntax of regular expressions. In this article, we use a few regular expressions to make sed do some simple text processing.
Ways to Run sed
sed can be run on the command line as follows:
cat sample.txt | sed -e '1,15d'
You can cat the file sample.txt and use the pipe to redirect its output (the lines of text) into the sed command. The -e option to sed tells it to use the next item as the sed command. The d command tells sed to delete lines 1–15 of the input stream, which in this case is the lines read from sample.txt. The rest of the file (if any) appears on standard output, your terminal window, unless redirected elsewhere.
Also, you simply can specify the input file as a command-line argument, so the above sed command also can be written as:
sed -e '1,15d' sample.txt
You also can tell sed to read commands from a script file by using the [-f script-file] option.
sed Command Format
A sed command has this format:
[pattern1][,pattern2][!] command [args]
The pattern1 and pattern2 are optional line ranges. Some commands don't use the patterns, some commands use only one and some can use both to specify a range of lines that the sed command can operate on, as we did in our simple example above.
pattern1 and pattern2 can be numbers, in which case they are treated like line numbers. They can also be a regular expression delimited by slashes (/pattern/). When using regular expression patterns, all lines that match the expression are filtered through the sed command.
If no pattern is specified, the sed command operates on every line of input.
The ! causes sed to operate on every line not included in the pattern range. You can change our example above to be:
cat sample.txt | sed -e '1,15!d'
This command deletes all lines except lines 1–15.
A Few sed Commands
Here are a few basic sed examples. These can all be run right from the command line. Testing and debugging your sed commands individually on the command line before integrating them into a larger script will save you a lot of time that otherwise would be spent debugging the commands from within a running script.
Let's say that you have a file that lists customers called customer.txt. For the following examples, it contains simple lines of text, like this:
Sam Jones
Brenda Jones
Carl Simon
Liz Smith
Let's use some sed commands to manipulate this file. For example, if you want to remove lines containing Carl Simon and update your customer file, you can do the following:
cat customer.txt | \
sed -e '/Carl Simon/d' > customer.txt
The pattern /Carl Simon/ is used by sed as a regular expression and matches every line that has that pattern somewhere on the line. The d command deletes every line that matches the pattern. So, any lines containing Carl Simon are removed from the file.
If you want to perform some type of text substitution on a text file, the s command is probably what you are looking for. It substitutes one text string for another. We tend to use this a lot in our scripts. For example, if Sam Jones calls up and tells you that you should have him listed as Samuel Jones, you can use this command to make the change:
cat customer.txt | \
sed -e 's/Sam Jones/Samuel Jones/' > customer.txt
The s command in sed has three slashes that follow the s. The text between the first and second slash is the pattern you want to match. The text between the second and third slash contains the pattern that you want to substitute for the first pattern. If you wanted all instances of Sam to be Samuel (not just Sam Jones), you could rewrite this example as follows:
cat customer.txt | \
sed -e 's/Sam/Samuel/' > customer.txt
The commands for append (a), replace (c) and insert (i) typically need to have the sed commands specified in a separate script file. For example, say you want to append the line After Brenda right after the line that contains the text Brenda. You can use the a sed command to append the text there. However, you need to put the sed commands in a separate script file, so fire up your favorite editor and create the following sed command file:
#
# sed command file (# are comment lines)
#
# append the line 'After Brenda'
# in this customer file
#
/Brenda/a\
After Brenda
Save this script file as sed1.cmd. Then, to run sed using this script file, use this syntax:
sed -f sed1.cmd customer.txt
You should see the contents of your customer file with the additional line added after the line Brenda Jones. The pattern /Brenda/ (in the sed command file) determine where in the output our appended line appears.
The difference between the append command and the insert command is where the text is added. For the append command, the text is added after the line containing the match. For the insert command, the text is added before the line that contains the match.

For those who have never used regular expressions, here are three regular expressions that are very useful when combined with sed:
  1. To match the start of a line, use the ^ character.
  2. To match the end of a line, use the $ character.
  3. To match any number of characters in a regular expression, use the characters .*. The . matches any single character, and the * matches any number of characters (including none at all).
Practical Examples
Filter out empty lines from a file:
sed -e '/^$/d' your_file.txt
Add the computer named mycomputer to the end of every line in /etc/exports:
cat /etc/exports |  \
sed -e 's/$/ mycomputer/' > /etc/exports
Add the computer named comp2 only to the directories beginning with /data/ in /etc/exports:
cat /etc/exports | \
sed -e '/^\/data\//s/$/ comp2/' > /etc/exports
See how the forward slashes used in the directory name have to be escaped using back slashes? Without the back slashes, sed interprets the forward slashes in the directory specifier as the delimiters in the sed command itself. However, the back slashes can make the sed command difficult to read and follow.
Remove the first word on each line (including any leading spaces and the trailing space):
cat test3.txt | sed -e 's/^ *[^ ]* //'
More regular expression matching is used in this example. Here's what it is doing.
The initial ^ * is used to match any number of spaces at the beginning of the line. The [^ ]* then matches any number of characters that are not spaces (the ^ inside the brace reverses the match on the space), so it matches a single word. The trailing space at the end matches the space found at the end of the first word. The empty replace pattern removes the text.
Remove the last word on each line:
cat test3.txt | sed -e 's/^\(.*\) .*/\1/'
This command introduces the concept of hold buffers. Hold buffers are used to keep parts of the matched text and to insert that text into the result. The pattern that matches the text between the parentheses is recalled in the substitution pattern by the \1. If an additional set of parentheses were in the match pattern, they would be addressed in the substitution pattern as \2, and so on, for more sets of parentheses. Up to nine hold buffers can be specified. In this example, the pattern contained within the parentheses matches from the start of the line up to the last space (the space after the parentheses).
To remove leading { and trailing }, or a } from each line:
sed -e 's/^.*{\(.*\)},*/\1/' table.txt
I'll leave it to the reader to dig in to this regular expression to see how it operates. Keep this in mind—the more comfortable you are with regular expressions and hold buffers, the more powerful the sed command becomes.
Conclusion
sed recognizes many other commands. However, even with these basic commands, you can successfully manipulate text files from within your own shell scripts or right from the command line.
Larry Richardson develops meteorological workstation software for 3SI. He has developed software for UNIX and Windows using C and C++ for more than 13 years. Now living in Georgia with his wife and son, he enjoys playing bass in his spare time.


-------------------------------------
Example: To look for a telephone number out of a file named dummy. Suppose the telephone number shows up in this format:
(301)594-8346

Here is how you can locate such a phone number in the file:

sed -e 's/([0-9]+)[0-9]+-[0-9]+/YES/g' dummy

The phone number will be replaced by 'YES'.
Note: [0-9]+ means a series of running numbers.

No comments:

Post a Comment