11.4. Text manipulation tools

TipAlso see
 

Also see tac, and cat over in this section, Section 11.2, as they can perform text manipulation too

sort

Sorting text with no options the sort is alphabetical. Can be run on text files to sort them alphabetically (note it also concatenates files), can also be used with a pipe '|' to sort the output of a command.

Use sort -r to reverse the sort output, use the -g option to sort 'numerically' (ie read the entire number, not just the first digit).

Examples:

cat shoppinglist.txt | sort

The above command would run cat on the shopping list then sort the results and display them in alphabetical order.

sort -r shoppinglist.txt

The above command would run sort on a file and sort the file in reverse alphabetical order.

Advanced sort commands:

sort is a powerful utility, here are some of the more hard to learn (and lesser used) commands. Use the -t option to use a particular symbol as the separator then use the -k option to specify which column you would like to sort by, where column 1 is the first column before the separator. Also use the -g option if numeric sorting is not working correctly (without the -g option sort just looks at the first digit of the number). Here is a complex example:

sort -t : -k 4 -k 1 -g /etc/passwd | more 

This will sort the “/etc/passwd” file, using the colon ':' as the separator. It will sort via the 4th column (GID section, in the file) and then sort within that sort using the first (name) to resolve any ties. The -g is there so it sorts via full numbers, otherwise it will have 4000 before 50 (it will just look at the first digit...).

join

Will put two lines together assuming they share at least one common value on the relevant line. It won't print lines if they don't have a common value.

Command syntax:

join file1 file2
cut

Prints selected parts of lines (of a text file), or, in other words, removes certain sections of a line. You may wish to remove things according to tabs or commas, or anything else you can think of...

Options for cut:

  • -d --- allows you to specify another delimiter, for example ':' is often used with /etc/passwd:

    cut -d ':' (and probably some more options here) /etc/passwd
  • -f --- this option works with the text by columns, separated according to the delimiter. For example if your file had lines like “result,somethingelse,somethingelse” and you only wanted result you would use:

    cut -d ',' -f 1 /etc/passwd 

    This would get you only the usernames in /etc/passwd

  • “,” (commas) --- used to separate numbers, these allow you to cut particular columns. For example:

    cut -d ':' -f 1,7 /etc/passwd

    This would only show the username and the shell that each person is setup for in /etc/passwd.

  • “-” (hyphen) --- used to show from line x to line y, for example 1-4, (would be from lines 1 to line 4).

    cut -c 1-50 file1.txt

    This would cut (display) characters (columns) 1 to 50 of each line (and anything else on that line is ignored)

  • -x --- where x is a number, to cut from line 1 to “x”

  • x- --- where x is a number, to cut from “x” to the end.

    cut -5, 20-, 8 file2.txt

    This would display (“cut”) characters (columns) 1 to 5, 8 and from 20 to the end.

ispell/aspell

To spell check a file interactively, prompts for you to replace word or continue. aspell is said to be better at suggesting replacement words, but its probably best to find out for yourself.

aspell example:

aspell -c FILE.txt

This will run aspell on a particular file called “FILE.txt”, aspell will run interactively and prompt for user input.

ispell example:

ispell FILE.txt

This will run ispell on a particular file called “FILE.txt” ispell will run interactively and prompt for user input.

chcase

Is used to change the uppercase letters in a file name to lowercase (or vice versa).

You could also use tr to do the same thing...

cat fileName.txt | tr '[A-Z]' '[a-z]' > newFileName.txt

The above would convert uppercase to lowercase using the the file “fileName.txt” as input and outputting the results to “newFileName.txt”.

cat fileName.txt | tr '[a-z]' '[A-Z]' > newFileName.txt

The above would convert lowercase to uppercase using the the file “fileName.txt” as input and outputting the results to “newFileName.txt”.

chcase (a perl script) can be found at the chcase homepage.

fmt

(format) a simple text formatter. Use fmt with the -u option to output text with "uniform spacing", where the space between words is reduced to one space character and the space between sentences is reduced to two space characters.

Example:

fmt -u myessay.txt

Will make sure the amount of space between sentences is two spaces and the amount of space between words is one space.

paste

Puts lines from two files together, either lines of each file side by side (normally separated by a tab-stop but you can have any symbols(s) you like...) or it can have words from each file (the first file then the second file) side by side.

To obtain a list of lines side by side, the first lines from the first file on the left side separated by a tab-stop then the first lines from the second file. You would type:

paste file1.txt file2.txt

To have the list displayed in serial, first line from first file, [Tab], second line from first file, then third and fourth until the end of the first file type:

paste --serial file1.txt file2.txt

TipThis command is very simple to understand if you make yourself an example
 

Its much easier if you create an example for yourself. With just a couple of lines, I used "first line first file" and "first line second file" et cetera for a quick example.

expand

Will convert tabs to spaces and output it. Use the option -t num to specify the size of a “tapstop”, the number of characters between each tab.

Command syntax:

expand file_name.txt

unexpand

Will convert spaces to tabs and output it.

Command syntax:

unexpand file_name.txt
uniq

Eliminates duplicate entries from a file and it sometimes greatly simplifies the display.

uniq options:

  • -c --- count the number of occurances of each duplicate

  • -u --- list only unique entries

  • -d --- list only duplicate entries

For example:

uniq -cd phone_list.txt

This would display any duplicate entries only and a count of the number of times that entry has appeared.

tr

(translation). A filter useful to replace all instances of characters in a text file or "squeeze" the whitespace.

Example:

cat some_file | tr '3' '5' > new_file

This will run the cat program on some file, the output of this command will be sent to the tr command, tr will replace all the instances of 3 with 5, like a search and replace. You can also do other things such as:

cat some_file | tr '[A-Z]' '[a-z]' > new_file

This will run cat on some_file and convert any capital letters to lowercase letters (you could use this to change the case of file names too...).

TipAlternatives
 

You can also do a search and replace with a one line Perl command, read about it at the end of this section.

nl

The number lines tool, it's default action is to write it's input (either the file names given as an argument, or the standard input) to the standard output.

Line numbers are added to every line and the text is indented.

This command can do take some more advanced numbering options, simply read the info page on it.

These advanced options mainly relate to customisation of the numbering, including different forms of separation for sections/pages/footers etc.

Also try cat -n (number all lines) or cat -b (number all non-blank lines). For more info on cat check under this section: Section 11.2

There are two ways you can use nl:

nl some_text_file.txt

The above command would add numbers to each line of some_text_file. You could use nl to number the output of something as shown in the example below;

grep some_string some_file | nl
Perl search and replace text

To search and replace text in a file is to use the following one-line Perl command[1]:

$ perl -pi -e "s/oldstring/newstring/g;" filespec [RET]

In this example, “oldstring” is the string to search, “newstring is the string to replace it with, and “filespec” is the name of the file or files to work on. You can use this for more than one file.

Example: To replace the string “helpless” with the string “helpful” in all files in the current directory, type:

$ perl -pi -e "s/helpless/helpful/g;" a12264.htm a12264.html b12722.htm backing-up-files.html book1.htm c10407.htm c10694.htm c107.htm c10866.htm c1089.htm c11270.htm c11412.htm c1195.htm c2086.htm c2269.htm c2690.htm c4268.htm c4975.htm c6239.htm c6435.htm c8113.htm c8319.htm c9295.htm c962.htm c9978.htm checking-the-hard-disk.html command-substitution.html compression.html concept-definitions.html contributors.html controlling-processes.html controlling-services.html controlling-the-system.html conventions.html date-time-calendars.html directing-input-ouput.html disclaimer.html doc-index.html duplicating-disks.html feedback.html file-permissions.html finding-information.html finding-packages-tools.html finding-text-within-files.html further-reading.html general-shell-tips.html gnu-free-documentation-licence.html GNU-Linux-Tools-Summary.html graphics-tools.html hard-disk-partition-info.html help.html i12910.htm icon_smile.png index.html internet-specific-commands.html introduction.html legal.html license.html managing-users.html mass-rename.html mathematical-tools.html mini-guides.html miscellaneous.html mounting-and-unmounting.html network-commands.html network-configuration.html other-key-combinations.html performing-more-than-one-command.html references.html remote-administration.html resources-used-to-create-this-document.html rpm.html rsync.html scheduling.html security.html shell-tips.html shutting-down.html some-basic-security-tools.html sources-of-document.html tar.html text-editors.html text-filter-tools.html text-information-tools.html text-manipulation-tools.html text-related-tools.html text-viewing-tools.html the-command-line-history.html the-unix-tools-philosophy.html usage-input-output.html users-and-groups.html using-filesystem.html virtual-terminals.html who-would-not-want-to-read-this-guide.html who-would-want-to-read-this-guide.html wildcards.html working-files-folders.html working-with-ms-dos.html working-with-the-file-system.html x10099.htm x10181.htm x1039.htm x11569.htm x11606.htm x11655.htm x12429.htm x12637.htm x1712.htm x1877.htm x2005.htm x2361.htm x2563.htm x2622.htm x299.htm x3289.htm x335.htm x392.htm x4055.htm x4892.htm x5152.htm x5368.htm x6066.htm x611.htm x6546.htm x662.htm x6823.htm x696.htm x6993.htm x7619.htm x7969.htm x8751.htm x9094.htm x9543.htm [RET]

Also try using tr to do the same thing (see further above in this section).

TipIf these tools are too primitive
 

If these text tools are too simple for your purposes then you are probably looking at doing some programming or scripting.

If you would like more information on bash scripting then please see the advanced bash scripting guide, authored by Mendel Cooper.

sed and awk are traditional UNIX system tools for working with text, this guide does not provide an explanation of them. sed works on a line-by-line basis performing substitution and awk can perform a similar task or assist by working on a file and printing out certain information (its a programming language).

You will normally find them installed on your GNU/Linux system and will find many tutorials all over the internet, feel free to look them up if you ever have to perform many similar operations on a text file.

Notes

[1]

This information has been taken from the Linux Cookbook (without editing). See [3] in the Bibliography for further information.