Tuesday, 1 October 2013

Character Strings

Reference

The material in this page of notes is organized after "C Programming, A Modern Approach" by K.N. King, Chapter 13. Some content may also be from other sources.

Overview

  • The C language does not have a specific "String" data type, the way some other languages such as C++ and Java do.
  • Instead C stores strings of characters as arrays of chars, terminated by a null byte.
  • This page of notes covers all the details of using strings of characters in C, including how they are stored, how to read and write them, common functions from the string library for dealing with them, and how to process command-line arguments.
  • Note that many of the concepts covered in this section also apply to other situations involving combinations of arrays, pointers, and functions.

String Literals

  • A String Literal, also known as a string constant or constant string, is a string of characters enclosed in double quotes, such as "To err is human - To really foul things up requires a computer."
  • String literals are stored in C as an array of chars, terminted by a null byte.
    • A null byte is a char having a value of exactly zero, noted as '\0'.
    • Do not confuse the null byte, '\0', with the character '0', the integer 0, the double 0.0, or the pointer NULL.
  • String literals may contain as few as one or even zero characters.
    • Do not confuse a single-character string literal, e.g. "A" with a character constant, 'A'. The former is actually two characters, because of the terminating null byte stored at the end.
    • An empty string, "", consists of only the null byte, and is considered to have a string length of zero, because the terminating null byte does not count when determining string lengths.
  • String literals may contain any valid characters, including escape sequences such as \n, \t, etc. Octal and hexadecimal escape sequences are technically legal in string literals, but not as commonly used as they are in character constants, and have some potential problems of running on into following text.

Passing String Literals to Functions

  • String literals are passed to functions as pointers to a stored string.
  • For example, given the statement:
    printf( "Please enter a positive value for the angle: " );
the string literal "Please enter . . . " will be stored somewhere in memory, and the address will be passed to printf. The first argument to printf is actually defined as a char *.

Continuing a String Literal

If a string literal needs to be continued across multiple lines, there are three options:
  1. If the new line character is included between the double quotes, then it is included as part of the string:
  2.     printf( "This will
             print over three
             lines, ( and will include extra tabs or spaces. )" );
  3. Escaping the newline character with the backslash will cause it to be ignored:
  4.     printf( "This will \
             print over a single \
             line, ( but will still include extra tabs or spaces. )" );
  5. However the best approach is to note that when two strings appear adjacent to one another, C will automatically concatenate them into a single string:
  6.     printf( "This will "
             "print over a single "
             "line, ( without extra tabs or spaces! )" );

Operations on String Literals

  • Character pointers may hold the address of string literals:
    char *prompt = "Please enter the next number: ";
  • String literals may be subscripted:

    • "abc" [ 2 ] results in 'c'
    • "abc" [ 0 ] results in 'a'
    • "abc" [ 3 ] results in the null byte.

  • Attempting to modify a string literal is undefined, and may cause problems in different ways depending on the compiler:
    prompt[ 0 ] = 'S'; // From above. Don't do this.

String Variables

String variables are typically stored as arrays of chars, terminated by a null byte.

Initializing String Variables

String variables can be initialized either with individual characters or more commonly and easily with string literals:
char s1[ ] = { 'h', 'e', 'l', 'l', 'o', '\0' };
char s2[ ] = "hello"; // Array of size six
char s3[ 20 ] = "hello"; // Fifteen null bytes
char s4[ 5 ] = "hello"; // Five chars, sacrifices the null byte
char s5[ 3 ] = "hello"; // Illegal

Character Arrays versus Character Pointers

Character arrays and character pointers are often interchangeable, but there can be some very important differences. Consider the following two variables:
char s6[ ] = "hello", *s7 = "hello";
  • s6 is a fixed constant address, determined by the compiler; s7 is a pointer variable, that can be changed to point elsewhere.
  • s6 allocates space for exactly 6 bytes; s7 allocates space for 10 ( typically ) - 6 for the characters plus another 4 for the pointer veriable.
  • The contents of s6, can be changed, e.g. s6[ 0 ] = 'J'; The contents of s7 should not be changed.

Reading and Writing Strings

Writing Strings Using printf and puts

  • The standard format specifier for printing strings with printf is %s
    • The width parameter can be used to print a short string in a long space
    • The .precision parameter limits the length of longer strings.
  • int puts ( const char * str );
    • Writes the string out to standard output ( the screen ) and automatically appends a newline character at the end.
      This is a lower-level routine that printf, without formatting, and possibly faster.

Reading Strings Using scanf, gets, and fgets

  • The standard format specifier for reading strings with scanf is %s
    • scanf reads in a string of characters, only up to the first non-whitespace character. I.e. it stops reading when it encounters a space, tab, or newline character.
    • scanf appends a null byte to the end of the character string stored.
    • scanf does skip over any leading whitespace characters in order to find the first non-whitespace character.
    • The width field can be used to limit the maximum number of characters read.
    • The argument passed to scanf needs to be of type pointer-to-character, i.e. char *, and is normally the name of a char array. ( In this case the size of the array -1 should be given as the width field, to make sure scanf can append the null byte without overflowing the array.
  • char * gets ( char * str );
    • The gets function will read a string of characters from standard input ( the keyboard ) until it encounters either a newline character or the end of file.
    • Unlike scanf, gets does not care about spaces in the input, treating them the same as any other character.
    • Note that gets has no provision for limiting the number of characters read, leading to possible overflow problems.
    • When gets encounters a new line character, it stops reading, but does NOT store it.
    • A null byte is always appended to the end of the string of stored characters.
  • char * fgets ( char * str, int num, FILE * stream );
    • fgets is equivalent to gets, except for two additional arguments and one key difference:
      • num provides for a limit on the maximum number of characters read
      • The FILE * argument allows fgets to read from any file. stdin may be entered to read from the keyboard.
      • fgets stops reading when it encounters a new line character, and it DOES store the newline character.
      • A null byte is always appended to the end of the string of stored characters.

Reading Strings Character by Character

  • Another option for reading in strings is to read them in character by character.
  • This is useful when you don't know how long the string might be, or if you want to consider other stopping conditions besides spaces and newlines. ( E.g. stop on periods, or when two successive slashes, //, are encountered. )
  • The scanf format specifier for reading individual characters is %c
    • If a width greater than 1 is given, then multiple characters are read, and stored in successive positions in a char array.
  • Characters scanned this way are often stored in an array, although the could also be read into an ordinary char variable for further processing first.

Accessing the Characters in a String

Characters in a string can be acessed using either the [ ] notation or a char pointer.

Using the C String Library

  • The C language includes a standard library full of routines for working with null-byte terminated arrays of chars.
  • #include < string.h > to use any of these functions.
  • CPlusPlus.com documents the library at http://cplusplus.com/reference/cstring/
  • Some of the more commonly used and useful routines include:

size_t strlen ( const char * str );

  • Returns the length of the string, not counting the null byte

char * strcpy ( char * destination, const char * source );

  • Copies characters from source to destination, until a null byte is eventually encountered

char * strncpy ( char * destination, const char * source, size_t num );

  • Copies from source to destination, up to a maximum of num characters.
  • Pads destination with extra null bytes if the length of source is smaller than num.
  • May sacrifice the null byte if the source is longer than num.

char * strcat ( char * destination, const char * source );

  • Concatenates ( appends ) source onto the end of destination.
  • The terminating null byte of destination is over-written by the first char of source, and a new null-byte is appended at the end of the concatenated string.
  • char * strncat ( char * destination, char * source, size_t num ); is a version that sets a limit on the number of characters concatenated.

int strcmp ( const char * str1, const char * str2 );

  • Compares strings str1 and str2, up until the first encountered null byte
    • Returns zero if the two strings are equal
    • Returns a positive value ( 1? ) if the first encountered difference has a larger value in str1 than str2
    • Returns a negative value ( -1? ) if the first encountered differenc has a smaller value in str1 than str2
  • int strncmp ( const char * str1, const char * str2, size_t num ); checks only up to num characters.
  • The methods stricmp and strnicmp do case-insensitive comparisons, but may be available for C++ programs only.

int atoi (const char * str);
double atof (const char* str);
long int atol ( const char * str );
long long int atoll ( const char * str );

  • parses a string of numeric characters into a number of type int, double, long int, or long long int, respectively.
  • Most often used when parsing command line arguments.
  • Can also be used for rigorous checking of user input - Read in an entire line into a char array using gets, fgets, or char-by-char, then rigorously test all the characters in the string, ( e.g. are they all numeric ), and then use one of these routines to convert the char array into a number once all the checks have been made.

sprintf( char [ ], format [ , data . . . ] );

  • "Prints" into a character array, similar to printf. Useful for converting data from other data types into a character string and/or combining things.

sscanf( char *, format [ , address . . . ] );

  • Reads data from a character string, just as if scanf read it from the keyboard.
  • One approach to rigorous input checking is to use getline( ) to read in an entire line of input from a user, then examine it character by character for any problems, e.g. non-numeric characters when trying to read in a number. Then when the string has been verified, use scanf to read in the data from the character string.
See also strchr, strrchr, strstr, and strtok for searching and tokenizing strings.

String Idioms

Searching for the End of a String

Copying a String

		void strcpy( const char * s, char * d ) {

			do {
				*d++ = *s++;
			} while( *s );

			return;
		}

Arrays of Strings

  • There is a subtle difference between:
    • a one-dimensional arrays of pointers to chars, ( char * [ ] ) and
    • a two-dimensional array of chars, ( char [ ] [ ] )
  • Both can be accessed using double subscripts
  • However the storage is different.

Command Line Arguments

  • int main( void );

  • int main( int argc, char * argv[ ] );
  • int main( int argc, char ** argv ); // Same as above
    • argv is an array of character pointers, of length argc

  • int main( int argc, char ** argv, char ** envp );
    • envp is an array of character pointers, terminated by a NULL pointer

  • Sample Program: argprinter.c

  • If integers, doubles, etc. need to be read in from the command line, use atoi, atof, etc. to convert them from character strings.

No comments:

Post a Comment