Saturday, 30 November 2013

Complexity and Big-O Notation

An important question is: How efficient is an algorithm or piece of code? Efficiency covers lots of resources, including:
  • CPU (time) usage
  • memory usage
  • disk usage
  • network usage
All are important but we will mostly talk about CPU time in 367. Other classes will discuss other resources (e.g., disk usage may be an important topic in a database class).
Be careful to differentiate between:
  1. Performance: how much time/memory/disk/... is actually used when a program is run. This depends on the machine, compiler, etc. as well as the code.
  2. Complexity: how do the resource requirements of a program or algorithm scale, i.e., what happens as the size of the problem being solved gets larger.
Complexity affects performance but not the other way around.
The time required by a method is proportional to the number of "basic operations" that it performs. Here are some examples of basic operations:
  • one arithmetic operation (e.g., +, *).
  • one assignment
  • one test (e.g., x == 0)
  • one read
  • one write (of a primitive type)
Some methods perform the same number of operations every time they are called. For example, the size method of the List class always performs just one operation: return numItems; the number of operations is independent of the size of the list. We say that methods like this (that always perform a fixed number of basic operations) require constant time.
Other methods may perform different numbers of operations, depending on the value of a parameter or a field. For example, for the array implementation of the List class, the remove method has to move over all of the items that were to the right of the item that was removed (to fill in the gap). The number of moves depends both on the position of the removed item and the number of items in the list. We call the important factors (the parameters and/or fields whose values affect the number of operations performed) the problem size or the input size.
When we consider the complexity of a method, we don't really care about the exact number of operations that are performed; instead, we care about how the number of operations relates to the problem size. If the problem size doubles, does the number of operations stay the same? double? increase in some other way? For constant-time methods like the size method, doubling the problem size does not affect the number of operations (which stays the same).
Furthermore, we are usually interested in the worst case: what is the most operations that might be performed for a given problem size (other cases -- best case and average case -- are discussed below). For example, as discussed above, the remove method has to move all of the items that come after the removed item one place to the left in the array. In the worst case, all of the items in the array must be moved. Therefore, in the worst case, the time for remove is proportional to the number of items in the list, and we say that the worst-case time for remove is linear in the number of items in the list. For a linear-time method, if the problem size doubles, the number of operations also doubles.


TEST YOURSELF #1
Assume that lists are implemented using an array. For each of the following List methods, say whether (in the worst case) the number of operations is independent of the size of the list (is a constant-time method), or is proportional to the size of the list (is a linear-time method):
  • the constructor
  • add (to the end of the list)
  • add (at a given position in the list)
  • isEmpty
  • contains
  • get
Answer:

  • constructor: This method allocates the initial array, sets current to -1 and sets numItems to 0. This has nothing to do with the sequence size and is constant-time.
  • add (to the end of the list): In the worst case, the array was full and you have to allocate a new, larger array, and copy all items. In this case the number of operations is proportional to the size of the list. If the array is not full, this is a constant-time operation (because all you have to do is copy one value into the array and increment numItems).
  • add (at a given position in the list): As for the other version of add, if the array is full, time proportional to the size of the list is required to copy the values from the old array to the new array. However, even if the array is not full, this version of add can require time proportional to the size of the list. This is because, when adding at position k, all of the items in positions k to the end must be moved over. In the worst case (when the new item is added at the beginning of the list), this requires moving all items over, and that takes time proportional to the number of items in the list.
  • isEmpty: This method simply returns the result of comparing numItems with 0; this is a constant=time operation.
  • contains: This method involves looking at each item in the list in turn to see if it is equal to the given item. In the worst case (the given item is at the end of the list or is not in the list at all), this takes time proportional to the size of the list.
  • get: This method checks for a bad position and either throws an exception or returns the value in the given position in the array. In either case it is independent of the size of the list and so it is a constant-time operation.

Constant and linear times are not the only possibilities. For example, consider method createList:
    List createList( int N ) {
      List L = new List();
      for (int k=1; k<=N; k++) L.add(0, new Integer(k));
      return L;
    }
    
Note that, for a given N, the for-loop above is equivalent to:
    L.add(0, new Integer(1) );
    L.add(0, new Integer(2) );
    L.add(0, new Integer(3) );
        ...
    L.add(0, new Integer(N) );
    
If we assume that the initial array is large enough to hold N items, then the number of operations for each call to add is proportional to the number of items in the list when add is called (because it has to move every item already in the array one place to the right to make room for the new item at position 0). For the N calls shown above, the list lengths are: 0, 1, 2, ..., N-1. So what is the total time for all N calls? It is proportional to 0 + 1 + 2 + ... + N-1.
Recall that we don't care about the exact time, just how the time depends on the problem size. For method createList, the "problem size" is the value of N (because the number of operations will be different for different values of N). It is clear that the time for the N calls (and therefore the time for method createList) is not independent of N (so createList is not a constant-time method). Is it proportional to N (linear in N)? That would mean that doubling N would double the number of operations performed by createList. Here's a table showing the value of 0+1+2+...+(N-1) for some different values of N:

N0+1+2+...+(N-1)
46
828
16120
Clearly, the value of the sum does more than double when the value of N doubles, so createList is not linear in N. In the following graph, the bars represent the lengths of the list (0, 1, 2, ..., N-1) for each of the N calls.
The value of the sum (0+1+2+...+(N-1)) is the sum of the areas of the individual bars. You can see that the bars fill about half of the square. The whole square is an N-by-N square, so its area is N2; therefore, the sum of the areas of the bars is about N2/2. In other words, the time for method createList is proportional to the square of the problem size; if the problem size doubles, the number of operations will quadruple. We say that the worst-case time for createList is quadratic in the problem size.


TEST YOURSELF #2
Consider the following three algorithms for determining whether anyone in the room has the same birthday as you.
Algorithm 1: You say your birthday, and ask whether anyone in the room has the same birthday. If anyone does have the same birthday, they answer yes.
Algorithm 2: You tell the first person your birthday, and ask if they have the same birthday; if they say no, you tell the second person your birthday and ask whether they have the same birthday; etc, for each person in the room.
Algorithm 3: You only ask questions of person 1, who only asks questions of person 2, who only asks questions of person 3, etc. You tell person 1 your birthday, and ask if they have the same birthday; if they say no, you ask them to find out about person 2. Person 1 asks person 2 and tells you the answer. If it is no, you ask person 1 to find out about person 3. Person 1 asks person 2 to find out about person 3, etc.
Question 1: For each algorithm, what is the factor that can affect the number of questions asked (the "problem size")?
Question 2: In the worst case, how many questions will be asked for each of the three algorithms?
Question 3: For each algorithm, say whether it is constant, linear, or quadratic in the problem size in the worst case.

Answer:
Question 1: The problem size is the number of people in the room.
Question 2: Assume there are N people in the room. In algorithm 1 you always ask 1 question. In algorithm 2, the worst case is if no one has your birthday. Here you have to ask every person to figure this out. This is N questions. In algorithm 3, the worst case is the same as algorithm 2. The number of questions is 1 + 2 + 3 + ... + N-1 + N. We showed before that this sum is N(N+1)/2.

Question 3: Given the number of questions you can see that algorithm 1 is constant time, algorithm 2 is linear time, and algorithm 3 is quadratic time in the problem size.

Big-O Notation

We express complexity using big-O notation. For a problem of size N:
  • a constant-time method is "order 1": O(1)
  • a linear-time method is "order N": O(N)
  • a quadratic-time method is "order N squared": O(N2)
Note that the big-O expressions do not have constants or low-order terms. This is because, when N gets large enough, constants and low-order terms don't matter (a constant-time method will be faster than a linear-time method, which will be faster than a quadratic-time method). See below for an example.
Formal definition:
    A function T(N) is O(F(N)) if for some constant c and for all values of N greater than some value n0:
    T(N) <= c * F(N)
The idea is that T(N) is the exact complexity of a method or algorithm as a function of the problem size N, and that F(N) is an upper-bound on that complexity (i.e., the actual time/space or whatever for a problem of size N will be no worse than F(N)). In practice, we want the smallest F(N) -- the least upper bound on the actual complexity.
For example, consider T(N) = 3 * N2 + 5. We can show that T(N) is O(N2) by choosing c = 4 and n0 = 2. This is because for all values of N greater than 2:
3 * N2 + 5 <= 4 * N2
T(N) is not O(N), because whatever constant c and value n0 you choose, I can always find a value of N greater than n0 so that 3 * N2 + 5 is greater than c * N.

How to Determine Complexities

In general, how can you determine the running time of a piece of code? The answer is that it depends on what kinds of statements are used.
  1. Sequence of statements
      statement 1;
      statement 2;
        ...
      statement k;
      
    (Note: this is code that really is exactly k statements; this is not an unrolled loop like the N calls to add shown above.) The total time is found by adding the times for all statements:total time = time(statement 1) + time(statement 2) + ... + time(statement k)
    If each statement is "simple" (only involves basic operations) then the time for each statement is constant and the total time is also constant: O(1). In the following examples, assume the statements are simple unless noted otherwise.
  2. if-then-else statements
      if (condition) {
          sequence of statements 1
      }
      else {
          sequence of statements 2
      }
      
    Here, either sequence 1 will execute, or sequence 2 will execute. Therefore, the worst-case time is the slowest of the two possibilities: max(time(sequence 1), time(sequence 2)). For example, if sequence 1 is O(N) and sequence 2 is O(1) the worst-case time for the whole if-then-else statement would be O(N).
  3. for loops
      for (i = 0; i < N; i++) {
          sequence of statements
      }
      
    The loop executes N times, so the sequence of statements also executes N times. Since we assume the statements are O(1), the total time for the for loop is N * O(1), which is O(N) overall.
  4. Nested loopsFirst we'll consider loops where the number of iterations of the inner loop is independent of the value of the outer loop's index. For example:
      for (i = 0; i < N; i++) {
          for (j = 0; j < M; j++) {
              sequence of statements
          }
      }
      
    The outer loop executes N times. Every time the outer loop executes, the inner loop executes M times. As a result, the statements in the inner loop execute a total of N * M times. Thus, the complexity is O(N * M). In a common special case where the stopping condition of the inner loop is j < N instead of j < M (i.e., the inner loop also executes N times), the total complexity for the two loops is O(N2).Now let's consider nested loops where the number of iterations of the inner loop depends on the value of the outer loop's index. For example:
      for (i = 0; i < N; i++) {
          for (j = i+1; j < N; j++) {
              sequence of statements
          }
      }
      
    Now we can't just multiply the number of iterations of the outer loop times the number of iterations of the inner loop, because the inner loop has a different number of iterations each time. So let's think about how many iterations that inner loop has. That information is given in the following table:
    Value of iNumber of iterations of inner loop
    0N
    1N-1
    2N-2
    ......
    N-22
    N-11
    So we can see that the total number of times the sequence of statements executes is: N + N-1 + N-2 + ... + 3 + 2 + 1. We've seen that formula before: the total is O(N2).

    TEST YOURSELF #3
    What is the worst-case complexity of the each of the following code fragments?
    1. Two loops in a row:
        for (i = 0; i < N; i++) {
            sequence of statements
        }
        for (j = 0; j < M; j++) {
            sequence of statements
        }
        
      How would the complexity change if the second loop went to N instead of M?
    2. A nested loop followed by a non-nested loop:
        for (i = 0; i < N; i++) {
            for (j = 0; j < N; j++) {
                sequence of statements
            }
        }
        for (k = 0; k < N; k++) {
            sequence of statements
        }
        
    3. A nested loop in which the number of times the inner loop executes depends on the value of the outer loop index:
        for (i = 0; i < N; i++) {
            for (j = N; j > i; j--) {
                sequence of statements
            }
        }
        
    Answer:
    1. The first loop is O(N) and the second loop is O(M). Since you don't know which is bigger, you say this is O(N+M). This can also be written as O(max(N,M)). In the case where the second loop goes to N instead of M the complexity is O(N). You can see this from either expression above. O(N+M) becomes O(2N) and when you drop the constant it is O(N). O(max(N,M)) becomes O(max(N,N)) which is O(N).
    2. The first set of nested loops is O(N2) and the second loop is O(N). This is O(max(N2,N)) which is O(N2).
    3. This is very similar to our earlier example of a nested loop where the number of iterations of the inner loop depends on the value of the index of the outer loop. The only difference is that in this example the inner-loop index is counting down from N to i+1. It is still the case that the inner loop executes N times, then N-1, then N-2, etc, so the total number of times the innermost "sequence of statements" execites is O(N2).

  5. Statements with method calls:When a statement involves a method call, the complexity of the statement includes the complexity of the method call. Assume that you know that method f takes constant time, and that method g takes time proportional to (linear in) the value of its parameter k. Then the statements below have the time complexities indicated.
      f(k);  // O(1)
      g(k);  // O(k)
      
    When a loop is involved, the same rule applies. For example:
      for (j = 0; j < N; j++) g(N);
      
    has complexity (N2). The loop executes N times and each method call g(N) is complexity O(N).

    TEST YOURSELF #4
    For each of the following loops with a method call, determine the overall complexity. As above, assume that method f takes constant time, and that method g takes time linear in the value of its parameter.
      1. for (j = 0; j < N; j++) f(j);
      
      2. for (j = 0; j < N; j++) g(j);
      
      3. for (j = 0; j < N; j++) g(k);
      
    Answer: 

  1. Each call to f(j) is O(1). The loop executes N times so it is N x O(1) or O(N).
  2. The first time the loop executes j is 0 and g(0) takes "no operations". The next time j is 1 and g(1) takes 1 operations. The last time the loop executes j is N-1 and g(N-1) takes N-1 operations. The total work is the sum of the first N-1 numbers and is O(N2).
  3. Each time through the loop g(k) takes k operations and the loop executes N times. Since you don't know the relative size of k and N, the overall complexity is O(N x k).

Best-case and Average-case Complexity

Some methods may require different amounts of time on different calls, even when the problem size is the same for both calls. For example, consider the add method that adds an item to the end of the list. In the worst case (the array is full), that method requires time proportional to the number of items in the list (because it has to copy all of them into the new, larger array). However, when the array is not full, add will only have to copy one value into the array, so in that case its time is independent of the length of the list; i.e., constant time.
In general, we may want to consider the best and average time requirements of a method as well as its worst-case time requirements. Which is considered the most important will depend on several factors. For example, if a method is part of a time-critical system like one that controls an airplane, the worst-case times are probably the most important (if the plane is flying towards a mountain and the controlling program can't make the next course correction until it has performed a computation, then the best-case and average-case times for that computation are not relevant -- the computation needs to be guaranteed to be fast enough to finish before the plane hits the mountain).
On the other hand, if occasionally waiting a long time for an answer is merely inconvenient (as opposed to life-threatening), it may be better to use an algorithm with a slow worst-case time and a fast average-case time, rather than one with so-so times in both the average and worst cases.
Note that calculating the average-case time for a method can be tricky. You need to consider all possible values for the important factors, and whether they will be distributed evenly.

When do Constants Matter?

Recall that when we use big-O notation, we drop constants and low-order terms. This is because when the problem size gets sufficiently large, those terms don't matter. However, this means that two algorithms can have thesame big-O time complexity, even though one is always faster than the other. For example, suppose algorithm 1 requires N2 time, and algorithm 2 requires 10 * N2 + N time. For both algorithms, the time is O(N2), but algorithm 1 will always be faster than algorithm 2. In this case, the constants and low-order terms do matter in terms of which algorithm is actually faster.
However, it is important to note that constants do not matter in terms of the question of how an algorithm "scales" (i.e., how does the algorithm's time change when the problem size doubles). Although an algorithm that requires N2 time will always be faster than an algorithm that requires 10*N2 time, for both algorithms, if the problem size doubles, the actual time will quadruple.
When two algorithms have different big-O time complexity, the constants and low-order terms only matter when the problem size is small. For example, even if there are large constants involved, a linear-time algorithm will always eventually be faster than a quadratic-time algorithm. This is illustrated in the following table, which shows the value of 100*N (a time that is linear in N) and the value of N2/100 (a time that is quadratic in N) for some values of N. For values of N less than 104, the quadratic time is smaller than the linear time. However, for all values of N greater than 104, the linear time is smaller.
N100*NN2/100
102104102
103105104
104106106
105107108
1061081010
1071091012

A brief overview of Complexity Theory

Complexity Theory is concerned with the study of the intrinsic complexity of computational tasks. Its ``final'' goals include the determination of the complexity of any well-defined task. Additional ``final'' goals include obtaining an understanding of the relations between various computational phenomena (e.g., relating one fact regarding computational complexity to another). Indeed, we may say that the former type of goals is concerned with absolute answers regarding specific computational phenomena, whereas the latter type is concerned with questions regarding the relation between computational phenomena.
Interestingly, the current success of Complexity Theory in coping with the latter type of goals has been more significant. In fact, the failure to resolve questions of the ``absolute'' type, led to the flourishing of methods for coping with questions of the ``relative'' type. Putting aside for a moment the frustration caused by the failure, we must admit that there is something fascinating in the success: in some sense, establishing relations between phenomena is more revealing than making statements about each phenomenon. Indeed, the first example that comes to mind is the theory of NP-completeness. Let us consider this theory, for a moment, from the perspective of these two types of goals.
Complexity theory has failed to determine the intrinsic complexity of tasks such as finding a satisfying assignment to a given (satisfiable) propositional formula or finding a 3-coloring of a given (3-colorable) graph. But it has established that these two seemingly different computational tasks are in some sense the same (or, more precisely, are computationally equivalent). The author finds this success amazing and exciting, and hopes that the reader shares his feeling. The same feeling of wonder and excitement is generated by many of the other discoveries of Complexity theory. Indeed, the reader is invited to join a fast tour of some of the other questions and answers that make up the field of Complexity theory.
We will indeed start with the ``P versus NP Question''. Our daily experience is that it is harder to solve a problem than it is to check the correctness of a solution (e.g., think of either a puzzle or a research problem). Is this experience merely a coincidence or does it represent a fundamental fact of life (or a property of the world)? Could you imagine a world in which solving any problem is not significantly harder than checking a solution to it? Would the term ``solving a problem'' not lose its meaning in such a hypothetical (and impossible in our opinion) world? The denial of the plausibility of such a hypothetical world (in which ``solving'' is not harder than ``checking'') is what ``P different from NP'' actually means, where P represents tasks that are efficiently solvable and NP represents tasks for which solutions can be efficiently checked.
The mathematically (or theoretically) inclined reader may also consider the task of proving theorems versus the task of verifying the validity of proofs. Indeed, finding proofs is a special type of the aforementioned task of ``solving a problem'' (and verifying the validity of proofs is a corresponding case of checking correctness). Again, ``P different from NP'' means that there are theorems that are harder to prove than to be convinced of their correctness when presented with a proof. This means that the notion of a proof is meaningful (i.e., that proofs do help when trying to be convinced of the correctness of assertions). Here NP represents sets of assertions that can be efficiently verified with the help of adequate proofs, and P represents sets of assertions that can be efficiently verified from scratch (i.e., without proofs).
In light of the foregoing discussion it is clear that the P-versus-NP Question is a fundamental scientific question of far-reaching consequences. The fact that this question seems beyond our current reach led to the development of the theory of NP-completeness. Loosely speaking, this theory identifies a set of computational problems that are as hard as NP. That is, the fate of the P-versus-NP Question lies with each of these problems: if any of these problems is easy to solve then so are all problems in NP. Thus, showing that a problem is NP-complete provides evidence to its intractability (assuming, of course, ``P different than NP''). Indeed, demonstrating NP-completeness of computational tasks is a central tool in indicating hardness of natural computational problems, and it has been used extensively both in computer science and in other disciplines. NP-completeness indicates not only the conjectured intractability of a problem but rather also its ``richness'' in the sense that the problem is rich enough to ``encode'' any other problem in NP. The use of the term ``encoding'' is justified by the exact meaning of NP-completeness, which in turn is based on establishing relations between different computational problems (without referring to their ``absolute'' complexity).
The foregoing discussion of the P-versus-NP Question also hints to the importance of representation, a phenomenon that is central to complexity theory. In general, complexity theory is concerned with problems the solutions of which are implicit in the problem's statement. That is, the problem contains all necessary information, and one merely needs to process this information in order to supply the answer. Thus, complexity theory is concerned with manipulation of information, and its transformation from one representation (in which the information is given) to another representation (which is the one desired). Indeed, a solution to a computational problem is merely a different representation of the information given; that is, a representation in which the answer is explicit rather than implicit. For example, the answer to the question of whether or not a given Boolean formula is satisfiable is implicit in the formula itself (but the task is to make the answer explicit). Thus, complexity theory clarifies a central issue regarding representation; that is, the distinction between what is explicit and what is implicit in a representation. Furthermore, it even suggests a quantification of the level of non-explicitness.
In general, complexity theory provides new viewpoints on various phenomena that were considered also by past thinkers. Examples include the aforementioned concepts of proofs and representation as well as concepts like randomness, knowledge, interaction, secrecy and learning. We next discuss some of these concepts and the perspective offered by complexity theory.
The concept of randomness has puzzled thinkers for ages. Their perspective can be described as ontological: they asked ``what is randomness'' and wondered whether it exist at all (or is the world deterministic). The perspective of complexity theory is behavioristic: it is based on defining objects as equivalent if they cannot be told apart by any efficient procedure. That is, a coin toss is (defined to be) ``random'' (even if one believes that the universe is deterministic) if it is infeasible to predict the coin's outcome. Likewise, a string (or a distribution of strings) is ``random'' if it is infeasible to distinguish it from the uniform distribution (regardless of whether or not one can generate the latter). Interestingly, randomness (or rather pseudorandomness) defined this way is efficiently expandable; that is, under a reasonable complexity assumption (to be discussed next), short pseudorandom strings can be deterministically expanded into long pseudorandom strings. Indeed, it turns out that randomness is intimately related to intractability. Firstly, note that the very definition of pseudorandomness refers to intractability (i.e., the infeasibility of distinguishing a pseudorandomness object from a uniformly distributed object). Secondly, as hinted above, a complexity assumption that refers to the existence of functions that are easy to evaluate but hard to invert (called one-way functions) imply the existence of deterministic programs (called pseudorandom generators) that stretch short random seeds into long pseudorandom sequences. In fact, it turns out that the existence of pseudorandom generators is equivalent to the existence of one-way functions.
Complexity theory offers its own perspective on the concept of knowledge (and distinguishes it from information). It views knowledge as the result of a hard computation. Thus, whatever can be efficiently done by anyone is not considered knowledge. In particular, the result of an easy computation applied to publicly available information is not considered knowledge. In contrast, the value of a hard to compute function applied to publicly available information is knowledge, and if somebody provides you with such a value then it has provided you with knowledge. This discussion is related to the notion of zero-knowledge interactions, which are interactions in which no knowledge is gained. Such interactions may still be useful, because they may assert the correctness of specific data that was provided beforehand.
The foregoing paragraph has explicitly referred to interaction. It has pointed one possible motivation for interaction: gaining knowledge. It turns out that interaction may help in a variety of other contexts. For example, it may be easier to verify an assertion when allowed to interact with a prover rather than when reading a proof. Put differently, interaction with some teacher may be more beneficial than reading any book. We comment that the added power of such interactive proofs is rooted in their being randomized (i.e., the verification procedure is randomized), because if the verifier's questions can be determined beforehand then the prover may just provide the transcript of the interaction as a traditional written proof.
Another concept related to knowledge is that of secrecy: knowledge is something that one party has while another party does not have (and cannot feasibly obtain by itself) - thus, in some sense knowledge is a secret. In general, complexity theory is related to Cryptography, where the latter is broadly defined as the study of systems that are easy to use but hard to abuse. Typically, such systems involve secrets, randomness and interaction as well as a complexity gap between the ease of proper usage and the infeasibility of causing the system to deviate from its prescribed behavior. Thus, much of Cryptography is based on complexity theoretic assumptions and its results are typically transformations of relatively simple computational primitives (e.g., one-way functions) into more complex cryptographic applications (e.g., a secure encryption scheme).
We have already mentioned the context of learning when referring to learning from a teacher versus learning from a book. Recall that complexity theory provides evidence to the advantage of the former. This is in the context of gaining knowledge about publicly available information. In contrast, computational learning theory is concerned with learning objects that are only partially available to the learner (i.e., learning a function based on its value at a few random locations or even at locations chosen by the learner). Complexity theory sheds light on the intrinsic limitations of learning (in this sense).
Complexity theory deals with a variety of computational tasks. We have already mentioned two fundamental types of tasks: searching for solutions (or ``finding solutions'') and making decisions (e.g., regarding the validity of assertion). We have also hinted that in some cases these two types of tasks can be related. Now we consider two additional types of tasks: counting the number of solutions and generating random solutions. Clearly, both the latter tasks are at least as hard as finding arbitrary solutions to the corresponding problem, but it turns out that for some natural problems they are not significantly harder. Specifically, under some natural conditions on the problem, approximately counting the number of solutions and generating an approximately random solution is not significantly harder than finding an arbitrary solution.
Having mentioned the notion of approximation, we note that the study of the complexity of finding approximate solutions has also received a lot of attention. One type of approximation problems refers to an objective function defined on the set of potential solutions. Rather than finding a solution that attains the optimal value, the approximation task consists of finding a solution that attains an ``almost optimal'' value, where the notion of ``almost optimal'' may be understood in different ways giving rise to different levels of approximation. Interestingly, in many cases even a very relaxed level of approximation is as difficult to achieve as the original (exact) search problem (i.e., finding an approximate solution is as hard as finding an optimal solution). Surprisingly, these hardness of approximation results are related to the study of probabilistically checkable proofs, which are proofs that allow for ultra-fast probabilistic verification. Amazingly, every proof can be efficiently transformed into one that allows for probabilistic verification based on probing a constant number of bits (in the alleged proof). Turning back to approximation problems, we note that in other cases a reasonable level of approximation is easier to achieve than solving the original (exact) search problem.
Approximation is a natural relaxation of various computational problems. Another natural relaxation is the study of average-case complexity, where the ``average'' is taken over some ``simple'' distributions (representing a model of the problem's instances that may occur in practice). We stress that, although it was not stated explicitly, the entire discussion so far has referred to ``worst-case'' analysis of algorithms. We mention that worst-case complexity is a more robust notion than average-case complexity. For starters, one avoids the controversial question of what are the instances that are ``important in practice'' and correspondingly the selection of the class of distributions for which average-case analysis is to be conducted. Nevertheless, a relatively robust theory of average-case complexity has been suggested, albeit it is far less developed than the theory of worst-case complexity.
In view of the central role of randomness in complexity theory (as evident, say, in the study of pseudorandomness, probabilistic proof systems, and cryptography), one may wonder as to whether the randomness needed for the various applications can be obtained in real-life. One specific question, which received a lot of attention, is the possibility of ``purifying'' randomness (or ``extracting good randomness from bad sources''). That is, can we use ``defected'' sources of randomness in order to implement almost perfect sources of randomness. The answer depends, of course, on the model of such defected sources. This study turned out to be related to complexity theory, where the most tight connection is between some type of randomness extractors and some type of pseudorandom generators.
So far we have focused on the time complexity of computational tasks, while relying on the natural association of efficiency with time. However, time is not the only resource one should care about. Another important resource is space: the amount of (temporary) memory consumed by the computation. The study of space complexity has uncovered several fascinating phenomena, which seem to indicate a fundamental difference between space complexity and time complexity. For example, in the context of space complexity, verifying proofs of validity of assertions (of any specific type) has the same complexity as verifying proofs of invalidity for the same type of assertions.
In case the reader feels dizzy, it is no wonder. We took an ultra-fast air-tour of some mountain tops, and dizziness is to be expected. Needless to say, the rest of the course will be in a totally different style. We will climb some of these mountains by foot, step by step, and will stop to look around and reflect.
Absolute Results (a.k.a. Lower-Bounds). As stated up-front, absolute results are not known for many of the ``big questions'' of complexity theory (most notably the P-versus-NP Question). However, several highly non-trivial absolute results have been proved. For example, it was shown that using negation can speed-up the computation of monotone functions (which do not require negation for their mere computation). In addition, many promising techniques were introduced and employed with the aim of providing a low-level analysis of the progress of computation. However, the focus of this course is elsewhere.

Sunday, 3 November 2013

Finite Automata

finite automaton (FA, also called a finite-state automaton or a finite-state machine) is a mathematical tool used to describe processes involving inputs and outputs. An FA can be in one of several states and can switch between states depending on symbols that it inputs. Once it settles in a state, it reads the next input symbol, performs some computational task associated with the new input, outputs a symbol, and switches to a new state depending on the input. Notice that the new state may be identical to the current state.


Figure 1 shows an example of a two-state FA that can be used to add binary numbers. The FA starts in state 1 (no carry) and inputs a pair of bits. If the pair is 11, the FA outputs a 0 and switches to state 2 (carry), where the next pair of bits is input and is added to a carry bit of 1. Table F.1 shows the individual steps in adding the two 6-bit numbers 14and 23 (these are actually 5-bit numbers, but the simple design of this FA requires that an extra 0 bit be appended to the left end of every number). Since adding numbers is done from right to left, the six columns of the table should also be read in this direction.



Definition: A DFA is 5-tuple or quintuple M = (Q, ∑, δ, q0, A) where
                 Q is non-empty, finite set of states.
                 ∑ is non-empty, finite set of input alphabets.
                 δ is transition function, which is a mapping from Q ∑ to Q.
                q0 ∈ Q is the start state.
                A  Q is set of accepting or final states.
Note: For each input symbol a, from a given state there is exactly one transition (there can be no transitions from a state also) and we are sure (or can determine) to which state the machine enters. So, the machine is called Deterministic machine. Since it has finite number of states the machine is called Deterministic finite machine or Deterministic Finite Automaton or Finite State Machine (FSM).
The language accepted by DFA is
      L(M) = { w | w  ∑* and δ*(q0, w) ∈ A }
The non-acceptance of the string w by an FA or DFA can be defined in formal notation as:
       = { w | w  ∑* and δ*(q0, w)  A }




So, the DFA which accepts strings of a’s and b’s starting with the string ab is given by M 
= (Q, ∑ , δ, q0, A) where


     Q = {q0, q1, q2, q3}
     ∑ = {a, b}
     q0­ is the start state
     A = {q2}.
     δ is shown by the transition table 3.
Example 2: Draw a DFA to accept string of 0’s and 1’s ending with the string 011.


Example 3: Obtain a DFA to accept strings of a’s and b’s having a sub string aa


Example 4: Obtain a DFA to accept strings of a’s and b’s except those containing the sub string aab.


Example 5: Obtain DFAs to accept strings of a’s and b’s having exactly one a. atleast one a and others


Example 6: Obtain a DFA to accept strings of a’s and b’s having even number of a’s and b’s
The machine to accept even number of a’s and b’s



Definition: Let M = (Q, ∑, δ, q0, A) be a DFA. The language L is regular if there exists a machine M such that L = L(M).
Applications of Finite Automata
  • String matching/processing
  • Compiler Construction
The various compilers such as C/C++, Pascal, FORTRAN or any other compiler is designed using the finite automata. The DFAs are extensively used in the building the various phases of compiler such as
  • Lexical analysis (To identify the tokens, identifiers, to strip of the comments etc.)
  • Syntax analysis (To check the syntax of each statement or control statement used in the program)
  • Code optimization (To remove the un wanted code)
  • Code generation (To generate the machine code)
The concept of finite automata is used in wide applications. It is not possible to list all the applications, as there is infinite number of applications. This section lists some applications:
  1. Large natural vocabularies can be described using finite automaton which includes the applications such as spelling checkers and advisers, multi-language dictionaries, to indent the documents, in calculators to evaluate complex expressions based on the priority of an operator etc. to name a few. Any editor that we use uses finite automaton for implementation.
  2. Finite automaton is very useful in recognizing difficult problems i.e., sometimes it is very essential to solve an un-decidable problem. Even though there is no general solution exists for the specified problem, using theory of computation, we can find the approximate solutions.
  3. Finite automaton is very useful in hardware design such as circuit verification, in design of the hardware board (mother board or any other hardware unit), automatic traffic signals, radio controlled toys, elevators, automatic sensors, remote sensing or controller etc.
  4. In game theory and games wherein we use some control characters to fight against a monster, economics, computer graphics, linguistics etc., finite automaton plays a very important role. 

Security


15.1 The Security Problem

  • Chapter 14 ( Protection ) dealt with protecting files and other resources from accidental misuse by cooperating users sharing a system, generally using the computer for normal purposes.
  • This chapter ( Security ) deals with protecting systems from deliberate attacks, either internal or external, from individuals intentionally attempting to steal information, damage information, or otherwise deliberately wreak havoc in some manner.
  • Some of the most common types of violations include:
    • Breach of Confidentiality - Theft of private or confidential information, such as credit-card numbers, trade secrets, patents, secret formulas, manufacturing procedures, medical information, financial information, etc.
    • Breach of Integrity - Unauthorized modification of data, which may have serious indirect consequences. For example a popular game or other program's source code could be modified to open up security holes on users systems before being released to the public.
    • Breach of Availability - Unauthorized destruction of data, often just for the "fun" of causing havoc and for bragging rites. Vandalism of web sites is a common form of this violation.
    • Theft of Service - Unauthorized use of resources, such as theft of CPU cycles, installation of daemons running an unauthorized file server, or tapping into the target's telephone or networking services.
    • Denial of Service, DOS - Preventing legitimate users from using the system, often by overloading and overwhelming the system with an excess of requests for service.
  • One common attack is masquerading, in which the attacker pretends to be a trusted third party. A variation of this is the man-in-the-middle, in which the attacker masquerades as both ends of the conversation to two targets.
  • replay attack involves repeating a valid transmission. Sometimes this can be the entire attack, ( such as repeating a request for a money transfer ), or other times the content of the original message is replaced with malicious content.
  • There are four levels at which a system must be protected:
    1. Physical - The easiest way to steal data is to pocket the backup tapes. Also, access to the root console will often give the user special privileges, such as rebooting the system as root from removable media. Even general access to terminals in a computer room offers some opportunities for an attacker, although today's modern high-speed networking environment provides more and more opportunities for remote attacks.
    2. Human - There is some concern that the humans who are allowed access to a system be trustworthy, and that they cannot be coerced into breaching security. However more and more attacks today are made via social engineering, which basically means fooling trustworthy people into accidentally breaching security.
      • Phishing involves sending an innocent-looking e-mail or web site designed to fool people into revealing confidential information. E.g. spam e-mails pretending to be from e-Bay, PayPal, or any of a number of banks or credit-card companies.
      • Dumpster Diving involves searching the trash or other locations for passwords that are written down. ( Note: Passwords that are too hard to remember, or which must be changed frequently are more likely to be written down somewhere close to the user's station. )
      • Password Cracking involves divining users passwords, either by watching them type in their passwords, knowing something about them like their pet's names, or simply trying all words in common dictionaries. ( Note: "Good" passwords should involve a minimum number of characters, include non-alphabetical characters, and not appear in any dictionary ( in any language ), and should be changed frequently. Note also that it is proper etiquette to look away from the keyboard while someone else is entering their password. )
    3. Operating System - The OS must protect itself from security breaches, such as runaway processes ( denial of service ), memory-access violations, stack overflow violations, the launching of programs with excessive privileges, and many others.
    4. Network - As network communications become ever more important and pervasive in modern computing environments, it becomes ever more important to protect this area of the system. ( Both protecting the network itself from attack, and protecting the local system from attacks coming in through the network. ) This is a growing area of concern as wireless communications and portable devices become more and more prevalent.

15.2 Program Threats

  • There are many common threats to modern systems. Only a few are discussed here.

15.2.1 Trojan Horse

  • Trojan Horse is a program that secretly performs some maliciousness in addition to its visible actions.
  • Some Trojan horses are deliberately written as such, and others are the result of legitimate programs that have become infected with viruses, ( see below. )
  • One dangerous opening for Trojan horses is long search paths, and in particular paths which include the current directory ( "." ) as part of the path. If a dangerous program having the same name as a legitimate program ( or a common mis-spelling, such as "sl" instead of "ls" ) is placed anywhere on the path, then an unsuspecting user may be fooled into running the wrong program by mistake.
  • Another classic Trojan Horse is a login emulator, which records a users account name and password, issues a "password incorrect" message, and then logs off the system. The user then tries again ( with a proper login prompt ), logs in successfully, and doesn't realize that their information has been stolen.
  • ( Special Note to UIC students: Beware that someone has registered the domain name of uic.EU ( without the "D" ), and is running an ssh server which will accept requests to any machine in the domain, and gladly accept your login and password information, without, of course, actually logging you in. Access to this site is blocked from campus, but you are on your own off campus. )
  • Two solutions to Trojan Horses are to have the system print usage statistics on logouts, and to require the typing of non-trappable key sequences such as Control-Alt-Delete in order to log in. ( This is why modern Windows systems require the Control-Alt-Delete sequence to commence logging in, which cannot be emulated or caught by ordinary programs. I.e. that key sequence always transfers control over to the operating system. )
  • Spyware is a version of a Trojan Horse that is often included in "free" software downloaded off the Internet. Spyware programs generate pop-up browser windows, and may also accumulate information about the user and deliver it to some central site. ( This is an example of covert channels, in which surreptitious communications occur. ) Another common task of spyware is to send out spam e-mail messages, which then purportedly come from the infected user.

15.2.2 Trap Door

  • A Trap Door is when a designer or a programmer ( or hacker ) deliberately inserts a security hole that they can use later to access the system.
  • Because of the possibility of trap doors, once a system has been in an untrustworthy state, that system can never be trusted again. Even the backup tapes may contain a copy of some cleverly hidden back door.
  • A clever trap door could be inserted into a compiler, so that any programs compiled with that compiler would contain a security hole. This is especially dangerous, because inspection of the code being compiled would not reveal any problems.

15.2.3 Logic Bomb

  • A Logic Bomb is code that is not designed to cause havoc all the time, but only when a certain set of circumstances occurs, such as when a particular date or time is reached or some other noticeable event.
  • A classic example is the Dead-Man Switch, which is designed to check whether a certain person ( e.g. the author ) is logging in every day, and if they don't log in for a long time ( presumably because they've been fired ), then the logic bomb goes off and either opens up security holes or causes other problems.

15.2.4 Stack and Buffer Overflow

  • This is a classic method of attack, which exploits bugs in system code that allows buffers to overflow. Consider what happens in the following code, for example, if argv[ 1 ] exceeds 256 characters:
    • The strcpy command will overflow the buffer, overwriting adjacent areas of memory.
    • ( The problem could be avoided using strncpy, with a limit of 255 characters copied plus room for the null byte. )
  • So how does overflowing the buffer cause a security breach? Well the first step is to understand the structure of the stack in memory:
    • The "bottom" of the stack is actually at a high memory address, and the stack grows towards lower addresses.
    • However the address of an array is the lowest address of the array, and higher array elements extend to higher addresses. ( I.e. an array "grows" towards the bottom of the stack.
    • In particular, writing past the top of an array, as occurs when a buffer overflows with too much input data, can eventually overwrite the return address, effectively changing where the program jumps to when it returns.
  • Now that we know how to change where the program returns to by overflowing the buffer, the second step is to insert some nefarious code, and then get the program to jump to our inserted code.
  • Our only opportunity to enter code is via the input into the buffer, which means there isn't room for very much. One of the simplest and most obvious approaches is to insert the code for "exec( /bin/sh )". To do this requires compiling a program that contains this instruction, and then using an assembler or debugging tool to extract the minimum extent that includes the necessary instructions.
  • The bad code is then padded with as many extra bytes as are needed to overflow the buffer to the correct extent, and the address of the buffer inserted into the return address location. ( Note, however, that neither the bad code or the padding can contain null bytes, which would terminate the strcpy. )
  • The resulting block of information is provided as "input", copied into the buffer by the original program, and then the return statement causes control to jump to the location of the buffer and start executing the code to launch a shell.
  • Unfortunately famous hacks such as the buffer overflow attack are well published and well known, and it doesn't take a lot of skill to follow the instructions and start attacking lots of systems until the law of averages eventually works out. ( Script Kiddies are those hackers with only rudimentary skills of their own but the ability to copy the efforts of others. )
  • Fortunately modern hardware now includes a bit in the page tables to mark certain pages as non-executable. In this case the buffer-overflow attack would work up to a point, but as soon as it "returns" to an address in the data space and tries executing statements there, an exception would be thrown crashing the program.
  • ( More details about stack-overflow attacks are available on-line from http://www.insecure.org/stf/smashstack.txt )

15.2.5 Viruses

  • A virus is a fragment of code embedded in an otherwise legitimate program, designed to replicate itself ( by infecting other programs ), and ( eventually ) wreaking havoc.
  • Viruses are more likely to infect PCs than UNIX or other multi-user systems, because programs in the latter systems have limited authority to modify other programs or to access critical system structures ( such as the boot block. )
  • Viruses are delivered to systems in a virus dropper, usually some form of a Trojan Horse, and usually via e-mail or unsafe downloads.
  • Viruses take many forms ( see below. ) Figure 15.5 shows typical operation of a boot sector virus:
  • Some of the forms of viruses include:
    • File - A file virus attaches itself to an executable file, causing it to run the virus code first and then jump to the start of the original program. These viruses are termed parasitic, because they do not leave any new files on the system, and the original program is still fully functional.
    • Boot - A boot virus occupies the boot sector, and runs before the OS is loaded. These are also known as memory viruses, because in operation they reside in memory, and do not appear in the file system.
    • Macro - These viruses exist as a macro ( script ) that are run automatically by certain macro-capable programs such as MS Word or Excel. These viruses can exist in word processing documents or spreadsheet files.
    • Source code viruses look for source code and infect it in order to spread.
    • Polymorphic viruses change every time they spread - Not their underlying functionality, but just their signature, by which virus checkers recognize them.
    • Encrypted viruses travel in encrypted form to escape detection. In practice they are self-decrypting, which then allows them to infect other files.
    • Stealth viruses try to avoid detection by modifying parts of the system that could be used to detect it. For example the read( ) system call could be modified so that if an infected file is read the infected part gets skipped and the reader would see the original unadulterated file.
    • Tunneling viruses attempt to avoid detection by inserting themselves into the interrupt handler chain, or into device drivers.
    • Multipartite viruses attack multiple parts of the system, such as files, boot sector, and memory.
    • Armored viruses are coded to make them hard for anti-virus researchers to decode and understand. In addition many files associated with viruses are hidden, protected, or given innocuous looking names such as "...".
  • In 2004 a virus exploited three bugs in Microsoft products to infect hundreds of Windows servers ( including many trusted sites ) running Microsoft Internet Information Server, which in turn infected any Microsoft Internet Explorer web browser that visited any of the infected server sites. One of the back-door programs it installed was a keystroke logger, which records users keystrokes, including passwords and other sensitive information.
  • There is some debate in the computing community as to whether a monoculture, in which nearly all systems run the same hardware, operating system, and applications, increases the threat of viruses and the potential for harm caused by them.

15.3 System and Network Threats

  • Most of the threats described above are termed program threats, because they attack specific programs or are carried and distributed in programs. The threats in this section attack the operating system or the network itself, or leverage those systems to launch their attacks.

15.3.1 Worms

  • worm is a process that uses the fork / spawn process to make copies of itself in order to wreak havoc on a system. Worms consume system resources, often blocking out other, legitimate processes. Worms that propagate over networks can be especially problematic, as they can tie up vast amounts of network resources and bring down large-scale systems.
  • One of the most well-known worms was launched by Robert Morris, a graduate student at Cornell, in November 1988. Targeting Sun and VAX computers running BSD UNIX version 4, the worm spanned the Internet in a matter of a few hours, and consumed enough resources to bring down many systems.
  • This worm consisted of two parts:
    1. A small program called a grappling hook, which was deposited on the target system through one of three vulnerabilities, and
    2. The main worm program, which was transferred onto the target system and launched by the grappling hook program.
  • The three vulnerabilities exploited by the Morris Internet worm were as follows:
    1. rsh ( remote shell ) is a utility that was in common use at that time for accessing remote systems without having to provide a password. If a user had an account on two different computers ( with the same account name on both systems ), then the system could be configured to allow that user to remotely connect from one system to the other without having to provide a password. Many systems were configured so that any user ( except root ) on system A could access the same account on system B without providing a password.
    2. finger is a utility that allows one to remotely query a user database, to find the true name and other information for a given account name on a given system. For example "finger joeUser@somemachine.edu" would access the finger daemon at somemachine.edu and return information regarding joeUser. Unfortunately the finger daemon ( which ran with system privileges ) had the buffer overflow problem, so by sending a special 536-character user name the worm was able to fork a shell on the remote system running with root privileges.
    3. sendmail is a routine for sending and forwarding mail that also included a debugging option for verifying and testing the system. The debug feature was convenient for administrators, and was often left turned on. The Morris worm exploited the debugger to mail and execute a copy of the grappling hook program on the remote system.
  • Once in place, the worm undertook systematic attacks to discover user passwords:
    1. First it would check for accounts for which the account name and the password were the same, such as "guest", "guest".
    2. Then it would try an internal dictionary of 432 favorite password choices. ( I'm sure "password", "pass", and blank passwords were all on the list. )
    3. Finally it would try every word in the standard UNIX on-line dictionary to try and break into user accounts.
  • Once it had gotten access to one or more user accounts, then it would attempt to use those accounts to rsh to other systems, and continue the process.
  • With each new access the worm would check for already running copies of itself, and 6 out of 7 times if it found one it would stop. ( The seventh was to prevent the worm from being stopped by fake copies. )
  • Fortunately the same rapid network connectivity that allowed the worm to propagate so quickly also quickly led to its demise - Within 24 hours remedies for stopping the worm propagated through the Internet from administrator to administrator, and the worm was quickly shut down.
  • There is some debate about whether Mr. Morris's actions were a harmless prank or research project that got out of hand or a deliberate and malicious attack on the Internet. However the court system convicted him, and penalized him heavy fines and court costs.
  • There have since been many other worm attacks, including the W32.Sobig.F@mm attack which infected hundreds of thousands of computers and an estimated 1 in 17 e-mails in August 2003. This worm made detection difficult by varying the subject line of the infection-carrying mail message, including "Thank You!", "Your details", and "Re: Approved".

15.3.2 Port Scanning

  • Port Scanning is technically not an attack, but rather a search for vulnerabilities to attack. The basic idea is to systematically attempt to connect to every known ( or common or possible ) network port on some remote machine, and to attempt to make contact. Once it is determined that a particular computer is listening to a particular port, then the next step is to determine what daemon is listening, and whether or not it is a version containing a known security flaw that can be exploited.
  • Because port scanning is easily detected and traced, it is usually launched from zombie systems, i.e. previously hacked systems that are being used without the knowledge or permission of their rightful owner. For this reason it is important to protect "innocuous" systems and accounts as well as those that contain sensitive information or special privileges.
  • There are also port scanners available that administrators can use to check their own systems, which report any weaknesses found but which do not exploit the weaknesses or cause any problems. Two such systems are nmap http://www.insecure.org/nmap ) and nessus ( http://www.nessus.org ). The former identifies what OS is found, what firewalls are in place, and what services are listening to what ports. The latter also contains a database of known security holes, and identifies any that it finds.

15.3.3 Denial of Service

  • Denial of Service ( DOS ) attacks do not attempt to actually access or damage systems, but merely to clog them up so badly that they cannot be used for any useful work. Tight loops that repeatedly request system services are an obvious form of this attack.
  • DOS attacks can also involve social engineering, such as the Internet chain letters that say "send this immediately to 10 of your friends, and then go to a certain URL", which clogs up not only the Internet mail system but also the web server to which everyone is directed. ( Note: Sending a "reply all" to such a message notifying everyone that it was just a hoax also clogs up the Internet mail service, just as effectively as if you had forwarded the thing. )
  • Security systems that lock accounts after a certain number of failed login attempts are subject to DOS attacks which repeatedly attempt logins to all accounts with invalid passwords strictly in order to lock up all accounts.
  • Sometimes DOS is not the result of deliberate maliciousness. Consider for example:
    • A web site that sees a huge volume of hits as a result of a successful advertising campaign.
    • CNN.com occasionally gets overwhelmed on big news days, such as Sept 11, 2001.
    • CS students given their first programming assignment involving fork( ) often quickly fill up process tables or otherwise completely consume system resources. :-)
    • ( Please use ipcs and ipcrm when working on the inter-process communications assignment ! )

15.4 Cryptography as a Security Tool

  • Within a given computer the transmittal of messages is safe, reliable and secure, because the OS knows exactly where each one is coming from and where it is going.
  • On a network, however, things aren't so straightforward - A rogue computer ( or e-mail sender ) may spoof their identity, and outgoing packets are delivered to a lot of other computers besides their ( intended ) final destination, which brings up two big questions of security:
    • Trust - How can the system be sure that the messages received are really from the source that they say they are, and can that source be trusted?
    • Confidentiality - How can one ensure that the messages one is sending are received only by the intended recipient?
  • Cryptography can help with both of these problems, through a system of secrets and keys. In the former case, the key is held by the sender, so that the recipient knows that only the authentic author could have sent the message; In the latter, the key is held by the recipient, so that only the intended recipient can receive the message accurately.
  • Keys are designed so that they cannot be divined from any public information, and must be guarded carefully. ( Assymmetric encryption involve both a public and a private key. )

15.4.1 Encryption

  • The basic idea of encryption is to encode a message so that only the desired recipient can decode and read it. Encryption has been around since before the days of Caesar, and is an entire field of study in itself. Only some of the more significant computer encryption schemes will be covered here.
  • The basic process of encryption is shown in Figure 15.7, and will form the basis of most of our discussion on encryption. The steps in the procedure and some of the key terminology are as follows:
    1. The sender first creates a message, m in plaintext.
    2. The message is then entered into an encryption algorithm, E, along with the encryption key, Ke.
    3. The encryption algorithm generates the ciphertext, c, = E(Ke)(m). For any key k, E(k) is an algorithm for generating ciphertext from a message, and both E and E(k) should be efficiently computable functions.
    4. The ciphertext can then be sent over an unsecure network, where it may be received by attackers.
    5. The recipient enters the ciphertext into a decryption algorithm, D, along with the decryption key, Kd.
    6. The decryption algorithm re-generates the plaintext message, m, = D(Kd)(c). For any key k, D(k) is an algorithm for generating a clear text message from a ciphertext, and both D and D(k) should be efficiently computable functions.
    7. The algorithms described here must have this important property: Given a ciphertext c, a computer can only compute a message m such that c = E(k)(m) if it possesses D(k). ( In other words, the messages can't be decoded unless you have the decryption algorithm and the decryption key. )

15.4.1.1 Symmetric Encryption

  • With symmetric encryption the same key is used for both encryption and decryption, and must be safely guarded. There are a number of well-known symmetric encryption algorithms that have been used for computer security:
    • The Data-Encryption Standard, DES, developed by the National Institute of Standards, NIST, has been a standard civilian encryption standard for over 20 years. Messages are broken down into 64-bit chunks, each of which are encrypted using a 56-bit key through a series of substitutions and transformations. Some of the transformations are hidden ( black boxes ), and are classified by the U.S. government.
    • DES is known as a block cipher, because it works on blocks of data at a time. Unfortunately this is a vulnerability if the same key is used for an extended amount of data. Therefore an enhancement is to not only encrypt each block, but also to XOR it with the previous block, in a technique known as cipher-block chaining.
    • As modern computers become faster and faster, the security of DES has decreased, to where it is now considered insecure because its keys can be exhaustively searched within a reasonable amount of computer time. An enhancement called triple DES encrypts the data three times using three separate keys ( actually two encryptions and one decryption ) for an effective key length of 168 bits. Triple DES is in widespread use today.
    • The Advanced Encryption Standard, AES, developed by NIST in 2001 to replace DES uses key lengths of 128, 192, or 256 bits, and encrypts in blocks of 128 bits using 10 to 14 rounds of transformations on a matrix formed from the block.
    • The twofish algorithm, uses variable key lengths up to 256 bits and works on 128 bit blocks.
    • RC5 can vary in key length, block size, and the number of transformations, and runs on a wide variety of CPUs using only basic computations.
    • RC4 is a stream cipher, meaning it acts on a stream of data rather than blocks. The key is used to seed a pseudo-random number generator, which generates a keystream of keys. RC4 is used in WEP, but has been found to be breakable in a reasonable amount of computer time.

15.4.1.2 Asymmetric Encryption

  • With asymmetric encryption, the decryption key, Kd, is not the same as the encryption key, Ke, and more importantly cannot be derived from it, which means the encryption key can be made publicly available, and only the decryption key needs to be kept secret. ( or vice-versa, depending on the application. )
  • One of the most widely used asymmetric encryption algorithms is RSA, named after its developers - Rivest, Shamir, and Adleman.
  • RSA is based on two large prime numbers, and q, ( on the order of 512 bits each ), and their product N.
    • Ke and Kd must satisfy the relationship:
      ( Ke * Kd ) % [ ( p - 1 ) * ( q - 1 ) ] = = 1
    • The encryption algorithm is:
      c = E(Ke)(m) = m^Ke % N
    • The decryption algorithm is:
      m = D(Kd)(c) = c^Kd % N
  • An example using small numbers:
    • p = 7
    • q = 13
    • N = 7 * 13 = 91
    • ( p - 1 ) * ( q - 1 ) = 6 * 12 = 72
    • Select Ke < 72 and relatively prime to 72, say 5
    • Now select Kd, such that ( Ke * Kd ) % 72 = = 1, say 29
    • The public key is now ( 5, 91 ) and the private key is ( 29, 91 )
    • Let the message, m = 42
    • Encrypt: c = 42^5 % 91 = 35
    • Decrypt: m = 35^29 % 91 = 42
  • Note that asymmetric encryption is much more computationally expensive than symmetric encryption, and as such it is not normally used for large transmissions. Asymmetric encryption is suitable for small messages, authentication, and key distribution, as covered in the following sections.

15.4.1.3 Authentication

  • Authentication involves verifying the identity of the entity who transmitted a message.
  • For example, if D(Kd)(c) produces a valid message, then we know the sender was in possession of E(Ke).
  • This form of authentication can also be used to verify that a message has not been modified
  • Authentication revolves around two functions, used for signatures ( or signing ), and verification:
    • A signing function, S(Ks) that produces an authenticator, A, from any given message m.
    • A Verification function, V(Kv,m,A) that produces a value of "true" if A was created from m, and "false" otherwise.
    • Obviously S and V must both be computationally efficient.
    • More importantly, it must not be possible to generate a valid authenticator, A, without having possession of S(Ks).
    • Furthermore, it must not be possible to divine S(Ks) from the combination of ( m and A ), since both are sent visibly across networks.
  • Understanding authenticators begins with an understanding of hash functions, which is the first step:
    • Hash functions, H(m) generate a small fixed-size block of data known as a message digest, or hash value from any given input data.
    • For authentication purposes, the hash function must be collision resistant on m. That is it should not be reasonably possible to find an alternate message m' such that H(m') = H(m).
    • Popular hash functions are MD5, which generates a 128-bit message digest, and SHA-1, which generates a 160-bit digest.
  • Message digests are useful for detecting ( accidentally ) changed messages, but are not useful as authenticators, because if the hash function is known, then someone could easily change the message and then generate a new hash value for the modified message. Therefore authenticators take things one step further by encrypting the message digest.
  • message-authentication code, MAC, uses symmetric encryption and decryption of the message digest, which means that anyone capable of verifying an incoming message could also generate a new message.
  • An asymmetric approach is the digital-signature algorithm, which produces authenticators called digital signatures. In this case Ks and Kv are separate, Kv is the public key, and it is not practical to determine S(Ks) from public information. In practice the sender of a message signs it ( produces a digital signature using S(Ks) ), and the receiver uses V(Kv) to verify that it did indeed come from a trusted source, and that it has not been modified.
  • There are three good reasons for having separate algorithms for encryption of messages and authentication of messages:
    1. Authentication algorithms typically require fewer calculations, making verification a faster operation than encryption.
    2. Authenticators are almost always smaller than the messages, improving space efficiency. (?)
    3. Sometimes we want authentication only, and not confidentiality, such as when a vendor issues a new software patch.
  • Another use of authentication is non-repudiation, in which a person filling out an electronic form cannot deny that they were the ones who did so.

15.4.1.4 Key Distribution

  • Key distribution with symmetric cryptography is a major problem, because all keys must be kept secret, and they obviously can't be transmitted over unsecure channels. One option is to send them out-of-band, say via paper or a confidential conversation.
  • Another problem with symmetric keys, is that a separate key must be maintained and used for each correspondent with whom one wishes to exchange confidential information.
  • Asymmetric encryption solves some of these problems, because the public key can be freely transmitted through any channel, and the private key doesn't need to be transmitted anywhere. Recipients only need to maintain one private key for all incoming messages, though senders must maintain a separate public key for each recipient to which they might wish to send a message. Fortunately the public keys are not confidential, so this key-ring can be easily stored and managed.
  • Unfortunately there are still some security concerns regarding the public keys used in asymmetric encryption. Consider for example the following man-in-the-middle attack involving phony public keys:
  • One solution to the above problem involves digital certificates, which are public keys that have been digitally signed by a trusted third party. But wait a minute - How do we trust that third party, and how do we know they are really who they say they are? Certain certificate authorities have their public keys included within web browsers and other certificate consumers before they are distributed. These certificate authorities can then vouch for other trusted entities and so on in a web of trust, as explained more fully in section 15.4.3.

15.4.2 Implementation of Cryptography

  • Network communications are implemented in multiple layers - Physical, Data Link, Network, Transport, and Application being the most common breakdown.
  • Encryption and security can be implemented at any layer in the stack, with pros and cons to each choice:
    • Because packets at lower levels contain the contents of higher layers, encryption at lower layers automatically encrypts higher layer information at the same time.
    • However security and authorization may be important to higher levels independent of the underlying transport mechanism or route taken.
  • At the network layer the most common standard is IPSec, a secure form of the IP layer, which is used to set up Virtual Private Networks, VPNs.
  • At the transport layer the most common implementation is SSL, described below.

15.4.3 An Example: SSL

  • SSL (  Secure Sockets Layer ) 3.0 was first developed by Netscape, and has now evolved into the industry-standard TLS protocol. It is used by web browsers to communicate securely with web servers, making it perhaps the most widely used security protocol on the Internet today.
  • SSL is quite complex with many variations, only a simple case of which is shown here.
  • The heart of SSL is session keys, which are used once for symmetric encryption and then discarded, requiring the generation of new keys for each new session. The big challenge is how to safely create such keys while avoiding man-in-the-middle and replay attacks.
  • Prior to commencing the transaction, the server obtains a certificate from a certification authority, CA, containing:
    • Server attributes such as unique and common names.
    • Identity of the public encryption algorithm, E( ), for the server.
    • The public key, k_e for the server.
    • The validity interval within which the certificate is valid.
    • A digital signature on the above issued by the CA:
      • a = S(K_CA )( ( attrs, E(k_e), interval )
  • In addition, the client will have obtained a public verification algorithm, V( K_CA ), for the certifying authority. Today's modern browsers include these built-in by the browser vendor for a number of trusted certificate authorities.
  • The procedure for establishing secure communications is as follows:
    1. The client, c, connects to the server, s, and sends a random 28-byte number, n_c.
    2. The server replies with its own random value, n_s, along with its certificate of authority.
    3. The client uses its verification algorithm to confirm the identity of the sender, and if all checks out, then the client generates a 46 byte random premaster secret, pms, and sends an encrypted version of it as cpms = E(k_s)(pms)
    4. The server recovers pms as D(k_s)(cpms).
    5. Now both the client and the server can compute a shared 48-byte master secret, ms, = f( pms, n_s, n_c )
    6. Next, both client and server generate the following from ms:
      • Symmetric encryption keys k_sc_crypt and k_cs_crypt for encrypting messages from the server to the client and vice-versa respectively.
      • MAC generation keys k_sc_mac and k_cs_mac for generating authenticators on messages from server to client and client to server respectively.
    7. To send a message to the server, the client sends:
      • c = E(k_cs_crypt)(m, S(k_cs_mac) )( m ) ) )
    8. Upon receiving c, the server recovers:
      • (m,a) = D(k_cs_crypt)(c)
      • and accepts it if V(k_sc_mac)(m,a) is true.
  • This approach enables both the server and client to verify the authenticity of every incoming message, and to ensure that outgoing messages are only readable by the process that originally participated in the key generation.
  • SSL is the basis of many secure protocols,including Virtual Private Networks, VPNs, in which private data is distributed over the insecure public internet structure in an encrypted fashion that emulates a privately owned network.

15.5 User Authentication

  • A lot of chapter 14, Protection, dealt with making sure that only certain users were allowed to perform certain tasks, i.e. that a users privileges were dependent on his or her identity. But how does one verify that identity to begin with?

15.5.1 Passwords

  • Passwords are the most common form of user authentication. If the user is in possession of the correct password, then they are considered to have identified themselves.
  • In theory separate passwords could be implemented for separate activities, such as reading this file, writing that file, etc. In practice most systems use one password to confirm user identity, and then authorization is based upon that identification. This is a result of the classic trade-off between security and convenience.

15.5.2 Password Vulnerabilities

  • Passwords can be guessed.
    • Intelligent guessing requires knowing something about the intended target in specific, or about people and commonly used passwords in general.
    • Brute-force guessing involves trying every word in the dictionary, or every valid combination of characters. For this reason good passwords should not be in any dictionary ( in any language ), should be reasonably lengthy, and should use the full range of allowable characters by including upper and lower case characters, numbers, and special symbols.
  • "Shoulder surfing" involves looking over people's shoulders while they are typing in their password.
    • Even if the lurker does not get the entire password, they may get enough clues to narrow it down, especially if they watch on repeated occasions.
    • Common courtesy dictates that you look away from the keyboard while someone is typing their password.
    • Passwords echoed as stars or dots still give clues, because an observer can determine how many characters are in the password. :-(
  • "Packet sniffing" involves putting a monitor on a network connection and reading data contained in those packets.
    • SSH encrypts all packets, reducing the effectiveness of packet sniffing.
    • However you should still never e-mail a password, particularly not with the word "password" in the same message or worse yet the subject header.
    • Beware of any system that transmits passwords in clear text. ( "Thank you for signing up for XYZ. Your new account and password information are shown below". ) You probably want to have a spare throw-away password to give these entities, instead of using the same high-security password that you use for banking or other confidential uses.
  • Long hard to remember passwords are often written down, particularly if they are used seldomly or must be changed frequently. Hence a security trade-off of passwords that are easily divined versus those that get written down. :-(
  • Passwords can be given away to friends or co-workers, destroying the integrity of the entire user-identification system.
  • Most systems have configurable parameters controlling password generation and what constitutes acceptable passwords.
    • They may be user chosen or machine generated.
    • They may have minimum and/or maximum length requirements.
    • They may need to be changed with a given frequency. ( In extreme cases for every session. )
    • A variable length history can prevent repeating passwords.
    • More or less stringent checks can be made against password dictionaries.

15.5.3 Encrypted Passwords

  • Modern systems do not store passwords in clear-text form, and hence there is no mechanism to look up an existing password.
  • Rather they are encrypted and stored in that form. When a user enters their password, that too is encrypted, and if the encrypted version match, then user authentication passes.
  • The encryption scheme was once considered safe enough that the encrypted versions were stored in the publicly readable file "/etc/passwd".
    • They always encrypted to a 13 character string, so an account could be disabled by putting a string of any other length into the password field.
    • Modern computers can try every possible password combination in a reasonably short time, so now the encrypted passwords are stored in files that are only readable by the super user. Any password-related programs run as setuid root to get access to these files. ( /etc/shadow )
    • A random seed is included as part of the password generation process, and stored as part of the encrypted password. This ensures that if two accounts have the same plain-text password that they will not have the same encrypted password. However cutting and pasting encrypted passwords from one account to another will give them the same plain-text passwords.

15.5.4 One-Time Passwords

  • One-time passwords resist shoulder surfing and other attacks where an observer is able to capture a password typed in by a user.
    • These are often based on a challenge and a response. Because the challenge is different each time, the old response will not be valid for future challenges.
      • For example, The user may be in possession of a secret function f( x ). The system challenges with some given value for x, and the user responds with f( x ), which the system can then verify. Since the challenger gives a different ( random ) x each time, the answer is constantly changing.
      • A variation uses a map ( e.g. a road map ) as the key. Today's question might be "On what corner is SEO located?", and tomorrow's question might be "How far is it from Navy Pier to Wrigley Field?" Obviously "Taylor and Morgan" would not be accepted as a valid answer for the second question!
    • Another option is to have some sort of electronic card with a series of constantly changing numbers, based on the current time. The user enters the current number on the card, which will only be valid for a few seconds. A two-factor authorization also requires a traditional password in addition to the number on the card, so others may not use it if it were ever lost or stolen.
    • A third variation is a code book, or one-time pad. In this scheme a long list of passwords is generated, and each one is crossed off and cancelled as it is used. Obviously it is important to keep the pad secure.

15.5.5 Biometrics

  • Biometrics involve a physical characteristic of the user that is not easily forged or duplicated and not likely to be identical between multiple users.
    • Fingerprint scanners are getting faster, more accurate, and more economical.
    • Palm readers can check thermal properties, finger length, etc.
    • Retinal scanners examine the back of the users' eyes.
    • Voiceprint analyzers distinguish particular voices.
    • Difficulties may arise in the event of colds, injuries, or other physiological changes.

15.6 Implementing Security Defenses

15.6.1 Security Policy

  • A security policy should be well thought-out, agreed upon, and contained in a living document that everyone adheres to and is updated as needed.
  • Examples of contents include how often port scans are run, password requirements, virus detectors, etc.

15.6.2 Vulnerability Assessment

  • Periodically examine the system to detect vulnerabilities.
    • Port scanning.
    • Check for bad passwords.
    • Look for suid programs.
    • Unauthorized programs in system directories.
    • Incorrect permission bits set.
    • Program checksums / digital signatures which have changed.
    • Unexpected or hidden network daemons.
    • New entries in startup scripts, shutdown scripts, cron tables, or other system scripts or configuration files.
    • New unauthorized accounts.
  • The government considers a system to be only as secure as its most far-reaching component. Any system connected to the Internet is inherently less secure than one that is in a sealed room with no external communications.
  • Some administrators advocate "security through obscurity", aiming to keep as much information about their systems hidden as possible, and not announcing any security concerns they come across. Others announce security concerns from the rooftops, under the theory that the hackers are going to find out anyway, and the only one kept in the dark by obscurity are honest administrators who need to get the word.

15.6.3 Intrusion Detection

  • Intrusion detection attempts to detect attacks, both successful and unsuccessful attempts. Different techniques vary along several axes:
    • The time that detection occurs, either during the attack or after the fact.
    • The types of information examined to detect the attack(s). Some attacks can only be detected by analyzing multiple sources of information.
    • The response to the attack, which may range from alerting an administrator to automatically stopping the attack ( e.g. killing an offending process ), to tracing back the attack in order to identify the attacker.
      • Another approach is to divert the attacker to a honeypot, on a honeynet. The idea behind a honeypot is a computer running normal services, but which no one uses to do any real work. Such a system should not see any network traffic under normal conditions, so any traffic going to or from such a system is by definition suspicious. Honeypots are normally kept on a honeynet protected by a reverse firewall, which will let potential attackers in to the honeypot, but will not allow any outgoing traffic. ( So that if the honeypot is compromised, the attacker cannot use it as a base of operations for attacking other systems. ) Honeypots are closely watched, and any suspicious activity carefully logged and investigated.
  • Intrusion Detection Systems, IDSs, raise the alarm when they detect an intrusion. Intrusion Detection and Prevention Systems, IDPs, act as filtering routers, shutting down suspicious traffic when it is detected.
  • There are two major approaches to detecting problems:
    • Signature-Based Detection scans network packets, system files, etc. looking for recognizable characteristics of known attacks, such as text strings for messages or the binary code for "exec /bin/sh". The problem with this is that it can only detect previously encountered problems for which the signature is known, requiring the frequent update of signature lists.
    • Anomaly Detection looks for "unusual" patterns of traffic or operation, such as unusually heavy load or an unusual number of logins late at night.
      • The benefit of this approach is that it can detect previously unknown attacks, so called zero-day attacks.
      • One problem with this method is characterizing what is "normal" for a given system. One approach is to benchmark the system, but if the attacker is already present when the benchmarks are made, then the "unusual" activity is recorded as "the norm."
      • Another problem is that not all changes in system performance are the result of security attacks. If the system is bogged down and really slow late on a Thursday night, does that mean that a hacker has gotten in and is using the system to send out SPAM, or does it simply mean that a CS 385 assignment is due on Friday? :-)
      • To be effective, anomaly detectors must have a very low false alarm ( false positive ) rate, lest the warnings get ignored, as well as a low false negative rate in which attacks are missed.

15.6.4 Virus Protection

  • Modern anti-virus programs are basically signature-based detection systems, which also have the ability ( in some cases ) of disinfecting the affected files and returning them back to their original condition.
  • Both viruses and anti-virus programs are rapidly evolving. For example viruses now commonly mutate every time they propagate, and so anti-virus programs look for families of related signatures rather than specific ones.
  • Some antivirus programs look for anomalies, such as an executable program being opened for writing ( other than by a compiler. )
  • Avoiding bootleg, free, and shared software can help reduce the chance of catching a virus, but even shrink-wrapped official software has on occasion been infected by disgruntled factory workers.
  • Some virus detectors will run suspicious programs in a sandbox, an isolated and secure area of the system which mimics the real system.
  • Rich Text Format, RTF, files cannot carry macros, and hence cannot carry Word macro viruses.
  • Known safe programs ( e.g. right after a fresh install or after a thorough examination ) can be digitally signed, and periodically the files can be re-verified against the stored digital signatures. ( Which should be kept secure, such as on off-line write-only medium. )

15.6.5 Auditing, Accounting, and Logging

  • Auditing, accounting, and logging records can also be used to detect anomalous behavior.
  • Some of the kinds of things that can be logged include authentication failures and successes, logins, running of suid or sgid programs, network accesses, system calls, etc. In extreme cases almost every keystroke and electron that moves can be logged for future analysis. ( Note that on the flip side, all this detailed logging can also be used to analyze system performance. The down side is that the logging also affects system performance ( negatively! ), and so a Heisenberg effect applies. )
  • "The Cuckoo's Egg" tells the story of how Cliff Stoll detected one of the early UNIX break ins when he noticed anomalies in the accounting records on a computer system being used by physics researchers.

Tripwire Filesystem ( New Sidebar )

  • The tripwire filesystem monitors files and directories for changes, on the assumption that most intrusions eventually result in some sort of undesired or unexpected file changes.
  • The tw.config file indicates what directories are to be monitored, as well as what properties of each file are to be recorded. ( E.g. one may choose to monitor permission and content changes, but not worry about read access times. )
  • When first run, the selected properties for all monitored files are recorded in a database. Hash codes are used to monitor file contents for changes.
  • Subsequent runs report any changes to the recorded data, including hash code changes, and any newly created or missing files in the monitored directories.
  • For full security it is necessary to also protect the tripwire system itself, most importantly the database of recorded file properties. This could be saved on some external or write-only location, but that makes it harder to change the database when legitimate changes are made.
  • It is difficult to monitor files that are supposed to change, such as log files. The best tripwire can do in this case is to watch for anomalies, such as a log file that shrinks in size.
  • Free and commercial versions are available at http://tripwire.org and http://tripwire.com.

15.7 Firewalling to Protect Systems and Networks

  • Firewalls are devices ( or sometimes software ) that sit on the border between two security domains and monitor/log activity between them, sometimes restricting the traffic that can pass between them based on certain criteria.
  • For example a firewall router may allow HTTP: requests to pass through to a web server inside a company domain while not allowing telnet, ssh, or other traffic to pass through.
  • A common architecture is to establish a de-militarized zone, DMZ, which sort of sits "between" the company domain and the outside world, as shown below. Company computers can reach either the DMZ or the outside world, but outside computers can only reach the DMZ. Perhaps most importantly, the DMZ cannot reach any of the other company computers, so even if the DMZ is breached, the attacker cannot get to the rest of the company network. ( In some cases the DMZ may have limited access to company computers, such as a web server on the DMZ that needs to query a database on one of the other company computers. )
  • Firewalls themselves need to be resistant to attacks, and unfortunately have several vulnerabilities:
    • Tunneling, which involves encapsulating forbidden traffic inside of packets that are allowed.
    • Denial of service attacks addressed at the firewall itself.
    • Spoofing, in which an unauthorized host sends packets to the firewall with the return address of an authorized host.
  • In addition to the common firewalls protecting a company internal network from the outside world, there are also some specialized forms of firewalls that have been recently developed:
    • personal firewall is a software layer that protects an individual computer. It may be a part of the operating system or a separate software package.
    • An application proxy firewall understands the protocols of a particular service and acts as a stand-in ( and relay ) for the particular service. For example, and SMTP proxy firewall would accept SMTP requests from the outside world, examine them for security concerns, and forward only the "safe" ones on to the real SMTP server behind the firewall.
    • XML firewalls examine XML packets only, and reject ill-formed packets. Similar firewalls exist for other specific protocols.
    • System call firewalls guard the boundary between user mode and system mode, and reject any system calls that violate security policies.

15.8 Computer-Security Classifications ( Optional )

  • No computer system can be 100% secure, and attempts to make it so can quickly make it unusable.
  • However one can establish a level of trust to which one feels "safe" using a given computer system for particular security needs.
  • The U.S. Department of Defense's "Trusted Computer System Evaluation Criteria" defines four broad levels of trust, and sub-levels in some cases:
    • Level D is the least trustworthy, and encompasses all systems that do not meet any of the more stringent criteria. DOS and Windows 3.1 fall into level D, which has no user identification or authorization, and anyone who sits down has full access and control over the machine.
    • Level C1 includes user identification and authorization, and some means of controlling what users are allowed to access what files. It is designed for use by a group of mostly cooperating users, and describes most common UNIX systems.
    • Level C2 adds individual-level control and monitoring. For example file access control can be allowed or denied on a per-individual basis, and the system administrator can monitor and log the activities of specific individuals. Another restriction is that when one user uses a system resource and then returns it back to the system, another user who uses the same resource later cannot read any of the information that the first user stored there. ( I.e. buffers, etc. are wiped out between users, and are not left full of old contents. ) Some special secure versions of UNIX have been certified for C2 security levels, such as SCO.
    • Level B adds sensitivity labels on each object in the system, such as "secret", "top secret", and "confidential". Individual users have different clearance levels, which controls which objects they are able to access. All human-readable documents are labeled at both the top and bottom with the sensitivity level of the file.
    • Level B2 extends sensitivity labels to all system resources, including devices. B2 also supports covert channels and the auditing of events that could exploit covert channels.
    • B3 allows creation of access-control lists that denote users NOT given access to specific objects.
    • Class A is the highest level of security. Architecturally it is the same as B3, but it is developed using formal methods which can be used to prove that the system meets all requirements and cannot have any possible bugs or other vulnerabilities. Systems in class A and higher may be developed by trusted personnel in secure facilities.
    • These classifications determine what a system can implement, but it is up to security policy to determine how they are implemented in practice. These systems and policies can be reviewed and certified by trusted organizations, such as the National Computer Security Center. Other standards may dictate physical protections and other issues.

15.9 An Example: Windows XP ( Optional )

  • Windows XP is a general purpose OS designed to support a wide variety of security features and methods. It is based on user accounts which can be grouped in any manner.
  • When a user logs on, a security access token is issued that includes the security ID for the user, security IDs for any groups of which the user is a member, and a list of any special privileges the user has, such as performing backups, shutting down the system, and changing the system clock.
  • Every process running on behalf of a user gets a copy of the users security token, which determines the privileges of that process running on behalf of that user.
  • Authentication is normally done via passwords, but the modular design of XP allows for alternative authentication such as retinal scans or fingerprint readers.
  • Windows XP includes built-in auditing that allows many common security threats to be monitored, such as successful and unsuccessful logins, logouts, attempts to write to executable files, and access to certain sensitive files.
  • Security attributes of objects are described by security descriptors, which include the ID of the owner, group ownership for POSIX subsystems only, a discretionary access-control list describing exactly what permissions each user or group on the system has for this particular object, and auditing control information.
  • The access control lists include for each specified user or group either AccessAllowed or AccessDenied for the following types of actions: ReadData,WriteData, AppendData, Execute, ReadAttributes, WriteAttributes, ReadExtendedAttribute, and WriteExtendedAttribute.
  • Container objects such as directories can logically contain other objects. When a new object is created in a container or copied into a container, by default it inherits the permissions of the new container.Noncontainer objects inherit no other permissions. If the permissions of the container are changed later, that does not affect the permissions of the contained objects.
  • Although Windows XP is capable of supporting a secure system, many of the security features are not enabled by default, resulting in a fair number of security breaches on XP systems. There are also a large number of system daemons and other programs that start automatically at startup, whether the system administrator has thought about them or not. ( My system currently has 54 processes running, most of which I did not deliberately start and which have short cryptic names which makes it hard to divine exactly what they do or why. Faced with this situation, most users and administrators will simply leave alone anything they don't understand. )