# 8.1.1: Sample Spaces and Probability (Exercises) - Mathematics

## SECTION 8.1 PROBLEM SET: SAMPLE SPACES AND PROBABILITY

In problems 1 - 6, write a sample space for the given experiment.

 1) A die is rolled. 2) A penny and a nickel are tossed. 3) A die is rolled, and a coin is tossed. 4) Three coins are tossed. 5) Two dice are rolled. 6) A jar contains four marbles numbered 1, 2, 3, and 4. Two marbles are drawn.

In problems 7 - 12, one card is randomly selected from a deck. Find the following probabilities.

 7) P( an ace) 8) P( a red card) 9) P( a club) 10) P( a face card) 11) P(a jack or a spade) 12) P(a jack and a spade)

For problems 13 - 16: A jar contains 6 red, 7 white, and 7 blue marbles. If one marble is chosen at random, find the following probabilities.

 13) P(red) 14) P(white) 15) P(red or blue) 16) P(red and blue)

For problems 17 - 22: Consider a family of three children. Find the following probabilities.

 17) P(two boys and a girl) 18) P(at least one boy) 19) P(children of both sexes) 20) P(at most one girl) 21) P(first and third children are male) 22) P(all children are of the same gender)

For problems 23 - 27: Two dice are rolled. Find the following probabilities.

 23) P(the sum of the dice is 5) 24) P(the sum of the dice is 8) 25) P(the sum is 3 or 6) 26) P(the sum is more than 10) 27) P(the result is a double) (Hint: a double means that both dice show the same value)

For problems 28-31: A jar contains four marbles numbered 1, 2, 3, and 4. Two marbles are drawn randomly WITHOUT REPLACEMENT. That means that after a marble is drawn it is NOT replaced in the jar before the second marble is selected. Find the following probabilities.

 28) P(the sum of the numbers is 5) 29) P(the sum of the numbers is odd) 30) P(the sum of the numbers is 9) 31) P(one of the numbers is 3)

For problems 32-33: A jar contains four marbles numbered 1, 2, 3, and 4. Two marbles are drawn randomly WITH REPLACEMENT. That means that after a marble is drawn it is replaced in the jar before the second marble is selected. Find the following probabilities.

 32) P(the sum of the numbers is 5) 33) P(the sum of the numbers is 2)

## 8.1.1: Sample Spaces and Probability (Exercises) - Mathematics

Rolling an ordinary six-sided die is a familiar example of a random experiment, an action for which all possible outcomes can be listed, but for which the actual outcome on any given trial of the experiment cannot be predicted with certainty. In such a situation we wish to assign to each outcome, such as rolling a two, a number, called the probability of the outcome, that indicates how likely it is that the outcome will occur. Similarly, we would like to assign a probability to any event, or collection of outcomes, such as rolling an even number, which indicates how likely it is that the event will occur if the experiment is performed. This section provides a framework for discussing probability problems, using the terms just mentioned.

### Definition

A random experiment is a mechanism that produces a definite outcome that cannot be predicted with certainty. The sample space The set of all possible outcomes of a random experiment. associated with a random experiment is the set of all possible outcomes. An event Any set of outcomes. is a subset of the sample space.

### Definition

An event E is said to occur on a particular trial of the experiment if the outcome observed is an element of the set E.

### Example 1

Construct a sample space for the experiment that consists of tossing a single coin.

The outcomes could be labeled h for heads and t for tails. Then the sample space is the set S = < h , t >.

### Example 2

Construct a sample space for the experiment that consists of rolling a single die. Find the events that correspond to the phrases “an even number is rolled” and “a number greater than two is rolled.”

The outcomes could be labeled according to the number of dots on the top face of the die. Then the sample space is the set S = < 1,2,3,4,5,6 >.

The outcomes that are even are 2, 4, and 6, so the event that corresponds to the phrase “an even number is rolled” is the set <2,4,6>, which it is natural to denote by the letter E. We write E = < 2,4,6 >.

Similarly the event that corresponds to the phrase “a number greater than two is rolled” is the set T = < 3,4,5,6 >, which we have denoted T.

A graphical representation of a sample space and events is a Venn diagram, as shown in Figure 3.1 "Venn Diagrams for Two Sample Spaces" for Note 3.6 "Example 1" and Note 3.7 "Example 2". In general the sample space S is represented by a rectangle, outcomes by points within the rectangle, and events by ovals that enclose the outcomes that compose them.

Figure 3.1 Venn Diagrams for Two Sample Spaces

### Example 3

A random experiment consists of tossing two coins.

1. Construct a sample space for the situation that the coins are indistinguishable, such as two brand new pennies.
2. Construct a sample space for the situation that the coins are distinguishable, such as one a penny and the other a nickel.
1. After the coins are tossed one sees either two heads, which could be labeled 2 h , two tails, which could be labeled 2 t , or coins that differ, which could be labeled d. Thus a sample space is S = < 2 h , 2 t , d >.
2. Since we can tell the coins apart, there are now two ways for the coins to differ: the penny heads and the nickel tails, or the penny tails and the nickel heads. We can label each outcome as a pair of letters, the first of which indicates how the penny landed and the second of which indicates how the nickel landed. A sample space is then S ′ = < h h , h t , t h , t t >.

A device that can be helpful in identifying all possible outcomes of a random experiment, particularly one that can be viewed as proceeding in stages, is what is called a tree diagram. It is described in the following example.

### Example 4

Construct a sample space that describes all three-child families according to the genders of the children with respect to birth order.

Two of the outcomes are “two boys then a girl,” which we might denote b b g , and “a girl then two boys,” which we would denote g b b . Clearly there are many outcomes, and when we try to list all of them it could be difficult to be sure that we have found them all unless we proceed systematically. The tree diagram shown in Figure 3.2 "Tree Diagram For Three-Child Families", gives a systematic approach.

Figure 3.2 Tree Diagram For Three-Child Families

The diagram was constructed as follows. There are two possibilities for the first child, boy or girl, so we draw two line segments coming out of a starting point, one ending in a b for “boy” and the other ending in a g for “girl.” For each of these two possibilities for the first child there are two possibilities for the second child, “boy” or “girl,” so from each of the b and g we draw two line segments, one segment ending in a b and one in a g. For each of the four ending points now in the diagram there are two possibilities for the third child, so we repeat the process once more.

The line segments are called branches of the tree. The right ending point of each branch is called a node. The nodes on the extreme right are the final nodes to each one there corresponds an outcome, as shown in the figure.

From the tree it is easy to read off the eight outcomes of the experiment, so the sample space is, reading from the top to the bottom of the final nodes in the tree,

## 8.1.1: Sample Spaces and Probability (Exercises) - Mathematics

The course will introduce the basic notion of probability theory and its application to statistics. The focus will be on the discussion of applications.

The text that will be used is:

Jay L. Devore, Probability and Statistics, 8th or 9th ed., Thomson

The syllabus can be found here.

There will be two midterm.

The exercise listed are for HW collection. I will collect them every two weeks and grade 2 or 3 exercises among the one assigned. In case of differences between the 9th and 8th editions of the book I will indicate in square brackets the number relative to the 8th edition.

The final grade will be based on the following rules: 45% final, 40% midterms,15% HW. Curving will be done on the final result.

The first midterm will be on Wednesday February 17 and the second on Wednesday March 30.

#### Arguments covered.

• Axioms, Interpretations and Properties of Probabilities
• Probability Distributions for Discrete Random Variables
• Example of Discrete Random Variables
• Continuous Random Variables and Probability Density Functions
• Example of Continuous Random Variables
• The central limit theorem
• Jointly Distributed Random Variables
• Population, Sample and Processes
• Point Estimation
• Statistical Intervals
• Test of Hypotheses
• Simple Linear Regression (time permitting)
• 1.1 (Population, Sample and Processes)
• 1.2 (Pictorial and Tabular methods in Descriptive Statistics)
• 1.3 (Measure of Location)
• 1.4 (Measure of Variability)

First HW due on January 25.

• 2.1 (Sample Spaces and Events)
• 2.2 (Axioms, Interpretations and Properties of Probabilities)
• 3.1 (Random Variables)
• 3.2 (Probability Distributions for Discrete Random Variables)
• 3.3 (Expected Values of Discrete Random Variable)
• (3.1) 6, 8, 10
• (3.2) 16, 23, 27
• (3.3) 29, 35 39, 42
• 3.4 (The Binomial Probability Distribution)
• 3.5 (Hypergeometric Distribution)
• 3.6 (The Poisson Probability Distribution)

The first midterm will be on February 17. The midterm will cover the material up to section 3.6.

Preparation material for the first midterm:

Sixth and seventh weeks

• 4.1 (Continuous Random Variables and Probability Density Functions)
• 4.2 (Cumulative Distribution Functions and Expected Values)
• 4.3 (The Normal Distribution)
• 4.4 (The Exponential Distribution)
• 5.1 (Jointly Distributed Random Variables)
• 5.2 (Expected Values,Covariance and Correlation)
• 5.5 (The Distribution of a Linear Combination)
• 5.3 (Statistics and their distribution)
• 5.4 (The Distribution of the Sample Mean)
• (5.3) 37, 41, 42
• (5.4) 48, 49, 53, 56

Preparation material for the second midterm:

Fifth HW due on April 4. The exercises are those listed above for Chapter 5.

Here is the text of the Otional Makeup for the second Midterm. Check it out and let me know if you have question. It is due Monday 11 in the evening. You can submit it in class, via email or by sliding it under my office door. Your submission must contain the first page signed

## 3.3: Conditional Probability and Independent Events

### Basi c

1. Q3.3.1For two events (A) and (B), (P(A)=0.73, P(B)=0.48 ext P(Acap B)=0.29).
1. Find (P(Amid B)).
2. Find (P(Bmid A)).
3. Determine whether or not (A) and (B) are independent.
1. Find (P(Amid B)).
2. Find (P(Bmid A)).
3. Determine whether or not (A) and (B) are independent.
1. (P(Acap B)).
2. Find (P(Amid B)).
3. Find (P(Bmid A)).
1. (P(Acap B)).
2. Find (P(Amid B)).
3. Find (P(Bmid A)).
1. Find (P(Amid B)).
2. Find (P(Bmid A)).
1. Find (P(Amid B)).
2. Find (P(Bmid A)).
1. The probability that the roll is even.
2. The probability that the roll is even, given that it is not a two.
3. The probability that the roll is even, given that it is not a one.
1. The probability that the second toss is heads.
2. The probability that the second toss is heads, given that the first toss is heads.
3. The probability that the second toss is heads, given that at least one of the two tosses is heads.
1. The probability that the card drawn is red.
2. The probability that the card is red, given that it is not green.
3. The probability that the card is red, given that it is neither red nor yellow.
4. The probability that the card is red, given that it is not a four.
1. The probability that the card drawn is a two or a four.
2. The probability that the card is a two or a four, given that it is not a one.
3. The probability that the card is a two or a four, given that it is either a two or a three.
4. The probability that the card is a two or a four, given that it is red or green.
1. (P(A), P(R), P(Acap B)).
2. Based on the answer to (a), determine whether or not the events (A) and (R) are independent.
3. Based on the answer to (b), determine whether or not (P(Amid R)) can be predicted without any computation. If so, make the prediction. In any case, compute (P(Amid R)) using the Rule for Conditional Probability.
1. (P(A), P(R), P(Acap B)).
2. Based on the answer to (a), determine whether or not the events (A) and (R) are independent.
3. Based on the answer to (b), determine whether or not (P(Amid R)) can be predicted without any computation. If so, make the prediction. In any case, compute (P(Amid R)) using the Rule for Conditional Probability.
1. (P(Acap B)).
2. (P(Acap B)), with the extra information that (A) and (B) are independent.
3. (P(Acap B)), with the extra information that (A) and (B) are mutually exclusive.
1. (P(Acap B)).
2. (P(Acap B)), with the extra information that (A) and (B) are independent.
3. (P(Acap B)), with the extra information that (A) and (B) are mutually exclusive.
1. (P(Acap Bcap C)).
2. (P(A^ccap B^ccap C^c)).
1. (P(Acap Bcap C)).
2. (P(A^ccap B^ccap C^c)).

### Applications

#### Q3.3.17

The sample space that describes all three-child families according to the genders of the children with respect to birth order is [S=] In the experiment of selecting a three-child family at random, compute each of the following probabilities, assuming all outcomes are equally likely.

1. The probability that the family has at least two boys.
2. The probability that the family has at least two boys, given that not all of the children are girls.
3. The probability that at least one child is a boy.
4. The probability that at least one child is a boy, given that the first born is a girl.

#### Q3.3.18

The following two-way contingency table gives the breakdown of the population in a particular locale according to age and number of vehicular moving violations in the past three years:

Age Violations
(0) (1) (2+)
Under (21) (0.04) (0.06) (0.02)
(21-40) (0.25) (0.16) (0.01)
(41-60) (0.23) (0.10) (0.02)
(60+) (0.08) (0.03) (0.00)

A person is selected at random. Find the following probabilities.

1. The person is under (21).
2. The person has had at least two violations in the past three years.
3. The person has had at least two violations in the past three years, given that he is under (21).
4. The person is under (21), given that he has had at least two violations in the past three years.
5. Determine whether the events &ldquothe person is under (21)&rdquo and &ldquothe person has had at least two violations in the past three years&rdquo are independent or not.

#### Q3.3.19

The following two-way contingency table gives the breakdown of the population in a particular locale according to party affiliation ((A, B, C, ext)) and opinion on a bond issue:

Affiliation Opinion
Favors Opposes Undecided
(A) (0.12) (0.09) (0.07)
(B) (0.16) (0.12) (0.14)
(C) (0.04) (0.03) (0.06)
None (0.08) (0.06) (0.03)

A person is selected at random. Find each of the following probabilities.

1. The person is in favor of the bond issue.
2. The person is in favor of the bond issue, given that he is affiliated with party (A).
3. The person is in favor of the bond issue, given that he is affiliated with party (B).

#### Q3.3.20

The following two-way contingency table gives the breakdown of the population of patrons at a grocery store according to the number of items purchased and whether or not the patron made an impulse purchase at the checkout counter:

Number of Items Impulse Purchase
Few (0.01) (0.19)
Many (0.04) (0.76)

A patron is selected at random. Find each of the following probabilities.

1. The patron made an impulse purchase.
2. The patron made an impulse purchase, given that the total number of items purchased was many.
3. Determine whether or not the events &ldquofew purchases&rdquo and &ldquomade an impulse purchase at the checkout counter&rdquo are independent.

#### Q3.3.21

The following two-way contingency table gives the breakdown of the population of adults in a particular locale according to employment type and level of life insurance:

Employment Type Level of Insurance
Low Medium High
Unskilled (0.07) (0.19) (0.00)
Semi-skilled (0.04) (0.28) (0.08)
Skilled (0.03) (0.18) (0.05)
Professional (0.01) (0.05) (0.02)

An adult is selected at random. Find each of the following probabilities.

1. The person has a high level of life insurance.
2. The person has a high level of life insurance, given that he does not have a professional position.
3. The person has a high level of life insurance, given that he has a professional position.
4. Determine whether or not the events &ldquohas a high level of life insurance&rdquo and &ldquohas a professional position&rdquo are independent.

#### Q3.3.22

The sample space of equally likely outcomes for the experiment of rolling two fair dice is [egin 11 & 12 & 13 & 14 & 15 & 16 21 & 22 & 23 & 24 & 25 & 26 31 & 32 & 33 & 34 & 35 & 36 41 & 42 & 43 & 44 & 45 & 46 51 & 52 & 53 & 54 & 55 & 56 61 & 62 & 63 & 64 & 65 & 66 end] Identify the events ( ext).

1. Find (P(N)).
2. Find (P(Nmid F)).
3. Find (P(Nmid T)).
4. Determine from the previous answers whether or not the events (N) and (F) are independent whether or not (N) and (T) are.

#### Q3.3.23

The sensitivity of a drug test is the probability that the test will be positive when administered to a person who has actually taken the drug. Suppose that there are two independent tests to detect the presence of a certain type of banned drugs in athletes. One has sensitivity (0.75) the other has sensitivity (0.85). If both are applied to an athlete who has taken this type of drug, what is the chance that his usage will go undetected?

#### Q3.3.24

A man has two lights in his well house to keep the pipes from freezing in winter. He checks the lights daily. Each light has probability (0.002) of burning out before it is checked the next day (independently of the other light).

1. If the lights are wired in parallel one will continue to shine even if the other burns out. In this situation, compute the probability that at least one light will continue to shine for the full (24) hours. Note the greatly increased reliability of the system of two bulbs over that of a single bulb.
2. If the lights are wired in series neither one will continue to shine even if only one of them burns out. In this situation, compute the probability that at least one light will continue to shine for the full (24) hours. Note the slightly decreased reliability of the system of two bulbs over that of a single bulb.

#### Q3.3.25

An accountant has observed that (5\%) of all copies of a particular two-part form have an error in Part I, and (2\%) have an error in Part II. If the errors occur independently, find the probability that a randomly selected form will be error-free.

#### Q3.3.26

A box contains (20) screws which are identical in size, but (12) of which are zinc coated and (8) of which are not. Two screws are selected at random, without replacement.

1. Find the probability that both are zinc coated.
2. Find the probability that at least one is zinc coated.

#### Q3.3.27

Events (A) and (B) are mutually exclusive. Find (P(Amid B)).

#### Q3.3.28

The city council of a particular city is composed of five members of party (A), four members of party (B), and three independents. Two council members are randomly selected to form an investigative committee.

1. Find the probability that both are from party (A).
2. Find the probability that at least one is an independent.
3. Find the probability that the two have different party affiliations (that is, not both (A), not both (B), and not both independent).

#### Q3.3.29

A basketball player makes (60\%) of the free throws that he attempts, except that if he has just tried and missed a free throw then his chances of making a second one go down to only (30\%). Suppose he has just been awarded two free throws.

1. Find the probability that he makes both.
2. Find the probability that he makes at least one. (A tree diagram could help.)

#### Q3.3.30

An economist wishes to ascertain the proportion (p) of the population of individual taxpayers who have purposely submitted fraudulent information on an income tax return. To truly guarantee anonymity of the taxpayers in a random survey, taxpayers questioned are given the following instructions.

1. Flip a coin.
2. If the coin lands heads, answer &ldquoYes&rdquo to the question &ldquoHave you ever submitted fraudulent information on a tax return?&rdquo even if you have not.
3. If the coin lands tails, give a truthful &ldquoYes&rdquo or &ldquoNo&rdquo answer to the question &ldquoHave you ever submitted fraudulent information on a tax return?&rdquo

The questioner is not told how the coin landed, so he does not know if a &ldquoYes&rdquo answer is the truth or is given only because of the coin toss.

## Multivariate Statistics

A popular way to develop a discriminant rule is to start by assuming (or estimating) a different distribution for (mathbf xin mathbb R^p) for each population. For example, suppose that the observations in population (j) have a distribution with pdf (f_j(mathbf x)) , for (j=1,ldots , g) .

We will begin by supposing the different population distributions (f_1(mathbf x), ldots, f_g(mathbf x)) are known, and in particular, that they are multivariate normal distributions.

Example 8.2 Consider the univariate case with (g=2) where (Pi_1) is the (N(mu_1,sigma_1^2)) distribution and (Pi_2) is the (N(mu_2,sigma_2^2)) distribution. The ML discriminant rule allocates (x) to (Pi_1) if and only if [ f_1(x) > f_2(x) , ] which is equivalent to [ frac<1><(2pisigma_1^2)^<1/2>> exp left(-frac<1> <2sigma_1^2>(x-mu_1)^2 ight) > frac<1><(2pisigma_2^2)^<1/2>> exp left(-frac<1> <2sigma_2^2>(x-mu_2)^2 ight). ] Collecting terms together on the left hand side (LHS) gives [egin && qquad frac exp left(-frac<1> <2sigma_1^2>(x - mu_1)^2 +frac<1> <2sigma_2^2>(x - mu_2)^2 ight)> 1 &iff& qquad log left(frac ight)-frac<1> <2sigma_1^2>(x - mu_1)^2 + frac<1> <2sigma_2^2>(x - mu_2)^2 > 0 & iff & qquad x^2 left(frac<1> - frac<1> ight) + x left(frac<2 mu_1> - frac<2 mu_2> ight)+ frac - frac + 2 log frac > 0. ag <8.2>end] Suppose, for example, that (mu_1 = sigma_1 = 1) and (mu_2 = sigma_2 = 2) , then this reduces to the quadratic expression [ -frac<3><4>x^2 + x + 2 log 2 > 0.] Suppose that our new observation is (x=0) , say. Then the LHS is (2 log 2) which is greater than zero and so we would allocate (x) to population 1.

Using the quadratic equation formula we find that (f_1(x)=f_2(x)) when [x = frac<-1 pm sqrt<1+6 log 2>> <-3/2>= frac<2> <3>pm frac<2> <3>sqrt<1 + 6 log 2>,] i.e. at (x = -0.85) and (x = 2.18) . Our discriminant rule is thus to allocate (x) to (Pi_1) if (-0.85 < x < 2.18) and to allocate it to (Pi_2) otherwise. This is illustrated in Figure 8.2.

Figure 8.2: Discriminant rule for the two Gaussians example.

Note that this has not resulted in connected convex discriminant regions (mathcal_i) . This is because our discriminant functions were not linear functions of (mathbf x) - thus we did not find a linear discriminant rule.

Note also that if (sigma_1=sigma_2) then the (x^2) term in Equation (8.2) cancels, and we are left with a linear discriminant rule. For example, if (sigma_2=1) with the other parameters as before, then we classify (x) to population 1 if

[ 2x left(mu_1 - mu_2 ight)+ mu_2^2 - mu_1^2 = -2x+3 > 0. ] i.e., if (x<frac<3><2>) . In this case, we do get discriminant regions (mathcal_j) which are connected and convex.

Figure 8.3: Discriminant rule for the two Gaussians example when sigma_2=1

### 8.1.1 Multivariate normal populations

Now we consider the case of (g) multivariate normal populations. We shall assume that for population (k) [mathbf xsim N_p(<oldsymbol>_k, oldsymbol)] i.e., we allow the mean of each population to vary, but have assumed a common covariance matrix between groups. We call the (<oldsymbol>_k) the population means or centroids.

Proposition 8.2 If cases in population (Pi_k) have a (N_p(<oldsymbol>_k,oldsymbol)) distribution, then the ML discriminant rule is [d(mathbf x)= argmin_(mathbf x-<oldsymbol>_k)^ op oldsymbol^ <-1>(mathbf x-<oldsymbol>_k).]

Equivalently, if (delta_k(mathbf x) = 2<oldsymbol>_k^ op oldsymbol^ <-1>mathbf x- <oldsymbol>_k^ op Sigma^ <-1><oldsymbol>_k) . Then [d(mathbf x) = arg max delta_k(mathbf x).] I.e. this is a linear discriminant rule.

Proof. The (k) th likelihood is [egin f_k(mathbf x) = | 2 pi oldsymbol|^ <-1/2>exp left(-frac<1> <2>(mathbf x- <oldsymbol>_k)^ op oldsymbol^ <-1>(mathbf x- <oldsymbol>_k) ight). ag <8.3>end] This is maximised when the exponent is minimised, due to the minus sign in the exponent and because (oldsymbol) is positive definite.

[egin (mathbf x-<oldsymbol>_k)^ op oldsymbol^ <-1>(mathbf x-<oldsymbol>_k) &= mathbf x^ op oldsymbol^<-1>mathbf x -2<oldsymbol>_k^ op oldsymbol^<-1>mathbf x+<oldsymbol>_k^ op oldsymbol^<-1><oldsymbol>_k &= mathbf x^ op oldsymbol^<-1>mathbf x-delta_k(mathbf x) end] Thus, [arg min_k (mathbf x-<oldsymbol>_k)^ op oldsymbol^ <-1>(mathbf x-<oldsymbol>_k) = argmax_k delta_k(mathbf x)] as (mathbf x^ op oldsymbolmathbf x) does not depend on (k) .

### 8.1.2 The sample ML discriminant rule

To use the ML discriminant rule, we need to know the model parameters for each group, (<oldsymbol>_k) , as well as the common covariance matrix (oldsymbol) . We will usually not know these parameters, and instead must estimate them from training data. We then substitute these estimates into the discriminant rule. Training data typically consists of samples (mathbf x_<1,k>, ldots, mathbf x_) known to be from population (Pi_k) , where (n_k) is the number of observations from population (Pi_k) .

We estimate the unknown population means by the sample mean for each population [hat<<oldsymbol>>_k =frac<1> sum_^ mathbf x_.]

To estimate the shared covariance matrix, (oldsymbol) , first calculate the sample covariance matrix for the (k) th group: [mathbf S_j=frac<1>sum_^ (mathbf x_-hat<<oldsymbol>>_j)(mathbf x_-hat<<oldsymbol>>_j)^ op]

Then [egin widehat<oldsymbol> = frac<1> sum_^g n_k mathbf S_k ag <8.4>end] is an unbiased estimate of (oldsymbol) where (n = n_1 + n_2 + ldots + n_g) . Note that this is not the same as the total covariance matrix (i.e. ignoring the class labels).

The sample ML discriminant rule is then defined by substituting these estimates into 8.2.

0$where$hat <a>= widehat<Sigma>^ <-1>(ar<mu>_1 - ar<mu>_2)$,$hat <h>= frac<1> <2>(ar<mu>_1 + ar<mu>_2)$and$widehat<Sigma>$, the pooled estimate of$Sigma$, is given by$ widehat <Sigma>= frac<1> (n_1 S_1 + n_2 S_2 ).$--> ### 8.1.3 Two populations If we think about the situation where (oldsymbol= mathbf I) , then we can make sense of this rule geometrically. If the variance of the two populations is the identity matrix, then we can simply classify to the nearest population mean/centroid, and the decision boundary is thus the perpendicular bisector of the two centroids. Moreover, (mathbf a= <oldsymbol>_1-<oldsymbol>_2) is the vector between the two population centroids, and thus will be perpendicular to the decision boundary. An equation for the decision boundary is (mathbf a^ op (mathbf x- mathbf h)=0) . Thinking of the scalar product, we can see that (mathbf a^ op (mathbf x- mathbf h)) is proportional to the cosine of the angle between (mathbf a) and (mathbf x-mathbf h) . The point (mathbf x) will be closer to (<oldsymbol>_1) than (<oldsymbol>_2) if the angle between (mathbf a) and (mathbf x-mathbf h) is between (-90^circ) and (90^circ) , or equivalently, if the cosine of the angle is greater than 0. Thus we classify (mathbf x) to population 1 if (mathbf a^ op (mathbf x- mathbf h)>0) , and to population 2 if (mathbf a^ op (mathbf x- mathbf h)<0) . This situation is illustrated in Figure 8.4. If we have more than (2) populations, then for (oldsymbol=mathbf I) , the decision boundaries are the perpendicular bisectors between the population centroids (the (<oldsymbol>_i) ) and we simply classify to the nearest centroid. When (oldsymbol ot =mathbf I) , we think of (oldsymbol) as distorting space. Instead of measuring distance using the Euclidean distance, we instead adjust distances to account for (oldsymbol) . The decision boundaries are then no longer the perpendicular bisectors of the centroids. ##### Example 2 Consider the bivariate case ( (p=2) ) with (g=2) groups, where (Pi_1) is the (N_2(<oldsymbol>_1,mathbf I_2)) distribution and (Pi_2) is the (N_2(<oldsymbol>_2,mathbf I_2)) distribution. Suppose (<oldsymbol>_1 = egin c 0 end) and (<oldsymbol>_2 = egin -c 0 end) for some constant (c>0) . Here, (mathbf a= oldsymbol^ <-1>(<oldsymbol>_1 - <oldsymbol>_2) = egin 2c 0 end) and (mathbf h= frac<1><2>( <oldsymbol>_1 + <oldsymbol>_2 ) = egin 0 0 end) . The ML discriminant rule allocates (mathbf x) to (Pi_1) if (mathbf a^ op (mathbf x- mathbf h) = mathbf a^ op mathbf x> 0) . If we write (mathbf x= egin x_1 x_2 end) then (mathbf a^ op mathbf x= 2cx_1) , which is greater than zero if (x_1 > 0) . Hence we allocate (mathbf x) to (Pi_1) if (x_1 > 0) and allocate (mathbf x) to (Pi_2) if (x_1 leq 0) . Figure 8.4: LDA when the covariance matrix is the identity ##### Example 3 Let’s now generalise the previous example, making no assumptions about (oldsymbol) , but still assuming (<oldsymbol>_1=-<oldsymbol>_2) . If we write (mathbf a= egin a_1 a_2 end) and (mathbf h= frac<1><2>( <oldsymbol>_1 + <oldsymbol>_2 ) = oldsymbol 0) . Then the ML discriminant rule allocates (mathbf x) to (Pi_1) if (mathbf a^ op mathbf x> 0) . If we write (mathbf x= egin x y end) then the boundary separating (mathcal R_1) and (mathcal R_2) is given by (mathbf a^ op mathbf x= egin a_1 & a_2 end egin x y end = a_1 x + a_2 y = 0) , i.e. (y = -frac x) . This is a straight line through the origin with gradient (-a_1/a_2) . If the variance of the (y) component is very small compared to the variance of the (x) component, then we begin to classify solely on the basis of (y) . For example, if (<oldsymbol>_1 =egin2 1 end) and (oldsymbol= egin1&0.09 .09&0.1 end) we find (mathbf a= egin2.39 17.8 end) , which gives the line (y = -0.13 x) . I.e., a line that is getting close to being horizontal. ### 8.1.4 More than two populations When (g>2) , the boundaries for the ML rule will be piece-wise linear. In the exercises you will look at an example with 3 populations in two dimensions. ## Minitab Express &ndash Frequency Tables To create a frequency table of dog ownership in Minitab Express: 1. Open the data set: • FALL2016STDATA.MTW 2. On a PC: In the menu bar select STATISTICS > Describe > Tally 3. On a Mac: In the menu bar select Statistics > Summary Statistics > Tally 4. Double click the variable Dog in the box on the left to insert the variable into the Variable box 5. Under Statistics, check Counts 6. Click OK This should result in the following frequency table: Tally Dog Count No 252 Yes 272 N= 524 *= 1 Select your operating system below to see a step-by-step guide for this example. ## 8.1.1: Sample Spaces and Probability (Exercises) - Mathematics This chapter covers the most basic definitions of probability theory and explores some fundamental properties of the probability function. Our starting point is the concept of an abstract random experiment. This is an experiment whose outcome is not necessarily determined before it is conducted. Examples include flipping a coin, the outcome of a soccer match, and the weather. The set of all possible outcomes associated with the random experiment is called the sample space. Events are subsets of the sample space, or in other words sets of possible outcomes. The probability function assigns real values to events in a way that is consistent with our intuitive understanding of probability. Formal definitions appear below. A sample space can be finite, for example [Omega=<1,ldots,10>] in the experiment of observing a number from 1 to 10. Or$Omega$can be countably-infinite, for example [Omega=<0,1,2,3,ldots>] in the experiment of counting the number of phone calls made on a specific day. A sample space may also be uncountably infinite, for example [Omega=] in the experiment of measuring the height of a passer-by. The notation$mathbb$corresponds to the natural numbers$<1,2,3,ldots>$, and the notation$mathbbcup<0>$corresponds to the set$<0,1,2,3,ldots>$. The notation$R$corresponds to the real numbers and the notation$$corresponds to the non-negative real numbers. See Chapter A in the appendix for an overview of set theory, including the notions of a power set and countably infinite and unconuntably infinite sets. In the examples above, the sample space contained unachievable values (number of people and height are bounded numbers). A more careful definition could have been used, taking into account bounds on the number of potential phone calls or potential height values. For the sake of simplicity, we often use simpler sample spaces containing some unachievable outcomes. This is not a significant problem, since we can later assign zero probability to such values. In particular, the empty set$emptyset$and the sample space$Omega$are events. Figure 1.2.1 shows an example of a sample space$Omega$and two events$A,BsubsetOmega$that are neither$emptyset$nor$Omega$. The R code below shows all possible events of an experiment with$Omega=$. There are$2^<|Omega|>$such sets, assuming$Omega$is finite (see Chapter A on set theory for more information on the power set). For an event$E$, the outcome of the random experiment$omegainOmega$is either in E$(omegain E)$or not in$E(omega otin E)$. In the first case, we say that the event$E$occurred, and in the second case we say that the event$E$did not occur.$Acup B$is the event of either$A$or$B$occurring and$Acap B$is the event of both$A$and$B$occurring. The complement$A^c$(in the complement, the universal set is taken to be$Omega$:$A^c=Omegasetminus A)$represents the event that$A$did not occur. If the events$A,B$are disjoint$(Acap B=emptyset)$, the two events cannot happen at the same time, since no outcome of the random experiment belongs to both$A$and$B$. If$Asubset B$, then$B$occurring implies that$A$occurs as well. ## 3.6 Variance and standard deviation The variance of a random variable measures the spread of the variable around its expected value. Rvs with large variance can be quite far from their expected values, while rvs with small variance stay near their expected value. The standard deviation is simply the square root of the variance. The standard deviation also measures spread, but in more natural units which match the units of the random variable itself. Let (X) be a random variable with expected value (mu = E[X]) . The variance of (X) is defined as [ ext(X) = E[(X - mu)^2] ] The standard deviation of (X) is written (sigma(X)) and is the square root of the variance: [ sigma(X) = sqrt< ext(X)> ] Note that the variance of an rv is always positive (in the French sense 11 ), as it is the integral or sum of a positive function. The next theorem gives a formula for the variance that is often easier than the definition when performing computations. Applying linearity of expected values (Theorem 5.8) to the definition of variance yields: [ egin E[(X - mu)^2] &= E[X^2 - 2mu X + mu^2] &= E[X^2] - 2mu E[X] + mu^2 = E[X^2] - 2mu^2 + mu^2 &= E[X^2] - mu^2, end ] as desired. Let (X sim ext(3,0.5)) . Here (mu = E[X] = 1.5) . In Example 5.35, we saw that (E[(X-1.5)^2] = 0.75) . Then ( ext(X) = 0.75) and the standard deviation is (sigma(X) = sqrt <0.75>approx 0.866) . We can check both of these using simulation and the built in R functions var and sd : Compute the variance of (X) if the pdf of (X) is given by (f(x) = e^<-x>) , (x > 0) . We have already seen that (E[X] = 1) and (E[X^2] = 2) (Example 5.37). Therefore, the variance of (X) is [ ext(X) = E[X^2] - E[X]^2 = 2 - 1 = 1. ] The standard deviation (sigma(X) = sqrt <1>= 1) . We interpret of the standard deviation (sigma) as a spread around the mean, as shown in this picture: Compute the standard deviation of the uniform random variable (X) on ([0,1]) . [ egin ext(X) &= E[X^2] - E[X]^2 = int_0^1x^2 cdot 1, dx - left(frac<1><2> ight)^2 &= frac<1> <3>- frac<1> <4>= frac<1> <12>approx 0.083. end ] So the standard deviation is (sigma(X) = sqrt <1/12>approx 0.289) . Shown as a spread around the mean of 1/2: For many distributions, most of the values will lie within one standard deviation of the mean, i.e. within the spread shown in the example pictures. Almost all of the values will lie within 2 standard deviations of the mean. What do we mean by “almost all”? Well, 85% would be almost all. 15% would not be almost all. This is a very vague rule of thumb. Chebychev’s Theorem is a more precise statement. It says in particular that the probability of being more than 2 standard deviations away from the mean is at most 25%. Sometimes, you know that the data you collect will likely fall in a certain range of values. For example, if you are measuring the height in inches of 100 randomly selected adult males, you would be able to guess that your data will very likely lie in the interval 60-84. You can get a rough estimate of the standard deviation by taking the expected range of values and dividing by 6 in this case it would be 24/6 = 4. Here, we are using the heuristic that it is very rare for data to fall more than three standard deviations from the mean. This can be useful as a quick check on your computations. Unlike expected value, variance and standard deviation are not linear. However, variance and standard deviation do have scaling properties, and variance does distribute over sums in the special case of independent random variables: Let (X) be a rv and (c) a constant. Then [ egin ext(cX) &= c^2 ext(X) sigma(cX) &= c sigma(X) end ] Let (X) and (Y) be independent random variables. Then [ < m Var>(aX + bY) = a^2 < m Var>(X) + b^2 < m Var>(Y) ] We prove part 1 here, and verify part 2 through simulation in Exercise 5.37. [egin < m Var>(cX) =& E[(cX)^2] - E[cX]^2 = c^2E[X^2] - (cE[X])^2 =&c^2igl(E[X^2] - E[X]^2) = c^2< m Var>(X) end] Theorem 5.10 part 2 is only true when (X) and (Y) are independent. If (X) and (Y) are independent, then (< m Var>(X - Y) = < m Var>(X) + < m Var>(Y)) . Let (X sim ext(n, p)) . We have seen that (X = sum_^n X_i) , where (X_i) are independent Bernoulli random variables. Therefore, [egin < ext >(X) &= < ext >(sum_^n X_i) &= sum_^n < ext >(X_i) &= sum_^n p(1 - p) = np(1-p) end] where we have used that the variance of a Bernoulli random variable is (p(1- p)) . Indeed, (E[X_i^2] -E[X_i]^2 = p - p^2 = p(1 - p)) . ## 1 Answer 1 Just as for rolling two ordinary dice, the sample space consists of a$6 imes 6$of pairs of faces. Enumeration: For the sum$S$on the two dice, each of the 36 cells can also be labeled with the total of the two corresponding faces. Then count the cells for each total. (The first two of the six rows are shown below.) Analytic methods: It is easy to show that$E(S) = E(D_a) + E(D_b) = 15/6 + 27/6 = 42/6 = 3.5,$which is the same as for regular dice. A bit more tediously, one can show that$Var(S)$is the same as for regular dice. 'Probability generating functions' could be used to show that the distribution of$S$agrees with the (triangular) distribution of the sum of two ordinary dice. Simulation: The distribution of$S\$ can be very closely approximated by simulating the sums on a million rolls of these two special dice and tallying the results. (Simulation in R statistical software gives probabilities accurate to about three places.)

The plot below shows a histogram of the million simulated totals obtained when rolling a pair of these special dice. The dots show the exact distribution.

## 4.5 Probability and Statistics

Modern science may be characterized by a systematic collection of empirical measurements and the attempt to model laws of nature using mathematical language. The drive to deliver better measurements led to the development of more accurate and more sensitive measurement tools. Nonetheless, at some point it became apparent that measurements may not be perfectly reproducible and any repeated measurement of presumably the exact same phenomena will typically produce variability in the outcomes. On the other hand, scientists also found that there are general laws that govern this variability in repetitions. For example, it was discovered that the average of several independent repeats of the measurement is less variable and more reproducible than each of the single measurements themselves.

Probability was first introduced as a branch of mathematics in the investigation of uncertainty associated with gambling and games of chance. During the early 19th century probability began to be used in order to model variability in measurements. This application of probability turned out to be very successful. Indeed, one of the major achievements of probability was the development of the mathematical theory that explains the phenomena of reduced variability that is observed when averages are used instead of single measurements. In Chapter 7 we discuss the conclusions of this theory.

Statistics study method for inference based on data. Probability serves as the mathematical foundation for the development of statistical theory. In this chapter we introduced the probabilistic concept of a random variable. This concept is key for understanding statistics. In the rest of Part I of this book we discuss the probability theory that is used for statistical inference. Statistical inference itself is discussed in Part II of the book.