1. Probability Models and More.
1.1 Probability Models.
• Here’s an example to introduce jargon. • You flip a coin repeatedly and record an H for heads and a T for tails. • e.g. after 10 flips: T H T T T H H H H T. • On any particular flip we can’t predict if we’ll be recording an H or T. • In many flips though we can predict (with some measure of accuracy) the proportion of H’s (and proportion of T’s) • Individual outcomes unpredictable but long-run pattern of relative frequency emerges; we use the word random to describe outcomes (events) of this nature.
Flip 1 2 3 4 5 .. .
1000
Outcome T T T H
Relative Frequency 0 0 0 1/4=0.25
.. . unpredictable
.. . kinda predictable
• Definition: In this context we call the limiting relative frequency of the outcome (or event) the probability of the outcome (or event). • For example I found that I could bring the proportion of heads flipped as close to 0.5 as I wanted by flipping the coin more and more. • For that reason I say the probability of a head on any particular flip is 0.5. • Mathematically: Provided the limit exists and denoting by N (H, n) for the number of heads in the first n independent flips of the fair coin, we assign lim
n→∞
N (H, n) n
for the probability of getting a head in a single flip of a fair coin. • Let’s go further into probability theory; a good path starts with some definitions. • Definition: A probability model is a framework for measuring the probability that an event occurs. • There are 3 or 4 key components to a probability model:
1
• Definition: The experiment: is the procedure or phenomenon that generates a random outcome. For example: Flipping a coin and recording the outcome. Waiting at the bus stop and recording the waiting time.
• Definition: The Sample Space is the set of all possible outcomes of the experiment. • Notation: We denote the sample space of an experiment by S For example: If the experiment is flipping a coin and recording the outcome then S = {H, T }. If the experiment is waiting at the bus stop and recording the waiting time then S = {x : x ≥ 0}.
• Definition: Events are subset of the sample space. • Notation: We denote events by capital letters from the beginning of the alphabet. • Notation: We denote the set of events by S. For Example: If we denote by A the event that we flipped a tail then A = {T }. If we denote by A the event that we waited more than 10 minutes then A = {x : x > 10}. • Important: Although there are technical issues that sometimes make this impossible, we will ignore them and always assume that S is the set of all subsets of S. For Example: If we denote by S the set of all subsets of S = {H, T } then S = {∅, {H}, {T }, S}. If we denote by S the set of all subsets of S = {x : x ≥ 0} then S = {A : A ⊂ S}.
• Definition: Probability Measure is a rule for assigning probabilities to events. 2
• Notation: We denote probability measures by P . • Mathematically: A probability measure is a set function mapping events in S into [0, 1] that satisfies 3 conditions that we call the axioms of probability. • Axioms of Probability: Suppose that S is a sample space and S are the events of S. If P is to be a probability measure for the model then P must satisfy: 1. 0 ≤ P (A) ≤ 1 for A ∈ S
2. P (S) = 1
3. if A1 , A2 , A3 , . . . are a countable collection of pairwise disjoint events in S then P (A1 ∪ A2 ∪ A3 ∪ · · · ) = P (A1 ) + P (A2 ) + P (A3 ) + · · · For example: If we denote by A the event we flip a tail and and assuming that the coin is fair then P (A) = P ({T }) = 1/2. If we denote by A the event we wait longer than 10 minutes for the bus and supposing that in many, many visits to the bus stop in 1 out of every 3 visits we had to wait longer than 10 minutes, then P (A) = P ({x : x > 10}) = 1/3.
• We’ve just talked about one theoretical (and sometimes practical) rule for assigning probabilities to events; the long run relative frequency rule. • The long run relative frequency rule is the most intuitively sound so always keep it in mind.
3
1.2 Review of Set Theory.
• Some familiarity with set theory is required to study probability theory. • Defintion: A set is a collection of elements. • We usually write down sets in one of two ways; listing out the elements or specifying a criterial for inclusion. For Example: The set of Jane’s household pets: A = {cat, dog, fish}. The set of real numbers between zero and one: A = {x : x ∈ R and 0 ≤ x ≤ 1}.
• Suppose A and B are two sets, then we write A ⊂ B if and only if x ∈ A implies x ∈ B and we say A is contained in B or A is a subset of B. A = B if and only if A ⊂ B and B ⊂ A and we say that A equals B. • The set with no elements is denoted by ∅ and called the empty set. • In the context of probability theory ∅ is called the impossible event. • Definitions: Denoting by S a set and A, A1 , A2 , A3 , . . . a collection of subsets of S. The union of A1 , A2 , A3 , . . . is those elements of S that are in A1 or A2 or A3 or . . . . Notation: The union of A1 , A2 , A3 , . . . is denoted by A1 ∪ A2 ∪ A3 ∪ · · · . In the context of probability if S is a sample space then A1 ∪ A2 ∪ A3 ∪ · · · is the event that A1 occurs or A2 occurs or A3 occurs or . . . . I.e. the event that at least one of them occurs. The intersection of A1 , A2 , A3 , . . . is those elements of S that are in A1 and A2 and A3 and . . . . Notation: The intersection of A1 , A2 , A3 , . . . is denoted by A1 ∩ A2 ∩ A3 ∩ · · · . In the context of probability if S is a sample space then A1 ∩ A2 ∩ A3 ∩ · · · is the event that A1 occurs and A2 occurs and A3 occurs and . . . . I.e. the event that they all occur. The complement of A is those elements of S that are not in A. Notation: The complement of A is denoted by AC . In the context of probability if S is a sample space then AC is the event that A does not occur. • Definition: Suppose S is a set and A and B are subsets of S, if A ∩ B = ∅ then we say A and B are disjoint or mutually exclusive. 4
• In the context of probability it is impossible for A and B to occur simultaneously. • Important: Disjointness of events is a set theory thing, not a probability one. For Example: For Experiment you drop your pencil onto table B (table of random numbers at back of textbook) and record the number pencil points to. S = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9} If A is the event the pencil points to an even number, then A = {0, 2, 4, 6, 8}. If B is the event the pencil points to a number bigger than 5, then B = {6, 7, 8, 9}. A ∪ B = {0, 2, 4, 6, 7, 8, 9} A ∩ B = {6, 8} A ∩ B (= ∅ implies that A and B are not disjoint. AC = {1, 3, 5, 7, 9}
• Definition: if S is a set and A1 , A2 , . . . , An are a collection of subsets of S that satisfy 1. they are pairwise disjoint, i.e Ai ∩ Aj = ∅ for i (= j and
2. A1 ∪ A2 ∪ · · · ∪ An = S
then we say that A1 , A2 , . . . , An partitions S. For Example: With respect to the pencil experiment if A1 = {0, 1, 2}, A2 = {3, 4, 5}, A3 = {6, 7, 8}, A4 = {9} then A’s are pairwise disjoint (no pair of the events have any outcomes in common.) and the union of all of them is S. They partition S. For Example: Suppose that S = R and An = [0, 1/n) for n = 1, 2, 3, . . . Then An = {x : x ∈ R and 0 ≤ x < 1/n} for n = 1, 2, 3, . . . Notice that A1 ⊃ A2 ⊃ A3 ⊃ · · · ∪∞ k=1 Ak = A1 for if for two sets A and B satisfy A ⊂ B then A ∪ B = B. ∩∞ k=1 Ak = {0} since 0 is the only number that is in all the Ak ’s. For example for any x > 0 there exists an N (x) big enough such that 1/k < x for all k ≥ N (x). 5
1.3 Probability Measures
• Let’s talk a little bit more about the probability measure in our probability model. • Axiom 1. follows immediately from the frequentist view of probability and actually, together with common sense about counting, so does Axiom 3. • To get a feel for axiom 3, suppose that A and B are two disjoint events and consider the long run relative frequency view of probability: If A and B are disjoint events with probabilities P (A) and P (B) then N (A ∪ B, n) = N (A, n) + N (B, n) where as before N (A ∪ B, n) is the number of times that A occurs or B occurs in n opportunities so that, N (A ∪ B, n) n ! " N (A, n) N (B, n) = lim + n→∞ n n N (A, n) N (B, n) = lim + lim n→∞ n→∞ n n = P (A) + P (B)
P (A ∪ B) = lim
n→∞
• Axioms 1-3 have some implications that we will put into our tool bag for calculating probabilities. • Complement Rule: For any event A ∈ S, P (Ac ) = 1 − P (A) Proof: Note that S = A ∪ AC and A ∩ AC = ∅ so by axiom 2 and 3,
implying
1 = P (S) = P (A ∪ AC ) = P (A) + P (AC ), P (AC ) = 1 − P (A).
For Example: Suppose that S = {1, 2, 3, . . . , 100} and P ({1}) = 0.1 and find P ({2, 3, . . . , 100}). P ({2, 3, . . . , 100}) = 1 − P ({1}) = 1 − 0.1 = 0.9 • Law of Total Probability: If B1 , B2 , B3 , . . . is a countable collection of events in S and partitions S then for any A ∈ S, P (A) = P (A ∩ B1 ) + P (A ∩ B2 ) + P (A ∩ B3 ) + · · · Proof: Note that A = A ∩ S = A ∩ (B1 ∪ B2 ∪ · · · ) = (A ∩ B1 ) ∪ (A ∩ B2 ) ∪ · · · 6
Since Bi ∩ Bj = ∅ for i (= j and A ∩ Bi ⊂ Bi for all i it follows that (A ∩ Bi ) ∩ (A ∩ Bj ) = ∅ for all i (= j. By axiom 3,
P (A) = P ((A ∩ B1 ) ∪ (A ∩ B2 ) ∪ (A ∩ B3 ) ∪ · · · )
= P (A ∩ B1 ) + P (A ∩ B2 ) + P (A ∩ B3 ) + · · ·
For Example: Forty four percent of STA B52 students are female and have long hair. Fifteen percent of STA B52 students are male and have long hair. Find the probability that a randomly chosen STA B52 student has long hair. If we denote by A the event that they have long hair and by B1 and B2 the events that they are female and male repectively, then P (A) = P (A ∩ B1 ) + P (A ∩ B2 ) = 0.44 + 0.15 = 0.59 • Monotonicity: If A and B are two events in S such that A ⊂ B then P (A) ≤ P (B) Proof: Note that B = A ∪ (B ∩ AC ) and A ∩ (B ∩ AC ) = ∅ so it follows by axiom 3 P (B) = P (A) + P (B ∩ AC ).
By axiom 1 together implying that P (B) ≥ P (A).
P (B ∩ AC ) ≥ 0
For Example: Suppose that S = {1, 2, 3, . . . , 100} and P ({1}) = 0.1 and estimate P ({3, . . . , 100}). P ({2, 3, . . . , 100}) = 1 − P ({1}) = 1 − 0.1 = 0.9 and moreover {3, . . . , 100} ⊂ {2, 3, . . . , 100}. Implying that P ({3, . . . , 100}) ≤ P ({2, 3, . . . , 100}) = 0.9. • Inclusion-Exclusion Principle: If A and B are two events in S then P (A ∪ B) = P (A) + P (B) − P (A ∩ B) Proof:
7
Note that A = (A ∩ B) ∪ (A ∩ B C ) and (A ∩ B) ∩ (A ∩ B C ) = ∅ implying via axiom 3 P (A) = P (A ∩ B) + P (A ∩ B C ). Moreover, A ∪ B = B ∪ (A ∩ B C ) and B ∩ (A ∩ B C ) = ∅ implying via axiom 3 P (A ∪ B) = P (B) + P (A ∩ B C ).
Re-arranging the first equation gives us
P (A ∩ B C ) = P (A) − P (A ∩ B). Substituting this into the second equation gives us P (A ∪ B) = P (B) + P (A) − P (A ∩ B). For Example: A STA B52 student arrives late ten percent of the time, leaves early twenty percent of the time and arrives late and leaves early five percent of the time. Find the probability that a STA B52 student arrives late or leaves early. If we denote by A the event the student arrives late and by B the event they leave early, then P (A ∪ B) = P (A) + P (B) − P (A ∩ B) = 0.10 + 0.20 − 0.05 = 0.25. • Sub-Additivity: If A1 , A2 , A3 , . . . are a countable collection of events in S then P (A1 ∪ A2 ∪ A3 ∪ · · · ) ≤ P (A1 ) + P (A2 ) + P (A3 ) + · · · Proof:
For Example: Suppose that P (A) = 0.2 and P (B) = 0.5. Find upper and lower bounds for P (A ∪ B). By sub-additivity P (A ∪ B) ≤ P (A) + P (B) = 0.2 + 0.5 = 0.7. 8
By the inclusion-exclusion principle P (A ∪ B) = P (A) + P (B) − P (A ∩ B). Note that A ∩ B ⊂ A and A ∩ B ⊂ B implying that P (A ∩ B) ≤ P (A) and P (A ∩ B) ≤ P (B). It follows that
Putting all of this together we get
P (A ∩ B) ≤ min(P (A), P (B)).
P (A ∪ B) = P (A) + P (B) − P (A ∩ B) ≥ P (A) + P (B) − min(P (A), P (B)) = 0.2 + 0.5 − 0.2 = 0.5. • Continuity of Probability: If A ∈ S and A1 , A2 , A3 , . . . is a countable collection of events in S such that {Ak } + A or {Ak } , A then lim P (Ak ) = P (A)
k→∞
Notation: We write {Ak } + A if A1 ⊂ A2 ⊂ A3 ⊂ · · · and ∪∞ k=1 Ak = A. Notation: We write {Ak } , A if A1 ⊃ A2 ⊃ A3 ⊃ · · · and ∩∞ k=1 Ak = A. Proof: Suppose that {Ak } + A.
Let B1 = A1 and Bk = Ak ∩ AC k−1 for k = 2, 3, . . ..
∞ n n Then Bi ∩ Bj = ∅ but ∪∞ k=1 Bk = ∪k=1 Ak = A and ∪k=1 Bk = ∪k=1 Ak = An so that ∞ P (A) = P (∪∞ k=1 Ak ) = P (∪k=1 Bk ) ∞ # = P (Bk ) k=1
= lim
n→∞
n #
Bk
k=1
= lim P (∪nk=1 Bk ) n→∞
= lim P (∪nk=1 Ak ) n→∞
= lim P (An ) n→∞
For Example: Suppose that P ([0, 1]) = 1 but P ([1/n, 1]) = 0 for all n. Find P ({0}). If we let Ak = [1/k, 1] then note A1 ⊂ A2 ⊂ A3 , ⊂ · · · and ∪∞ k=1 = (0, 1] implying via the continuity of probability P ((0, 1]) = lim P ([1/n, 1]) = lim 0 = 0 n→∞
n→∞
using this together with axiom 1 we get 1 = P ([0.1]) = P ({0}) + P ((0, 1]) = P ({0}) + 0 = P ({0}). 9
1.4 Finite Sample Spaces
• There is one setting where finding probabilities of events is very straightforward (though not necessarily easy going.) • The sample space S can be finite, countably infinite or uncountably infinite. • In the situation where S is finite we can without loss of generality write S = {s1 , s2 , . . . , sk } for some k ∈ N
and completely describe P on S with
P ({sj }) = pj
j = 1, 2, . . . , k
where p1 , p2 , . . . , pk ≥ 0 and p1 + p2 + · · · + pk = 1 together with the formula P (A) =
#
sj ∈A
P ({sj })
for A ∈ S
• A demonstration of this would use axiom three. For example: If A ∈ S then there exists {i1 , i2 , . . . , in } ⊂{ 1, 2, . . . , k} such that A = {si1 , si2 , . . . , sin } and then by axiom 3 it follows P (A) = P ({si1 , si2 , . . . , sin }) = P ({si1 }) + P ({si2 }) + · · · + P ({sin }).
• All this says is, if there is only a finite number of possible outcomes and we know the probability of each one occurring then to calculate the probability of any event just sum up the probabilities of the outcomes that make up the event. • A special case of this is when the finite number of possible outcomes are equally likely so that we can write S = {s1 , s2 , . . . , sk } for some k ∈ N and P ({sj }) = 1/|S| = 1/k
j = 1, 2, . . . , k
then the formula for P (A), A ∈ S becomes P (A) =
#
sj ∈A
P ({sj }) =
#
sj ∈A
1/|S| = |A|/|S| = |A|/k
for A ∈ S
• That the outcomes are equally likely has to be a reasonable assumption or justified empirically via the long run relative frequency view of probability. 10
• Sometimes you can list out and count the outcomes in an event or sample space, sometimes that is not feasible. • Here are two useful results from combinatorics for counting outcomes in events. • The Multiplication Rule: If a process involves m steps and step k can be done in nk ways for k = 1, 2, . . . , m then the process can be done in n1 n2 · · · nm ways. • The Combination Rule: The number of ways that you can choose k objects from n is ! " n by . k
n! k!(n−k)!
and is denoted
• The Partition Rule: The number of ordered partitions of n objects into k sets of sizes n1 , n2 , . . . , nk is ! " n and is denoted by . n1 , n2 , . . . , nk
n! n1 !···nk !
For Example: You flip a fair coin 5 times independently what is the probability of getting all heads? Denote by A the event that you get all heads. Then by the fact that all 5 tuples in {H, T }5 are equally likely and the multiplication rule it follows |A| |S| 1 = 5 2
P (A) =
For Example: An urn contains 5 red balls and 7 blue balls. You select 3 balls randomly and independently from the urn. What is the probability all 3 balls are the same color? Denote by A the event all the balls are the same color, A1 the event all the balls are red and by A2 the event all the balls are blue. Then A = A1 ∪ A2 and A1 ∩ A2 = ∅. By axiom 3, the fact that all possible 3 ball combinations are equally likely and the combination rule it follows
P (A) = P (A1 ∪ A2 )
= P (A1 ) + P (A2 )
|A1 | |A1 | + |S| |S| ! "! " ! "! " 5 7 5 7 3 0 0 3 = ! " + ! " 12 12 3 3 =
11
1.5 Conditional Probability and Independence
• Not all of probability theory is implied by the axioms of probability measures. • There a couple of important ideas in probability theory that we introduce via definition. • Definition: Given two events A and B the conditional probability of A given B occurs is denoted by P (A|B) and is equal to P (A ∩ B) P (A|B) = . P (B) provided that P (B) > 0. • We like the long run relative frequency approach to probability because it holds a lot of intuitive weight. • Let’s reconcile the definition above with the long run relative frequency approach to probability. Suppose A and B are two events with P (B) > 0 Then limn→∞ N (B, n)/n > 0 implying N (B, n) → ∞ when n → ∞. An ideal candidate for P (A|B), we think, is N (A ∩ B, n) lim N (B, n) N (B,n)→∞ the long run relative frequency of A amongst those experiments where B occurred. Dividing top and bottom by n we get, lim
N (B,n)→∞
N (A ∩ B, n) N (A ∩ B, n)/n P (A ∩ B) = lim = n→∞ N (B, n) N (B, n)/n P (B)
• Visually: Opportunity 1 2 3 4 5 6 .. .
A T F T F F T .. .
B F T T T F T .. .
A∩B F F T F F T .. .
.. .
.. .
.. .
.. .
Relative Frequency A|B 0/1 1/2 1/3 2/4 .. . ↓
P (A|B)
For Example: A STA B52 student arrives late 10 percent of the time, leaves early 20 percent of the time and does both 5 percent of the time. What is the probability that they leave early given that they arrived late? Denote by A the event that they arrive late and by B the event that they leave early. Then P (B|A) =
P (B∩A) P (A)
=
0.05 0.10
= 1/2 12
• The definition of conditional probability immediately leads to another important formula for calculating probabilities. • The Multiplication Rule: For two events A and B P (A ∩ B) = P (A)P (B|A) and this makes sense even when P (A) = 0. • Recall the Law of Total Probability: If B1 , B2 , B3 , . . . is a countable collection of events in S and partitions S then for any A ∈ S, ∞ # P (A) = P (A ∩ Bk ). k=1
• The multiplication rule gives us a new way to apply the law of total probability. • Law of Total Probability Conditional Form: If B1 , B2 , B3 , . . . is a countable collection of events in S and partitions S then for any A ∈ S, P (A) =
∞ #
k=1
P (A ∩ Bk ) =
∞ #
P (Bk )P (A|Bk ).
k=1
For Example. 55 percent of STA B52 students are female of which 4/5 have long hair. 45 percent of students are male, of which 1/3 have long hair. What is the probability that a student selected randomly has long hair? Denote by A the event the student is female, by B the event the student is male and by C the event the student has long hair. Note that A ∩ B = ∅ and A ∪ B = S. Then P (C) = P (A)P (C|A) + P (B)P (C|B) = (0.55)(4/5) + (0.45)(1/3) = 0.59.
• Two events A and B are said to be independent if knowing that one occurred does not affect your measurement of the probability of the other occurring. • The idea of independence is important in probability but to incorporate it into the theory we need to define it mathematically. • If A and B are independent we should expect a definition that implies P (B|A) = P (B) if knowledge of A is to have no effect on your measurement of the probability of B. • Definition: Two events A and B are independent if and only if P (A ∩ B) = P (A)P (B). • Theorem: Two events A and B of positive probability are independent if and only if P (A|B) = P (A) and P (B|A) = P (B). 13
Proof: (⇒) P (A|B) =
P (A∩B) P (B)
=
P (A)P (B) P (B)
= P (A). P (B|A) the same.
(⇐) P (A ∩ B) = P (B)P (A|B) = P (B)P (A). In the situation where one or both of P (A) or P (B) is zero, one or both of the conditional probabilities in the theorem are meaningless. • What does independence mean in the context of the long run relative frequency approach to probability? Provided P (B) > 0, then for some n big enough N (B, n) > 0 and
N (A∩B,n) n
=
N (A∩B,n) N (B,n)
·
N (B,n) n
and
N (A ∩ B, n) n N (A ∩ B, n) N (B, n) · lim = lim n→∞ n→∞ N (B, n) n N (A ∩ B, n) = lim · P (B). n→∞ N (B, n)
P (A ∩ B) = lim
n→∞
So A and B are independent if and only if lim
n→∞
N (A ∩ B, n) N (A, n) = P (A) = lim . n→∞ N (B, n) n
That is to say the long run relative frequency of A equals the long relative frequency of A amongst experiments where B occured. For Example: A hat contains 3 cards, one black on both sides, one white on both sides and one black on one side and white on the other. The experiment consists of randomly selecting a card from the hat and putting it down on a table. Denote by A the event “the face up is black,” and by B the event “the face down is white.” Are A and B independent? Recall, A and B are independent if and only if P (A ∩ B) = P (A)P (B). For this experiment b b w w b w S={ , , , , , } b b w w w b where for example then P (A) =
|A| |S|
b w
is the outcome the face up is black and the face down is white.
= 36 , P (B) =
3 6
and P (A ∩ B) = 61 .
P (A ∩ B) = 1/6 (= 3/6 · 3/6 = P (A)P (B) and we conclude tht A and B are not independent. 14
For Example: A follow up question. Suppose a card is placed on the table and the face up is black, what is the probability that the face down is white? P (B|A) =
P (A∩B) P (A)
=
1/6 3/6
= 1/3.
• We can generalize the multiplication rule to more than two events. • Multiplication Rule: If A1 , A2 , . . . , An are a collection of n events in S, then P (A1 ∩ A2 ∩ · · · ∩ An ) = P (A1 )P (A2 |A1 )P (A3 |A1 ∩ A2 ) · · · P (An |A1 ∩ · · · ∩ An−1 ) For Example: For experiment 3 cards are dealt from an ordinary 52 card deck. What is the probability the first card is the ace of clubs, the second card is the 5 of clubs and the third card is the king of diamonds? Denote by A the event the first card is the ace of clubs, by B the event the second card is the 5 of clubs and by C the event the third card is the king of diamonds. Then P (A ∩ B ∩ C) = P (A)P (B|A)P (C|A ∩ B) = (1/52)(1/51)(1/50) • We can generalize the idea of independence to more than two events. • Definition: A countable collection of events A1 , A2 , . . . in S are independent if and only if P (Ai1 ∩ Ai2 ∩ · · · ∩ Ain ) = P (Ai1 )P (Ai2 ) · · · P (Ain ) for any finite subcollection of distinct events. For Example: You flip a fair coin 5 times independently. What is the probability of getting at least 1 head? Denote by Ak the event you get a tail on the kth flip for k = 1, 2, 3, 4, 5. and by A the event that you get at least one head. Noticing that the complement of the event you get at least one head is the event you get all tails and using independence of the Ak ’s:
P (A) = 1 − P (AC )
= 1 − P (A1 ∩ A2 ∩ · · · ∩ A5 ) = 1 − P (A1 ) · · · P (A5 )
= 1 − (1/2)5 .
15
For Example: An experiment consists of 2 procedures. First a fair six sided die is thrown and then as second as many fair coins as there were dots showing up on the die are thrown. What is the probability of getting at least one head? Denote by A the event you get at least one head and by Bk the event you rolled a k for k = 1, 2, ldots, 6. Then
P (A) = P (B1 )P (A|B1 ) + P (B2 )P (A|B2 ) + · · · + P (B6 )P (A|B6 )
= P (B1 )(1 − P (AC |B1 )) + P (B2 )(1 − P (AC |B2 )) + · · · + P (B6 )(1 − P (AC |B6 ))
= (1/6)(1 − 1/2) + (1/6)(1 − (1/2)2 ) + · · · + (1/6)(1 − (1/2)6 )
16
2. Random Variables and Distributions.
2.1 Random Variables
• The experiment space (S, S, P ) can be very abstract. Nothing rules out the possibility that S is a set whose elements are H’s and T ’s or even “cat”, “dog” and “fish” for that matter. • Many times a scientist, mathematician or otherwise is interested in a number valued quantity that depends on the outcome of an experiment. • Definition:, A random variable, denoted by X, is a function from the sample space, S, to the set R of real numbers. • Ie X : S → R. For Example: For experiment a fair coin is flipped repeatedly and the outcomes recorded. Denote by X the random variable equal to the number of heads in the first 2 flips. Then S = {(x1 , x2 , . . .) : xk ∈ {H, T }} and for example X((H, T, . . .)) = 1,
X((T, T, . . .)) = 0,
etc. . .
• Notation: I will sometimes write supp(X) for the possible values that the random variable X can take and call those possible values the “support” of X. For Example: For experiment a fair coin is flipped repeatedly and the outcomes recorded. Denote by X the random variable equal to the number of heads in the first 2 flips. Then supp(X) = {0, 1, 2}.
• Random variables are functions and can be combined with binary operations like you would ordinary functions. • Suppose (S, S) is an experiment space and X and Y are random variables on that space. Then X + Y (s) = X(s) + Y (s),
X − Y (s) = X(s) − Y (s),
X × Y (s) = X(s) × Y (s) etc . . .
For Example: For experiment a fair coin is flipped repeatedly and the outcomes recorded. Denote by X the random variable equal to the number of heads in the first 2 flip and by Y the random variable equal to the flip number on which you get first head. Then for example
17
X + Y ((T, T, T, H, . . .)) = X((T, T, T, H, . . .)) + Y ((T, T, T, H, . . .)) = 0 + 4 = 4
etc . . .
• There are some random variables worth mentioning:
• Definition: If A ∈ S, then the indicator of A, denoted by IA , is the random variable equal to $ 1 if s ∈ A, IA (s) = 0 if s ∈ / A.
• Visually:
For Example: For experiment a fair coin is flipped repeatedly and the outcomes recorded. Denote by A the event at least one head in the first 2 flips. Then for example IA ((H, H, . . .)) = 1,
IA ((T, T, . . .)) = 0
etc . . .
For Example: We sometimes write down random variables using indicator random variables. Suppose X is a random variable with supp(X) = {0, 1, 2, . . . } Then X(s) =
∞ #
k=1
I{X≥k} (s) for s ∈ S 18
• Definition: A random variable X is said to be simple if X(s) =
n #
ck IAk (s)
k=1
where the ck are real numbers and the Ak ∈ S for k = 1, 2, . . . , n. • Visually:
For Example: We sometimes approximate random variables using simple random variables. Suppose S = [0, 1] and X is the random variable such that X(s) = k If An = ( k−1 n , n ] for k = 1, 2, 3, . . . , n then
Xn (s) =
n #
k=1
is a simple random variable approximation for X.
19
%
k−1 IAk n
√
s for s ∈ S.
2.2 Distributions
• Terminology: In our framework for studying probability, we call the double (S, S) an experiment space and the triple (S, S, P ) a probability space. • Random variables take us from the probability space (home of the experiment space and probability measure P ) to the space of real numbers. • Definition: We denote by B the set of all subsets of R. • B will be our model for events in the space of real numbers. • B is to R what S was to S. • Visually:
• Our trip from the probability space (S, S, P ) to the new space (R, B) is almost complete, we are only missing a probability measure that will measure the events in B. • Important: If (S, S, P ) is a probability space and X is a random variable on that space, then X induces a probability measure on (R, B), denoted by PX , in the following way: If B ∈ B then PX (B) = P ({s ∈ S : X(s) ∈ B}) = P (X −1 (B)). • PX : B → [0, 1]. • Basically, to find the probability of B ∈ B, we find the probability of all those outcomes in S that get mapped into B. • Visually:
20
For Example: For experiment you flip a fair coin 2 times and record the outcomes. Denote by X the random variable equal to the number of heads. Note S = {(H, H), (H, T ), (T, H), (T, T )} and the outcomes are equally likely. Consider B = (−∞, 1.5] ∈ B. When considering the random variable X, this corresponds to the event we flip 1 21 heads or fewer. Then PX ((−∞, 1.5]) = PX ({X ≤ 1.5})
= P ({s ∈ S : X(s) ≤ 1.5})
= P ({(H, T ), (T, H), (T, T )}) = |{(H, T ), (T, H), (T, T )}|/|{(H, H), (H, T ), (T, H), (T, T )}|
= 3/4.
For Example: Same experiment and random variable as above. Again S = {(H, H), (H, T ), (T, H), (T, T )} and the outcomes are equally likely. Consider B = {π} ∈B . When considering the random variable X, this corresponds to the event we flip exactly π heads. Then PX ({π}) = PX ({X = π}) = P ({s ∈ S : X(s) = π})
= P (∅) = 0. 21
• Random variables are by their very nature unpredictable. • If you are interested in the speed of light and you believe that it is constant then one number describes the whole story. Random variables are different. • If you are interested in a random variable the best that you can hope to describe are the possible values that the variable takes together with the probability that it takes those values. • The distribution of a random variable is exactly that: the possible values that the variable takes together with the probability that it takes those values. • Definition: If X is a random variable, then the distribution of X is the collection of probabilities {PX (B) : B ∈ B}. • That’s a lot of probabilities to keep track of!
22
2.3 Cumulative Distribution Functions.
• Sets B ∈ B of the form (−∞, x] play an important role in probability theory. • Definition: Given a distribution, {PX (B) : B ∈ B} its cumulative distribution function is the function FX : R → [0, 1], defined by FX (x) = PX ((−∞, x]) = P (X ≤ x) • There are way more B ∈ B than just B of the form (−∞, x] but . . . • Theorem: Let {PX (B) : B ∈ B} be a distribution with c.d.f FX . If B ∈ B then PX (B) can be determined solely from the values of FX (x). • Ie. FX completely describes PX : B → [0, 1]. For Example: Suppose {PX (B) : B ∈ B} is a distribution with cumulative distribution function FX and consider B = (a, b]. Notice that (−∞, b] = (−∞, a] ∪ (a, b]. It follows by axiom 3 that PX ((−∞, b]) = PX ((−∞, a] ∪ (a, b]) = PX ((−∞, a]) + PX ((a, b]). Re-arranging and by the definition of FX it follows PX ((a, b] = PX ((−∞, b]) − PX ((−∞, a]) = FX (b) − FX (a). For Example: Suppose {PX (B) : B ∈ B} is a distribution with cumulative distribution function FX and consider B = (a, b). Notice that if Bn = (a, b − n1 ] then 1. B1 ⊂ B2 ⊂ · · · and 2. ∪∞ n=1 Bn = B = (a, b). It follows by the continuity of probability
PX ((a, b)) = PX (B) = lim PX (Bn ) n→∞
= lim PX ((a, b − n→∞
1 ]) n
1 ) − FX (a) n ≡ FX (b− ) − FX (a)
= lim FX (b − n→∞
23
• Obviously, the more complicated B ∈ B, the more complicated the expression of PX (B) in terms of FX (x). • Lucky for us, the events that we typically want the probability of, are not complicated at all. • In fact, most of the time we are interested in events of the form B = {k} and B = (−∞, k]. For Example: For experiment a fair coin is flipped once and the outcome recorded. Denote by X the random variable equal to the number of heads. It follows that x x2 > · · · is any sequence approaching x then 1. (−∞, x1 ] ⊃ (−∞, x2 ] ⊃ · · · and 2. ∩∞ n=1 (−∞, xn ] = (−∞, x]. If follows by the continuity of probability that lim FX (xn ) = lim PX ((−∞, xn ]) = PX ((−∞, x]) = FX (x).
n→∞
n→∞
• If you are specifying the distribution of a random number valued quantity using a function F satisfying the properties above, you may want assurance that there exists a probability space and random variable X with distribution {PX (B) : B ∈ B} as mathematical objects for studying that has cumulative distribution function equal to F . • Rest assured, here is the theorem. • Theorem: If a function F : R → R satisfies properties 1-4 above, then there exists a unique distribution {PX (B) : B ∈ B} with c.d.f equal to F . • We will probably never make use of this last theorem but we mention it for completeness.
25
2.4 Discrete and Absolutely Continuous Distributions
• There are a couple of situations where it’s even easier to keep track of all those probabilities to completely describe the distribution of a random variable. • Definition: A distribution, {PX (B) : B ∈ B}, is discrete if there is a finite or countably infinite sequence x1 , x2 , x3 , . . . of distinct real numbers and a corresponding sequence p1 , p2 , p3 , . . . of non-negative real numbers, such that PX ({xk }) = pk
for all k and
#
pk = 1.
k
Visually:
• There are more B ∈ B than just the singletons {x} but . . .
• Theorem: Those pk for k ∈ N together with the formula PX (B) =
#
xk ∈B
PX ({xk }) for B ∈ B
completely describe {PX (B) : B ∈ B}. For Example: If X has discrete distribution {PX (B) : B ∈ B} and B ∈ B is any event then PX (B) = PX ((B ∩ supp(X)) ∪ (B ∩ supp(X)C ))
= PX (B ∩ supp(X)) + PX (B ∩ supp(X)C ) # = PX ({xk }) xk ∈B
=
#
xk ∈B
pk 26
• The importance of those singleton events (for discrete distributions) and their probabilities motivate the following definition: • Definition: For a discrete distribution PX , its probability mass function, pX : R → [0, 1] is defined by pX (x) = PX ({x}) For Example: For experiment a biased coin is flipped repeatedly and the outcomes recorded. The coin is such that the probability of getting a head on any particular flip is p. Denote by X the random variable equal to the number of heads in the first n flips.
For Example: Flip a coin repeatedly. The coin is such that P ({H}) = p. Denote by X the random variable equal to the flip on which we get the first head. 27
For Example: Flip a coin repeatedly. The coin is such that P ({H}) = p. Denote by X the random variable equal to the number of heads in the first n flips. We’ve already seen that X ∼ Binomial(n, p) and ! " n pX (k) = pk (1 − p)n−k k
28
• There is another situation where you don’t have to keep track of the probabilities of every subset of the real numbers to completely describe the distribution of a random variable. • Definition: A distribution, PX , is continuous if PX ({x}) = 0 for all x ∈ R. For Example: Consider P , the uniform probability measure on S = [0, 1] and where X(s) = s for s ∈ S.
29
• Unlike discrete distributions, this definition doesn’t suggest a procedure (in general) for finding PX (B) for B ∈ B. • It is generally harder to keep track of all PX (B), B ∈ B for continuous distributions than for discrete distributions. Although . . . • Definition: Let f : R → R, then f is a density function if f (x) ≥ 0
for all x ∈ R and
*
∞
f (x)dx = 1
−∞
• Definition: A distribution, PX , is absolutely continuous (stronger than continuous) if there is a density function fX such that * b PX ([a, b]) = f (x)dx whenever a ≤ b. a
For Example: Consider P , the uniform probability measure on S = [0, 1] and where X(s) = s for s ∈ S.
• There are more B ∈ B than just the closed intervals [a, b] but . . . • Theorem: (For absolutely continous distributions) That fX (if it exists) together with the formula PX (B) =
*
f (x)dx
x∈B
for B ∈ B
completely describe {PX (B) : B ∈ B}. • We make the obvious connection between probability and area under the p.d.f:
30
For Example
31
• We will on occasion make use of the facts that – If PX is the distribution of a discrete random variable with probability mass function pX , then the connection between pX and FX is described by: pX (x) = FX (x) − FX (x− )
– If PX is the distribution of an absolutely continous random variable with probability density function fX , then the connection between fX and FX is described by: fX (x) =
– The proofs of these are left as exercise. 32
dFX (x) dx
2.5 The Normal Distribution
• There is one absolutely continuous distribution that must be singled out as possibly the most important and/or pervasive in all of probability theory: the normal distribution. • Definition: Any absolutely continuous random variable X with probability density function given by fX (x) =
1 x−µ 2 1 √ e− 2 ( σ ) σ 2π
for x ∈ R
where µ ∈ R and σ > 0, is said to have the normal distribution and we write X ∼ N (µ, σ). • The probability density function for the normal distribution has the familiar bell shaped graph:
• µ is a location parameter and σ is a spread parameter. + ,2 • Note that x−µ = (x − µ)(σ 2 )−1 (x − µ) is the squared distance from x to µ in σ units. σ • If a random variable X has the normal distribution with parameters µ and σ, then we’ll see later that µ has the interpretation of being the average value of X and σ has the interpretation of being (kind of) the average deviation of X from µ. • If X ∼ N (µ, σ) then you can say to yourself: X is on average µ but not all values of X are µ and the number that describes the variability in X is σ. For Example: The length of the humerus bone in sparrows has approximately the normal distribution with parameters µ = 1.5 cm and σ = 0.25 cm.
33
We can infer that the average humerus bone length is 1.5 cm but not all humerus bone lengths are 1.5 cm and the number that describes the variability in humerus bone lengths is 0.25 cm. In fact, approximately 68 percent of humerus bone lengths are between 1.25 and 1.75 cm and almost all of them are between 0.75 and 2.25 cm. • When µ = 0 and σ = 1 called the standard normal distribution. • The normal distribution shows up in many places: – As the limit of the binomial distribution. – Empirically everywhere. – The centerpiece of the central limit theorem (one of the most important theorems in probability theory and in the practice of statistics.) • There is a normal distribution for every pair (µ, σ), but there is a theorem that makes a connection between all of them. • Theorem: (Scale/translation invariance of normal distribution) X ∼ N (µ, σ) if and only if Z =
X −µ ∼ N (0, 1) σ
Proof:
• Unfortunately there is no closed form expression for the antiderivative of the probability density function. • To calculate probabilities for a normal random variables we use software or tables. 34
• We sometimes write Φ(k) = P (Z ≤ k) when Z ∼ N (0, 1). For Example: Suppose {Xn : n ≥ 1} is a collection of random variables such that Xn ∼ N (0, 1/n) for n ≥ 1 and consider the cumulative distribution functions Fn (x) for n ≥ 1.
35
2.6 Change of Variable.
• Now we turn our attention to the following problem: If X is a random variable with known distribution and h : R → R, then can we find the distribution of the “new” random variable Y = h(X)? For Example: For experiment you flip a coin repeatedly and the probability of heads on any particular flip is p. Denote by X the random variable equal to the flip on which you get a first head and by Y the random variable equal to min(X, 2). X has geometric distribution with parameter p; what is the distribution of Y ?
• Visually: In the discrete case;
• Discrete Change of Variable Formula: Let X be a discrete random variable with probability mass function pX . Let Y = h(X), where h : R → R is some function. Then Y is also discrete and its probability mass function pY satisfies: #
pY (y) =
pX (x)
x∈h−1 ({y})
where h−1 ({y}) is the inverse image of {y}.
• This should remind you of how a random variable X : S → R induced a probability measure on B via the formula PX (B) = P (X −1 (B)). 36
For Example: For experiment you flip a coin repeatedly and the probability of heads on any particular flip is p. Denote by X the random variable equal to the flip on which you get a first head and by Y the random variable equal to min(X, 2). X has geometric distribution with parameter p; what is the distribution of Y ? supp(X) = {1, 2, 3, . . .} and Y = min(X, 2) imply that supp(Y ) = {1, 2}. It follows by the discrete change of variable formula that PY ({1}) = P (Y = 1) = P (min(X, 2) = 1) = P (X = 1) = PX ({1}) = pX (1) = (1 − p)0 p = p
and
PY ({2}) = P (Y = 2) = P (min(X, 2) = 2) =
∞ #
k=2
P (X = k) = 1 − PX ({1}) = 1 − pX (1) = 1 − p.
• The continuous case is similar. For Example: Suppose that X ∼ Uniform(0, 1). (Ie, X is absolutely continuous with probability density function fX (x) = 1 for 0 ≤ x ≤ 1.) Also suppose that Y = 3X. If X ∼ Uniform(0, 1) then supp(X) = [0, 1] and it follows that 0≤x≤1
implies and we conclude that supp(Y ) = [0, 3].
0 ≤ y/3 ≤ 1 and 0 ≤ y ≤ 3
Also if X ∼ Uniform(0, 1) then 0 for x < 0 FX (x) = x for 0 ≤ x < 1 1 for x ≥ 1
For y < 0, FY (y) = P (Y ≤ y) = 0 and for y ≥ 3, FY (y) = P (Y ≤ y) = 1. For 0 ≤ y < 3 fY (y) =
d d d d d d FY (y) = P (Y ≤ y) = P (3X ≤ y) = P (X ≤ y/3) = FX (y/3) = fX (y/3) y/3 = 1/3 dy dy dy dy dy dy
and we conclude that Y ∼ Uniform(0, 3).
37
• I would describe this as the distribution function technique. • In the distribution function technique you find an expression for the cumulative distribution function by a direct calculation. For Example: For experiment, a point is picked uniformly at random from the perimeter of a unit circle. Denote by X the angle that a line segment from the origin to the point makes with the x-axis. Find the distribution of the random variable Y equal to the x co-ordinate of the point. We gather from the information in the question that X ∼ Uniform(0, 2π) and Y = cos(X). Moreover, we know that if X ∼ Uniform(0, 2π) then FX (x) = It follows
0
x 2π
1
x 1.
• For a special case of the distribution function technique check out the following theorem: • Continuous Change of Variable Formula: Let X be an absolutely continous random variable with probability density function fX . Let Y = h(X), where h : R → R is a function that is 1. differentiable and 2. strictly increasing or strictly decreasing. Then Y is also absolutely continuous and its probability density function fY is given by fY (y) = fX (h−1 (y))|(h−1 )' (y)| where h−1 is the inverse of h and (h−1 )' is its derivative. • Note that if fX (x) = 0 in an interval of x values, then it does not matter how h behaves in that interval. Proof: Suppose X has p.d.f fX and Y = h(X) where 1.h is differentiable and 2.h is strictly decreasing. 38
Note that if h is decreasing then h−1 is too. We are required to find a density for Y ; a function fY that can be used for the purposes of calculating P (a ≤ Y ≤ b)
P (a ≤ Y ≤ b) = P (h−1 (b) ≤ X ≤ h−1 (a)) * h−1 (a) = fX (x)dx change of variable y = h(x) =
h−1 (b) a
*
fX (h−1 (y))(h−1 )' (y)dy
b
=− =
*
*
b
fX (h−1 (y))(h−1 )' (y)dy
a
b
fX (h−1 (y))|(h−1 )' (y)|dy
since (h−1 )' < 0
a
For Example: Suppose that X ∼ N (µ, σ) and Y = aX + b where a (= 0. Find the p.d.f of Y . If y = h(x) = ax + b then x = h−1 (y) =
y−b a
and (h−1 )' (y) = a1 .
By the continuous change of variable formula it follows that fY (y) = fX (h−1 )|(h−1 )' (y)| =
y−b −µ 2 1 −1 y−(aµ+b) 2 −1 1 1 a √ e2( σ ) √ e 2 ( |a|σ ) = |a| σ 2π |a|σ 2π
for −∞ < y < ∞, which we recognize as the p.d.f for a random variable with N (aµ + b, |a|σ) distribution.
39
3. Random Vectors and Joint Distributions.
3.1 Random Vectors. • So far we’ve only considered one random variable at a time. • Now we are going to consider more than one (but at most a finite collection of) random variables. • Suppose X1 , X2 , . . . , Xn are n random variables. Viewed as a random vector, (X1 , . . . , Xn )T : S → Rn . For Example: For experiment two fair 4 sided dice are rolled and the outcomes recorded. The sample space is S = {(1, 1), (1, 2), . . . , (4, 4)}. Denote by X the random variable equal to the sum of the face values and by Y the random variable equal to the size of the difference. For example (X, Y )T ((1, 2)) = (X((1, 2)), Y ((1, 2)))T = (3, 1)T . Note (X, Y )T : S → R2 not (X, Y )T : S2 → R2 .
• Most of the time, in practice, we want to describe probabilistically the relationship between the variables For Example: Suppose X is the random variable equal to the exiting high school grade point average of a UTSC student and Y is the random variable equal to their undergraduate cumulative grade point average.
For Example: Suppose X1 , X2 , . . . , XN are the random variables equal to the lifetimes of UTSC faculty.
40
3.2 Joint Distributions
• Notation: We denote by B n the set of all subsets of Rn . • Important: If (S, S, P ) is a probability space and (X1 , X2 , . . . , Xn )T is a random vector on that space, then (X1 , X2 , . . . , Xn )T induces a probability measure on (Rn , B n ), denoted by PX1 ,...,Xn , in the following way: if B ∈ Bn then PX1 ,...,Xn (B) = P ({s ∈ S : (X1 (s), X2 (s), . . . , Xn (s))T ∈ B}) = P (((X1 , X2 , . . . , Xn )T )−1 (B)) • Definition: Let (X1 , X2 , . . . , Xn )T be a random vector. Then its joint distribution is the collection of probabilities {PX1 ,...,Xn (B) : B ∈ Bn }
41
3.3 Joint Cumulative Distribution Functions.
• As with the distribution of one random variable, the joint distribution of n random variables has a lot of probabilities to keep track of. • Lucky for us we have the following important definition and theorem, analogous to those for one random variable. • Sets B ∈ Bn of the form (−∞, x1 ] × (−∞, x2 ] × · · · × (−∞, xn ] play an important role in probability theory. • Recall that A × B = {(x, y) : x ∈ A and y ∈ B} • We call sets of the form (−∞, x1 ] × (−∞, x2 ] × · · · × (−∞, xn ] n-dimensional cuts of Rn . • Definition: Let (X1 , X2 , . . . , Xn )T be a random vector. Then its joint cumulative distribution function is the function FX1 ,...,Xn : Rn → [0, 1], defined by FX1 ,...,Xn (x1 , . . . , xn ) = PX1 ,...,Xn ((−∞, x1 ] × · · · × (−∞, xn ]) = P (X1 ≤ x1 , . . . , Xn ≤ xn ) • There are way more B ∈ Bn than B of the form (−∞, x1 ] × · · · (−∞, xn ] but . . . • Theorem: Let (X1 , X2 , . . . , Xn )T be a random vector with joint cumulative distribution function FX1 ,...,Xn . If B ∈ Bn then PX1 ,...,Xn (B) can be determined solely from the values of FX1 ,...,Xn (x1 , . . . , xn ). • Ie the joint c.d.f completely describes the joint distribution of (X1 , X2 , . . . , Xn )T . • One limitation with the theory of random variables that we’ve seen so far is that the probability distributions of the variables individually say nothing about the relationships between the variables.
For Example: Suppose X ∼ Bernoulli(1/2), Y1 = X and Y2 = 1 − X. It follows that Y1 and Y2 are equal in distribution but
P (X ≤ 0, Y1 ≤ 0) = 1/2 and P (X ≤ 0, Y2 ≤ 0) = 0 and we conclude that (X, Y1 )T and (X, Y2 )T have different joint distributions.
• The joint cumulative distribution function describes probabilistically the relationship between the variables and the variables individually. • It has properties analogous to those of the c.d.f of a single random variable.
42
• We can recover the c.d.f of any of the individual random variables in the collection. • Theorem: Let (X1 , X2 , . . . , Xn )T be a random vector with joint cumulative distribution function FX1 ,...,Xn . The cumulative distribution function FXk of Xk satisfies FXk (xk ) =
lim
xi →∞,∀i)=k
FX1 ,...,Xn (x1 , . . . , xn ).
Proof: Note that {s ∈ S : Xk ≤ xk } = {s ∈ S : Xk ≤ xk , Xi ≤ ∞, ∀i (= k} and that if ni1 < ni2 < · · · are any sequences of numbers approaching ∞ for i (= k then the events
satisfy that
Aj = {s ∈ S : Xk ≤ xk , Xi ≤ nij , ∀i (= k} for j = 1, 2, . . .
A1 ⊂ A2 ⊂ · · · and ∪∞ n=1 An = {s ∈ S : Xk ≤ xk , Xi ≤ ∞∀i (= k}.
It follows by the continuity of probability that
FXk (xk ) = P (Xk ≤ xk )
= P (Xk ≤ xk , Xi ≤ ∞, ∀i (= k)
= lim P (Xk ≤ xk , Xi ≤ nij , ∀i (= k) j→∞
=
lim
xi →∞,∀i)=k
FX1 ,...,Xn (x1 , . . . , xn )
For Example: Suppose that (X, Y )T is a random vector with joint cumulative distribution function given by 0 xy 2 FX,Y (x, y) = x y2 1
x < 0 or y < 0 0 ≤ x ≤ 1, 0 ≤ y ≤ 1 0 ≤ x ≤ 1, y ≥ 1 x ≥ 1, 0 ≤ y ≤ 1 x > 1 and y > 1
We can retrieve the cumulative distribution function of X: FX (x) = lim FX,Y (x, y) = lim x = x y→∞
y→∞
for 0 ≤ x ≤ 1 and FX (x) = 0 for x < 0 and FX (x) = 1 for x > 1. • Terminology: The distribution of a single random variable amongst several in a random vector is called its marginal distribution.
43
3.4 Jointly Discrete and Jointly Absolutely Continuous Distributions.
• Joint cumulative distribution functions are not in general easy to work with. • We will almost always consider the case when (X1 , X2 , . . . , Xn )T are jointly discrete or jointly absolutely continuous. • Definition: Let (X1 , X2 , . . . , Xn )T be jointly discrete random variables (ie the support is at most a countable subset of Rn , n-tuples in the support have positive probability and those probabilities sum to one). Then their joint probability mass function, pX1 ,...,Xn : Rn → [0, 1], is defined by pX1 ,...,Xn (x1 , . . . , xn ) = PX1 ,...,Xn ({x1 } × · · · {xn }) = P (X1 = x1 , . . . , Xn = xn ) • Note that there are way more sets in B n than of the form {x1 } ×{ x2 } × · · · × {xn } but . . . • Theorem: pX1 ,...,Xn completely describes the joint distribution of (X1 , X2 , . . . , Xn )T . For Example: For B ∈ Bn PX1 ,...,Xn (B) =
#
pX1 ,...,Xn (x1 , . . . , xn ).
(x1 ,...,xn )∈B∩supp((X1 ,...,Xn )T )
For Example: For experiment two fair 4 sided dice are rolled. The sample space is S = {(1, 1), (1, 2), . . . , (4, 4)}. Denote by X the random variable equal to the sum of the faces, and by Y the size of the difference between the faces. Note that supp(X) = {2, 3, . . . , 8}, supp(Y ) = {0, 1, 2, 3} but supp((X, Y )T ) = {(2, 0), (3, 1), (4, 2), (4, 0), (5, 3), (5, 1), (6, 2), (6, 0), (7, 1), (8, 0)} (= {2, 3, . . . , 8} ×{ 0, 1, 2, 3} = supp(X) × supp(Y ) and for example pX,Y (2, 0) = P (X = 2, Y = 0) = P ({s ∈ S : (X, Y )T (s) = (2, 0)T }) = P ({1, 1}) = 1/16.
44
• We can recover the p.m.f of any of the individual random variables in the collection. • Theorem: Let (X1 , X2 , . . . , Xn )T be a random vector with joint probability mass function pX1 ,...,Xn . The probability mass function pXk of Xk satisfies pXk (xk ) = Σxi ,i)=k pX1 ,...,Xn (x1 , . . . , xn ). For Example: Suppose (X, Y )T are jointly discrete with joint p.m.f given by
X
1 2
1 1/6 0
Y 2 0 1/6
3 2/6 2/6
We can recover the marginal p.m.f of Y by summing over the possible values of X
pY (1) = pXY (1, 1) + pXY (2, 1) = 1/6 + 0 = 1/6 and pY (2) = pXY (1, 2) + pXY (2, 2) = 0 + 1/6 = 1/6 and pY (3) = pXY (1, 3) + pXY (2, 3) = 2/6 + 2/6 = 4/6.
• Toward working with several absolutely continuous random variables, here is an important definition: • Definition: Let f : Rn → R be a function. Then f is a joint probability density function if
and
f (x1 , . . . , xn ) ≥ 0 for all (x1 , x2 , . . . , xn ) ∈ Rn *
Rn
f (x1 , . . . , xn )dx1 · · · dxn = 1
• This definition is completely analogous to the single variable p.d.f.
45
• Visually:
• The height of fX,Y at (x, y) gives how dense in probability the plane is at (x, y). • We make a connection between volume under the joint p.d.f and probability.
• Definition: If (X1 , X2 , . . . , Xn )T is a random vector then it is jointly absolutely continuous if there exists a joint probability density function fX1 ,...,Xn such that * b1 * bn PX1 ,...,Xn ([a1 , b1 ] × · · · × [an , bn ]) = ··· fX1 ,...,Xn dx1 · · · dxn a1
an
for all ai ≤ bi , i = 1, 2, . . . , n. • There are more sets in B n than sets of the form [a1 , b1 ] × · · · × [an , bn ] but . . . • Theorem: When fX1 ,...,Xn exists, it completely describes the joint distribution of (X1 , X2 , . . . , Xn )T . For Example: For B ∈ Bn PX1 ,...,Xn (B) =
*
···
*
B
fX1 ,...,Xn (x1 , . . . , xn )dx1 · · · dxn
For Example: Suppose that X and Y are jointly absolutely continuous with joint probability density function given by $ 120x3 y fX,Y (x, y) = 0
for x ≥ 0, y ≥ 0 and x + y ≤ 1; otherwise. 46
Visually:
For example we can find the probability that X < Y . P (X < Y ) equals the volume under fX,Y and above the region x < y intersected with supp(X, Y )T . P (X < Y ) = =
* * *
fX,Y (x, y)
x 0 for all vectors − a = ( 0. → → → → → → • Note that (− x −− µ )T Σ−1 (− x −− µ ) is the squared distance of − x to − µ in Σ units. • Notation: When n = 2 and for the random vector (X, Y )T ∼ N2 we will often write ρ =
σXY σX σY
.
!!
µX µY
" ! 2 σX , σXY
σXY σY2
""
For Example: Denote by (X, Y )T the random vector where X is the weight of a B52 student in pounds and Y equals the height of the same B52 student in feet. Suppose that (X, Y ) ∼ N2 T
!!
µX µY
" ! 2 σX , σXY
σXY σY2
""
and letting ρ =
σXY σX σY
.
A well known result in linear algebra implies Σ
−1
1 = 2 2 2 σX σY − σXY
-
σY2 −σXY
−σXY 2 σX
.
By Substitution and some algebra it follows
49
1 = 2 2 σX σY (1 − ρ2 )
-
σY2 −ρσX σY
−ρσX σY 2 σX
.
.
→ → → → (− x −− µ )T Σ−1 (− x −− µ ) = (x − µX , y − µY )
1 2 σ 2 (1 − ρ2 ) σX Y
-
σY2 −ρσX σY
−ρσX σY 2 σX
.!
x − µX y − µY
2 (y − µY )2 σY2 (x − µX )2 − 2ρσX σY (x − µX )(y − µY ) + σX 2 2 2 σX σY (1 − ρ ) 5! "2 ! "! " ! "2 6 1 x − µX x − µX y − µY y − µY = − 2ρ + . 1 − ρ2 σX σX σY σY
"
=
Putting all of this together it follows that $
1 7 exp − fXY (x, y) = 2 2 2 2(1 − ρ2 ) 2π σX σY (1 − ρ ) 1
5!
x − µX σX
"2
− 2ρ
!
x − µX σX
"!
y − µY σY
"
+
!
y − µY σY
"2 68
• Visually:
• We can investigate those ordered pairs in the support of (X, Y )T that have equal density in probability. • For example and for fixed C > 0 $
1
1 7 C= exp − 2 σ 2 (1 − ρ2 ) 2(1 − ρ2 ) 2π σX Y
5!
x − µX σX
"2
− 2ρ
!
x − µX σX
"!
y − µY σY
"
+
!
y − µY σY
"2 68
if and only if !
x − µX σX
"2
− 2ρ
!
x − µX σX
"!
y − µY σY
"
+
!
y − µY σY
which is the equation of an ellipse. • In the situation where ρ = the co-ordinate axes. • In the situation where ρ = eigenvectors of Σ−1 .
σXY σX σY σXY σX σY
"2
! " 9 2 σ 2 (1 − ρ2 ) = −2(1 − ρ2 )log C · 2π σX Y
= 0, the axes of each ellipse of constant density are in the same direction as (= 0, the axes of each ellipse of constant density are in the direction of the 50
.
• In the situation where ρ = of the ellipse.
σXY σX σY
(= 0 we conclude that σXY is the parameter that determines the orientation
• To describe the orientation of the ellipse using σXY we note that probability density is concentrated along the line y − µY =
σXY 2 (x − µX ) σX
• There’s a sometimes preferrable definition of the multivariate normal distribution. • Definition: A random vector (X1 , X2 , . . . , Xn )T is said to have the multivariate normal distribution with parameters
− → µ = (µ1 , . . . , µn )T
and
if and only if
Σ=
σ12 σ21 .. .
σ12 σ22
··· ···
σn1
σn2
···
σ1n σ2n σn2
→ → → → a1 X1 + a2 X2 + · · · + an Xn ∼ N (− µT− a, − a T Σ− a)
→ for every vector − a T = (a1 , a2 , . . . , an ) (= 0.
• It follows by the assumption of positive definiteness that the spread parameter is bigger than zero. • We won’t prove that the two definitions of the multivariate normal distribution are equivalent; the proof is left as an exercise. • From this definition it follows immediately that if (X1 , X2 , . . . , Xn )T is multivariate normal then every subcollection of the variables are multivariate normal. For Example: Suppose that (X, Y ) ∼ N2
!!
µX µY
" ! 2 σX , σXY
σXY σY2
""
and let ρ =
σxy σx σy .
We can recover the marginal distribution of Y . It follows by the definition that 2 a1 X + a2 Y ∼ N (a1 µX + a2 µY , a21 σX + 2a1 a2 σXY + a22 σY2 )
for all choices of a1 and a2 .
Consider a1 = 0 and a2 = 1, it follows Y ∼ N (µY , σY2 ). 51
3.6 Conditional Distributions
• Recall that for random vector (X1 , X2 , . . . , Xn )T the joint cumulative distribution function FX1 ,...,Xn completely describes the joint distribution of the variables and is given by:
FX1 ,...,Xn (x1 , . . . , xn ) = PX1 ,...,Xn ((−∞, x1 ] × · · · × (−∞, xn ]) = P (X1 ≤ x1 , X2 ≤ x2 , . . . , Xn ≤ xn )
• Recall also that for events A1 , A2 , . . . , An the repeated application of the multiplication rule gives: P (A1 ∩ A2 ∩ · · · ∩ An ) = P (A1 )P (A2 |A1 )P (A3 |A1 ∩ A2 ) · · · P (An |A1 ∩ · · · ∩ An−1 ) • These motivate the following calculation/example: For Example: When n = 2 consider X and Y with joint c.d.f and FX,Y and let A = (−∞, x] × (−∞, ∞) and B = (−∞, ∞) × (−∞, y] it follows that FX,Y (x, y) = PX,Y ((−∞, x] × (−∞, y])
= PX,Y ((−∞, x] × (−∞, ∞) ∩ (−∞, ∞) × (−∞, y])
= PX,Y ((−∞, x] × (−∞, ∞))PX,Y ((−∞, ∞) × (∞, y])|(−∞, x] × (−∞, ∞)) = PX ((−∞, x])PX,Y ((−∞, ∞) × (∞, y])|(−∞, x] × (−∞, ∞)) = P (X ≤ x)P (Y ≤ y|X ≤ x) = FX (x)P (Y ≤ y|X ≤ x)
• This suggests that we might be able to describe the joint distribution of a random vector with conditional probability measures between the variables in the vector. • Definition: Let X, X1 , X2 , . . . , Xn be random variables and suppose that PX1 ,...,Xn ({x1 } × · · · × {xn }) > 0
if B ∈ B then the conditional probability that X ∈ B given that X1 = x1 , . . . , Xn = xn is denoted by PX|X1 ,...,Xn (B|{x1 } × · · · × {xn }) and is defined to be PX|X1 ,...,Xn (B|{x1 } × · · · × {xn }) =
PX,X1 ,...,Xn (B × {x1 } × · · · × {xn }) PX1 ,...,Xn ({x1 } × · · · × {xn })
• Note that PX|X1 ,...,Xn : B → [0, 1] and satisfies all the axioms of required to make it a probability measure. • Definition: The conditional distribution of X given that X1 = x1 , . . . , Xn = xn is the collection of probabilities {PX|X1 ,...,Xn (B|{x1 } × · · · × {xn }) : B ∈ B} 52
• As always, we comment that: that’s a lot of probabilities to keep track of. • Also as always we will look at 2 special cases: discrete collections of random variables and absolutely continuous collections of random variables. • Definition: Suppose that X, X1 , X2 , . . . , Xn are discrete random variables. Then the conditional probability mass function of X given that X1 = x1 , . . . , Xn = xn is the function pX|X1 ,...,Xn defined by pX|X1 ,...,Xn (x|x1 , . . . , xn ) =
pX,X1 ,...,Xn (x, x1 , . . . , xn ) pX1 ,...,Xn (x1 , . . . , xn )
provided pX1 ,...,Xn (x1 , . . . , xn ) > 0 and where pX,X1 ,...,Xn
and
pX1 ,...,Xn
are the joint probability mass functions of X, X1 , . . . , Xn and X1 , . . . , Xn respectively. • Important: To reinforce what has been said and to step back to see the big picture: for X1 , X2 , . . . , Xn jointly discrete random variables, several applications of the multiplication rule gives pX1 ,...,Xn (x1 , . . . , xn ) = pX1 (x1 )pX2 |X1 (x2 |x1 ) · · · pXn |X1 ,...,Xn−1 (xn |x1 , . . . , xn−1 ) so these conditional probability mass functions completely describe the joint distribution of the random variables X1 , . . . Xn . For Example: Suppose that (X, Y )T are jointly discrete random variables with joint probability mass function given in the table below.
X
1 2 3
-1 2/9 1/9 0
Y 0 1/9 1/9 1/9
1 0 1/9 2/9
We can find the conditional distribution of Y when X = 1. First we find that P (X = 1) = P (X = 1, Y = −1) + P (X = 1, Y = 0) + P (X = 1, Y = 1) = 2/9 + 1/9 + 0 = 3/9. It then follows by definition that pY |X (−1|1) = pY |X (0|1) =
P (X = 1, Y = −1) 2/9 = = 2/3 and P (X = 1) 3/9 P (X = 1, Y = 0) 1/9 = = 1/3 and P (X = 1) 3/9
pY |X (1|1) =
P (X = 1, Y = 1) 0/9 = = 0/3. P (X = 1) 3/9
• It probably goes without saying but pY |X is a p.m.f (you can verify this for exercise) so that it must satisfy: 53
i 0 ≤ pY |X ≤ 1 and : ii k pY |X (k|x) = 1. • The absolutely continuous case is similar but with the obvious calculus flavored twist. Here is an example to motivate: For Example: Suppose that (X, Y )T is jointly absolutely continuous with joint probability density function given by: fX,Y (x, y) = 4x2 y + 2y 5
for 0 < x < 1 and 0 < y < 1.
Conditional on X = 8/10, what is the probability that 2/10 < Y < 3/10? Visually:
A naive first attempt might go like this: P (2/10 < Y < 3/10|X = 8/10) =
P (X = 8/10, 2/10 < Y < 3/10) = 0/0. P (X = 8/10)
Our experience with calculus suggests that we try to approximate first
P (2/10 < Y < 3/10|X = 8/10) ≈ P (2/10 < Y < 3/10|8/10 − $ < X < 8/10 + $) =
P (8/10 − $ < X < 8/10 + $, 2/10 < Y < 3/10) P (8/10 − $ < X < 8/10 + $)
and then take limits as $ ↓ 0. It may play out something like this:
54
P (x − $ < X < x + $, a < Y < b) P (x − $ < X < x + $) ; b ; x+# fXY (u, v)dudv = a x−# ; x+# f (u)du x−# X ; b ; x+# fXY (x, v)dudv ≈ a x−# ; x+# f (x)du x−# X ;b 2$fXY (x, v)dv = a 2$fX (x) * b fXY (x, v) = dv fX (x) a
P (a < Y < b|x − $ < X < x + $) =
The third equally could be made rigorous with an assumption like continuity of fXY and fX . (x,v) If fXY satisfied that it was a p.d.f, then we’d have every right to call it something like the conditional fX (x) probability density function of Y given that X = x.
• In fact, check out the following definition: • Definition: Suppose that X, X1 , X2 , . . . , Xn are jointly absolutely continuous random variables. Then the conditional probability density function of X given that X1 = x1 , . . . , Xn = xn is the function fX|X1 ,...,Xn defined by fX|X1 ,...,Xn (x|x1 , . . . , xn ) =
fX,X1 ,...,Xn (x, x1 , . . . , xn ) fX1 ,...,Xn (x1 , . . . , xn )
provided fX1 ,...,Xn (x1 , . . . , xn ) > 0 and where and
fX,X1 ,...,Xn
fX1 ,...,Xn
are the joint probability density functions of X, X1 , . . . , Xn and X1 , . . . , Xn respectively.
• Important: To reinforce what has been said and to step back to see the big picture: for X1 , X2 , . . . , Xn jointly absolutely continuous random variables, several applications of the construction/definition above gives fX1 ,...,Xn (x1 , . . . , xn ) = fX1 (x1 )fX2 |X1 (x2 |x1 ) · · · fXn |X1 ,...,Xn−1 (xn |x1 , . . . , xn−1 ) so these conditional probability density functions completely describe the joint distribution of the random variables X1 , . . . Xn . For Example: Returning to the example above, we can find PY |X (2/10 < Y < 3/10|X = 8/10). First we find the marginal p.d.f of X as follows 55
fX (x) =
*
1
4x2 y + 2y 5 dy = 2x2 + (1/3) for 0 < x < 1.
0
It follows by the definition that fY |X (y|8/10) =
fXY (8/10, y) 4(8/10)2 y + 2y 5 = for 0 < y < 1. fX (8/10) 2(8/10)2 + 1/3
Lastly we calculate
PY |X (2/10 < Y < 3/10|X = 8/10) = =
*
3/10
2/10
*
3/10
2/10
fY |X (y|8/10)dy 4(8/10)2 y + 2y 5 dy 2(8/10)2 + 1/3
. = ..
• With the definition of conditional distributions (and in particular conditional probability density functions) introduced, we can revisit the bivariate normal example. For Example: Suppose that (X, Y ) ∼ N2
!!
µX µY
" ! 2 σX , σXY
σXY σY2
""
and let ρ =
σxy σx σy .
It follows by the second definition of the bivariate normal distribution that X ∼ N (µX , σX ). We’ve just seen that if X and Y are jointly absolutely continuous with joint p.d.f fX,Y then the definition of fY |X implies fX,Y (x, y) = fX (x)fY |X (y|x). Factoring out the marginal p.d.f of X gives a remarkable result: When (X, Y ) ∼ N2
!!
µX µY
" ! 2 σX , σXY
σXY σY2
Y |X = x ∼ N
!
""
µY + σY ρ
!
x − µX σX
"
" , σY2 (1 − ρ2 ) .
Important: It is worth noting that ρ = 0 if and only if σXY = 0 and the interpretation of σXY as the number describing the relationship between X and Y is now explicit.
56
3.7 Independence
• Definition: Let (X1 , X2 , . . . , Xn )T be a random vector. Then the random variables in the vector are independent if, for all subsets B1 , B2 , . . . Bn ∈ B PX1 ,...,Xn (B1 × · · · × Bn ) = PX1 (B1 ) · · · PXn (Bn ) • Ie. the events “X1 ∈ B1 ,” . . . “Xn ∈ Bn ” are independent events. • It is hard to verify independence by checking the left side equals the right side for all B1 , B2 , . . . , Bn ∈ B. • It should come as no surprise that given the importance of the cuts in probability theory, that to verify independence we only need to check this for B1 , . . . , Bn of the form B1 = (−∞, x2 ] . . . Bn = (−∞, xn ]. • Theorem: Let (X1 , X2 , . . . , Xn )T be a random vector. The random variables in the vector are independent if and only if PX1 ,...,Xn ((−∞, x1 ] × (−∞, xn ]) = PX1 ((−∞, x1 ]) · · · PXn ((−∞, xn ]) • We won’t prove the theorem above. • This just says that X1 , X2 , . . . , Xn are independent if the joint cumulative distribution function satisfies FX1 ,...,Xn (x1 , . . . , xn ) = FX1 (x1 ) · · · FXn (xn ) • Important: It follows from the above theorem that when X1 , X2 , . . . , Xn are a collection of independent random variables, the joint distribution of (X1 , . . . , Xn )T is completely described by the marginal distributions of X1 , . . . , Xn . • Theorem: If X1 , X2 , . . . , Xn are a collection of independent random variables, then any finite subcollection of them are independent. Poof: Take any sub-collection Xi1 , Xi2 , . . . , Xik for 1 ≤ k ≤ n. It follows by independence that FX1 ,...,Xn (x1 , . . . , xn ) = FX1 (x1 ) · · · FXn (xn ). Taking limits as xj → ∞ for j (∈ {i1 , i2 , . . . , ik } it follows FXi1 ,...,Xik (xi1 , . . . , xik ) = =
lim
FX1 ,...,Xn (x1 , . . . , xn )
lim
FX1 (x1 ) · · · FXn (xn )
xj →∞;j)∈{i1 ,i2 ,...,ik } xj →∞;j)∈{i1 ,i2 ,...,ik }
= FXi1 (xi1 ) · · · FXik (xik ) 57
• We are usually working with joint probability mass functions and/or joint probability density functions. • You could probably guess the following theorem: • Theorem: Let (X1 , X2 , . . . , Xn )T be a random vector. a If (X1 , X2 , . . . , Xn )T is jointly discrete, then X1 , X2 , . . . , Xn are independent if and only if pX1 X2 ···Xn (x1 , x2 , . . . , xn ) = pX1 (x1 )pX2 (x2 ) · · · pXn (xn ) for all (x1 , x2 , . . . , xn ) ∈ Rn . b If (X1 , X2 , . . . , Xn )T is jointly absolutely continuous, then X1 , X2 , . . . , Xn are independent if and only if fX1 X2 ···Xn (x1 , x2 , . . . , xn ) = fX1 (x1 )fX2 (x2 ) · · · fXn (xn ) for all (x1 , x2 , . . . , xn ) ∈ Rn . Proof: We’ll prove the second part. Recall ∂n FX ,...,Xn (x1 , . . . , xn ) ∂x1 · · · ∂xn 1 ∂n = FX (x1 ) · · · FXn (xn ) ∂x1 · · · ∂xn 1 dFX1 (x1 ) dFXn (xn ) = ··· dx1 dxn = fX1 (x1 ) · · · fXn (xn )
fX1 ,...,Xn (x1 , . . . , xn ) =
For Example: Suppose that (X, Y )T is a jointly discrete random vector with joint probability mass function given in the table below.
X
1 2 3
-1 1/9 2/9 0
Y 0 2/9 1/9 1/9
1 0 1/9 1/9
Are X and Y independent? Let’s just start checking.
but
P (X = 1)P (Y = −1) = (1/3)(1/3) = 1/9 = P (X = 1, Y = −1).
P (X = 1)P (Y = 1) = (1/3)(2/9) (= 0 = P (X = 1, Y = 1)
and we conclude that X and Y are not independent. 58
• To conclude dependence we only needed to find one ordered pair satisfying P (X = x, Y = y) (= P (X = x)P (Y = y). • Important: Evidently if there are zeros in the joint probability mass function of (X, Y )T , then X and Y cannot be independent. • The converse is not true. For Example: Consider (X, Y )T jointly discrete with joint probability mass function given in the table below. Y -1 0 1 1 2/10 1/10 1/10 X 2 1/10 1/10 1/10 3 1/10 1/10 1/10 P (X = 1, Y = −1) = 2/10 (= (4/10)(4/10) = P (X = 1)P (Y = −1) and we conclude dependence but there are no zeros in the joint p.m.f.
• We can extend that observation above to the jointly absolutely continuous case: If the support of two jointly absolutely continuous random variables X and Y is not a rectangle, then they are not independent. • Again, be careful here, it is not in general true that if the support of two jointly absolutely continuous random variables X and Y is a rectangle then they are indpendent. For Example: Suppose that X and Y are jointly absolutely continuous with joint probability density function given below: $ 4x2 y + 2y 5 fX,Y (x, y) = 0
for 0 < x < 1, 0 < y < 1 otherwise.
Are X and Y independent? It is left as an exercise to show that fX (x) =
*
1
4x2 y + 2y 5 dy = 2x2 + (1/3) for 0 < x < 1
0
and fY (y) =
*
1
4x2 y + 2y 5 dx = (4/3)y + 2y 5 for 0 < y < 1
0
but
fXY (x, y) = 4x2 y + 2y 5 (= (2x2 + (1/3))((4/3)y + 2y 5 ) = fX (x)fY (y).
It follows that X and Y are dependent.
59
• Let’s revisit the bivariate normal distribution. • We’ve seen that if (X, Y ) ∼ N2
!!
µX µY
" ! 2 σX , σXY
σXY σY2
""
then it follows that
and
2 X ∼ N (µX , σX ),
Y |X = x ∼ N
!
µY + σY ρ
Y ∼ N (µY , σY2 ) !
x − µX σX
"
" , σY2 (1 − ρ2 ) .
• This immediately describes Y ’s dependence on X via the parameter σXY . • Note for example that when σXY = 0 Y |X = x ∼ N (µY , σY2 ). • In fact we can combine these results in an important theorem. • Theorem: If (X, Y ) ∼ N2
!!
µX µY
" ! 2 σX , σXY
σXY σY2
""
then X and Y are independent if and only if σXY = 0. Proof: By definition of independence and by the example above, it follows that X and Y are independent if and only if fXY (x, y) = fX (x)fY (y). The left side equals the right side if and only if σXY = 0. • It follows from this theorem that independent normal variables are automatically multivariate normal!
• Moving forward we start from a new defintion. • Definition: A random vector (X1 , X2 , . . . , Xn )T is called a sample if 1. X1 , X2 , . . . , Xn are independent and 2. X1 , X2 , . . . , Xn all have the same distribution.
• Terminology: In the situation where one and two above are satisfied we sometimes write X1 , X2 , . . . , Xn are i.i.d (independent and identically distributed.) 60
• Using that notation, a sample is just a collection of i.i.d random variables. For Example: The multivariate normal distribution three ways: → 1. X1 , X2 , . . . , Xn ∼ Nn (− µ , Σ)
2. X1 , X2 , . . . , Xn are independent with location parameters µk and spread parameters σk2 .
→ Note that we could very easily prove here that X, Y ∼ N2 (− µ , Σ) are independent if and only if σXY = 0, but I’ll leave is as an exercise. 3. X1 , X2 , . . . , Xn are independent and identically distributed with location parameter µ and spread parameter σ 2 .
61
3.8 Multidimensional Change of Variable Formula (n = 2)
• When X is a random variable with distribution PX (discrete of absolutely continuous) we’ve already considered the problem of finding the distribution, PY , of Y = h(X), where h : R → R. • Now we turn our attention to the same problem but with more random variables. If (X1 , X2 , . . . , Xn ) is a random vector with joint distribution {PX1 ,...,Xn (B) : B n } and (Y1 , Y2 , . . . , Ym ) = (h1 (X1 , . . . , Xn ), h2 (X1 , . . . , Xn ), . . . , hm (X1 , . . . , Xn )) is a new random vector where hk : Rn → R for k = 1, 2, . . . , m, then what is the distribution of (Y1 , Y2 , . . . , Ym )? • How we will proceed: - For now we’ll only consider this problem in the situation where n = 2 and m = 2 - I will leave it to the homework problems for you to discover the situation where n = 2, m = 1, Y = h(X1 , X2 ) = X1 + X2 and where X1 and X2 are independent. For example this leads to the definition of convolution. Seriously, check it out because I will assume that you know it. - Later on in chapter 4 we’ll consider general n, m = 1 and in the special case where X1 , X2 , . . . , Xn are independent and identically distributed. • What the heck, I can’t wait. The following example is a sneak peak at that special situation (general n and m = 1 and X1 , . . . , Xn is a sample.) For Example: Suppose X1 , X2 , . . . , Xn is a sample from the uniform(0,1) distribution and denote by Y the random variable equal to the smallest of the X1 , X2 , . . . , Xn . Ie Y = min(X1 , . . . , Xn ). Find the distribution of Y .
fY (y) =
d d FY (y) = P (Y ≤ y) dy dy d = 1 − P (Y > y) dy d = 1 − P (min(X1 , . . . , Xn ) > y) dy d = 1 − P (X1 > y, . . . , Xn > y) dy d = 1 − P (X1 > y) · · · P (Xn > y) dy d = 1 − P (X1 > y)n dy d = 1 − (1 − y)n dy = n(1 − y)n−1 for 0 < y < 1 62
• Back to the situation where n = 2 and m = 2 and where X1 and X2 are jointly discrete, consider the following example: For Example: Suppose that X1 and X2 are two discrete random variables with joint probability mass function given in the table below.
-1 1/9 2/9 0
1 2 3
X1
X2 0 2/9 1/9 1/9
1 0 1/9 1/9
If Y1 = X1 + X2 and Y2 = 2X1 then find the joint distribution of Y1 and Y2 . The solution to this problem involves a bunch of tedious calculations:
Y2 = 2X1
2 4 6
Y1 = X1 + X2 0 1 2 3 4 1/9
where for example
P (Y1 = 0, Y2 = 2) = P ({(x1 , x2 } ∈ supp(X1 , X2 ) : x1 + x2 = 0, 2x1 = 2}) = P (X1 = 1, X2 = −1) = 1/9 The rest are left as exercise. • That last example played out a lot like the single variable case and hopefully it motivates the following theorem that looks a lot like the single variable case. • Theorem: Let X and Y be discrete random variables, with joint probability function pXY . Let Z = h1 (X, Y ) and W = h2 (X, Y ), where h1 , h2 : R2 → R are some functions. Then Z and W are also discrete, and their joint probability function pZW satisfies # pZW (z, w) = pXY (x, y) x,y:h1 (x,y)=z,h2 (x,y)=w
• As always the absolutely continuous case looks more scary but in my opinion is usually easier computationally. • Below is the 2 dimensional analogue of the change of variable formula. Here we replace X1 and X2 with X and Y only for notational convenience. 63
• Theorem: Let X and Y be jointly absolutely continuous, with joint p.d.f. fXY . Let Z = h1 (X, Y ) and W = h2 (X, Y ) where h1 , h2 : R2 → R are differentiable functions. Define the joint function h = (h1 , h2 ) : R2 → R2 by h(x, y) = (h1 (x, y), h2 (x, y)). Assume that h is one-to-one, at least on the region {(x, y) : fXY (x, y) > 0} . Then Z and W are also jointly absolutely continuous, with joint p.d.f. fZW given by < 6< 5 < < ∂h−1 ∂h−1 1 (z,w) 1 (z,w) < < −1 −1 fZW (z, w) = fXY (h1 (z, w), h2 (z, w)) <det ∂h−1∂z(z,w) ∂h−1∂w < 2 2 (z,w) < < ∂z
∂w
−1 where (h−1 1 (z, w), h2 (z, w)) is the unique pair (x, y) such that
h(x, y) = (h1 (x, y), h2 (x, y)) = (z, w). For Example: Suppose that X and Y are a sample from the exponential(λ) distribution. Find the distribution of (X + Y )/2. If X and Y are a sample from the exponential(λ) distribution then it follows by definition: fXY (x, y) = fX (x)fY (y) = λ2 e−λ(x+y)
for x > 0 and y > 0.
To make use of the multivariate change of variable formula we introduce a second new random variable in addition to the one given in the problem: define U = (X + Y )/2 and V = X. It follows that if u = h1 (x, y) = (x + y)/2 and v = h2 (x, y) = x that x = h−1 1 (u, v) = v
and y = h−1 2 (u, v) = 2u − v
and det
5
∂h−1 1 ∂u ∂h−1 2 ∂u
∂h−1 1 ∂v ∂h−1 2 ∂v
6
= det
-
∂ ∂u v
∂ ∂u 2u
−v
∂ ∂v v
∂ ∂v 2u
−v
.
= det
-
0 2
1 −1
Putting all this together into our formula < 5 < < −1 −1 fU V (u, v) = fXY (h1 (u, v), h2 (u, v)) <det
0 and x = v implying that v > 0 and
64
∂h−1 1 ∂v −1 ∂h2 ∂v
6< < < <
0 and y = 2u − v implying that 2u − v > 0. We conclude that $ 2λ2 e−2λu fU V (u, v) = 0
for 2u − v > 0, v > 0. otherwise.
Lastly to get the distribution of U alone, we integrate out our dummy variable from the joint p.d.f. fU (u) =
*
0
2u
fU V (u, v)dv =
*
0
2u
= > 2λ2 e−2λu dv = 2λ2 e−2λu v|2u = 4λ2 ue−2λu for u > 0. 0
65