Primer of Genetic Analysis: A Problems Approach 3rd Ed.

CHAPTER FIVE Probability and Chi-Square

STUDY HINTS

The ability to determine the probability of an event or series of events is fundamental to many applications of genetic principles. Thinking in terms of probability is not easy at first, but with a set of guidelines and some practice, you should find that you soon have no difficulty. In the following section we will discuss some important terms and then summarize the general types of probability problems you might expect to encounter in genetics.

First, let us contrast two important phrases. Independent events are events that have no causal interrelationship. The conception of the first child in a family, for example, cannot biologically influence the fusion of sperm and egg at fertilization for a second child. Each fertilization is an independent event, and each probability for segregation or sex determination must be assessed independently. Mutually exclusive events, on the other hand, are related in that the occurrence of one eliminates the possibility that the other will occur. A normal child cannot be both a boy child and a girl child. Sex determination yields either of two mutually exclusive events.

Both of these ideas play a role in solving a probability problem in genetics. Depending upon the genotype of the parents, the probabilities of mutually exclusive events, for example, the birth of “normal” progeny as opposed to the birth of “affected” progeny, may be different. In a similar way, the number and combinations of independent events considered in a problem influence not only the answer but also the way in which you find it.

There are perhaps three main levels of complexity in common probability problems: (1) determining the likelihood of a single independent event; (2) determining the likelihood of a sequence of events in which the order either is set by the problem or is not important; and (3) determining the likelihood of a sequence of events in which several different orders must be considered and accounted for. The ways of calculating each type of probability are summarized in the following outline.

I. Individual independent events

A. Examples: A gamete carrying a dominant allele being formed; one child being a boy; a heterozygote being produced in a monohybrid cross.

B. Calculation: Determine the proportion of times that such an event is expected to occur in repeated trials (e.g., 1/2 for the probability of a gamete carrying a dominant allele being formed in a heterozygote or 1 in a homozygote).

II. Sequence of independent events where order is set or irrelevant

A. Examples: a family of three children, boy–boy–boy (all one class, so order irrelevant); a family of four children, boy–girl–girl–boy (order is set, or specified in the problem).

B. Calculation: Multiply the individual probabilities (e.g., for a family of three boys, 1/2 · 1/2 · 1/2 = 1/8).

III. Sequence of events in which different orders must be pooled

A. Examples: a six-child family composed of four girls and two boys in any order; a seven-child family composed of at least two affected children.

B. Calculation: Use the probability formula or some expansion of it.

1. For a single sequence,

Probability=n!s!t!(p)s(q)t

where n is the number of individuals in the sequence (number of children in the family), p is the probability of the first event (e.g., being normal), q is the probability of the second event (e.g., being affected), sis the number of cases of the first event (e.g., number of normal children in a family), and t is the number of cases of the second event (e.g., number of affected children in the family). The value n! (read “nfactorial,” that is, the product of all integers from 1 to n) divided by (s!)(t!) gives the number of different ways in which the sequence of events can occur. By definition, 0! = 1. For example, for a three-child family composed of one boy and two girls, n!/s!t! is 3!/1!2!, and, using the formula

(3⋅2⋅1)(1) (2⋅1)=3

That is, the single boy could be the first, or the second, or the third child.

2. For combining several sequences: Note that one of the examples in Section III. A was a seven-child family composed of at least two affected children. Any family with two affected and five normal children, or three affected and four normal children, or four affected and three normal children, or any of the other possible combinations would fulfill this requirement. The probability can be calculated in either of two ways.

a. Using the probability formula, calculate the likelihood for each family, and add the figures together. This is an accurate but time-consuming method.

b. Expand the binomial expression for (p + q)n, where p is the probability of the first event, q is the probability of the second, and n is the size of the sequence (e.g., size of the family). This method is faster and simpler than method (a). For example, for a seven-child family,

(p + q)7 = p7 + 7p6q + 21p5q2 + 35p4q3 + 35p3q4 + 21p2q5 + 7pq6 + q7¯

If we let p be the probability of being normal and q be the probability of being affected, the six family makeups underlined in the equation have at least two affected children. Probabilities must include all possibilities and add up to 1. Note that p (at least two affected) = 1 − (p7 + 7p6q), where the terms in parentheses refer to seven-child families with 0 or 1 affected.

One way to expand the binomial is to use Pascal’s triangle to determine the coefficients for each term. Each line gives the coefficients for expansion of (p + q)n. Each line begins and ends with 1 and adds the two numbers directly above it. The following four lines give the triangle for n = 1 through n = 4.



Thus, (p + q)4 = 1p4 + 4p3q + 6p2q2 + 4pq3 + 1q4

Once the probability of an event or set of events has been determined, the next step is often to test the fit of observed data to these expectations. Chi-square (χ2) tests are frequently used in genetics to test the significance of the deviation between observed and expected numbers.

We shall consider the use of χ2 for testing two slightly different types of statistical fits. First, let us assume that you have counted the number of “normal” and “mutant” progeny that are produced by a cross between two heterozygous parents and found 137 normal and 63 mutant progeny. Your expectation is that these progeny will be produced in a ratio of 3 “normal” to 1 “mutant.” If you have counted, say, 200 progeny, then the expected number of normal ones is 3/4 of these, or 150 progeny. You would expect 50 to have the mutant phenotype.


Normal

Mutant

Total

Observed

(O)

137

63

200

Expected

(E)

150

50

200


Aχ2 value is calculated by measuring the deviation between observed and expected numbers (OE), by summing the square of this deviation, and then dividing it by the expected number for each class. That is,

Χ2=∑(O−E)2E

For the data just given, this calculation has the following results:


Normal

Mutant

Total

Observed (O)

137

63

200

Expected (E)

150

50

200

OE

−13

13

0

(OE)2

169

169

(O−E)2E

169150=1.127

16950=3.380

χ2 = 1.127 + 3.380 = 4.507


For this type of χ2, the number of degrees of freedom (d.f.) is 1 less than the number of classes. Here there are two classes (normal and mutant), so d.f. = 1. Looking at the table of critical χ2 values at the end of this book (Table R.1), you see that for 1 degree of freedom you would expect to find a deviation of 3.84 or larger by chance alone only 1 time in 20 (a probability of .05). Since the deviation in our data is greater than this but not as great as the deviation one might find once in 100 tests (.01), we can conclude that there is a significant difference between the observed and expected, and .01 < p < .05.

Thus we should reject our initial hypothesis that these progeny would be produced in simple Mendelian proportions. The actual outcome could have been due to differences in viability between classes, errors in classification, or the fact that the trait might have a complex genetic basis. The initial hypothesis, or null hypothesis, can only be rejected by our statistical test; it cannot be proved. (The null hypothesis is discussed in more detail in Chapters 8 and 9.)

A χ2 test can also be used to test whether two sets of data are independent. This is called a contingency χ2test and is easily calculated by substituting the values from a table of data into the formula given in the following table.

Χ2=[|ad−bc|−1/2(n)]2n(a+b)(a+c)(c+d)(b+d)

where the vertical bars around the term |adbc| signify the absolute value (the positive magnitude) of the difference between ad and bc. For example, consider the calculations for the following set of data from two replicate experiments.

Χ2=[|(78)⁢(41)−(72)⁢(44)|−(1/2)⁢(235)]2⁢(235)(150)⁢(122)⁢(85)⁢(113)=.0102

Thus we have no statistical basis for rejecting the null hypothesis that the two replicates are producing normal and mutant progeny in the same proportions.


Class A

Class B

Totals

Group 1

a

b

a + b

Group 2

c

d

c + d

a + c

b + d

a + b + c + d = n



Normal

Mutant

Totals

Replicate 1

78

72

150

Replicate 2

44

41

85

122

113

235


The 1/2 in the formula just given is a correction factor (Yates’s correction factor) that is typically included in χ2 tests in which the number of observations in any of the categories is less than five. It is also frequently used when the number of classes is small, as in our contingency χ2. We could have also used it in the earlier χ2, in which case χ2 = 4.167, and p is still less than .05.

One key point to remember is that χ2 tests take the sample size into consideration. Thus one cannot do a χ2 analysis of percentages. An illustration of this is included in problems 14 and 15 at the end of the chapter.

Finally, remember that statistical tests evaluate only the likelihood that a given set of data is inconsistent with a stated hypothesis. One cannot prove that a particular hypothesis is correct, since proof would require observing all possible examples. One can only reject a hypothesis (that is, the null hypothesis) by marshaling evidence against it. If the null hypothesis is rejected, then by elimination we have supported the alternative.

IMPORTANT TERMS

Chi-square (χ2)

Contingency test

Degrees of freedom

Hypothesis

Independent events

Mutually exclusive events

Null hypothesis

Yates’s correction factor

PROBLEM SET 5

This problem set concentrates on applying the basic rules of probability to genetics questions and on testing hypotheses by using the χ2 test of goodness of fit. The first 12 questions involve probability, and the last 5 are χ2 problems.

1. The sex-linked trait hemophilia is found in the history of a particular family. A phenotypically normal couple produces a son who is a “bleeder.” The probability that the next child will be a bleeder is

(a) 0,

(b) 3/4,

(c) 1/2,

(d) 1/4,

(e) none of the above.

link to answer

2. A married student, talking to the teacher after a lecture on probability, asks the following question. “My husband and I have been married for almost five years, and we have a family of three children. If I tell you that at least two of them are girls, what is the probability that all of our children are girls?” What should the teacher answer?

link to answer

3. The gene pool of a human population has five different alleles at an autosomal locus. The number of different genotypes that are possible at this locus is

(a) 5,

(b) 10,

(c) 15,

(d) 20,

(e) 25,

(f) none of the above.

link to answer

4. A trihybrid pea plant, having the genotype AaBbC1C2, is self-fertilized. All loci are unlinked. There is complete dominance at the A and B loci but incomplete dominance at the C locus. The fraction of progeny that will be phenotypically different from the parent is

(a) 1/8,

(b) 7/8,

(c) 37/64,

(d) 23/32,

(e) none of the above.

link to answer

5. Red flower color is dominant (R –) to white (rr) in garden peas. In a cross of Rr to rr, 4,400 progeny are recovered. The probability that these 4,400 plants will consist of 2,200 red- and 2,200 white-flowered plants is

(a) 100 percent,

(b) 50 percent,

(c) 25 percent,

(d) much less than 25 percent.

link to answer

6. Phenylketonuria, a metabolic disorder in humans, is inherited as an autosomal recessive trait. A husband and wife, both heterozygous for this gene, plan to have six children. What is the probability that four of the offspring will be normal and two will have phenylketonuria?

link to answer

7. Heterozygotes (carriers) for the autosomal sickle cell anemia gene occurr in the U.S. black population with a frequency of about 1 in 10. If two phenotypically normal people from the general black population marry, what is the probability that their first child will have sickle cell anemia?

(a) 1/10,

(b) 1/40,

(c) 1/100,

(d) 1/400.

link to answer

8. A deleterious trait, inherited as an autosomal recessive (b), has a penetrance of 60 percent. In a cross of Bb and Bb, the expected frequency of individuals showing the deleterious trait is

(a) .15,

(b) .25,

(c) .40,

(d) .45,

(e) .60,

(f) none of the above.

link to answer

9. A woman whose mother was heterozygous for the retinal-cancer mutation retinoblastoma (R), a dominant allele with 90 percent penetrance, marries a man who is heterozygous for the mutation. Assuming that all other people in the pedigree are homozygous normal, what is the probability that their first child will suffer from retinal cancer?

link to answer

10. A thirty-year-old woman with four children starts showing the unmistakable symptoms of Huntington’s chorea, the rare autosomal dominant trait that killed the American folk singer Woody Guthrie.

(a) What is the probability that two of her four children carry the gene for Huntington’s chorea?

link to answer

(b) What is the probability that the woman’s father carries the gene?

link to answer

(c) What is the probability that the woman’s mother carries the gene?

link to answer

(d) What is the probability that the woman’s cousin on her father’s side of the family carries the gene?

link to answer

link to answer

11. What proportion of all five-child families from a cross of Ss and Ss will be expected to include at least one child who is homozygous ss?

(a) 1/32,

(b) 781/1,024,

(c) 243/1,024,

(d) 31/32,

(e) none of the above.

link to answer

12. Consider all six-child families produced by parents that include one heterozygous for a common recessive autosomal condition and one homozygous for the condition (Rr × rr). What proportion of all such six-child families will include at least three affected children?

link to answer

Chi Square Problems

13. A testcross of a monohybrid gray mouse to an albino strain results in 64 gray and 48 albino progeny. Test the goodness of fit of these data to a 1:1 ratio, using the χ2 test.

link to answer

14. Self-fertilization of several phenotypically tall pea plants results in the production of 100 progeny: 79 tall and 21 short plants. Test the goodness of fit of these data to the hypothesis that the cross should yield progeny in the expected ratio of 3 tall:1 short.

link to answer

15. In the same experiment as in problem 14, assume that 1,000 plants are produced, including 790 tall and 210 short (the same frequencies). Test the goodness of fit to the same 3:1 ratio, and compare your results to those you obtained in problem 14. Although in the same proportions is chi square the same?

link to answer

16. A pure-breeding, tall pea plant with white flowers is crossed with a pure-breeding, short plant with red flowers. The F1 plants are tall, with red flowers. When allowed to fertilize themselves, these produce the following F2: 326 tall, red; 104 tall, white; 117 short, red; and 29 short, white. Explain these data, indicating genotypes of the parents, and the F1, and the different F2 phenotypes. What are the expected numbers of the various F2 classes? Test the goodness of fit between the data and your hypothesis, using χ2.

link to answer

17. A pure-breeding plant with colored flowers is crossed with a pure-breeding plant with colorless flowers. The F1 are all colored, and when self-fertilized these produce an F2 consisting of 196 colored and 92 with colorless blooms. Two alternative hypotheses might be proposed. First, this could be a monohybrid cross, with colored flowers dominant to colorless flowers. A second explanation is that this is a dihybrid cross, with flower pigment dependent on two dominant complementary genes that are not linked. In this second alternative, only the AB – genotype would produce colored flowers. Test the goodness of fit of the data to the two expected ratios. Which hypothesis fits the data better?

link to answer

ANSWERS TO PROBLEM SET 5

1.

(d) 1/4. Since hemophilia is sex-linked, the father of this family must be genotypically normal, or he would have hemophilia. A sex-linked recessive trait is passed from the mother to her sons. Thus the mother must be a heterozygote. If she were homozygous, she too would have hemophilia. The mating and all possible offspring are summarized as follows. (Let H represent the normal allele and h represent the hemophilia-causing allele. The father’s Y chromosome does not carry an allele for this locus.)

H⁢h×H⁢y↓

All female children will inherit the H allele from the father and will be either HH or Hh and are phenotypically normal. Females will comprise 1/2 of all children, on the average. Half of the male offspring will inherit the h allele from their mothers and will have hemophilia. Since being male and inheriting the h allele are independent events, the probabilities are multiplied: 1/2 · 1/2 = 1/4.

2. 1/4. The purpose of this problem is to help you see that it is critically important to determine which of all possible events (three-child families in this case) are consistent with the available data (two girls already born in this case). Families with two or more boys are not relevant to the problem. A three-child family is the outcome of three independent events in which there is a 50 percent probability of the child’s being a girl and a 50 percent probability of its being a boy. The possible makeups of three-child families are as follows:

The four families that include at least two girls (marked with an asterisk) constitute the base sample that must be considered. In only one are all three children girls. This is an example of conditional probability.

3.

(c) 15. Each individual genotype must contain only 2 of these alleles. The homozygotes total 5. In addition, there are 10 heterozygotes. Remember that A1A2 is genotypically the same as A2A1. Numbering the alleles from 1 to 5, we get the following genotypes:

4.

(d) 23/32. The three loci will assort independently. To answer this question, consider each locus separately. First, in a cross of Aa × Aa, 3/4 will show the dominant phenotype like the parents. The same is true of the B locus. Since the C locus displays incomplete dominance, however, only 1/2 of the offspring will be heterozygotes like the parent. The fraction of all progeny that will be phenotypically the same as the parent will be the product of the individual probabilities: 3/4 · 3/4 · 1/2 = 9/32. The proportion that will be phenotypically different is the remainder of all combinations: 1 − 9/32 = 23/32. You can also calculate the probability of being different, but if you choose to approach the problem this way, remember that one can be different in some of the genes but like the parents for others. Thus, there are many phenotype combinations that are different in some way. All of these must be counted.

5.

(d) Much less than 25 percent. The 1:1 ratio is only the expectation. Equal numbers of dominant and recessive phenotypes from this cross would not occur very often, however, because of the sampling variation that always exists in a chance event. Chi-square tests are used to compare the observed and expected data.

6. The probability is .297. For this type of problem, one should use the probability formula. Since the parents are both heterozygotes, the probability of the child’s having a normal phenotype is 3/4, and the probability of having the mutant phenotype is 1/4. In a family of six children, n = 6. The stipulated makeup of four normal and two affected children means that s = 4 and t = 2. Substituting into the probability formula, we have

Probability=n!s!t!(p)s(q)tProbability=6!4!2!(34)4(14)2=(6⋅5⋅4⋅3⋅2⋅14⋅3⋅2⋅1⋅2⋅1)(814,096)​=1,2154,096=.297

7.

(d) 1/400. For a child to have sickle cell anemia, both normal parents must be heterozygotes. The probability that the mother is a carrier is 1/10. The same is true for the father. If both are carriers, the probability that the child is a homozygote for sickle cell anemia is 1/4. These are independent events, and all probabilities should be multiplied: 1/10 · 1/10 · 1/4 = 1/400.

8.

(a) .15. There is a 1/4 probability of a bb individual being produced by the cross of Bb × Bb. If the individuals are bb, there is only a 60 percent probability that they will express the trait. These are independent events, and the frequencies are therefore multiplied: 1/4 · 6/10 = 6/40 = .15.

9. The answer is 45/80. The woman has a 1/2 chance of having inherited the retinoblastoma mutation from her mother. Since the man she marries is known to be a heterozygote, there is a 3/4 probability that their child will have at least one dominant mutant allele (R −), assuming that both parents are carriers. If the child inherits the mutation, there is only a 9/10 chance that it will express the trait. All three of these are independent events, and their probabilities are multiplied:

1/2 chance that the woman is a heterozygote

× 3/4 chance that the child will inherit the R allele if both parents are carriers

× 9/10 chance that the child will express the trait + 1/2 chance that the woman is rr

× 1/2 chance that the child will inherit the R allele from its Rr father

× 9/10 chance that the child will express the trait

= (1/2 × 3/4 · 9/10) + (1/2 · 1/2 · 9/10) = 45/80

10.

(a) 3/8. The mother is a heterozygote for this dominant trait, since it is rare, and it is therefore unlikely that both of her parents were carriers (the only way in which she could be a homozygote). There is, therefore, a 1/2 chance that a child will be a carrier and a 1/2 chance that it will be normal. Finding the proper solution involves substituting these probabilities and family composition into the probability formula:

Probability=n!s!t!(p)s(q)tProbability=4!2!2!(12)2(12)2=(4⋅3⋅2⋅12⋅1⋅2⋅1)(116)=38

(b) 1/2.

(c) 1/2. The answer to both (b) and (c) is the same. The woman in this problem carries a dominant trait that she inherited from either her mother or her father. Since it is an autosomal trait, there is an equal probability that it was her mother or her father who carried it.

(d) 1/8. There is a 50 percent probability that the mutant allele will be transmitted in each generation. There is a 50 percent chance that one of the father’s parents carries the mutant, which means that his parents have a 50 percent chance of being Rr × rr. The probability that the R allele will be passed on to the cousin’s parent is 50 percent. The probability that the R allele will be passed on to the cousin is also 50 percent. Thus the probability that the cousin carries the gene is 1/2 · 1/2 · 1/2 = 1/8.

11. 781/1,024. All of the five-child families from such a mating will include at least one homozygous ss child except the family composed only of children with the dominant phenotype. Since the cross is between two heterozygotes, there is a 3/4 chance that a child will have the dominant phenotype. Because the birth of each child is an independent event and order is irrelevant (since they are all the same), you would simply multiply probabilities to determine the proportion of completely normal families for this mating: (3/4)5=243/1,024. All of the rest have at least one affected (ss) individual: 1 − (243/1,024)=781/1,024.

12. 21/32. This problem requires that the proportion of several different family makeups be calculated. The most efficient way to do this is to use the expanded binomial for six-child families.

(p+q)6=p6+6p5q+15p4q2+20p3q3+15p2q4+6pq5+q6¯

The four families that have three or more affected individuals are underlined. In order to determine what proportion of all six-child families have at least three affected individuals, one must substitute the probability that the child will be normal (p) from the type of mating. Here the mating is between an affected individual and a heterozygote, so p = 1/2 and q = 1/2. Substituting in the underlined portion of the binomial gives

​Probability=20(12)3(12)3+15(12)2(12)4+6(12)(12)5+(12)6=2064+1564+664+164=4264=2132

13. Set up your χ2 test as shown in the following table. If you pay attention to the sums of the first two rows, you will be less likely to make a mathematical error, and if you do all the intermediate steps it will be easier to detect a mistake.


Gray

Albino

Total

Notes

Observed

64

48

should be the same

Expected (1:1)

56

56

Deviation (Observed-Expected)

+8

−8

0

should always be zero

(Deviation)2

64

64

(Deviation)2/Expected

1.14

1.14

 X2=∑​(Deviation)2Expected=1.14+1.14=2.28


With two classes, gray and albino, there is 2 − 1 = 1 degree of freedom. The χ2 table at the end of this book (Table R.1) shows that the value 2.28 lies above the .05 cutoff. You would expect, on the basis of chance alone, deviations as large or larger from 10 percent to 50 percent of the time. This is not sufficiently rare to require us to reject the null hypothesis that the data sets reflect a 1:1 ratio.

14. The χ2 calculations in the left-hand side of the table are done for a 3:1 expected ratio, with a sample size of 100. For a better contrast, we have also put the calculations required for problem 15 on the right-hand side of the table. For problem 14, however, the χ2 value of .85 with 1 degree of freedom gives a p value of between .5 and .1. We can conclude that the data fit the expectation, in that the deviation is not significantly large enough to compel us to reject the null hypothesis.



15. The calculations for this χ2 test are given in the answer for problem 14 (right-hand table), so that you can contrast them readily with those for the same proportions but different sample size. In this case, the χ2 value of 8.53 gives a p value of much less than .01. In other words, you would expect deviations this large or larger by chance in less than 1 in 100 such data sets. We can therefore reject the null hypothesis that the observed and expected frequencies are the same. Thus one clearly cannot use percentages in calculating χ2; sample size is very important.

16. Tall and red seem to be dominant to short and white, and the cross is a dihybrid, with each parent contributing one dominant and one recessive trait. The F2 shows segregation and independent assortment generating a 9:3:3:1 expected ratio, as indicated in the following table.



Chi-square is therefore 2.27. The number of degrees of freedom is 4 (the number of classes) minus 1, or 3. The probability is therefore between .9 and .5, and we conclude that the data are consistent with the hypothesis.

17. If this is a monohybrid cross, the 288 F2 plants are expected to include 3/4 · 288 = 216 colored and 1/4 · 288 = 72 colorless plants. For the second explanation, only AB – plants are colored. This gives 9/16 · 288 = 162 colored. All others are colorless; 7/16 · 288 = 126.


Colored

Colorless

Total

Observed

196

92

288

Expected (3:1)

216

72

288

Deviation (Observed-Expected)

−20

+20

0

(Deviation)2

400

400

 X2=∑​(Deviation)2Expected=1.85+5.56=7.41



Colored

Colorless

Total

Observed

196

92

288

Expected (9:7)

162

126

288

Deviation (Observed-Expected)

+34

−34

0

(Deviation)2

1,156

1,156

X2=∑​(Deviation)2Expected=7.14+9.17=16.31


The smaller χ2 value occurs with the 3:1 ratio, so this hypothesis gives a better fit between the observed and the expected. Still, both are significantly different, so in practice you would want to repeat the experiment and consider additional alternative hypotheses.



If you find an error or have any questions, please email us at admin@doctorlib.org. Thank you!