A Rank-Sum Test for Significance of Difference in Copy Variation

Student: Kendrick Hardison (Francis Marion University)
Mentor: Scott Linder (Department of Mathematics and Computer Science)

Clinicians search for presence of a disease by examining a handful of particular gene locations known to be closely associated with it.  When measurements are taken at these gene locations, they are compared to similar measurements taken on a large number of controls.  Here we examine the sampling distribution of a rank sum statistic obtained by comparing measures of the case to corresponding measures of the control.  We derive the exact distribution, compare it to an approximate distribution proposed in the literature, and then demonstrate that use of the approximate distribution leads to an overstatement of the test's power and an understatement of Type 1 error risk.


Copy number variation (CNV) results from duplications and deletions of genomic DNA, and is known to correlate with a number of genetic diseases. Typically, a subject being screened for a particular disease will have measurements from k sections of DNA, and these measurements are compared to those from a collection of N controls. Here we describe a rank sum statistic useful for determining whether the subject is at risk for disease.

We derive the exact distribution of this statistic and compare the exact distribution to an approximate distribution proposed in the literature. We demonstrate that use of the approximate distribution of the rank sum statistic results in higher than nominal Type 1 error rate and an exaggeration of power.