EXAMINATION OF NOVEL STATISTICAL DESIGNS FOR PHASE II AND PHASE III CLINICAL TRIALS. Ayanbola Olajumoke Ayanlowo

Size: px
Start display at page:

Download "EXAMINATION OF NOVEL STATISTICAL DESIGNS FOR PHASE II AND PHASE III CLINICAL TRIALS. Ayanbola Olajumoke Ayanlowo"

Transcription

1 EXAMINATION OF NOVEL STATISTICAL DESIGNS FOR PHASE II AND PHASE III CLINICAL TRIALS by Ayanbola Olajumoke Ayanlowo David T. Redden, PhD., CHAIR Christopher Coffey, PhD. Gary Cutter, PhD. Charles Katholi, PhD. David Kimberlin, MD. Sharina Person, PhD. A DISSERTATION Submitted to the graduate faculty of The University of Alabama at Birmingham, in partial fulfillment of the requirements for the degree of Doctor of Philosophy BIRMINGHAM, ALABAMA 2007

2 EXAMINATION OF NOVEL STATISTICAL DESIGNS FOR PHASE II AND PHASE III CLINICAL TRIALS Ayanbola Olajumoke Ayanlowo BIOSTATISTICS ABSTRACT Clinical trials to determine the usefulness of a new treatment are usually conducted in four (4) phases. Each phase is designed to answer a distinct research question about the usefulness of the new treatment. Phase I trials determine the safe dose range of the new treatment identify the possible side effects and other treatment associated toxicity issues, in a small group of healthy people. In phase II trials, the efficacy and safety of the new treatment is investigated in a larger group of individuals usually from the diseased population of interest. Phase III trials, conducted in large groups of people, further investigate the effectiveness of the new treatment, monitor its side effects, compare it to an established form of treatment for the disease and collect information that help determine the safe use of the new treatment. Phase IV trials, also called post marketing surveillance studies, are conducted after the treatment has been marketed to gather more information on the side effects in various subgroups and any side effects associated with long-term use. Several statistical designs have been proposed to answer the research questions posed at each phase. In the first paper of this dissertation, we propose a method for designing phase II trials that allows for early termination of the trial for lack of efficacy. The second paper, proposes a two-stage adaptive procedure for a phase III trial when there is statistical evidence that the effect of the treatment might differ depending on the characteristics of the individuals being investigated. The third paper, presents a ii

3 modification of the two-stage adaptive procedure proposed in paper 2. The modified procedure allows for early termination of any stratum of the trial that shows little evidence of the treatment efficacy at the beginning of the second stage. A conditional power approach is implemented in all three designs to: 1) terminate early a trial or an arm of the trial due to lack of evidence of efficacy in papers 1 and 3 respectively; 2) control the type I error rate at the end of stage 2 conditioned on a statistically significant covariate by treatment interaction for the procedures proposed in papers 2 and 3. iii

4 DEDICATION To the Almighty God who has given me the grace and wisdom to come this far in my career. To my father Chief J.M. Ayanlowo who has supported me tirelessly through all my educational pursuits. iv

5 ACKNOWLEDGMENTS I sincerely thank my mentor and advisor Dr. David T. Redden who has worked tirelessly with me to bring this dissertation to fruition. I appreciate his tenacity and yet openness to new ideas. I extend a heartfelt gratitude to my committee members: Dr. Chris Coffey, Dr. Gary Cutter, Dr. Charles Katholi, Dr. David Kimberlin and Dr. Sharina Person, whose insight and contributions to my dissertation work is immeasurable. I am thankful to Dr. O. Dale Williams providing me an opportunity to jump start my career as a biostatistician even as a student and the financial resources for most my doctoral training. I am grateful to the faculty, staff and students of the University of Alabama, Department of Biostatistics for their contribution to my career and giving me a chance to excel. Special thanks to Dr. George Howard, Dr. David Allison for their indispensable support throughout my doctoral studies. I am thankful for Mrs. Yhenneko Taylor, Mr. & Dr. (Mrs.) Ameko, Ms. Tina Dube and Ms. Tamekia Jones for their support, prayers and encouragement throughout my graduate studies. I am grateful to the faculty and staff of the Division of Preventive Medicine for their support throughout my doctoral studies. I am also grateful to the Biostatistics group at the Division of Preventive Medicine for giving me the opportunity to tackle practical issues that arise in the practice of clinical trials early in my career. v

6 Thank you to the entire Ayanlowo, Ogundiran and Elegbe families for their tireless support and contribution towards the success of my graduate studies. I am forever indebted to them. Finally, I would like to thank my husband, Olusegun Elegbe, Esq, who has given me enormous support during my doctoral studies. vi

7 TABLE OF CONTENTS Page ABSTRACT... ii DEDICATION... iv ACKNOWLEDGMENTS...v LIST OF TABLES... ix LIST OF FIGURES... xii INTRODUCTION...1 Phase I Clinical Trials...1 Phase II Clinical Trials...3 Phase II clinical trial designs...4 Phase III Clinical Trials...8 Randomization...9 Response Adaptive Randomized Designs...11 Covariate by Treatment Interactions...14 Phase III Designs Accounting for Covariate by Treatment Interactions...15 Phase IV Clinical Trials...19 The Concept of Conditional Power...19 Summary of Research Objectives...27 STOCHASTICALLY CURTAILED PHASE II CLINICAL TRIALS...29 A TWO-STAGE CONDITIONAL POWER ADAPTIVE DESIGN ADJUSTING FOR TREATMENT BY COVARIATE INTERACTION...55 AN EFFICIENT CLINICAL TRIAL DESIGN INVESTIGATING TREATMENY BY COVARIATE INTERACTION...95 CONCLUSION Concluding remarks on paper one Concluding remarks on paper two Concluding remarks on paper three Future research vii

8 TABLE OF CONTENTS (Continued) GENERAL LIST OF REFERENCES APPENDIX: COMPUTER PROGRAMS viii

9 LIST OF TABLES Table Page STOCHASTICALLY CURTAILED PHASE II CLINICAL TRIALS I II Comparison of type I error rates, and expected sample size under H 0 (p p 0 ) for simple binomial test stochastic curtailment and Simon s minimax and optimal designs Comparison of type I error rates, and expected sample size under H 0 (p p 0 ) for Simon s minimax and optimal designs and Simon s designs enhances with stochastic curtailment...41 A TWO-STAGE CONDITIONAL POWER ADAPTIVE DESIGN ADJUSTING FOR TREATMENT BY COVARIATE INTERACTION I II III IV Global type I error rate and stage 2 strata type I error rates under equal and adaptive allocation in the second stage using the uncorrected critical value, z = 1.96 for three effect size designs...66 Simulated second stage critical value (c 2 ), the expected sample size under the null hypothesis of no covariate by treatment interaction, global type I error rates and stage 2 strata type I error rate at the end of the trial under an equal allocation scheme in the second stage of the design using c 2 for three effect size designs, varying total sample size N 1 with a type A interaction...75 Simulated second stage critical value (c 2 ), the expected sample size under the null hypothesis of no covariate by treatment interaction, global type I error rates and stage 2 strata type I error rate at the end of the trial under an adaptive allocation scheme in the second stage of the design using c 2 for three effect size designs, varying total sample size N 1 with a type A interaction...76 Simulated second stage critical value (c 2 ), the expected sample size under the null hypothesis of no covariate by treatment interaction, global type I error rates and stage 2 strata type I error rate at the end of the trial under an equal allocation scheme in the second stage of the design using c 2 for three effect size designs, varying total sample size N 1 with a type B interaction...77 ix

10 LIST OF TABLES (Continued) V VI VII VIII IX Simulated second stage critical value (c 2 ), the expected sample size under the null hypothesis of no covariate by treatment interaction, global type I error rates and stage 2 strata type I error rate at the end of the trial under an adaptive allocation scheme in the second stage of the design using c 2 for three effect size designs, varying total sample size N 1 with a type B interaction...78 Simulated global power and stage 2 strata power at the end of the trial under an equal allocation scheme in the second stage of the design using c 2 for three effect size designs, varying total sample size N 1 with a type A interaction...79 Simulated global power and stage 2 strata power at the end of the trial under an adaptive allocation scheme in the second stage of the design using c 2 for three effect size designs, varying total sample size N 1 with a type A interaction...80 Simulated global power and stage 2 strata power at the end of the trial under an equal allocation scheme in the second stage of the design using c 2 for three effect size designs, varying total sample size N 1 with a type B interaction...81 Simulated global power and stage 2 strata power at the end of the trial under an adaptive allocation scheme in the second stage of the design using c 2 for three effect size designs, varying total sample size N 1 with a type B interaction...82 AN EFFICIENT CLINICAL TRIAL DESIGN INVESTIGATING TREATMENT BY COVARIATE INTERACTION I II Simulated second stage average critical values (c 1, c 2 ), the expected sample size under the null hypothesis of no covariate by treatment interaction, stratum wise type I error rates (α S ), experiment wise type I error rates (α E ) at the end of the trial using c1, c2 for three effect size designs for the single trial design, Cohen-Simon design, the parallel 2-group sequential design and the proposed design Simulated experiment wise power, the expected sample size under the alternative hypothesis of a treatment effect (δ = 0.10) at the end of the trial using c 1, c 2 for effect sizes 0.20 vs (marginal effect size 0.20 vs. 0.30) for varying threshold values λ for the proposed design, the single trial design, the Cohen-Simon design and the parallel 2-group sequential design x

11 LIST OF TABLES (Continued) III IV Simulated experiment wise power, the expected sample size under the alternative hypothesis of a treatment effect (δ = 0.125) at the end of the trial using c 1, c 2 for effect sizes 0.25 vs (marginal effect size 0.25 vs ) for varying threshold values λ for the proposed design, the single trial design, the Cohen-Simon design and the parallel 2-group sequential design Simulated experiment wise power, the expected sample size under the alternative hypothesis of a treatment effect (δ = 0.15) at the end of the trial using c 1, c 2 for effect sizes 0.35 vs (marginal effect size 0.35 vs. 0.50) for varying threshold values λ for the proposed design, the single trial design, the Cohen-Simon design and the parallel 2-group sequential design xi

12 LIST OF FIGURES Figures Page STOCHASTICALLY CURTAILED PHASE II CLINICAL TRIALS 1 Plots illustrating the effect of threshold value (θ) upon: (a) observed α; (b) observed power; and (c) average sample size (N) for 3 different stochastically curtailed phase II designs with three effect sizes Graph of conditional power for a simulated study testing H 0 : p 0.2 versus H 1 : p 0.4, for a treatment whose true response proportion (p) is A TWO-STAGE CONDITIONAL POWER ADAPTIVE DESIGN ADJUSTING FOR TREATMENT BY COVARIATE INTERACTION 1 Outline of proposed design Plots of two types of treatment by covariate interaction with 2 treatments d 1 and d 2 and a covariate g with 2 levels, 1, 2 for effect sizes 0.25 vs AN EFFICIENT CLINICAL TRIAL DESIGN INVESTIGATING TREATMENT BY COVARIATE INTERACTION. 1 Outline of proposed design xii

13 1 INTRODUCTION A clinical trial is a prospective research study designed to answer specific questions about the effect and worth of a new treatment or new ways of using an established treatment in human beings. Clinical trials to determine the worth of a new treatment or drug are usually conducted in four (4) phases. Each phase is designed to answer a distinct research question about the utility of the new treatment or drug. The following sections describe each phase of a clinical trial. More details are provided for phase II and phase III trials which are relevant to the methods discussed in this dissertation. This is not intended to be exhaustive, but to provide a basis for some of the issues specific to each phase of a clinical trial and solutions that have been provided in literature, particularly for phase II and III clinical trials. 1.1 Phase I Clinical Trials In phase I trials, the new drug or treatment is usually tested in a small group of healthy individuals to determine its chemical activity and pharmacologic actions in humans and a safe dose range. The possible side effects and other toxicity issues associated with the new treatment or drug are also identified in phase I. A phase I trial is usually the first time that the new treatment is tested on humans. In the literature, several statistical designs that have been proposed for phase I clinical trials focus on estimating the maximum tolerable dose (MTD) or finding a safe dose range for the new treatment or drug. The maximum tolerable dose is defined as the largest dose that can be given before unac-

14 2 ceptable toxicity is experienced by patients [1,2,3]. The MTD is often estimated by applying a dose escalation/de-escalation rule to determine the dose each cohort of patients receives. The standard dose escalation rule used has been defined by a Fibonacci series in which increment of dose for succeeding levels are 100%, 67%, 50%, and 40% followed by 33% for all subsequent levels [4]. Patients are entered into the study is groups of three. The first cohort of three patients are treated at a starting dose level determined from animal studies, usually one tenth of the lethal dose (LD 10, i.e. the dose that is lethal to 10% of animals), and observed for severe toxicity for one course of the new treatment or drug, before more patients are entered into the study. If none of the three patients experience dose-limiting toxicity (DLT), the next group of three patients are treated at the next higher dose level. If at most one of the six patients experience DLT, the third cohort of patients are entered into the study and treated at the 3 rd higher dose from LD 10. If any time during the trial, one of the three patients in a cohort experiences DLT, the next cohort of patients are treated at the same dose as the immediate previous cohort. If two or all three patients in a cohort experience DLT, then the next cohort of patients are treated at the next lower dose, except six patients have already been treated at that dose. As a rule, if the occurrence of DLT is greater than 1/6 (i.e. more than one in six patients experiences DLT) the MTD is said to have been exceeded and the next cohort of patients are treated at the next lower dose. The highest dose at which the occurrence of DLT is less than 1/6 is considered the MTD. Estimating the MTD using the standard design described above could take a long time, as such in recent years, many authors have proposed other methods that estimate the MTD and also investigate possible variability among patients treated at a dose level without experiencing dose-limiting toxicity.

15 3 Collins et al [5] suggested accelerating the dose escalation by using the area under the concentration versus time curve at the estimated LD 10 in mice as the target exposure in humans. Storer [6] proposed the use of a single patient per dose level until the first experience of a dose-limiting toxicity is observed. Storer also proposed the use of a logistic regression model fitted to the dose level versus DLT occurrence data to estimate the MTD. Expanded details of these methods are beyond the scope of this dissertation. 1.2 Phase II Clinical Trials After the estimation of the maximum toxicity dose of the new treatment/drug in a phase I trial, the drug is moved to a phase II trial to determine its effectiveness to treat the on the disease of interest. Phase II clinical trials are often aimed at accomplishing two objectives: 1) to evaluate the therapeutic efficacy and safety of a new treatment/drug in the treatment of diseases/impairments. 2) To estimate the true response rate of the new treatment. Phase II studies are often conducted in a larger group of individuals, compared to the size of phase I trials, typically from the diseased population of interest. Other common short-term side effects and risks of the new treatment or drug are also ascertained at this phase of the drug discovery process. For ethical and efficiency reasons a desirable feature of a phase II trial is to quickly stop the trial when the treatment shows unacceptably low therapeutic activity [7]. This feature of a phase II clinical trial motivates the phase II clinical designs proposed in the first paper of this dissertation. It is also beneficial to stop a trial and move to a phase III comparative trial when there is clear evidence of efficacy. Several statistical designs have been developed that meet these re-

16 4 quirements of Phase II clinical trials [7,8,9,10], in the following section we discuss a few of these designs and the designs proposed in the first paper of this dissertation Phase II Clinical Designs The oldest phase II design is a frequentist design proposed by Gehan in 1961 [11]. Gehan proposed this design to deal with some of the medical difficulties associated with setting up a trial to test the effectiveness of a new chemotherapeutic (cancer treatment) agent. To ensure that the probability of passing a new chemotherapeutic agent to a phase III comparative trial and the probability of the agent being abandoned after the phase II trial, are controlled to some degree for various true levels of effectiveness (response rate) of the agent, Gehan proposed a two-stage design that rejects a potentially ineffective treatment early when no success is observed among the n 1 patients in the first stage of the design. The number of patients required for the first stage is determined such that β 1 = Pr (0 patient responding to the new treatment) = (1- p) n1 (1) Equation (1) states that the probability of rejecting a new treatment which has a true response rate of p, after n 1 consecutive failures is equal to β 1 [8]. The response rate is the estimated proportion of patients that is expected to respond to the new treatment. Gehan s design is generally used to test the null hypothesis: H 0 : The drug is unlikely to be effective in x percent of the patients or more versus H A : The drug could be effective in x percent of patients or more.

17 5 In most applications of Gehan s design investigators usually test the hypothesis with x = 20%, which implies testing a therapeutic effectiveness of This implies that we will have enough evidence to reject H 0 and conclude that the treatment is effective if and only if more than 20% of the patients respond to the treatment. If we set β 1 =.05 (the probability of failing to reject H 0 given that the treatment is effective), and p = 0.20 (20% response rate), according to equation (1) we will need at least 14 (n 1 ) patients in the first stage of our trial. If none of the 14 patients respond to the treatment, the trial will be terminated, that is, we will fail to reject H 0 and conclude that the treatment is ineffective because if the true response rate were at least 0.20, then at least one response would have been expected to be observed in the first 14 patients. If at least one patient responds to the treatment, the trial proceeds to a second stage. The number of patients to be included in the second stage depends upon the number of successes (x 1 ) observed in the first stage and upon the precision desired for the final estimate of the response rate (p). If the first stage consists of 14 patients, then the second stage will consist of between 1 and 11 patients if a standard error of 0.10 is desired. The second stage will consist of between 45 and 86 patients if a standard error of 0.05 is desired. The final estimate of the response rate is given by ) p = x + 1 x2 N, N = n 1 + n 2, is the total number of patients enrolled in the trial, and x 2 is the number of successes observed in the second stage of the trial. A limitation of Gehan s design is that while it controls for the probability of rejecting an effective treatment (β), the design fails to control for the probability of accepting an ineffective treatment (α), and as such for a new treatment with a true response rate of 0.05, there is a 51% chance of obtaining at least one re-

18 6 sponse among the first 14 patients. Consequently, the first stage of 14 patients would not effectively screen out potentially ineffective drugs with a true response rate of 5%. To deal with this limitation of Gehan s design, Simon [7] proposed two phase II clinical trial designs; the Minimax and Optimal Designs, each with two stages. Simon s optimal design minimizes the expected number of patients exposed to a treatment whose true response rate is less than or equal to the null hypothesized value p 0. The minimax design minimizes the maximum total sample size (N) required for the trial. The usual hypothesis tested by both designs is; H 0 : p p 0 vs. H A : p p 1 p is the true proportion responding to the new treatment (response rate), p 0 is the greatest proportion of response which is deemed clinically ineffective, and p 1 is the smallest proportion of response which is deemed clinically effective. In both designs, the number of patients in each stage of the design is not specified by the investigators, but is a result of a minimization constraint utilizing a pre-specified overall type I error rate (α - probability of concluding an ineffective drug is effective) and type II error rate (β - probability of failing to recognize an effective treatment). The minimization constraint under the optimal design ensures that the probability of rejecting an ineffective treatment at the end of the first stage is high, and as such the design does not permit a large second stage. Since the optimal design limits the maximum duration of the trial, the minimax design is usually preferred over the optimal design in situations where patient accrual rate is low. A drawback of both of these designs, which are commonly used in practice, is that under certain combinations of p 0, p 1, type I and type II error rates they can fail to provide an opportunity to early terminate a trial when there is a long series of failures at the be-

19 7 ginning of the trial [8], thereby exposing more patients than necessary to a potentially ineffective treatment. This drawback of Simon s optimal and minimax designs justifies the need for a design, like the one proposed in paper one of this dissertation that provides more opportunities to terminate a trial when the treatment is potentially ineffective. Paper 1 of this dissertation presents greater details of Simon s minimax and optimal designs. In the first paper of this dissertation, Ayanlowo and Redden [12], we propose three alternative phase II clinical trial designs that incorporate stochastic curtailment rules to accomplish the first objective of a phase II clinical trial. Stochastic curtailment is a sequential monitoring approach to clinical trials that allows for unplanned interim analyses to be carried out at unspecified times during the trial. Stochastic curtailment rules allow for calculation of the probability of rejecting H 0 at the end of the trial given the current number of observed responses and assuming either H 0 or H 1 is true [9,13]. Specifically we use a stochastic curtailment rule based on conditional power in the development of our proposed designs; a brief literature review of conditional power is presented in section 1.5. In paper 1, we compare and contrast the properties of the three proposed designs: 1) stochastically curtailed (SC) binomial tests, 2) stochastically curtailed (SC) Simon s optimal design, and 3) SC Simon s minimax design to those of Simon s minimax and Simon s optimal designs. For each of these designs, we compare and contrast the number of opportunities for study termination, the expected sample size of the trial under the null hypothesis (p < p 0 ), and the effective Type I and Type II errors. We also present graphical tools for monitoring phase II clinical trials with stochastic curtailment using conditional power. We do not consider the estimation of the true response rate of

20 8 the new treatment in the development of the proposed designs. Details of our proposed designs are presented in paper 1 of this dissertation. 1.3 Phase III Clinical Trials Phase III trials are usually randomized controlled trials, conducted in large groups of patients, to further investigate the effectiveness of the new treatment, further monitor its side effects, compare it to an established form of treatment for the disease or a placebo and collect information that can help determine the safe use of the new treatment. A randomized controlled trial is a trial in which patients are allocated to either a control group (standard treatment) or an intervention group (new treatment), with the aim of unbiased assignment of patients and consequently minimizing possible differences between patients assigned to the different groups. At the beginning of a phase III clinical trial, the characteristics of patients assigned to the control group and the intervention group must be sufficiently similar so that differences in the outcome of interest may be reasonably attributable to the new treatment. The outcome of interest is usually a measurable effect of the new treatment. For instance in a phase III trial of a new therapeutic agent for cancer, the outcome of interest could be a reduction in the size of the tumor, or time to remission of the cancer. The balancing of the characteristics of the patients assigned to either the control group or the intervention group at baseline is a consequence of randomization. We present an overview of randomization in the next section. Some design issues related to phase III clinical trials and relevant to the designs proposed in the second and third paper of this dissertation are discussed in sections and 1.3.3

21 Randomization Randomization in its simplest form reduces the potential for bias in treatment assignment within a trial. Various randomization schemes have been proposed, most of which aim to reduce or eliminate different sources of bias in a clinical trial. The simplest type of randomization is the equal allocation (simple coin toss) randomization, also referred to as complete randomization. This form of randomization assumes that given a sequence of treatment assignments t 1,t 2, t n, the assignments are independently and identically distributed as Bernoulli random variables with probability of success (p) = Pr(t i = 1) = 0.5, p is constant throughout the trial and t i 1if treatment A is assigned = 0 otherwise Another frequently used randomization scheme is the blocked randomization, in which patients are grouped into blocks for the purpose of randomization to either of the two treatment arms [14]. In blocked randomization, blocks of k patients are created and the block of patients are assigned to a randomly selected combination of treatment assignment based on the block size. For instance, with a block size of k = 4 patients, there 4! are only six ( ) possible combinations of treatment assignment to treatment A or B; 2! ABAB, ABBA, BABA, BAAB, BBAA, AABB. One of these combinations of treatment assignments is randomly selected and the block of 4 patients are assigned accordingly. Note that the block size has to be divisible by the number of treatment arms in the trial. Blocked randomization controls for potential trends in recruitment or changes in the diseased population or recruitment over time and ensures that at every point during the

22 10 trial an imbalance in patient assignment is not large and that at certain points the number of patients assigned to each treatment group are equal. Other forms of randomization have been proposed in literature. A commonly suggested alternative to equal allocation randomization is the adaptive randomization, in which the probability of assignment to treatment A changes throughout the course of the trial. Adaptive randomization schemes can either update the probability of future treatment assignment based on pre-specified covariates (Covariate Adaptive Randomized Designs (CARDs)) or on patients response (Response Adaptive Randomized Designs (RARDs)). A covariate is usually a specific characteristic of the patients in the trial, for example, the gender, educational level, or age of the patients could be considered covariates. Covariate adaptive randomized designs are useful when there is a need to ensure balance between treatment arms with respect to certain known covariates. Response adaptive randomized designs are beneficial when ethical issues make it unfavorable to allocate equal number of patients to each treatment arm. Recently the idea of randomizing a patient to a treatment arm based on the patient s predicted probability of treatment response after adjusting for the effect of a pre-specified covariate has been proposed [15]. A brief detail of the response adaptive randomized design and Rosenberger et al s covariate adjusted response adaptive design are presented in sections and respectively. Details of the covariate adaptive randomized designs are beyond the scope of this dissertation.

23 Response Adaptive Randomized Designs Response adaptive randomized designs update the probability of future assignment to a treatment arm based on the responses of all patients already treated. The main goal of response-adaptive randomized designs is to assign more patients to the superior treatment. The argument for such a design is that more patients will benefit from the trial by having fewer allocations to the inferior treatment as determined by the information accrued so far in the trial. Because the determination of whether a treatment is the better of the two treatments is based on all the information accrued so far about the treatment effect at time t i, the big issue that arises in the analysis of such data is correlation among treatment allocations. A commonly referenced response-adaptive randomized design for binary responses is the randomized-play-the winner rule [16]. The design is usually described in terms of the urn model. The trial begins with an initial urn composition depending on the choice of the randomized-play-the winner (RPW) allocation rule. The RPW allocation rule is guided by the choice of υ and γ; where υ is the number of balls representing each treatment in the urn at the beginning of the trial, which is termed the initial urn composition, and γ is the number of balls used to update the urn as the trial proceeds. Assume we have two treatments A and B; each indexed by a red and blue respectively, and we choose to use a RPW(1,1) rule, the trial proceeds as follows. The trial begins with 1 red and 1 blue ball in the urn, this is termed the initial composition of the urn. The 1 st patient is assigned to either of the two treatments with a probability of 0.5 each. Assume the patient is randomized to treatment A by drawing with replacement the red ball. Suppose a success is observed; the urn is then updated with 1 red ball. The probability of randomiz-

24 12 ing the 2 nd patient to either treatment A or B is 2/3 and 1/3 respectively. Assume the 2 nd patient is randomized to treatment B, if we observe a failure then the urn is updated with 1 red ball. If the 2 nd patient responds to treatment B, the urn is updated with a blue ball. The urn is continuously updated until the n th patient is assigned to either of the two treatment arms. On average, the rule places more patients on the potentially better treatment. Delayed responses and staggered entries can be modeled with various distributions from the exponential family. Trials that utilize the RPW(υ,γ) in their randomization, analyze the data using permutation test which could be complicated and complex. An undesirable property of the RPW(υ,γ) rule is that for p A + p B > 3/2, the variance of the number of patients allocated to treatment A N A (n)/n, depends on the initial composition of the urn [17]. p A and p B are the response rates for treatment A and B respectively. The smaller the initial composition the larger the variation in proportion of patients assigned to the better treatment. This property makes the RPW(υ,γ) unattractive in practice, because of the difficulty in selecting the initial urn composition. This dependence of the outcome of the randomization on the initial composition of the urn when using RPW(υ,γ) led to the ambiguity in results of the popular ECMO trial. The ECMO trial was a prospective controlled randomized phase II trial of the use of extracorporeal membrane oxygenation (ECMO), conducted at the University of Michigan medical Center, Ann Arbor in 1984 [18]. The trial had two treatment arms the extracorporeal membrane oxygenation arm and a conventional ventilator therapy arm. The trial involved 12 newborns with respiratory failure, age up to 1 week with a birth weight > 2kg who had a high mortality risk ( 90%). The primary outcome was death from respiratory failure and a secondary outcome of lung recovery or existence of BPD.

25 13 The investigators chose to use the RPW(1,1) in the randomization of the trial participants to either of the two treatments because: 1) Patient outcome was known soon after randomization as most of these infants die within the first week of life. 2) It was a reasonable approach to the scientific/ethical dilemma of unnecessary withholding an effective treatment from trial participants. 3) It was anticipated that most ECMO patients would survive and most control patient (those on conventional treatment) would die, so significance could be reached with a modest number of patients. They planned to cease randomization whenever 10 balls of one type were added to the urn. After randomization ceased, all additional patients will be assigned to only the treatment represented by the most balls in the urn, that is, the treatment that gave the better results. The study was terminated with 11 patients assigned to ECMO, all survived and 1 patient assigned to conventional ventilator treatment who had died. Based on the results of the trial the investigators concluded that ECMO allows lung rest and improves survival compared to conventional ventilator therapy in newborn infants with severe respiratory failure. Given that these emphatic results were based on a trial involving only12 patients, with a seemingly bias of the patient-treatment allocation, there was an outcry from the medical research community over the credibility of the trial results and conclusion. In retrospect the investigator should have allowed the initial composition of the urn to be larger than 1 ball each, which would have probably resulted in a different patienttreatment allocation than that observed in the trial. Note that the initial composition of the urn and the response rates of patients assigned to treatment A or B determines the rate of deviation from the probability of 0.50 of treatment assignment. In recent years, other trials involving the use of ECMO in newborn infants conducted using conventional forms

26 14 of randomization (i.e. equal allocation, complete block randomized designs etc.) have supported some of the conclusions from the initial ECMO trial Covariate by Treatment Interactions In phase III clinical trials, there is the possibility for the effect of the new treatment to differ across different levels of a covariate. When treatment effect differs depending on the level of a covariate, it is said that an interaction exists between the covariate and treatment. Treatment effect measured as the difference between the response rate of treatment A and the response rate of treatment B. Two kinds of covariate by treatment interaction are distinguished in literature; qualitative interaction and quantitative interaction. A variation in the magnitude, but not the direction, of the treatment effect is called a quantitative interaction. [19]. A qualitative interaction describes the interaction that occurs when the direction of treatment effect differs by the level of the covariate. Another kind of interaction that could occur in practice is the interaction that causes the treatment effect to occur solely in one subset of the population or one level of the covariate. This could be the case when a subset of the patients with a common attribute does not respond well to either of the two treatments, while the other subset of patients respond better to one of the treatments compared to the other. Accurately determining the true effect of the new treatment in the presence of an interaction between the treatment and a covariate is an important goal of a phase III clinical trial design. To our knowledge, few statistical designs have been proposed in literature that allows the effect of an interaction to be adjusted for in the course of the trial, usually the effect of the interaction is accounted for during the analyses at the planned end of the trial. The limited literature related to possi-

27 15 bly accounting for the effect of a treatment by covariate interaction during the implementation of a phase III clinical trial motivated the designs proposed in paper 2 and 3 of this dissertation. In section 1.3.4, we briefly review two designs that have been proposed in literature to account for a covariate by treatment interaction during the implementation of the trial Phase III Designs Accounting for Covariate by Treatment Interactions Rosenberger et al [15] proposed a method that allows investigators to incorporate the possible existence of an interaction between the treatment and a covariate into the design and randomization of the trial. They achieve this by adjusting for the possible effects of the treatment, covariate(s) and interaction(s) during randomization, under the assumption that a true interaction exists. This approach is considered an adaptive randomization method. Rosenberger and his colleagues suggest updating the probability of future treatment assignment using the estimates of the treatment, covariate and covariate by treatment interaction effects from fitting a standard logistic regression equation using information accrued as the trial is conducted. They allow for a burn-in period before the adaptive randomization procedure begins (i.e. before fitting the first logistic regression equation), but do not conduct a formal test of interaction. The duration of the burn-in period is arbitrarily determined by first time the logistic regression algorithm converges. Whereas the paper by Rosenberger et al proposes an interesting approach to adjusting for the possible effect of a covariate by treatment interaction there are multiple issues unresolved in their approach. These issues include 1) No justification of overall sample size of the trial and no discussion of the consequence of assuming a true covariate by treatment interac-

28 16 tion, 2) An arbitrary starting point for adaptive allocation (i.e. the first time the logistic regression algorithm converges), and 3) no formal statistical test of the covariate by treatment interaction. Thall and Wathen [20] proposed a two-stage design similar to the Rosenberger s procedure but using a Bayesian framework. Thall and Wathen update the probability of treatment assignment based on the posterior distribution of the parameter of interest. Both approaches allow for continuous and categorical covariates. To resolve some of the issues presented by Rosenberger et al s design, in the second paper of this dissertation, we propose a two stage adaptive procedure which allows for a formal test of the interaction. The proposed design is intended to provide an alternative to Rosenberger s design. The first stage of our design utilizes an equal randomization scheme and its information is used to conduct a test of interaction. The statistical significance of the test of interaction determines the design of the second stage of the procedure. If the test of interaction is not significant, additional patients to have 80% power to detect the treatment effect of the specified magnitude are accrued. A test of treatment effect adjusting for the effect of the covariate is conducted after accrual of the additional patients. If a statistically significant test of interaction is observed at the specified α-level the trial proceeds to a stage 2 where the patients within each subset of the covariate are considered separately. In stage 2, the trial accrues n 2 patients within each subset of the covariate and the final stage 2 test statistic for the test of treatment effect is conditioned on the significance of the interaction. For trials with a significant interaction, we implement a conditional power approach at the beginning of the second stage to control the overall type I error rate while maintaining adequate power to detect the hypothesized treatment effect (θ = δ) within each covariate stratum.

29 17 Cohen and Simon in 1997 [21], proposed a two-stage procedure that determines the size of its first stage based on an assumption that no treatment by covariate interaction exists. The information from stage 1 is used to conduct a formal test of interaction. If the interaction is not statistically significant at the specified significance level the trial is terminated at the end of the first stage and the significance of the marginal effect size is tested. If the interaction is statistically significant, the trial proceeds to consider each level of the covariate separately, with the possibility of accruing more patients in the second stage. Under Cohen and Simon [21] design, the sample size of the second stage is the same as the sample size of the first stage. Cohen and Simon [21] test the main effect within a stratum immediately after the observed significant test of interaction. The test is conducted using boundaries derived from a group sequential design which assumes an interim test within a stratum is conducted at N/2 and a final test at N. If the interim test statistic exceeds the interim boundary, the trial ends within the stratum and significant treatment effect is reported. For any stratum that the interim test statistic does not exceed the critical values, accrual continues until N individuals are enrolled. After accrual to N individuals, the final test is conducted using the final prespecified critical value The third paper of this dissertation presents a two-stage adaptive procedure, similar to the design proposed by Cohen and Simon [21]. The procedure is a modification of the procedure proposed by Ayanlowo and Redden [22], paper 2 of this dissertation. Like the design proposed in the second paper of this dissertation, this procedure uses the information in its first stage to conduct a formal test of interaction. If a non significant test of interaction is observed, the trial is terminated and a test of treatment effect adjusting for a covariate effect is immediately conducted. For trials which observe a statistically

30 18 significant test of interaction at the specified α-level, like the Cohen-Simon design, the trial is split into j strata at the beginning of the second stage; j is the number of levels of the covariate of interest. At the beginning of the second stage, a test of significance of the treatment effect is conducted within each stratum using the information from stage 1. If the test of treatment effect is statistically significant within any stratum, the stratum (strata) is terminated at the beginning of the second stage with no accrual occurring within the stratum in stage 2. For any stratum (strata) in which a non significant test of treatment effect is observed, we compute the probability of rejecting the null hypothesis of no treatment effect at the end of the second stage, conditional upon the information observed in the first stage and the assumption that the treatment effect is δ (i.e. the conditional power). If the conditional power falls below a prior specified threshold value λ (conditional power threshold) within any of the stratum, the stratum (strata) is terminated, before accrual begins in the second stage. Details of this procedure are outlined in paper 3 of this dissertation. We examine the statistical properties of the proposed procedure under different treatment effect sizes and various values of the conditional power threshold (λ). Under the assumption of the null hypothesis being true (no treatment effects within any stratum), we compare and contrast proposed design, the Cohen-Simon design, and two other approaches with regard to experiment-wise type I error rates, stratum specific type I error rates, and expected sample size. Details of the two other approaches are presented in paper 3 of this dissertation. Under the assumption of the alternative hypothesis being true (treatment effects exist within only one stratum), we compare and contrast the four approaches with regard to experiment-wise power, stratum specific power, and expected sample size.

31 Phase IV Clinical Trials Phase IV trials, also called post marketing surveillance studies, and are conducted after the treatment has been approved by the FDA or other regulatory agencies to gather more information on the side effects which are rare in various subgroups and any side effects associated with long-term use. Such side effects detected in Phase IV trials may result in the withdrawal or restricted use of the treatment or drug. Sometimes, in phase IV trials the treatment or drug may be tried on slightly different patient populations than those studied in earlier phases of the drug discovery process, in the United States, Phase IV studies are often mandated by the drug regulatory authority, the Food and Drug Administration (FDA). Most phase IV trial designs are observational in nature and often suffer from under reporting of events. 1.5 The Concept of Conditional Power All three papers contained in this dissertation utilize a conditional power approach to either stop early a trial or an arm of a trial due to lack of evidence of treatment efficacy. In the first paper of this dissertation, Ayanlowo and Redden [12], we propose methods that use conditional power to provide multiple early stopping rules in a phase II trial to declare a new treatment ineffective. In the second and third papers of this dissertation, we utilize the definition of conditional power under the null hypothesis to control the overall type I error rate of our proposed two-stage adaptive designs. Additionally, in paper 3, we implement a stopping rule based on conditional power to early terminate a subset of the trial that shows little evidence of a treatment effect at the beginning of the

32 20 second stage of design for trials that observe a significant test of interaction. In this section, we briefly review the concept of conditional power. Conditional power is a stochastic curtailment procedure proposed by Lan, Simon and Halperin in 1989 [13]. Lan et al defined conditional power as the conditional probability of rejecting the null hypothesis of no treatment effect at the planned end of the trial given the observed information under the assumption that alternative hypothesis (θ = δ) is true. Paper 1 and paper 3 provide formulas for the computation of conditional power under the hypothesized alternative. Other authors have considered the computation of the conditional power using some value of the parameter of interest other than that prespecified under the alternative hypothesis. Pepe and Anderson [10] considered conditional power computations under the current estimate of treatment effectθ = δ, where δ ) is estimated from the observed data at the time of data monitoring. Lan et al [13] discussed the use of conditional power as a tool to early stopping a trial before its planned end. They considered stopping the trial early for futility if the conditional power under the alternative hypothesis was below a pre-specified threshold value λ. This implies that given the current data the probability that the null hypothesis would be rejected at the planned end of the trial assuming the hypothesized treatment effect under the alternative was true for the reminder of the trial is too small. Lan and his colleagues also consider stopping the trial early because there is ample evidence based on the conditional power at the current time to reject the null hypothesis at the planned end of trial even if the null hypothesis was assumed to be true for the remainder of the trial. Lan and Wittes [23] proposed using a B-value instead of the usual Z test statistic in the computation of conditional power. The B-value is a transformation of the Z-value observed at the time of the

33 21 conditional power calculation and is simply computed as Z n n N, n is the number of observations at the time of the conditional power computation, Z n is the observed value of the z-statistic and N is the planned total sample size of the trial; N n is the proportion of the total planned information observed at the time of data monitoring [23]. The original work on conditional power by Lan et al [13] and Lan & Wittes [23] were based on binary outcomes, but conditional power is a statistical tool that is adaptable to outcomes other than binary ones. For instance, P.K. Andersen [24] considered the use of conditional power in aiding the decision on whether to continue a clinical trial in which an outcome of time to an event or survival times of patients in two treatment arms are compared. Andersen based the conditional power computation in this context on the current estimates of accrual rates, dropout rates (necessary to determine censoring of the data) and death rates within each of the treatment arms using a variation of the log-rank statistic. Andersen defined conditional power as the probability that the log-rank test statistic at the planned end of the trial will fall within the rejection region conditional on the death rates and the total observation time at the current time under the assumption that the relative hazard (risk ratio) of dying remained at θ = θ A (the originally postulated clinically relevant difference between the two treatment arms) and that the estimated accrual rate was constant during the remainder of the trial. The observation time is defined as the patient time in months, i.e. the duration of a patient in the trial measured in calendar months. The hazard function was assumed to be constant across the two treatment arms (i.e. the distribution of the survival curves is exponentially distributed) and all patients that had no events at the time of the conditional power computation were censored.

34 22 The conditional power approach as proposed by Andersen was used in aiding the decision whether to continue accrual after the predesigned end of a trial of testosterone treatment of men with alcoholic cirrhosis [25]. This trial was part of the Copenhagen Study Group for Liver Diseases, conducted in Copenhagen, Denmark. The testosterone study was a double blinded controlled clinical trial with a treatment arm and a placebo arm. The trial was designed to study the effect of testosterone treatment on the survival of males with alcoholic cirrhosis. Alcoholic cirrhosis is a permanent scaring of the liver caused by alcohol abuse. An accrual rate of about five patients per month within each group was expected, with an assumed mortality rate of 0.02 per month for the control group. A reduction in mortality rate of about 4% (exp(-0.4) = 0.7, θ = 0.7) for the testosterone arm was the postulated clinical relevant difference, an 80% power to detect this difference at a 5% significance level were used in the sample size calculations for the trial. At the predesigned end of the trial an estimated relative hazard of 1.36 (λt/λc) was observed with an estimated 95% confidence interval of (0.68, 2.71). Given that the estimated 95% confidence interval of the relative hazard included a value of 1, it implied that the difference in the risk of dying from alcoholic cirrhosis was not statistically significantly different between the males on testosterone treatment and those on placebo. This meant that at the predesigned end of the trial, the testosterone treatment showed little or no evidence of improving survival of males with alcoholic cirrhosis. Based on this observed information the trial investigators, who where blinded to the direction of the effect, i.e. they did not know whether the estimated relative hazard of 1.36 was the hazard ratio of treatment to control (λt/λc) or that of control to treatment ((λc/λt), considered extending the trial for another 12 months. To aid the decision of whether or not to extend

35 23 the trial conditional power computations were conducted using the approach proposed by Andersen [24]. The conditional power was computed based on the current observed number of deaths within each group, and the total observation time (i.e. patient time in months, assumed to be equal in each group) under the assumption that the postulated clinical relevant difference of θ = 0.7 was still detectable if the trial was extended for another 12 months. The conditional power of declaring the clinical relevant difference of θ = 0.7 as significant at the end of 12 months was less than 5%. Given that the conditional power was less than 5% and that the observed hazard ratio of 1.36, was in the opposite direction than anticipated, the group decided to stop accrual for the trial. Henderson et al in 1989 [26], proposed an algorithm for the computation of conditional power for a time-to-event outcome. Henderson et al s algorithm is a modification of an algorithm proposed by Halperin and Brown [27] that computes the unconditional power for comparing survival curves. Henderson and his colleagues in the development of their algorithm relaxed the constant hazard assumption of Andersen s conditional power approach. They achieved this by using arbitrary survival curves without a distributional assumption on the possible distribution of the survival curves. They computed conditional powers using the log rank test, Gehan s generalization of the Wilcoxon test and a modified Kolmogorov-Smirnov test. Henderson et al s algorithm was used in a VA cooperative study on Valvular disease entitled Prognosis and Outcome Following Heart Valve Replacement [28], to decide whether or not to extend the trial beyond its planned end. The VA study was a randomized controlled trial with 2 treatment arms. One arm was to receive a tissue valve (the Hancock procine heterograft valve) and the other a mechanical valve (the Bjork-Shiley spherical disc value). The outcomes of the trial were

36 24 death from all causes and heart-valve complications which were either fatal or nonfatal. 575 patients requiring either an aortic or a mitral valve were randomized to receive either the tissue or the mechanical valve. Patients were enrolled into the study and randomized between 1977 and 1982 and were followed up by clinic visits every 6 months, for an average follow-up of 5 years. In 1988, after an average follow-up of 7.5 years, based on the conditional power estimates using the modified Kolmogorov-Smirnov test and the logrank test, the trial was extended for another 5 years. The conditional power estimates using the Gehan s modified Wilcoxon test was too conservative for all the scenarios considered in the decision to extend the trial. Lin et al [29] examined the computation of conditional power for two-sample weighted log-rank test for survival data in the presence of censoring. Bautista et al [30] also considered computation of conditional power for time-to-event outcomes using a B-value transformation of the log-rank test. The Z n in this framework is the observed log-rank statistic, and the proportion of the total planned information observed at the time of data monitoring is defined in terms of the exposure time. Other authors have also proposed extension of a clinical trial based on conditional power. Proschan and Hunsberger [31] proposed a flexible two-stage method that uses conditional power to extend a trial. The conditional power at the end of the first stage is used to determine the sample size (n 2 ) of the extension. Proschan and Hunsberger [31] protect the overall type I error rate by using an increasing conditional error function that specifies the amount of conditional type I error needed for the second stage conditioned on the value of the stage 1 test statistic. The conditional error function is then used to determine the appropriate critical value c 2 the controls the overall type I error rate at the

37 25 usual 5%. Their design also incorporates the possibility of extending the trial for small non-significant p-values. Li et al [32] proposed a modification to Proschan and Hunsberger s method that determines the sample size of the second stage using conditional power, and adjusts the final-stage critical value (c 2 ) to protect the overall type-1 error rate without specifying a form of the conditional error function. They achieve this by directly applying the definition of the conditional type-i error. Denne [33] proposed a 2-stage group sequential procedure, an application of Proschan and Hunsberger s method, that allows for the re-estimation of the sample size required in stage 2 based on the information observed about the nuisance parameter (σ 2 ) in stage 1. Denne s procedure uses the definition of the conditional power under H 0 to determine the appropriate critical value that prevents inflation of the overall trial type I error rate. Denne s procedure is discussed at length in the second paper of the dissertation. Several procedures or designs based on conditional power have also been proposed in literature. Lan [34] proposed a two-stage procedure that fixes the size of the second stage (n 2 ) to ensure sufficient conditional power atθ = δ. To protect against a type I ) error rate inflation, Lan incorporates a rule for early stopping to accept H 0 that ensures that the type I error rate is controlled over a range of design parameters. Betensky [35,36] has also proposed procedures for early stopping to accept H 0 based on conditional power. Betensky [35] considered the use of a linear combination of the current estimate of the ) treatment effect θ = δ and z (1-γ), z (1-γ) is the 100(1-γ) percentile of the standard normal distribution; instead of the value of θ under the pre-specified alternative hypothesis in the computation of the conditional power. The procedure proposed by Betensky [36] is a modification of the work by Pepe & Anderson [10]. In this paper, Betensky [36] pro-

38 26 posed the use of conditional power in the computation of lower boundaries for the O Brien-Flemming group sequential test and the repeated significance tests. The conditional power computation was based on the linear combination of the current estimate of the treatment effect proposed by Betensky [35]. The aim of the Betensky s paper [36] was to provide boundaries that aid the decision to stop early in favor of the null hypothesis. Spiegelhalter D.J. et al [37], and Brannath W. and Bauer P. [38] have all considered procedures based on conditional power. Proschan [39] discussed the computation of conditional power in a multi-armed trial using the classic Fisher least square significant difference procedure. Conditional power is a useful tool to communicate information about an ongoing clinical trial with physicians involved in the trial. For instance, it can be used to illustrate the effects of slow accrual and to aid the decision to stop the trial before its planned end if the conditional power is low [9]. Since conditional power computations are dependent on the value of the parameter of interest θ, the total sample size assumed and the observed the test statistic, it is imperative to specify in the protocol for the clinical trial, at the design/planning phase, details of the parameters that will be used in the computation of the conditional power during the interim analyses that involve these computations. The threshold value for the conditional power by which the early stopping in favor of the null hypothesis or the alternative hypothesis is determined should also be specified in the protocol. Reasonable justifications for the choice of these threshold values should be stated. The test statistic used in the computation of the conditional power should be the same test statistic intended to be used for inference of the treatment effect at the planned end of the trial. The total sample size specified at the beginning of the trial should be strictly ad-

39 27 hered to during the computation of the conditional power except when the intent of the conditional power calculations is to re-estimate the total sample size of the trial. Any change in the total sample size used for the conditional power computation should be adequately justified in an amendment to the original protocol. 1.6 Summary of Research Objectives Below is a synopsis of the research objectives/aims of each paper of the dissertation. Paper 1: 1) Develop a phase II clinical design that provides more opportunities, compared to existing phase II designs, for early termination of the trial due to lack of evidence of efficacy. 2) The developed design should seek to provide smaller expected sample sizes compared to Simon s minimax and optimal design. Paper 2: 1.) Propose a design that incorporates the possible existence of a covariate by treatment interaction into the implementation of a phase III clinical trial with a dichotomous outcome and two treatments. We are specifically interested in an interaction between the treatments and a primary covariate with two levels. 2.) The design should provide an opportunity to conduct a formal of test of the covariate by treatment interaction. 3.) Incorporate a method into the design that controls the overall trial type I error rate at the end of trial at the usual 5% for trials that observe a significant test of interaction.

40 28 Paper 3: 1) Develop an efficient two-stage that modifies the two-stage design proposed in paper 2 to provide an opportunity to terminate early any stratum that shows little evidence of a treatment effect after a statistically significant test of covariate by treatment interaction has been observed. 2) The modification of the design should also provide an opportunity to terminate early any stratum that shows evidence of the statistically significant treatment effect at the beginning of the second stage of the design. 3) Compare and contrast the statistical properties of a single stage design, a parallel 2-group sequential trial design, and a design proposed by Cohen and Simon [21] to the properties of our developed design.

41 29 STOCHASTICALLY CURTAILED PHASE II CLINICAL TRIALS by AYANLOWO, A.O AND REDDEN, D.T Statistics in Medicine Copyright 2006 by John Wiley & Sons, Ltd. Used by permission Format adapted for dissertation

42 30 SUMMARY Phase II trials often test the null hypothesis H 0 : p p 0 versus H 1 : p p 1, where p is the true unknown proportion responding to the new treatment, p 0 is the greatest response proportion which is deemed clinically ineffective, and p 1 is the smallest response proportion which is deemed clinically effective. In order to expose the fewest number of patients to an ineffective therapy, phase II clinical trials should terminate early when the trial fails to produce sufficient evidence of therapeutic activity (i.e., if p < p 0 ). Simultaneously, if a treatment is highly effective (i.e., if p p 1 ), the trial should declare the drug effective in the fewest patients possible to allow for advancement to a Phase III comparative trial. Several statistical designs, including Simon s minimax and optimal designs, have been developed that meet these requirements. In this paper, we propose three alternative designs that rely upon stochastic curtailment based on conditional power. We compare and contrast the properties of the three approaches: 1) stochastically curtailed (SC) binomial tests, 2) stochastically curtailed (SC) Simon s optimal design, and 3) SC Simon s minimax design to those of Simon s minimax and Simon s optimal designs. For each of these designs we compare and contrast the number of opportunities for study termination, the expected sample size of the trial under the null hypothesis (p < p 0 ), and the effective Type I and Type II errors. We also present graphical tools for monitoring phase II clinical trials with stochastic curtailment using conditional power. Keywords: Stochastic curtailment; conditional power; phase II trials; efficacy; response proportion.

43 31 1. INTRODUCTION Phase II clinical trials are aimed at evaluating the therapeutic efficacy of investigational agents in the treatment of diseases/impairments [1]. The therapeutic efficacy of phase II clinical trials of new treatments is often evaluated in terms of a Bernoulli random variable reflecting the participants response or non-response to therapy. Phase II trials test the null hypothesis H 0 : p p 0 versus H 1 : p p 1, where p is the true proportion responding to the new treatment, p 0 is the largest response proportion such that the drug is deemed clinically ineffective and p 1 is the smallest response proportion such that the drug is deemed clinically effective. A phase II trial should stop quickly when the treatment shows unacceptably low therapeutic activity. Likewise, in situations when the drug is effective, the trial should declare the drug effective in the fewest patients possible to allow for advancement to a Phase III comparative trial. To minimize the total number of patients required for a Phase II trial, several statistical designs have been proposed. Simon [2] proposed two different two-stage designs. Simon s optimal design minimizes the expected number of patients exposed to a treatment whose true response proportion is the null hypothesized value p 0. Alternatively, Simon s minimax design minimizes the maximum total sample size n required for the trial. In both designs, the number of patients in each stage of the design is not specified by the investigators, but is a result of a minimization constraint utilizing a pre-specified overall type I error rate (probability of concluding an ineffective drug is effective) and type II error rate (probability of failing to recognize an effective treatment). The minimization constraint under

44 32 Simon s optimal design tries to ensure that the probability of rejecting an ineffective treatment at the end of the first stage is high, thus eliminating the use of the second stage. The minimax design is usually preferred over the optimal design in cases where patient accrual is expected to be slow and the difference in expected sample sizes under the null for both designs is small. A drawback of both of these designs is that under certain combinations of p 0, p 1, type I and type II error rates they can fail to provide an opportunity to early terminate a trial when there is a long series of failures at the beginning of the trial [3]. For example, the optimal design for p 0 = 0.20, p 1 = 0.35, with α = β = 0.10, where α is the type I error rate and β is the type II error rate, requires a sample size at the first stage of n 1 = 27 patients with early termination when five or fewer successes have been observed. Under this design, the earliest an investigator can terminate the study, assuming no responders, is after 22 consecutive failures. An alternative approach is to incorporate stochastic curtailment rules, which is a sequential monitoring approach to clinical trials that allows for unplanned interim analyses to be carried out at unspecified times during the trial [4]. The stochastic curtailment procedure allows for calculation of the probability of rejecting H 0 at the end of the trial given the current number of observed responses and assuming either H 0 or H 1 is true [5,6]. In literature, three main approaches to stochastic curtailment have been proposed: the predictive power approach, the parameter free approach and the conditional power approach. The predictive power approach utilizes a prior distribution to represent likely values of the true unknown response proportion. The data from the phase II trial is then used

45 33 to update the prior distribution to get a posterior distribution under which the predictive probability is computed. The application of this approach is discussed by Herson [7], Choi et al [8], Spiegelhalter et al [9]. Other applications of Bayesian methods in the design of phase II clinical trials have been considered by Sylvester [10], Thall and Simon [11], and most recently by Jung et al [12]. The parameter free approach uses a bootstrap method to estimate the conditional joint distribution of the sequence of test statistics {Z 1,, Z k }given the information level I k at a discrete time k. The information level (I k ) is defined as the reciprocal of the estimated variance of the response proportion observed up to time k. The conditional joint distribution is used to compute the conditional probability of rejecting the null hypothesis at the end of the trial given the observed information level. This method was originally proposed by Jennison [14], and has been further developed by Xiong [13], Tan and Xiong [15], and Tan et al [16]. The third approach to stochastic curtailment is the conditional power approach which calculates the probability of rejecting the null hypothesis at the end of the trial, conditional upon the responses of the first k individuals and the assumption that H 1 is true. These conditional probabilities can be used to terminate the trial early due to lack of evidence of treatment efficacy and the low probability of rejecting H 0 at the end of the trial. Similar procedures have been considered by Lan et al [17], Jennison and Turnbill [18], Pepe and Anderson [19] and Betensky [20, 21]. Because of the dependency of the predictive power upon a prior distribution and the complexities of the parameter free approach, we use the conditional power approach in the development of our methods.

46 34 In this paper, we compare and contrast the properties of stochastically curtailed binomial trials to Simon s minimax and optimal designs. We also consider two approaches to enhance Simon s minimax and optimal designs using stochastic curtailment rules. For each of these three alternative approaches; stochastically curtailed binomial trials, stochastically curtailed Simon s minimax design, and stochastically curtailed Simon s optimal design, we compare and contrast the number of opportunities for stopping due to lack of efficacy, type I error control (α), statistical power (1-β) and the expected sample size under the null hypothesis to those of Simon s original minimax and optimal designs. 2. OVERVIEW OF SIMON S TWO STAGE DESIGNS Simon s approach to designing Phase II clinical trials is to specify the parameters p 0, p 1, α, β, and then determine the two-stage design that satisfies α,β while minimizing the expected sample size under the null hypothesis (p p 0 ). The expected sample size under the null hypothesis is given by E(N) = n 1 + (1-PET)n 2, where n 1, n 2 are the number of patients enrolled in the first and second stage respectively. PET (the probability of early termination, i.e. terminating after the first stage) is the probability of observing r 1 or fewer responses out of the n 1 patients in the first stage given p 0 and is computed r1 t= 0 n n t t 1 t 1 as PET = ( ) p (1 p) ; which is the cumulative density function of a binomial random variable. If r 1 or fewer responses are observed out of the first n 1 patients the trial will be terminated after the first stage due to lack of evidence of treatment efficacy. The treatment will be declared ineffective at the end of the second stage if r = r 1 + r 2 or fewer

47 35 responses are observed, where r 2 is the number responses observed in the second stage. The probability of declaring ineffective a treatment whose true proportion of response is p, is given by x ( x ) ( t ) m in [ n, r ] r - x 1 n1 n1 x n2 t n2 t (1 ) (1 ). = = 0 P E T + p p + p p x r t The designs are iteratively optimized over n 1, n 2, r 1, and r such that the expected sample size under the null hypothesis is minimized. Simon s optimal design minimizes the expected number of patients exposed to a treatment whose true proportion of response is at most the null hypothesized value p 0. Alternatively, Simon s minimax design minimizes the maximum total sample size n required for the trial. Simon s designs do not include stopping rules for early termination in the case of a highly efficacious treatment (p p 1 ). 3. OVERVIEW OF STOCHASTIC CURTAILMENT Let r k denote the number of observed responses among the first k patients, and r n-k denote the number of observed responses among the future n-k patients. Because r k and r n-k are both sums of independent Bernoulli trials, r k and r n-k are distributed Binomial (k, p) and Binomial (n-k, p) respectively. To design a trial that tests the hypotheses H 0 : p p 0 vs. H 1 : p p 1 and utilize the conditional power approach of stochastic curtailment, first we determine the total sample size n and a critical value r c at which to reject H 0 at the end of the study for given values p 0, p 1, and a prior specified type I error control (α) and power (1-β). The conditional power after observing the responses of the first k individuals is then defined as

48 36 C k = Pr (reject H 0 r k, under H 1 ) = Pr ( r n-k r c -r k r k, p p 1 ) r = n k 1 n k 1 p t= 0 t t n k t 1 (1 p1) C k represents the conditional probability of rejecting the null hypothesis (i.e. declaring the drug effective) at the end of the trial given the number of observed responses out of the first k individuals assuming the alternative hypothesis is true; it is also the probability of observing enough responses out of the future n-k patients to reject H 0. The test statistic r k is computed after all subjects enrolled up to the k th individual have been observed. The trial is terminated early for lack of evidence of efficacy when C k falls below a threshold θ [7]. If the trial reaches full accrual and does not terminate early, C k is either zero or one, and the treatment is declared ineffective if the total observed responses out of the n patients is less than r c. In this paper, like Simon [2], we do not consider early termination of a Phase II trial in order to declare a drug effective. We do implement stochastic curtailment rules to terminate the study due to lack of efficacy when the conditional power to declare the drug effective is less than θ. To find these stopping rules, we sequentially examined all combinations of initial responses (r 1 ) out of k patients and calculated the conditional power (C k ) of observing sufficient responses (r n-k ) within the remaining n-k patients to allow rejection of the null hypothesis over a range of designs where p 0, p 1, α, and β were a priori specified. Simulations of size 1000 were used to estimate average sample size under the null hypothesis, type I error rate, and power for all the designs using stochastic curtailment. The simulations under the null hypothesis were done under the assumption that the treatment has a true response proportion of exactly p 0. Similarly the power calculations were

49 37 done under the assumption that the treatment has a true response proportion of exactly p 1. We also considered the properties of the designs when the treatment has a true proportion of response less than p 0 or greater than p 1. We examined the benefit of incorporating stochastic curtailment in simple binomial test designs as well as in Simon s minimax and optimal designs. Since the first stage of Simon s two-stage designs are positioned to minimize expected sample size assuming the null hypothesis is true under the constraints of α and β, we implemented stochastic curtailment rules only within the second phase of the Simon s designs. 4. SIMULATION RESULTS Table 1 illustrates the properties of stochastic curtailment in simple binomial designs. For example, in a simple binomial design to test between p and p 1 0.4, a design of 35 patients requiring 12 responses to conclude p.4 provides 80% power with Type I error below.05. This design provides no opportunities to early terminate the trial due to lack of evidence of efficacy. Using the stochastically curtailed design to terminate if the conditional power is less than θ =.05 provides 10 opportunities to end the trial due to lack of evidence of efficacy and low probability of future rejection of H 0. The first stopping rule terminates the trial if 0 out of the first 16 patients respond. The rationale for this stopping rule is if the trial was allowed to continue after observing 0 responses out of the first 16 patients then rejection of H 0 will only occur if at least 12 responses were observed in the future 19 patients. The probability of observing this outcome assuming H 1 (p 0.4) is true is only which is less than the threshold value of Other opportunities to stop the trial early occur at intervals of 18, 20, 22, 24, 26, 27, 29, and

50 38 31patients. Overall, using stochastic curtailment under the simple binomial design provides an estimated power of 78%, type I error of.03, and average sample size of 25.5, as compared to a type I error of 0.046, power of 80% and average sample size of for Simon s minimax design. Simon s optimal design for the same parameters provides power of 80%, type I error of.05, and expected sample size under the null of If θ is set to.10, the stochastically curtailed binomial trial has expected sample size under the null of which is more comparable to the expected sample sizes for Simon s minimax designs. However, increasing the conditional probability threshold to.10 has little effect on both the power and the type I error rate. Table 2 illustrates the benefit of supplementing Simon s minimax and optimal Phase II designs with stochastic curtailment. Under Simon s optimal design to test between p 0.2 and p 0.4, a design of 43 patients is required to provide 80% power with Type I error of.05. The optimal design terminates early if 3 or fewer responses occur in the first 13 patients and this design has an expected sample size under the null of The probability of rejecting a treatment with a true proportion of response of p 0.2 after the first stage of the optimal design for testing H 0 : p 0.2 versus H 1 : p 0.4 is For this design the conditional probability of declaring the drug effective at the end of the trial if in fact the true proportion of response of the treatment is 0.4 given that we have seen only 3 responses out the first 13 patients (i.e. the probability of observing at least 10 responses within the next 30 patients given that the true response rate of the drug is 0.4) is Implementing stochastic curtailment within the second stage of this design provides 6 more opportunities to cease the trial early due to evidence of lack of efficacy; 4 out of 30, 5 out of 32, 6 out of 34, 7 out of 35, 8 out 37 and 9 out 39. At each of these six

51 39 stopping rules the conditional probability of observing sufficient evidence to declare the drug effective at the end of the trial if in fact the true proportion of response of the treatment is 0.4 is less than The estimated properties of the design using stochastic curtailment with θ of 0.05 are Type I error rate of 0.044, power 80.2%, and expected sample size under the null of For θ of 0.10 the Type I error rate remains at with a power of 80.1% and an expected sample size of The slight reduction in expected sample size for θ = 0.10 shows that the utility of the stochastically curtailed Simon s optimal design is dependent on the choice of θ.

52 40

53 41

54 42 5. DISCUSSION Within this paper, we have estimated the properties of the stochastically curtailed (SC) binomial tests, Simon s minimax and optimal designs, and stochastically curtailed (SC) Simon s minimax & optimal designs. For θ =.05, we note that the SC binomial approach provides expected sample sizes under the null hypothesis much closer to Simon s minimax approach. However, at θ =.05, Simon s optimal design has superior statistical characteristics over the SC binomial design. It is noteworthy that the statistical properties of the stochastically curtailed binomial designs are highly dependent upon the conditional probability threshold θ. As θ is increased, expected sample size under the null hypothesis decreases and becomes more comparable to the expected sample size of Simon s minimax design. However, increasing θ decreases statistical power as illustrated in Figure 1. Figure 1 presents the effect of the choice of the threshold value (θ) on the observed power, observed alpha, and the average sample size of the trial respectively for simulated studies with three different effect sizes. The benefit of stochastically curtailed binomial designs depends completely upon an appropriate choice of threshold. Based upon our simulation results, we recommend that the theta value be chosen between.05 and.10 in order to maintain reasonable statistical power at the end of the trial while exposing the fewest number of patients to a potentially ineffective treatment. A theta value >.10 will reduce overall statistical power of the trial and a theta value <.05 increases the number of patients that are exposed to a potentially ineffective treatment. Because Simon s approach is a two-stage design with the first stage evaluation positioned to minimize expected sample size under the constraints of α and β, we implemented stochastic curtailment rules within only the second phase of the Simon s designs.

55 43 We hoped to further decrease expected sample size when the drug/therapy is ineffective. As can be seen from Table 2, supplementing Simon s design with stochastic curtailment did not greatly decrease the expected sample size under the null hypothesis. The question must be raised Why did stochastic curtailment not decrease expected sample sizes more under the Simon optimal design? The answer lies within the way Simon s designs are optimized and the choice of thresholds for the stochastic curtailed design. Simon s approach iteratively searches over all possible designs and chooses for the design that meets the design criteria and has smallest expected sample size. Inherent in this iterative approach is the choice of an optimal stage one design that has high probability of early termination when the drug is ineffective. Given that the stage one rule in Simon s designs is highly effective at early termination when the drug is ineffective, the stochastic curtailment rules implemented in the second stage terminate very few trials early except when θ is large (.10 or greater). Given most of the sample size savings occur within stage one, it is not surprising the SC Simon optimal designs did not substantially lower expected sample size under the null hypothesis. We note that for small choices of θ ( 0.05), the SC designs provide little improvement over Simon s designs. However, we believe the SC designs are potentially useful and likely beneficial in situations where investigators are inclined to use Simon s minimax design. In slowly accruing trials, such as rare cancers, or trials involving quickly observed outcomes, such as headache or analgesic trials, SC designs maintain adequate type I error control, statistical power, and provide more opportunities to early terminate the trial due to lack of evidence of efficacy. In situations where investigators choose to use the SC designs, we recommend using graphical procedures such as Figure 2 to monitor the trial. These figures are

56 44 useful for investigators using SC designs given the large number of stopping rules that can be generated under stochastic curtailment designs. Figure 2 presents the graph of the conditional power of rejecting the null hypothesis based upon the number of responses observed and assuming the alternative hypothesis is true; for a simulated study testing H 0 : p 0.2 vs. H 1 : p 0.4, for a treatment whose true response proportion (p) is 0.2. Such a visualization of the conditional power provides a useful tool in explaining the rationale to terminate the trial or to continue and helps the investigators to ascertain the most appropriate stopping rule at which to early terminate the trial due to lack of efficacy. This is the point on the graph at which the conditional power decreases below the pre-specified threshold value (θ), indicating a very little chance of ever rejecting the null hypothesis even if the trial was continued to the end. For the design shown in figure 2 using θ =.10, this point corresponds to observing three (3) successes or less out of the 20 patients. If researchers preferred using θ = 0.05, the point would correspond to four (4) successes or less out of 24.

57 45

58 46

59 47 6. CONCLUSION We note the advantages of the stochastically curtailed phase II design are: 1) the design presents the investigator with many opportunities to early terminate the trial due to lack of efficacy. 2) The design helps to ensure that adequate statistical power is maintained at the end of the trial. 3) The design will also ensure that the trial meets the ethical imperative of exposing the fewest patients to a potentially ineffective treatment. One main disadvantage of the stochastically curtailed design is the possible complexity introduced into the trial by the frequent stopping of the trial to observe the responses of all patients accrued up to time k. For this reason, we do not recommend the use of a stochastically curtailed design in trials with fast accrual and trials with outcomes that can only be observed over a long period of time. Under this situation, we recommend the use of Simon s optimum design. Within this paper, we compare and contrast the properties of stochastically curtailed (SC) binomial tests, Simon s optimal and minimax designs, and stochastically curtailed (SC) Simon optimal and minimax designs. Under a large conditional probability threshold and a large difference between p 0 and p 1, stochastically curtailed binomial designs provide statistical properties that are most comparable to Simon s minimax designs. This fact leads us to conclude that stochastically curtailed designs may be most useful in trials that would employ a minimax design, such as trials with slow accrual, trials of rare diseases, or trials involving quickly observed responses. In any situation that an investigator would use the Simon s minimax design, stochastically curtailed designs are a viable option. We further conclude that enhancing Simon s design with stochastic curtailment slightly decreases the expected sample sizes exposed to potentially ineffective treatments

60 48 with minimal loss of power. In light of these findings, more research into the design and conduct of the phase II trials is warranted. The use of the predictive power and the parameter free approach in the design of phase II trials also merit further research.

61 49 REFERENCES 1. Gehan, E.A. The determination of the number of patients required in a preliminary and a follow-up trial of a new chemotherapeutic agent. J Chronic Dis. 1961; 13, Simon, R. Optimal two-stage designs for phase II clinical trials. Controlled Clinical Trials 1989; 10, Kramar A, Potvin D. and Hill C. Multistage designs for phase II clinical trials: statistical issues in cancer research. British Journal of Cancer 1996; 74, Jennison, C. and Turnbull, B.W. Group sequential methods with applications to clinical trials. Chapman and Hall/CRC: Boca Raton, 2000; pp Lan, K.K. and Whites, J. The B-Value: a tool for monitoring data. Biometrics 1988; 44, Leung, D. H., Wang, Y., and Amar, D. Early stopping by using stochastic curtailment in a three-arm sequential trial. Applied Statistics 2003; 52, Herson, J. Predictive probability early termination for phase II clinical trials. Biometrics 1979; 35, Choi, S.C., Smith, P.J. and Becker, D.P. Early decision in clinical trials when treatment differences are small. Controlled Clinical Trials 1985; 6,

62 50 9. Spiegelhalter, D.J., Freedman, L.S. and Blackburn, P.R. Monitoring clinical trials: Conditional or predictive power? Controlled Clinical Trials 1986; 7, Sylvester, R. A Bayesian approach to the design of Phase II clinical trials. Biometrics 1988; 44, Thall, P. and Simon, R. Practical Bayesian guidelines for Phase IIB clinical trials. Biometrics 1994; 50, Jung, S.H., Lee, T., Kim, K., and George S. Admissible two-stage designs for phase II cancer clinical trials. Statistics in Medicine. 2004; 23, Xiong, X. A class of sequential conditional probability ratio tests. J Amer. Statist. Assoc. 1995; 90, Jennson, C. Bootstrap tests and confidence intervals for a hazard ratio when the number of observed failures is small, with applications to group sequential survival studies. Computing Sciences and Statistics, 23. Springer-Verlag: New York, 1992; pp Tan, M. and Xiong, X. Continuous and group sequential conditional probability ratio tests for Phase II clinical trials. Statist. Med. 1996; 15, Tan, M., Xiong, X. and Kutner, M.H. Clinical trial designs based on sequential conditional probability ratio tests and reverse stochastic curtailing. Biometrics 1998; 54, Lan, K.K. G., Simon, R. and Halperin, M. Stochastically curtailed tests in long-term clinical trials. Communications in Statistics Sequential Analysis 1982; 1,

63 Jennison, C. and Turnbill, B.W. Statistical approaches to interim monitoring of medical trials: A review and commentary. Statistical Science 1990; 5, Pepe, M.S. and Anderson, G.L. Two-stage experimental designs: Early stopping with a negative result. Applied Statistics 1992; 41, Betensky, R. Conditional power calculations for early acceptance of H 0 embedded in sequential tests. Statistics in Medicine 1997; 16, Betensky, R. Early stopping to accept H 0 based on conditional power: approximations and comparisons. Biometrics 1997; 53,

64 52 APPENDIX SIMULATION OF DESIGN USING SAS * The following code (all prior to the macro) provides a stochastic curtailment design using 35 people, alpha =.1, power =.81, to detect the difference Ho p =.2 versus p =.4; data sc; do i = 1 to 39; do j = 0 to i-1; need = 8 - j; remainder = 40-i; if need <= 0 then cond_prob = 1; if need > remainder then cond_prob = 0; if need <= remainder then cond_prob = 1-cdf('binom',need-1,.25,remainder); output; end; end; run; *select all rules that meet the rule stop the study because there is only a 5% chance of rejecting; data keep; set sc; if 0 < cond_prob <=.05; run; proc sort data = keep; by j i ; run; *select the first chance to stop; data keep; set keep; by j; if first.j; * The proc print displays the stopping rules; proc print; run; data keep; run; *macro to simulate the properties of the design. Early stopping rules will affect Type I and Type II errors; %macro sim(trial); data trial; trial = &trial; call streaminit(trial); success = 0; complete = 0; reject = 0;

65 53 stop = 0; do i = 1 to 40; success = success + RAND('BERNOULLI',.1); if i = 23 and success = 0 then do; stop = 1; output; end; if i = 26 and success = 1 and stop = 0 then do; stop = 2; output;end; if i = 29 and success = 2 and stop = 0 then do; stop = 3; output;end; if i = 31 and success = 3 and stop = 0 then do; stop = 4; output;end; if i = 34 and success = 4 and stop = 0 then do; stop = 5; output;end; if i = 37 and success = 5 and stop = 0 then do; stop = 6; output;end; end; run; if i = 40 and stop = 0 then do; complete = 1; if success >= 8 then reject = 1; output; end; data keep; set keep trial; if success ne.; run; %mend; %macro doit; %let trial= 0; %do i = 1 %to 1000; %let trial = %eval(&trial+1); %sim(&trial); %end; %mend; %doit; proc print data = keep; run; proc freq data = keep; table stop complete reject; run; proc univariate data = keep; var i success; histogram; run; data new; do i = 1 to 10 by 0.05; c =((i*sqrt(150))-(3.257*sqrt(75))-(75*0.15/sqrt( )))/sqrt(75); cond_power = 1 - cdf('normal',c); d = (-0.84*sqrt(75)*sqrt( ))+ (3.247*sqrt(75)*sqrt( ))+(75*.15); v = sqrt(150)*sqrt( ); f = d/v;

66 output; end; run; proc print data=new; run; data new; c =((2.032*sqrt(119))-(2.125*sqrt(75))-((119-75)*0.2/(sqrt(0.48))))/(sqrt(119-75)); cond_power = 1 - cdf('normal',c); run; proc print data=new; run; 54

67 55 A TWO STAGE CONDITIONAL POWER ADAPTIVE DESIGN ADJUSTING FOR TREATMENT BY COVARIATE INTERACTION by AYANLOWO, A.O AND REDDEN, D.T Submitted to Contemporary Clinical Trials Format adapted for dissertation

68 56 SUMMARY During the design and planning phase of clinical trials, researchers often assume that no covariate by treatment interaction exists. This assumption has led to many trials being underpowered to detect such interactions and perhaps inaccurate interpretation of treatment effects. We propose a two-stage adaptive design that incorporates the likely existence of a treatment by covariate interaction into the design and implementation of the clinical trial. The information in stage 1 is used to test for the presence of the covariate by treatment interaction. A statistically significant interaction influences how the second stage of the trial will be implemented, thereby aiding in the full understanding and consequently, an accurate interpretation of the treatment effect. We examine the statistical properties of the proposed design using a binary outcome under different types of covariate by treatment interactions and treatment allocation schemes. A conditional power approach is used to prevent inflation of the overall trial type I error rate while maintaining adequate statistical power conditional on the statistically significant interaction. Keywords: Conditional power; treatment effect; covariate by treatment interaction; adaptive design.

69 57 1. INTRODUCTION Clinical trials are often designed under the assumption of no covariate by treatment interaction. Despite the assumption of no treatment by covariate interaction used in planning the trial, many researchers at the end of the trial are interested in tests of interaction. However, these tests of interaction are usually underpowered. Furthermore, if a significant test of interaction is observed, sample sizes conditional upon covariate levels may be inadequate to allow inferences of treatment effects within each covariate stratum. Within this paper, we propose a two stage adaptive design for binary outcomes that allows for a test of interaction at the end of the first stage. Below, we review methods that motivated and are used within our design. Adaptive clinical trial designs which base future decisions, such as continuing to a second stage or terminating early on the information accrued so far, must address the possibility of an inflated type I error rate and the loss of statistical power. Several methods have been proposed for controlling type I error rates while maintaining adequate power for adaptive designs. Several of these approaches are based on the concept of conditional power. Ayanlowo and Redden in 2007 [1] proposed a method that uses conditional power to provide multiple early stopping rules in a phase II clinical trial to declare a new treatment ineffective. Lan [2] proposed a two-stage procedure that fixes the size of the second stage (n 2 ) to ensure sufficient conditional power atθˆ, the current estimate of treatment effect. To protect against a type I error rate inflation, Lan incorporates a rule for early stopping to accept H 0 that ensures that the type I error rate is controlled over a

70 58 range of design parameters. Betensky [3, 4], Lan et al [5], Pepe and Anderson [6] have also proposed procedures for early stopping to accept H 0 based on conditional power. Proschan and Hunsberger [7] proposed a flexible method that uses conditional power to extend a trial. The conditional power at the end of the first stage is used to determine the sample size (n 2 ) of the extension. Proschan and Hunsberger protect the type I error rate by using an increasing conditional error function that specifies the amount of conditional type I error rate needed for the second stage conditioned on the value of the stage 1 test statistic to determine the appropriate critical value c 2. Their design also incorporates the possibility of extending the trial for small non-significant p-values. Li et al [8] proposed a modification to Proschan and Hunsberger s method that determines the sample size of the second stage using conditional power, and adjusts the final stage 2 critical value (c 2 ) to protect the overall type-1 error rate without specifying a form of the conditional error function. They achieve this by directly applying the definition of the conditional type-i error. Denne [9] proposed a 2-stage group sequential procedure, an application of Proschan and Hunsberger s method, that allows for the re-estimation of the sample size required in stage 2 based on the information observed about the nuisance parameter (σ 2 ) in stage 1. Denne s procedure uses the definition of the conditional power under H 0 to determine the appropriate critical value that prevents inflation of the overall trial type I error rate. Denne s procedure is discussed at length in section 2.1. For trials in which researchers are willing to assume the existence of a covariate by treatment interaction at the beginning of the trial, Rosenberger et al [10] proposed a method that allows investigators to incorporate the existence of a covariate by treatment interaction into the design and randomization of the trial. They achieve this by adjusting

71 59 for the possible effects of the treatment, covariate(s) and interaction(s) during randomization, under the assumption that a true interaction exists. Such an approach falls under the umbrella of methods called adaptive randomization. Rosenberger and his colleagues suggest updating the probability of future treatment assignment using the estimates of the treatment, covariate and covariate by treatment interaction effects from information accrued as the trial is conducted. They allow for a burn-in period before the adaptive randomization procedure begins, but do not conduct a formal test of interaction. Thall and Wathen [11] proposed a two-stage design similar to the Rosenberger s procedure but using a Bayesian framework. Thall and Wathen update the probability of treatment assignment based on the posterior distribution of the parameter of interest. Both approaches allow for continuous and categorical covariates. In this paper, we propose a two stage adaptive procedure, an alternative to the Rosenberger et al s procedure. Unlike the Rosenberger et al s procedure our proposed design allows for a formal test of the interaction. The first stage of our design utilizes an equal randomization scheme and its information is used to conduct a test of interaction. The trial proceeds to stage 2 if the test of interaction is statistically significant at the specified α-level. If the test of interaction is not significant, we accrue additional patients to have 80% power to detect the treatment effect of the specified magnitude. A test of treatment effect adjusting for the effect of the covariate is conducted immediately after accrual of the additional patients. If the test of interaction is statistically significant, the trial continues to enroll n 2 patients and the final stage 2 test statistic for the test of treatment effect is conditioned on the significance of the interaction. Thus the probability of making a type I error at the end of the trial is high. We implement a conditional power

72 60 approach at the beginning of the second stage to control the overall type I error rate while maintaining adequate power to detect the hypothesized treatment effect (θ) within each covariate stratum. Details of our procedure are outlined in section 2.3. In section 3, using simulations, we examine the statistical properties of our design under different types of covariate by treatment interactions and patient randomization schemes (equal allocation and adaptive allocation). Specifically, we examine the overall type I error rate, the overall power, and expected sample size under the null of no covariate by treatment interaction of the design. We also investigate the effect of the timing of the test for the covariate by treatment interaction on overall type I error rate, overall power and the expected sample size under the null of no interaction. We assume a fixed sample trial for all our simulations. 2. METHODS 2.1. Review of Denn s Procedure Denne [9] developed a flexible method of extending a trial based upon the conditional power approach proposed by Proschan and Hunsberger [7]. Proschan and Hunsberger [7] use the significance of the treatment effect at the planned end of the study to determine the size of the extension (possibly none) and the critical value necessary for accruing the additional patients while controlling the probability of making a type I error. Denne [9] considers this conditional power approach in the development of a method to control the type I error rate in a 2-stage group sequential error spending design, where there is the possibility of increasing the sample size needed in the second stage of the design based on the information observed in the first stage about the nuisance parameter

73 61 (σ 2 ). The procedure uses a non-decreasing error spending function, f(t), specified at the beginning of the trial to determine the amount of type I error, α 1 and α 2, to be spent at first and second stage respectively (α 1 α 2 ). α 1 = f( γ 1 ),α 2 = α - f( γ 1 ), γ 1 = n 1 / nr (the ratio of the sample size per treatment arm at the interim analysis (n 1 ) to the re-estimated total sample size per treatment (n r )). For simplicity, Denne considered a one-sided alternative for the null hypothesis of no difference in probability of response: H : p d p θ vs. H p p > θ d A : d d 1 2 If more than one analysis is planned, the test of H 0 vs. H A at a k th analysis is based on the test statistic z k n ( pˆ pˆ ) 1 1 k 0 k = (2.1) 2 2σˆ 1 ( p ˆ ˆ )(2 ( ˆ ˆ 11 + p01 p11 + p σ = ˆ1 )), the normal approximation to the binomial is used to estimate the variance of the difference of the two proportions. p i1 is the proportion of success observed within the i th treatment arm, i = 1,2 after stage 1. The conditional power is then defined as the probability of rejecting the null hypothesis at the end of the second stage given the observed information from stage 1 under the assumption that either the null or alternative hypothesis is true. The conditional power is thus computed as CP θ ( n, c 2 z 1 ) = 1 Φ c 2 n 2 + n 1 z 1 n 2 n 1 n 2 θ 2 σˆ 1 2 (2.2)

74 62 n ˆ ˆ 1 ( p11 p01) where z1 =, is the test statistic at the end of stage 1; n = n 2 + n 1, c 2 is the 2 2 ˆ σ 1 critical value used to determine significance at the final analysis of the trial and Ф( ) is the cumulative distribution function of a standard normal. In his paper, Denne shows that for a choice of c 2 that satisfies equation (2.3) below, the procedure ensures that the probability of a type I error conditional on z 1 given n is exactly what it would have been had the total sample size per treatment arm (n) not changed after re-estimation, i.e. n r = n. CP ~ θ n, c z ) CP ( n, c ) (2.3) = 0 ( 2 1 = θ = 0 r 2 z 1 ~c 2 is the critical value required for significance of the stage 2 test statistic in a 2-stage group sequential error spending trial by spending α 2 (the remaining type I error in stage 2) in the usual way. Using the definition of conditional power under H 0, and some algebraic manipulations, equation (2.3) can be re-written as: = ~ γ 2 γ 1 γ 1 γ 2 γ 1 c 2 c 2 z 1 1, γ (1 ) 2 = n / nr (2.4) γ 2 γ 1 γ 2 1 γ 1 From equation (2.4) it is obvious that the critical value c 2 depends on the proportional increase in n r, γ 2 and z 1 for a fixed value ofγ 1. Denne asserts that the flexibility of this approach lies in its ability to maintain the type I error rate at α regardless of the method used to re-estimate the new total sample size (n r ) conditional on z 1. This implies that the approach is insensitive to the type of dependence that exists between n r and z 1. This flexibility of the approach allows us to choose a c 2 that achieves a specific conditional power as defined in (2.2) under the null hypothesis for a fixed sample size n, given z 1, nevertheless still maintaining a type I error rate of α.

75 63 The final test statistic z for which c 2 is the critical value is then computed as a weighted sum of the standardized z-statistics from the two stages: z γ γ γ γ γ = z 1 + z 2 (2.5) 2 2 n2 where z ( ˆ ˆ 2 = p12 p02 ) ), p i2 is the proportion of success observed within the i th 2 ˆ σ 1 treatment arm from the n 2 observations in the second stage. Under H 0, z 2 is distributed as a standard normal random variable for all values of z 1, as such z 2 is independent of z 1. Using Denne s procedure, the conditional probability of a type I error given z 1 is simply written as 1 Φ c~ 2 1 γ 1 z 1 γ 1 1 γ 1 (2.6) 2.3 Proposed design Our design is described as follows: assume we have two treatments (d 1 and d 2 ) with a dichotomous outcome (X) and a covariate (g) with j levels, j = 1, 2; and the trial is planned for a total sample size N. The first n 1 patients are randomized to either treatment arm, regardless of their covariate levels, like Rosenberger et al s [10] design, using an equal allocation scheme. After observing the outcome of the first n 1 patients, we conduct a formal test of covariate by treatment interaction using the following test statistic ( n jhi jhi Q BD = (2.7) j = 1 h = 1 i = 1 m jhi n jhi is the observed cell count in the h th outcome level and the i th treatment category of the table representing the j th level of the covariate, and m jhi is the expected count for the cell m )

76 64 corresponding to the observed count n hji, h = 1, 2; i = 1, 2; j = 1,2. Q BD is the Breslow- Day s test for homogeneity of odds ratio across the levels of the covariate (Breslow and 2 Day, [12]. If Q >, BD χ 1, α χ 2 1, α is the 100(1-α) th percentile of a central chi-squared distribution with 1 degree of freedom, the trial proceeds to the second stage; otherwise, accrual continues to N* using an equal allocation scheme. N* is the total sample size needed to detect a treatment effect for the specified effect size with about 80% power. After accrual of the N* patients in the trial that does not proceed to the second stage, an appropriate method is used to conduct the test of treatment effect adjusting for the effect of the covariate. If the trial proceeds to the second stage the trial design changes. In stage 2 the trial is split into j strata, depending on the levels of the covariate. This implies that the n 1 patients from stage 1 are now split into j different strata depending on their covariate levels, and thus j trials will be run concurrently during the second stage. The remaining N- N 1 patients are then randomized to either of the 2 treatment arms within each of the j stratum using either an adaptive allocation or equal allocation scheme. Since the trial only proceeds to stage 2 if the test of interaction is statistically significant, we know that the final test statistic is conditioned on the significance of the test of interaction, and thus we would expect an overall trial type I error rate inflation. Table I shows the magnitude of the type I error rate inflation of the design when the significance of the test of interaction is ignored and we naively use the uncorrected critical value, z = 1.96, which is the usual critical value for a z-test with α = 0.05 and a two-sided alternative. To maintain an overall trial type I error rate of 5% at the end of trial for our design, we implement the conditional power approach as proposed by Denne [9] at the beginning of the second

77 65 stage to determine the appropriate critical value c 2. The next section describes our simulated designs and provides detail of how Denne s procedure is incorporated into our design to help protect the overall trial type I error rate. Figure 1 presents an outline of the proposed design. Accrue N 1 patients using EQ Interim analysis to test for interaction Significant Interaction Accrue *n 2g1 patients Test treatment effect NS Interaction Accrue *n 2g2 patients Test treatment effect Accrue up to N* and test treatment effect adjusting for covariate *n 2.. is accrued using either adaptive or equal allocation Figure 1: Outline of proposed design Trt d 1 Trt d 2 A Response rate g 1 g 2 g 1 g 2 Figure 2: Plots of two types of treatment by covariate interaction with 2 treatments d 1, d 2 and a covariate g with 2 levels, 1,2 for effect sizes 0.25 vs. 0.40

78 66

79 67 3. DESIGN ANALYSIS AND SIMULATION For the trial that proceeds to stage 2, the information from the first stage of the trial is split into j strata. We compute z 1j from equation (2.1) within each stratum using 2 the information of the patients within a particular stratum. The nuisance parameter ( σ ) is also estimated within each stratum using ˆ σ 2 1. After the N-N 1 patients in stage 2 have been randomized by their covariate type to either of the two treatment arms, we compute z 2j using (2.1) based on only the information from the second stage. The stage 2 randomization can be done using either adaptive or equal allocation. Our simulations use the Randomized-Play-Winner rule (Wei & Durham [13]) the initial composition of the urn based on the information from stage1 and the urn is updated by 1 ball after the response of each patient in stage 2 is observed. A patient is randomized by picking a ball from the urn with replacement under this rule. The final test statistic (z j ) within each stratum is then computed as a weighted sum of z 1j and z 2j using equation (2.5). The final test statistic (z j ) is used to conduct the test for treatment effect within each of the stratum. To obtain an appropriate critical value (c 2j ) within each stratum for z j that protects the overall trial type I error rate at about 5%, we compute (c 2j ) using the definition of conditional power in equation (2.2) under H 0 : θ = 0. In this paper, we assume a fixed sample size trial, i.e. the value of n 2 is determined prior to the beginning of the trial and computed as n - n 1, so that for our design n r = n, n 2g1 and n 2g2 (the number of patients accrued with each stratum in the second stage) are assumed to be equal. Future research will investigate the design implications of re-estimating n 2g., after stage 1. Given that the conditional probability of a type I error given z 1 can be written as (2.6), replacing ~c 2 = c 2 and setting the conditional probability equal to α, straightforward algebra gives

80 68 c 2 = z 1 α / 2 n n n z 1 1 n 1 (3.1) z 1-α is the 100(1-α) th percentile of the standard normal distribution. c 2 given by (3.1) is similar to the critical value for the second stage test statistic using Proschan and Hunsberger s [7] method. The difference is that in Proschan and Hunsberger s procedure n 2 is estimated at the end of the first stage, while for our design n 2 is fixed at the beginning of the trial and 1-α/2 is not an increasing function of z 1. To determine the critical value (c 2 ) that ensures a type I error rate of at most 5% at the end of the second stage, in our simulations the conditional probability of a type I error given z 1 was set at α = 5%. Based upon the asymptotic results provided by Rosenberger [10], we utilize c 2 as the critical value under the equal allocation and the adaptive allocation designs. To study the properties of our proposed design, we simulated the design using two types of covariate by treatment interactions, shown in figure 2 (type A and B interactions). A type A interaction occurs when the direction of the effect of the two treatments are reversed depending on the covariate level. This could occur in a trial of two active treatments or drugs where patients response to treatment depends on their gender, for instance males respond better to treatment A than to treatment B, while the converse is the case in females. A type B interaction occurs when one of the treatments say treatment B has the same effect regardless of the level of the covariate, but the effect of treatment A differs depending on the level of the covariate. This could be the case when males do not respond well to either of the two treatments and females respond better to one of the treatments compared to the other. For the critical values (c 2 ) shown in tables II and III we simulated 100,000 replicates of a trial using our proposed design for varying effect sizes. Although critical values

81 69 were computed in the two stratums, tables II and III only report the higher of the two critical values. Type I error rates and power were simulated using 30,000 replicates. Type I error rate is defined as the probability of rejecting H 0 of no treatment effect when the null hypothesis is true. In our design this error can occur either in the j th strata for trials continuing to the second stage or after stage 1 if the trial fails to proceed beyond the first stage. In our simulations, we consider type I error rate in two ways: 1) global type I error rate 2) stage 2 strata type I error rate. The global type I error rate is type I error that occurs in either the j strata for trials continuing to the second stage or after stage 1 if the trial fails to proceed beyond the first stage, when the null hypothesis of no treatment effect is true. The stage 2 strata type I error rate is the conditional probability of rejecting H 0 of no treatment effect in the j strata given that the trial proceeded to the stage II, when the null hypothesis of no treatment effect is true. Tables II and III present these results for the two types of interaction under the equal allocation scheme. Tables IV and V present the results under the adaptive allocation for the two types of interaction. Statistical power is defined in a similar way but computed under the alternative hypothesis of treatment effect (θ) equals to δ, tables VI IX present these results. Since the design allows for a formal test of interaction, we were interested in investigating how the timing of the test of interaction affected the properties of the design. To this end, we simulated the design using varying total sample sizes (N 1 ) at the interim analysis. All simulations were conducted using R statistical computing software version (R Development Core Team, [14]). The sample sizes N 1 were determined using the O Brien [15] UnifyPower program for the specified effect sizes, α, and power at the interim analysis. For instance, for

82 70 an effect size of 0.20 vs with α = 0.05 and 80% power, a total stage1 sample size (N 1 ) of 160 is needed to detect an interaction of type A (pd 1 g 1 = 0.40, pd 1 g 2 = 0.20, pd 2 g 1 = 0.20, pd 1 g 2 = 0.40). To detect a type B interaction of the same magnitude (pd 1 g 1 = 0.40, pd 1 g 2 = 0.20, pd 2 g 1 = 0.20, pd 1 g 2 = 0.20) with about 70% power a total stage 1 sample size of 570 is needed. For α = 0.15 this sample size decreases to 120 and 440 with about 80% and 70% power to detect an effect size of 0.20 vs for a type A and type B interaction respectively. The total study sample size (N) reported in the tables were computed by doubling the total stage 1 sample size (N 1 ) required to test the covariate by treatment interaction at the usual 5% significance level for the specified effect size. For instance a total sample size (N 1 ) of 570 is needed to detect an interaction of type B for an effect size of 0.20 vs using 5% significance level with a 70% power to detect the interaction, which implies that the total study sample size (N) is For a type A interaction the total sample size (N) also corresponds to the total sample size needed to conduct 2 parallel studies with 2 treatment arms each at the specified effect size, significance level and power. For instance, the total sample size needed to detect an effect size of 0.25 vs with 80% at a 5% significance level is 300 per study. A total sample size of 600 would then be needed to conduct 2 studies (representing each covariate level) for an effect size of 0.25 vs. 40 at a 5% significance level with 80% power. The simulation results are presented in the next section. 4. SIMULATION RESULTS Tables II and III present the stage 2 critical values (c 2 ), the global type I error rates and the stage 2 strata type I error rate of our design under a fixed total study sample

83 71 size (N) for type A interaction under an equal allocation and an adaptive allocation scheme respectively. The stage 2 critical values (c 2 ), the global type I error rates and the stage 2 strata type I error rate of our design under a fixed total study sample size (N) for type B interaction under an equal allocation and an adaptive allocation scheme are presented in tables IV and V. In table II, for a type A interaction with an effect size of 0.20 versus 0.40 a total stage 1 sample size (N 1 ) of 160 is needed for the test of covariate by treatment interaction using a 5% significance level. If the test for covariate by treatment is not statistically significant accrual for the trial stops for this design because N* (the total sample size needed to detect a treatment effect for the specified effect size with about 80% power) is 160. To maintain a global type I error rate of 5% at the end of the trial for the 0.20 vs effect size, under the equal allocation and adaptive allocation designs, a critical value of is needed within each stratum, with an additional 160 patients accrued during the second stage of the trial. This implies that 80 patients of each covariate level type are accrued in the second stage of the trial. For this design the stage 2 strata type I error rate is about 2.13%. If a significance level of 10% is used for the test of covariate by treatment interaction a total stage 1 sample size (N 1 ) of 128 is needed at the interim analysis. If the interaction is not significant an additional 32 patients are accrued. If the interaction is significant, an additional 192 patients (96 of each covariate level type) are accrued in the second stage. A critical value of is needed within each stratum to maintain a global type I error rate of about 5% at the end of the trial when a significance level of 10% is used at the interim analysis. The stage 2 strata type I error is about 2.3%. The critical value reduces to if a significance level of 15% is used at the interim analysis. This design requires a total stage 1 total sample size (N 1 ) of 105 and

84 72 an additional 55 patients if the interaction is not significant. If the test of interaction is significant an additional 215 patients are accrued in the second stage. The stage 2 strata type I error for this design is about 2.8%. For the effect size of 0.20 versus 0.40, the expected sample sizes under the null hypothesis of no covariate by treatment interaction are 168, 176 and 184 for an interim analysis type I error rate of 5%, 10% and 15% respectively. Similar results are shown in table 3 under the adaptive allocation scheme. In tables IV and V, for the same effect size of 0.20 versus 0.40 a total stage 1 sample size (N 1 ) of 570 is needed for the test of covariate by treatment interaction to detect an interaction of type B with about 70% power. If the interaction is significant, an additional 570 patients is accrued in the second stage. A critical value of is needed within each stratum to maintain a global type I error rate of about 5% under the equal allocation and the adaptive design. The stage 2 strata type I error rate is 1.7%. The total stage 1 sample size (N 1 ) reduces to 440 and 360 for an interim analysis with significance levels of 10% and 15% respectively. For a significance level of 10% used at the interim analysis a critical value of is needed within each stratum to maintain a global type I error rate of about 5%, with a 1.9% stage 2 strata type I error rate. A critical value of is needed to maintain a global type I error rate of 5% with a stage 2 strata type I error rate of about 2.16% when a significance level of 15% is used at the interim analysis. For N 1 = 440 (using a 10% significance level at the interim analysis) an additional 700 patients is accrued in the second stage of the design, while a design with N 1 = 360 (using a 15% significance level at the interim analysis) accrues an additional 780 patients in its second stage. The expected sample sizes under the null hypothesis of no covariate by treatment interaction using the effect size of 0.20 versus 0.40 under a type B interaction

85 73 are 598.5, 627 and 655.5, for an interim analysis type I error rate of 5%, 10% and 15% respectively. Tables VI IX show the estimated global power and stage 2 strata power of our design for a fixed sample size trial using type A and type B interactions respectively. In tables VI and VII, for an interaction of type A with effect sizes of 0.20 versus 0.40 the global power to detect a treatment effect at the end of the trial is about 49.0% under both allocation schemes given that a significance level of 5% is used at the interim analysis for the test of interaction. The stage 2 strata power for this design is about 58%. The global power to detect the treatment effect at the end of the trial increases to about 56.09% with a stage 2 strata power of about 68.4% if a significance level of 10% is used at the interim analysis. The global power increases to about 60.8% with a stage 2 strata power of about 75% if a 15% significance level is used at the interim analysis. In table VIII, for a type B interaction with effect sizes 0.20 versus 0.40 there is about a 92.6% global power to detect a treatment effect at the end of the trial with a stage 2 strata power of 98.4%, when a significance level of 5% is used at the interim analysis. The global power to detect the treatment effect at the end of the trial slightly increases to about 93.6% with a stage 2 strata power of 99.5% if an interim analysis significance level of 10% is used. Global power further increases to about 93.72% for a significance level of 15% at the interim analysis. The stage 2 strata power for this design is about 99.6%. The relatively high power to detect a treatment effect at the end of the trial is due to high fixed total study sample size used for these designs in our simulations. A sample size reestimation technique can be implored at the beginning of the second stage to determine the number of additional patients, if any, needed to achieve about an 80% power to detect

86 74 the treatment effect at the end of the trial. Future research will explore this sample size re-estimation approach. B

87 75

88 76

89 77

90 78

91 79

92 80

93 81