How to cite this article: Rosas-Peralta M, Santos-Martínez LE, Magaña-Serrano JA, Valencia-Sánchez JS, Garrido-Garduño M, Pérez-Rodríguez G. [Methodology for superiority versus equivalence and non-inferior clinical studies. A practical review]. Rev Med Inst Mex Seguro Soc. 2016 May-Jun;54(3):344-53.
REVIEW ARTICLES
Received: March 2nd 2015
Judged: March 4th 2015
Martin Rosas-Peralta,a,g Luis Efrén Santos-Martínez,b José Antonio Magaña-Serrano,c Jesús Salvador Valencia-Sánchez,d Martin Garrido-Garduño,e Gilberto Pérez-Rodríguezf
aInvestigación en Salud
bJefatura del Departamento de Hipertensión Pulmonar y Función Ventricular Derecha
cDivisión de Enseñanza
dDirección de Enseñanza e Investigación
eDirección Médica
fDirección General
gAcademia Nacional de Medicina, A.C.
a,b,c,d,e,fHospital de Cardiología, Centro Médico Nacional Siglo XXI, Instituto Mexicano del Seguro Social
Ciudad de México, México
Communication with: Martin Rosas-Peralta
Email: mrosas_peralta@hotmail.com
Physicians should always remember that a negative result in a superiority trial never would prove that the therapies under research are equivalent; more often, there may be a risk of type 2 (false negative) error. Equivalence and not inferiority studies demand high standards to provide reliable results. Physicians should take into account above all that the equivalence margins tend to be too large to be clinically significant and that the claim of equivalence can be misleading if a study has not been conducted at a sufficiently high level. In addition, physicians must be a bit skeptical of judgments that do not include the basic requirements of information, including the definition and justification of the equivalence margin, the calculation of the size of the sample bearing in mind this margin, the presentation of both analysis (intention-to-treat and by protocol), and provide confidence intervals for the results. Equivalence and inferiority studies are not indicated in certain areas. If one follows the required strict adherence to the specific methodology, such studies can provide new and important knowledge.
Keywords: Clinical trial; Randomized controlled trial; Therapeutic equivalency
The randomized clinical trial (RCT) is generally accepted as the best method for comparing the effects of treatments.1,2 Very often the goal of an RCT is to show that a new treatment is superior to placebo or established therapy, i.e., a superiority study is planned and carried out. Sometimes the goal of an RCT is established only to show that a new therapy is not superior, but equal to or not worse than an established therapy, i.e., RCTs are planned and carried out as equivalence studies or non-inferiority studies.3 Since these tests have different objectives, they differ significantly in various methodological aspects.4 Awareness of methodological differences is generally quite limited. For example, it is a fairly common belief that failure to find a significant difference between treatments in a superiority study implies that the therapies have the same effect or are equivalent.5-10 However, this conclusion is incorrect because of a considerable risk of overlooking a clinically relevant effect due to insufficient sample size.
The CONSORT Statement (Consolidated Standards of Reporting Trials), which includes a list and a flow chart, is a guide developed to help authors improve the dissemination of the results of randomized controlled trials. It was recently updated in 2010.1 Its main focus is individually randomized, two parallel groups that assess the possible superiority of one treatment compared to the other. The CONSORT Statement extended to other trial designs, such as cluster randomized trials, and from this recommendations for the equivalence of tests and results were formulated in 2006. The purpose of this paper is to review the methodology of the different types of studies with special reference to differences regarding the planning, implementation, analysis, and presentation of evidence and the extent of the CONSORT Statement.1 In this context the most important statistical concepts will be examined. Some of the important points are illustrated with examples.
The estimation of sample size and power of an RCT
An important aspect in the planning of any RCT is to estimate the number of patients required, i.e., the sample size. In this regard, the various types of studies are differentiated.1,2,11 A superiority study aims to demonstrate the superiority of a new therapy compared to a placebo or established therapy. The following description applies to a superiority analysis. The characteristics that differ for an equivalence or non-inferiority study will be described later. To estimate the sample size it is necessary to consider some important aspects, such as: How much better is the new therapy supposed to be than the standard therapy? This additional effect of the treatment in question versus the reference treatment is called relevant difference or at least clinical significance. It is often denoted by the Greek letter Δ (delta) (Figure 1).
Figure 1 Illustration of the factors influencing the sample size of a study. The difference in effect found in a study will be subject to random variation. The variation is illustrated by normal bell-shaped distribution curves for a difference of zero corresponding to the null hypothesis (H0) and a difference of Δ corresponding to the alternative hypothesis (HΔ). The areas defined under the curves indicate the probability of a given difference to be compatible with HΔ or H0, respectively. If the difference is close to H0, H0 could be accepted. However much more the H0 difference is, the less likely is H0. If there is a probability that H0 becomes very small (less than the specified error type 1) to risk 2 α (being α, found in the tail of the curve), H0 could be rejected. Distribution curves of the sample have some overlap. A large overlap will result in a considerable risk of interpretation error; in particular, the risk of error type 2 can be substantial, as shown in the figure. An important issue would be to reduce the risk of error type 2 or β (and increase the power of 1 - β) to a reasonable level. Three ways to do this are shown in graphs B, C, and D, and there is always a reference situation. B) Isolated increases of 2 α will decrease β and increase power. Conversely, an isolated decrease of 2α will increase β and decrease power. C) Isolated narrowing of the distribution curves of the sample (by increasing sample size 2N or decreasing the variance of the difference S2) will decrease β and increase power. Conversely, isolated widening of the distribution curves of the sample (by decreasing sample size or by increasing the variance of the difference) will increase β and decrease power. D) Isolated increase of Δ (greater therapeutic effect) will decrease β and increase power. Conversely, isolated decrease of Δ (which will make the therapeutic effect smaller) will increase β and will decrease power.
How much would the difference in effect be between the two groups when influenced by random factors? Like any other measurement of a biological treatment effect, this will be subject to considerable "random" variation, which needs to be determined and always taken into account. The magnitude of the variation is described in statistical terms by the standard deviation S or variance S2 (Figure 1C), which would have to be obtained from a pilot study or similar previously published studies.
This study should demonstrate as accurately as possible the real difference between the effects of treatments. However, due to random variation, the outcome of the study may deviate from the real difference and give erroneous results. If, for example, the null hypothesis H0 of no difference were true, it could even happen in some cases that the analysis shows a difference. This type of error is called error type 1 (false positive) (Figure 1), and would have the consequence of the introduction of ineffective therapy.
If instead the alternative hypothesis HΔ of the difference (Δ: delta) were true, the analysis could, in some cases, not show a significant difference. This type of error is error type 2 or a false negative (Figure 1), which would have the consequence of rejecting effective therapy. So one has to specify the risk of type 1 and type 2 errors that would be acceptable for analysis. Ideally, the risks of errors type 1 and type 2 are near zero, but this would require extremely large studies.
Very often the risk of making error type 1 α is specified at 5%. In this paper, alpha means that the risk of committing error type 1 is only in one direction, i.e., H0 either up or down; so α = 5%. However, in many situations it could be of interest to detect both beneficial and harmful effects of the new therapy compared to the control treatment, i.e., one would be interested in testing "double face" or "two tails”, a difference of direction in both "up" and "down" (Figure 1). So we would have a specific place for the risk of error type 1, that is 2α (α up + α down), i.e. 2α = 5%.
Error type 2 is the risk β (beta), and it is usually specified between 10 and 20% in clinical studies. Since a given value of Δ is always above or below zero (H0), the risk of committing error type 2 or beta will always be one-sided. Thus, the lower the value of β, the greater the probability of its complement, or 1 – β, of accepting HΔ when in fact it is true.
A 1 - β is called the power or the potency of the test, as it establishes the probability of finding Δ if there really is a difference. From the values given to Δ, S2, α, and β, the necessary number (N) of patients in each group can be estimated using the following formula which is relatively simple:
Where Z2α and Zβ are standard deviations for the levels of defined values of α 2 (Table I, left) and β (Table I, right), respectively. If for some reason you want to test the difference in only one direction ("one tail" tests), Z2α should be replaced with Zα in the formula and apply the right side of Table I. The formula is approximate, but in most cases gives a good estimate of the required number of patients. For a study with two parallel groups of equal size, the total sample size is 2 N.
Table I Standardized Normal Distribution (adapted for this work) (see note below) | |||||
Two-tail probability |
One-tail probability |
||||
Z2α | 2 α | Zα o Zβ | α o β | Zα o Zβ | α o β |
3.09 | 0.002 | 3.09 | 0.001 | -0.25 | 0.60 |
2.58 | 0.01 | 2.58 | 0.005 | -0.39 | 0.65 |
2.33 | 0.02 | 2.33 | 0.010 | -0.52 | 0,70 |
1.96 | 0.05 | 1.96 | 0.025 | -0.67 | 0.75 |
1.64 | 0,1 | 1.64 | 0.05 | -0.84 | 0.80 |
1.28 | 0,2 | 1.28 | 0.10 | -1.04 | 0.85 |
1.04 | 0.3 | 1.04 | 0.15 | -1.28 | 0.90 |
0.84 | 0.4 | 0.84 | 0.20 | -1.64 | 0.95 |
0.67 | 0,5 | 0.67 | 0.25 | -1.96 | 0.975 |
0.52 | 0,6 | 0.52 | 0.30 | -2.33 | 0.990 |
0.39 | 0,7 | 0.39 | 0.35 | -2.58 | 0.995 |
0.25 | 0.8 | 0.25 | 0.40 | -3.09 | 0.999 |
0.13 | 0.9 | 0.13 | 0.45 | -3.29 | 0.9995 |
0.00 | 1.0 | 0.00 | 0.50 | -3.72 | 0.9999 |
The total area below the normal distribution curve is one. The area under a given part of the curve gives the probability of an observation being in that part. The x-axis shows the "probability density", which is higher in the center of the curve and decreases in any direction towards the tails of the curve. The normal distribution is symmetrical, i.e., the probability of Z to infinity plus (right side of table) is the same as that of - Z to -∞. The right side of the table gives the probability of a single side of a given Z -value of the x axis to x + ∞. The left side of the table gives the probability of two sides as the sum of the probability of a given positive Z-value + ∞ and the probability based on the corresponding negative Z -value to -∞ |
The values used for 2α, β, and Δ should always be determined by the investigator, not the statistician. The values chosen should take into account the disease, its stage, the efficacy, and side effects of the control therapy as well as an estimate of the amount of additional effect that can reasonably be expected from the new therapy.
If for example the disease is fairly benign with a relatively good prognosis and the new therapy is more expensive and may have more side effects than a quite effective control treatment, specify a relatively higher value of Δ and β and a lower one for 2α, because the new therapy would only be interesting if it is significantly better than the control treatment.
If, however, the disease is aggressive, and the new therapy is cheaper or may have fewer side effects than the (not very effective) control therapy, try to specify a relatively lower value for Δ and β and a higher one for 2α, because the new therapy would be interesting, even if its value is only slightly better than the control treatment.
As mentioned above, 2α is usually specified at 5% or 0.05, but one can justify the values of 0.10 or 0.01 in certain situations, as mentioned above. The value of β is usually specified between 0.10 and 0.20, but in special situations a higher or lower value can be justified. The value of Δ should be decided on clinical grounds and is translated as the relevant therapeutic gain of the new therapy taking into account the prevalence of the disease and its prognosis on the effectiveness of the control therapy and what can reasonably be expected from the new therapy. Preliminary data from experimental studies or data on historical observations can be guidelines for choosing the magnitude of Δ. While it is often tempting to specify a relatively large proportion of Δ, therefore requiring fewer patients, Δ should never be larger than is biologically reasonable or clinically important. So, it is always unethical to conduct studies with unrealistic goals. Figure 1 illustrates the effects of the risk of error type 2 or β, and therefore also of power (1 - β), changing 2 α, N, S2, and Δ. Thus, β is reduced and the power of 1 - β will increase if 2 α (Figure 1B) is increased, if the sample size (Figure 1C) is increased, and if Δ increases (Figure 1D). The estimated sample size should increase in proportion to the expected loss of patients to follow-up due to dropouts and withdrawals.
An important concept indicating confidence in the result of an RCT is the amplitude of the confidence interval, popularly known with the initials CI, of the delta difference in effect between therapies investigated.1,2
The narrower the confidence interval, the more reliable the result. In general, the width of this range is determined by the sample size. A large sample size would result in a narrow confidence interval. Typically, this is estimated at 95%. Thus, this range establishes that on average the real difference will be included in 95 out of 100 similar studies. This is illustrated in Figure 2, in which 100 study samples of the same size are drawn at random from the same population. It is important to note that in 5 of the 100 samples of the confidence interval, 95% of the difference in effect D does not include the real difference found in the population, i.e. an error of 5% or 0.05 is accepted.
Figure 2 Illustration of the variation of the confidence limits on random samples (computer simulation). A) Ninety-five percent confidence intervals on 100 random samples of the same size from the same population aligned according to the real value in the population. In five of the samples of 95% confidence interval, the real value found the population is not included. B) The same confidence intervals are here arranged according to their values. C) When confidence intervals are ranked according to their average, their variation in relation to the actual value of the population is clearly seen again. This presentation corresponds to how researchers would see the world. They investigate samples in order to extrapolate the results to the population. However, the potential inaccuracy of extrapolating from a sample to the population is evident (especially if the confidence interval is wide). Therefore, it is important to maintain rather narrow confidence intervals. This means making relatively larger studies.
When confidence intervals are aligned according to their average (Figure 2C), the variation in relation to the real value in the population becomes even clearer. If the simulation is carried out on an even larger scale, the probability distribution of the true difference in the population, given the results of a given study sample, follow in a normal distribution, as shown in Figure 3.2
Figure 3 A) Histogram showing the distribution of the true difference in the population relative to the difference D found in the study sample (computer simulation of 10,000 samples). (B) The probability curve of normal distribution of the true difference in the population relative to the difference D is found in a test sample. 95% confidence interval (95% CI) is shown
One can see the probability that the actual difference in the population is highest in the difference D found in the sample, and decreases with higher and lower values. The figure also illustrates the 95% confidence interval, which is the interval that includes 95% of the average total area below the curve of normal probability. This area can be calculated from the difference D and its standard error (SED). To make sure that the real difference is included in the confidence interval, each can be calculated at 99%, which would be broader, since it must include the middle 99% of the total area of probability.
If the 95% confidence interval of Δ includes zero, then there is no significant difference in effect between two therapies. However, this does not mean that it can be concluded that the effects of therapies are the same. There may still be a real difference in effect between the therapies that RCTs have not been able to detect, due to insufficient sample size or power.
The risk of overlooking some difference in the effect of Δ between therapies is the risk of making an error type 2 or β. In some cases, this risk can be substantial. An example of this pattern is shown in case 1.
It is known that previously untreated cases of myocarditis due to Coxsackie virus B / genotype 1 when using interferon and ribavirin for three months induces sustained virologic response in about 40%. We want to test whether a new treatment regimen can increase sustained response in these patients to 60% with a power (1 - β) of 80%. The risk of error type 1 (2 α) should be 5%. You have to calculate the number of patients required for this study. For comparison of proportions, as in this study, the variance of the difference (S2) is equal to p1 (1 - p1) + p2 (1 - p2), where p1 and p2 are the proportions with response in the comparison groups. Thus we have:
Using N = (Z2α + Zβ)2 × p1 (1 - p1 ) + p 2 (1 - p2)/Δ2 one obtains:
Therefore, the required number of patients (2N) would be 188.
Howeve|r, due to various difficulties, only 120 patients (60 in each group) of this type could be recruited. By solving the general sample size formula according to Zβ you obtain:
By using this formula, the power of the test with the smaller number of patients can be estimated as follows:
If the right side of Box I is used with interpolation, β becomes 0.39. So with this limited number of patients, power 1 - β is now only 0.61 or 61% (a clear reduction instead of the traditional 80%). This seriously reduced power markedly decreases the chances of demonstrating a significant treatment effect. A post hoc power calculation such as this one can be used to explain why a superiority analysis is not conclusive, reminding us that this can never be used to support the negative result of a test of superiority.
The result of the study was as follows: sustained virological response was found in 26 out of 60 (0.43 or 43%) in the control group and in 35 out of 60 (0.58 or 58%) in the new therapy group. The difference D is 0.15 or 15%, but not statistically significant (p > 0.10). A simple approximate formula for the standard error of difference is:
The 95% confidence interval for D is D ± Z2α × SED = 0.15 ± 1.96 × 0.09 or -0.026 − 0.326 (-2.6% to 32.6%), which is quite broad, since it includes both zero and Δ. The risk of error type 2 at a 20% effect (corresponding to Δ) can be estimated as follows:
If the right side of Box I is used with interpolation, β becomes 0.29. Thus, the risk of having overlooked an effect of 20% is 29%. This is a consequence of the small number of patients involved and the reduced power of the study. This situation corresponds to that shown in Figure 4. As seen in this figure, the result of a negative RCT like this does not rule out that the real difference may be Δ, since the risk of error type 2 (or β) of overlooking a Δ effect is substantial.
Figure 4 Illustration of the risk of β or error type 2 (gray area) in an RCT showing a difference D in effect, which is not significant, since the zero (0) of difference is in the lower limit (L) and the upper limit (U) of the 95% confidence interval. The risk of error type 2 of the value of Δ is substantial
The purpose of an equivalence study is to establish the same effects of the treatments compared.12-17 Having equivalent effects means having a Δ value of zero. As seen from the formula for estimating sample size (see above), this division would mean zero, which is not possible. Dividing by a very small Δ value would result in a large, but realistic, sample size. Therefore, as a manageable compromise, the goal of an equivalence study would be to determine whether the difference between the effects of two therapies is within a small range of - Δ to + Δ.
An equivalence study would be relevant if the new therapy was simpler, associated with fewer side effects, or less expensive, even if it was not expected to have a greater therapeutic effect than the control treatment. This is crucial to specify a size corresponding to Δ.14,17
This is not simple. One must be willing to limit as much as possible the acceptance of a new therapy, which is lower than the control therapy. Therefore, the value of Δ must be specified as small and in any case smaller than the smallest value that could represent a clinically significant difference. As a rule, Δ must be specified at no more than half the value that can be used in a superiority study.13 The equivalence between therapies would be shown if the confidence interval for the difference in effect between therapies turns out to be entirely between - Δ and + Δ.13 Figure 5 illustrates the conclusions that can be drawn from the position of the confidence limits for the difference in the effect found in the study conducted.
Figure 5 Examples of differences of treatment observed (new therapy-control therapy) with confidence intervals at 95% and the conclusions that can be drawn. (A) The new therapy is significantly better than the control treatment. However, the magnitude of the effect may be clinically important. (B-D) The therapies may be considered of equivalent effect. (E-F) The result is inconclusive. (G) The new therapy is significantly worse than the control treatment, but the magnitude of the difference may be clinically important. (H) The new therapy is significantly worse than the control treatment.
It is crucial to understand that in the equivalence study, the roles of the null and alternative hypothesis are reversed, i.e., the relevant null hypothesis is that a difference of at least Δ exists, and the aim of the study is to disprove this in favor of the alternative hypothesis that there is no difference.13 Although this situation reflects the superiority of similar analysis, it turns out that the method to estimate the sample size is also similar in the two types of analysis, although Δ has different meanings in superiority and equivalence studies.
In the same patients as described in case 1, one wants to compare in an RCT assuming therapeutic equivalence of the current regime of interferon and ribavirin (with a sustained response of 40%) with a new low-cost therapeutic regimen with less side effects.
You have to calculate the number of patients required for this study. The power (1 - β) of the test should be 80%. The risk of error type 1 (2α) must be 5%. The therapies are considered equivalent if the confidence interval for the difference in proportion with sustained response falls entirely within the range of ± 0.10% or ± 10%. Therefore, Δ is specified at 0.10. So we have:
If we use the same expression for the variance of the difference (S2), as in case 1, the following result:
Therefore, the required number of patients (2N) would be 752.
The study was carried out and the result showed that sustained virological response was found in 145 out of 372 (0.39 or 39%) in the control group and 156 out of 380 (0.41 or 41%) in the new therapy group. The difference D was 0.02 or 2%, but was not statistically significant (p > 0.50). The standard error of the difference was:
The 95% confidence interval for D was D ± Z2α × SED = 0.02 ± 1.96 × 0.036 o −0.050-0.091 (−5.0 a 9.1 %). Since this confidence interval is found fully within the specified interval for Δ (-0.1 to 0.1), the effects of the two therapies were considered equivalent. This situation corresponds to B or C in Figure 5.
As in this example, the sample size required in an equivalence study will often be at least four times that of a corresponding superiority study. Therefore, the resources needed will be greater.
The non-inferiority study, which relates to the equivalence analysis, is not intended to show equivalency, but only to show that the new therapy is not worse than the reference therapy. Thus, the non-inferiority study is designed to demonstrate that the difference in the effect (new therapy versus control therapy) must be less than - Δ. The non-inferiority of the new therapy would then be demonstrated if the lower confidence limit for the difference in effect between therapies is above - Δ. The position of the upper confidence limit is not of primary interest. Thus, the non-inferiority study is designed as a study of one side only. For that reason the number needed is always less than for a corresponding equivalence study, as illustrated in the following case.
We want to carry out the study described in case 2, not as an equivalence study, but as a non-inferiority study. Thus, the study should be of one side instead of an equivalence study with two tails. The only difference would be that we should use Zα instead of Z2α. For α = 0.05 we obtain Zα = 1.64 (right side of Table I). Thus we obtain:
Therefore the required number of patients (2N) would be 590.
The study was carried out and the result showed that a sustained virologic response was found in 114 out of 292 (0.39 or 39%) in the control group and 125 out of 298 (0.42 or 42%) in the new therapy group. The difference D is 0.03 or 3%, but it is not statistically significant (p > 0.50). The standard error of the difference is:
The lower limit of 95% unilateral confidence would be D - Zα × SED = 0.03-1.64 × 0.040 = -0.036 (-3.6%). Since the lower confidence limit is above the specified limit for Δ of -0.1, the effect of the new therapy is not less than the control therapy. If the two-tailed 95% confidence interval is being estimated (which is recommended by some even for a non-inferiority study),18 we obtain D ± Z2α × SED = 0.03 ± 1.96 × 0.040 or −0.048-0.108 (-4.8% of 10.8%). The lower confidence limit is still above -0.1, but the upper confidence limit is above 0.1 (the upper limit of equivalence; see case 2). Therefore, the new therapy may be slightly better than the control treatment. The risk of error type 2 is high and if it has an effect of 0.1 or 10%, it could be estimated as follows: Zβ = (Δ - D) / SED = (0.10 - 0.03) /0.04 = 1.75. If the right side of Table I with β interpolation is used, it becomes 0.04, i.e., rather it is a small risk.
Since the objective of an equivalence or non-inferiority study is different, there is not the same incentive to eliminate factors that could obscure any difference between treatments, as is the case of a superiority study. Thus in some cases the finding of equivalence may be due to deficiencies of the studies, such as small sample size, not being double-blind, lack of hidden randomization, incorrect dosages of drugs, the effects of concomitant medication, or the spontaneous recovery of patients without medical intervention.19
Both the equivalence study and the non-inferiority study should reflect as closely as possible the methods used in the superiority studies that previously evaluated the effect of control versus placebo therapy. In particular, it is important that the inclusion and exclusion criteria that define the patient population, blinding, randomization, the dosing schedule of the standard treatment, the use of concomitant medication (and other interventions), the primary response variable, and the timing of measurements are the same as in previous superiority studies, which evaluated the reference therapy used in the comparison. In addition, attention should be paid to patient compliance, response for any length of time, and the magnitude of patient loss (and the reasons for these). These studies should not be different from previous superiority studies.
An important point for the analysis of equivalence and non-inferiority studies lies in the choice of using intention to treat or analysis by protocol. In a superiority study, in which the goal is to decide whether two treatments are different, analysis by intention to treat is generally conservative: the inclusion of those who violate protocol and withdrawals usually tend to make the results of the two treatment groups more similar. However, for an equivalence or non-inferiority study, this effect is no longer conservative: any elimination of difference between the treatment groups will increase the likelihood of finding equivalence or non-inferiority.
An analysis by protocol compares patients according to the treatment actually received, and includes only those who met the entry criteria and correctly followed the protocol. In a superiority study, this approach may tend to improve any difference between treatments rather than decreasing it, since the noise does not tell us about patients who withdrew. In an equivalence or non-inferiority study, both types of analysis to be performed and the equivalence or non-inferiority can only be established if both analyses support it. To ensure the best possible quality of analysis, it is important to collect full follow-up data from all patients assigned both randomly and by protocol, regardless of whether they are subsequently found to have failed entry criteria, if they stop using the study medication before time, or if they violate protocol in some other way.20 Such a rigid approach to data collection allows maximum flexibility for subsequent analysis and therefore provides a more solid basis for decisions.
The most common problem in reported equivalence or non-inferiority studies is that they are planned and analyzed as if they were superiority studies, and the lack of a statistically significant difference is taken as evidence of equivalency.7-10 It therefore appears that there is a need for a better understanding of how they should plan equivalence and non-inferiority studies, and how their analysis and reporting should be done.
A recent study reported on the quality of published equivalence studies.21 One of the findings was that some studies had been planned as superiority studies, but were presented as if they had been equivalence studies after the failure to demonstrate superiority, as they did not include a margin of equivalency. The study also states that one-third of reports that included a calculation of sample size had omitted elements necessary to reproduce it; one-third of the reports described a confidence interval whose size was not in accordance with the risk of error type 1 used in the calculation of sample size; and half of the reports used statistical tests that did not take margins into account. In addition, only 20% of the studies surveyed gave the four basic requirements: the margin of equivalence defined, the calculation of sample size based on that margin, protocol analysis by intention to treat, and confidence interval for the result. Only 4% of the studies gave a justification for the range used, which is essential.
An extension regarding equivalence and non-inferiority studies is the CONSORT Statement on RCT publications.1,18,22-24 This includes the description of the reasons for adopting a design of equivalence or non-inferiority, how the study hypotheses were incorporated into the design, the choice of participants, interventions (especially the reference treatment) and the results derived from statistical methods, including the calculation of sample size and the way in which the design affects the interpretation and conclusions.18Conflict of interest statement: The authors have completed and submitted the form translated into Spanish for the declaration of potential conflicts of interest of the International Committee of Medical Journal Editors, and none were reported in relation to this article.