Equivalence and agreement in validation studies: A practical methodological review

Main Article Content

Silvina Dell’Era https://orcid.org/0000-0001-9186-6229
Vanina Pagotto https://orcid.org/0000-0003-0309-2660

Keywords

Methods, Validation Study, Data Accuracy, Evaluation Study, Data Analysis

Abstract

Accurate statistical analysis is essential in validation studies of instruments that quantify continuous variables against a reference standard. This article describes statistical approaches to evaluate equivalence between measurement instruments combining graphical methods and statistical tests. Its application is exemplified through a study that assessed the accuracy of a physical activity tracking wristband (Xiaomi Mi Band 4) for counting steps walked during different activities in patients with chronic respiratory diseases, and it was compared with a video-based reference method. Confidence intervals were used alongside predefined equivalence zones, TOST (two one-sided tests) procedures were applied, and both group-level and individual-level indicators of agreement were calculated, such as the mean error (ME), mean percentage error (MPE), mean absolute percentage error (MAPE), and root mean squared error (RMSE). In addition, some common errors were also discussed, such as the inappropriate use of scatter plots or correlations to assess accuracy. The article concludes that selecting appropriate statistical methods is a key aspect to ensure clinical and methodological validity in equivalence studies between measurement instruments that quantify continuous variables and a reference method.

Abstract 25 | PDF (Spanish) Downloads 15

References

1. Shei RJ, Holder IG, Oumsang AS, et al. Wearable activitytrackers-advancedtechnologyoradvanced marketing?Eur J Appl Physiol. 2022;122(9):1975-90.doi: 10.1007/s00421-022-04951-1

2. Dixon PM, Saint-Maurice PF, Kim Y, et al. A Primer on the Use of Equivalence Testing for Evaluating Measurement Agreement. Med Sci Sports Exerc. 2018;50(4):837-45. doi: 10.1249/MSS.0000000000001481

3. Giurgiu M, von Haaren-Mack B, Fiedler J, et al. The wearable landscape: Issues pertaining to the validation of the measurement of 24-h physical activity, sedentary, and sleep behavior assessment. J Sport Health Sci. 2024;14:101006. doi: 10.1016/j.jshs.2024.101006

4. Dell’Era S, Gimeno-Santos E, Chain NAF, et al. Exactitud del Xiaomi Mi Band 4 para contabilizar pasos en adultos con enfermedades respiratorias crónicas. Estudio de concordancia. Respirar. 2024;16(2):101-12. doi: 10.55720/respirar.16.2.1

5. Kim J, Kenyon J, Billingsley H, et al. Validity of the Actigraph-GT9X accelerometer for measuring steps and energy expenditures in heart failure patients. PLoS One. 2024;19(12):e0315575. doi: 10.1371/journal.pone.0315575

6. Hibbing PR, Pilla M, Birmingham L, et al. Evaluation of the Garmin Vivofit 4 for assessing sleep in youth experiencing sleep disturbances. Digit Health. 2024. doi: 10.1177/20552076241277150

7. Taffé P, Zuppinger C, Burger GM, et al. The Bland-Altman method should not be used when one of the two measurement methods has negligible measurement errors. PLoS One. 2022;17(12):e0278915. doi: 10.1371/journal.pone.0278915

8. Welk GJ, Bai Y, Lee JM, et al. Standardizing Analytic Methods and Reporting in Activity Monitor Validation Studies. Med Sci Sports Exerc. 2019;51(8):1767-80. doi: 10.1249/MSS.0000000000001966

9. Ialongo C. The logic of equivalence testing and its use in laboratory medicine. Biochem Med (Zagreb). 2017;27(1):5-13.doi: 10.11613/BM.2017.001

10. Mayorga-Vega D, Casado-Robles C, Guijarro-Romero S, et al. Criterion-Related Validity of Consumer-Wearable Activity Trackers for Estimating Steps in Primary School children under Controlled Conditions: Fit-PersonStudy. J Sports Sci Med. 2024;23(1):79-96. doi: 10.52082/jssm.2024.79

11. Casado-Robles C, Mayorga-Vega D, Guijarro-Romero S, et al. Validity of the Xiaomi Mi Band 2, 3, 4 and 5 Wristbands for Assessing Physical Activity in 12-to-18-Year-Old Adolescents under Unstructured Free-Living Conditions. Fit-Person Study. J Sports Sci Med. 2023;22(2):196-211. doi: 10.52082/jssm.2023.196

12. Hao Y, Ma XK, Zhu Z, et al. Validity of Wrist-Wearable Activity Devices for Estimating Physical Activity in Adolescents: Comparative Study. JMIR Mhealth Uhealth. 2021;9(1):e18320. doi: 10.2196/18320

13. Ummels D, Bijnens W, Aarts J, et al. The Validation of a Pocket Worn Activity Tracker for Step Count and Physical Behavior in Older Adults during Simulated Activities of Daily Living. Gerontol Geriatr Med. 2020;6:2333721420951732. doi: 10.1177/2333721420951732

14. Kwon S, Wan N, Burns RD, et al. The Validity of Motion Sense HRV in Estimating Sedentary Behavior and Physical Activity under Free-Living and Simulated Activity Settings. Sensors (Basel). 2021;21(4). doi: 10.3390/s21041411

15. Viciana J, Casado-Robles C, Guijarro-Romero S, et al. Are Wrist-Worn Activity Trackers and Mobile Applications Valid for Assessing Physical Activity in High School Students? Wearfit Study. J Sports Sci Med. 2022;21(3):356-75. doi: 10.3390/s21041411

16. Silva JC, Silva KF, Torres VB, et al. Reliability and validity of My Jump 2 app to measure the vertical jump in visually impaired five-a-side soccer athletes. Peer J. 2024;12:e18170. doi: 10.7717/peerj.18170

17. Matlary RED, Holme PA, Glosli H, et al. Comparisonof free-living physical activity measurements between ActiGraph GT3X-BT and Fitbit Charge 3 in young people with haemophilia. Haemophilia. 2022;28(6):e172-80. doi: 10.1111/hae.14624

18. Sullivan K, Metoyer CJ, Hornikel B, et al. Agreement Between A 2-Dimensional Digital Image-Based 3-Compartment Body Composition Model and Dual Energy X-Ray Absorptiometry for The Estimation of Relative Adiposity. J Clin Densitom. 2022;25(2):244-51. doi: 10.1016/j.jocd.2021.08.004

19. Majmudar MD, Chandra S, Yakkala K, et al. Smartphone camera based assessment of adiposity: a validation study. NPJ Digit Med. 2022;5(1):79. doi: 10.1038/s41746-022-00628-3

20. Shinozaki K, Yu PJ, Zhou Q, et al. An Automation System Equivalent to the Douglas Bag Technique Enables Continuous and Repeat Metabolic Measurements in Patients Undergoing Mechanical Ventilation. Clin Ther. 2022;44(11):1471-9. doi: 10.1016/j.clinthera.2022.09.004

21. Correa-Rojas J. Coeficiente de correlación intraclase: aplicaciones para estimar la estabilidad temporal de un instrumento de medida. Cienc Psicol. 2021;15(2):e1220. doi: 10.22235/cp.v15i2.2318

22. Nazaroff J, Mark B, Learned J, et al. Measurement of acetabular wall indices: comparison between CT and plain radiography. J Hip Preserv Surg. 2021;8(1):51-7. doi: 10.1093/jhps/hnab008

23. Villa G, Cerfoglio S, Bonfiglio A, et al. Validation of a Commercially Available IMU-Based System Against an Optoelectronic System for Full-Body Motor Tasks. Sensors (Basel). 2025;25(12):3736. doi: 10.3390/s25123736

24. Johnston W, Judice PB, Molina García P, et al. Recommendations for determining the validity of consumer wearable and smartphone step count: expert statement and checklist of the INTERLIVE network. Br J Sports Med. 2021;55(14):780-93. doi: 10.1136/bjsports-2020-103147

25. Courtney JB, Nuss K, Lyden K, et al. Comparing the activPAL software’s Primary Time in Bed Algorithm against Self-Report and van derBerg's Algorithm. Meas Phys Educ Exerc Sci. 2021;25(3):212-26. doi: 10.1080/1091367x.2020.1867146

26. Tinsley GM, Park KS, Saenz C, et al. Deuterium oxide validation of bioimpedance total body water estimates in Hispanic adults. Front Nutr. 2023;10:1221774. doi: 10.3389/fnut.2023.1221774

27. McCarthy C, Tinsley GM, Yang S, et al. Smartphone prediction of skeletal muscle mass: model development and validation in adults. Am J Clin Nutr. 2023;117(4):794-801. doi: 10.1016/j.ajcnut.2023.02.003

28. Katz MJ, Wang C, Nester CO, et al. T-MoCA: A valid phone screen for cognitive impairment in diverse community samples. Alzheimers Dement (Amst). 2021;13(1):e12144. doi: 10.1002/dad2.12144

29. Cheng X, Liu J, Wang Y, et al. Comparison of Students’ Physical Activity at Different Times and Establishment of a Regression Model for Smart Fitness Trackers. Sensors (Basel). 2025;25(6). doi: 10.3390/s25061726

30. Gutierrez NM, Cribbie R. Effect Sizes for Equivalence Testing: Incorporating the Equivalence Interval. Methods in Psychology. 2022;9:100127. doi: 10.31234/osf.io/5buz9