{"id":2665,"date":"2023-02-23T23:28:04","date_gmt":"2023-02-24T05:28:04","guid":{"rendered":"https:\/\/baylor.ai\/?p=2665"},"modified":"2023-02-24T13:19:43","modified_gmt":"2023-02-24T19:19:43","slug":"how-to-show-that-your-model-is-better-a-step-by-step-guide-to-statistical-hypothesis-testing","status":"publish","type":"post","link":"https:\/\/lab.rivas.ai\/?p=2665","title":{"rendered":"How to Show that Your Model is Better: A Basic Guide to Statistical Hypothesis Testing"},"content":{"rendered":"\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\">Do you need help determining which machine learning model is superior? This post presents a step-by-step guide using basic statistical techniques and a real case study! \ud83e\udd16\ud83d\udcc8 #AIOrthoPraxy #MachineLearning #Statistics #DataScience<\/p>\n<\/blockquote>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"637\" src=\"https:\/\/baylor.ai\/wp-content\/uploads\/2023\/02\/image-11-1024x637.png\" alt=\"\" class=\"wp-image-2708\" srcset=\"https:\/\/lab.rivas.ai\/wp-content\/uploads\/2023\/02\/image-11-1024x637.png 1024w, https:\/\/lab.rivas.ai\/wp-content\/uploads\/2023\/02\/image-11-300x187.png 300w, https:\/\/lab.rivas.ai\/wp-content\/uploads\/2023\/02\/image-11-768x478.png 768w, https:\/\/lab.rivas.ai\/wp-content\/uploads\/2023\/02\/image-11-863x537.png 863w, https:\/\/lab.rivas.ai\/wp-content\/uploads\/2023\/02\/image-11-174x108.png 174w, https:\/\/lab.rivas.ai\/wp-content\/uploads\/2023\/02\/image-11.png 1254w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">When employing Machine Learning to address problems, our choice of a model plays a crucial role. Evaluating models can be straightforward when performance disparities are substantial, for example, when comparing two large-language models (LLMS) on a masked language modeling (MLM) task with 71.01 and 28.56 perplexity, respectively. However, if differences among models are minute, making a solid analysis to discern if one model is genuinely superior to others can prove challenging.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This tutorial aims to present a step-by-step guide to determine if one model is superior to another. Our approach relies on basic statistical techniques and real datasets. Our study compares four models on six datasets using one metric, standard accuracy. Alternatively, other contexts may use different numbers of models, metrics, or datasets. We will work with the tables below that show the properties of the datasets and the performance of two baseline models and two of our proposed models, for which we hope to show that they are better, which would be our hypothesis to be tested.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/baylor.ai\/wp-content\/uploads\/2023\/02\/image-9-1024x236.png\" alt=\"\" class=\"wp-image-2689\" width=\"709\" height=\"163\" srcset=\"https:\/\/lab.rivas.ai\/wp-content\/uploads\/2023\/02\/image-9-1024x236.png 1024w, https:\/\/lab.rivas.ai\/wp-content\/uploads\/2023\/02\/image-9-300x69.png 300w, https:\/\/lab.rivas.ai\/wp-content\/uploads\/2023\/02\/image-9-768x177.png 768w, https:\/\/lab.rivas.ai\/wp-content\/uploads\/2023\/02\/image-9-1536x354.png 1536w, https:\/\/lab.rivas.ai\/wp-content\/uploads\/2023\/02\/image-9-863x199.png 863w, https:\/\/lab.rivas.ai\/wp-content\/uploads\/2023\/02\/image-9-469x108.png 469w, https:\/\/lab.rivas.ai\/wp-content\/uploads\/2023\/02\/image-9.png 1796w\" sizes=\"auto, (max-width: 709px) 100vw, 709px\" \/><figcaption class=\"wp-element-caption\">Summary of performance measured with standard accuracy<\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/baylor.ai\/wp-content\/uploads\/2023\/02\/image-5-1024x244.png\" alt=\"\" class=\"wp-image-2680\" width=\"584\" height=\"139\" srcset=\"https:\/\/lab.rivas.ai\/wp-content\/uploads\/2023\/02\/image-5-1024x244.png 1024w, https:\/\/lab.rivas.ai\/wp-content\/uploads\/2023\/02\/image-5-300x71.png 300w, https:\/\/lab.rivas.ai\/wp-content\/uploads\/2023\/02\/image-5-768x183.png 768w, https:\/\/lab.rivas.ai\/wp-content\/uploads\/2023\/02\/image-5-1536x365.png 1536w, https:\/\/lab.rivas.ai\/wp-content\/uploads\/2023\/02\/image-5-863x205.png 863w, https:\/\/lab.rivas.ai\/wp-content\/uploads\/2023\/02\/image-5-454x108.png 454w, https:\/\/lab.rivas.ai\/wp-content\/uploads\/2023\/02\/image-5.png 1900w\" sizes=\"auto, (max-width: 584px) 100vw, 584px\" \/><figcaption class=\"wp-element-caption\">Summary of the main properties of the datasets considered in this tutorial.<\/figcaption><\/figure>\n\n\n\n<p>\nOne of the primary purposes of statistics is hypothesis testing. Statistical inference involves taking a sample from a population and determining how well the sample represents the population. In hypothesis testing, we formulate a null hypothesis, <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/lab.rivas.ai\/wp-content\/ql-cache\/quicklatex.com-a4fc152da9c0802275c766010d183a54_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#72;&#95;&#48;\" title=\"Rendered by QuickLaTeX.com\" height=\"15\" width=\"22\" style=\"vertical-align: -3px;\"\/>, and an alternative hypothesis, <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/lab.rivas.ai\/wp-content\/ql-cache\/quicklatex.com-51f986d4b344b048e3204caeef9c1839_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#72;&#95;&#65;\" title=\"Rendered by QuickLaTeX.com\" height=\"15\" width=\"25\" style=\"vertical-align: -3px;\"\/>, based on the problem (comparing models). Both hypotheses must be concise, mutually exclusive, and exhaustive. For example, we could say that our null hypothesis is that <em>the models perform equally<\/em>, and the alternative could mean that <em>the models perform differently<\/em>.\n<\/p>\n\n\n\n\n<h4 class=\"wp-block-heading\">Why is the ANOVA test not a good alternative?<\/h4>\n\n\n\n<p>\nThe ANOVA (Analysis of Variance) test is a parametric test that compares the means of multiple groups. In our case, we have four models to compare with six datasets. The null hypothesis for ANOVA is that all <em>the means are equal<\/em>, and the alternative hypothesis is that <em>at least one of the means is different<\/em>. If the <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/lab.rivas.ai\/wp-content\/ql-cache\/quicklatex.com-e88067785874f6c6d6c6162b76fdeee7_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#112;&#45;\" title=\"Rendered by QuickLaTeX.com\" height=\"12\" width=\"23\" style=\"vertical-align: -4px;\"\/>value of the ANOVA test is less than the significance level (usually 0.05), we reject the null hypothesis and conclude that at least one of the means is different, i.e., at least one model performs differently than the others. However, ANOVA may not always be the best choice for comparing the performance of different models.\n<\/p>\n<p>\nOne reason for this is that ANOVA assumes that the data follows a normal distribution, which may not always be the case for real-world data. Additionally, ANOVA does not take into account the difficulty of classifying certain data points. For example, in a dataset with a single numerical feature and binary labels, all models may achieve 100% accuracy on the training data. However, if the test set contains some mislabeled points, the models may perform differently. In this scenario, ANOVA would not be appropriate because it does not account for the difficulty of classifying certain data points.\n<\/p>\n<p>\nAnother issue with ANOVA is that it assumes that the variances of the groups being compared are equal. This assumption may not hold for datasets with different levels of noise or variability. In such cases, alternative statistical tests like the Friedman test or the Nemenyi test may be more appropriate.\n<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Friedman test<\/h4>\n\n\n\n<p>\nThe Friedman test is a non-parametric test that compares multiple models. In our example, we want to compare the performance of <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/lab.rivas.ai\/wp-content\/ql-cache\/quicklatex.com-d834642aefb9c5f91547741a6d8377ad_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#107;&#61;&#52;\" title=\"Rendered by QuickLaTeX.com\" height=\"12\" width=\"42\" style=\"vertical-align: 0px;\"\/> different models, i.e., two baseline models, Gabor randomized, and Gabor repeated, on <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/lab.rivas.ai\/wp-content\/ql-cache\/quicklatex.com-4b46308f4b39d89a8e494591f92a7743_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#78;&#61;&#54;\" title=\"Rendered by QuickLaTeX.com\" height=\"12\" width=\"49\" style=\"vertical-align: 0px;\"\/> datasets. First, the test calculates the average rank of each model&#8217;s performance on each dataset, with the best-performing model receiving a rank of 1. The Friedman test then tests the null hypothesis, <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/lab.rivas.ai\/wp-content\/ql-cache\/quicklatex.com-a4fc152da9c0802275c766010d183a54_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#72;&#95;&#48;\" title=\"Rendered by QuickLaTeX.com\" height=\"15\" width=\"22\" style=\"vertical-align: -3px;\"\/>, <em>that all models are equally effective<\/em> and their average ranks should be equal. The test statistic is calculated as follows: \n<p class=\"ql-center-displayed-equation\" style=\"line-height: 64px;\"><span class=\"ql-right-eqno\"> (1) <\/span><span class=\"ql-left-eqno\"> &nbsp; <\/span><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/lab.rivas.ai\/wp-content\/ql-cache\/quicklatex.com-e2518869970055a72a2524df63957827_l3.png\" height=\"64\" width=\"282\" class=\"ql-img-displayed-equation quicklatex-auto-format\" alt=\"&#92;&#98;&#101;&#103;&#105;&#110;&#123;&#101;&#113;&#117;&#97;&#116;&#105;&#111;&#110;&#42;&#125; &#92;&#99;&#104;&#105;&#95;&#123;&#70;&#125;&#94;&#123;&#50;&#125;&#61;&#92;&#102;&#114;&#97;&#99;&#123;&#49;&#50;&#32;&#78;&#125;&#123;&#107;&#40;&#107;&#43;&#49;&#41;&#125;&#92;&#108;&#101;&#102;&#116;&#91;&#92;&#115;&#117;&#109;&#95;&#123;&#106;&#61;&#49;&#125;&#94;&#123;&#107;&#125;&#32;&#82;&#95;&#123;&#106;&#125;&#94;&#123;&#50;&#125;&#45;&#92;&#102;&#114;&#97;&#99;&#123;&#107;&#40;&#107;&#43;&#49;&#41;&#94;&#123;&#50;&#125;&#125;&#123;&#52;&#125;&#92;&#114;&#105;&#103;&#104;&#116;&#93; &#92;&#101;&#110;&#100;&#123;&#101;&#113;&#117;&#97;&#116;&#105;&#111;&#110;&#42;&#125;\" title=\"Rendered by QuickLaTeX.com\"\/><\/p>\nwhere <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/lab.rivas.ai\/wp-content\/ql-cache\/quicklatex.com-dae6bae3dcdac4629730754352c5e329_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#82;\" title=\"Rendered by QuickLaTeX.com\" height=\"12\" width=\"14\" style=\"vertical-align: 0px;\"\/> is the average ranking of each model.\n<\/p>\n<p>\nThe test result can be used to determine whether there is a statistically significant difference between the performance of the models by making sure that <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/lab.rivas.ai\/wp-content\/ql-cache\/quicklatex.com-b45c3b6e6cfdfef39219e916a1249b29_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#92;&#99;&#104;&#105;&#95;&#123;&#70;&#125;&#94;&#123;&#50;&#125;\" title=\"Rendered by QuickLaTeX.com\" height=\"20\" width=\"22\" style=\"vertical-align: -5px;\"\/> is not less than the critical value for the <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/lab.rivas.ai\/wp-content\/ql-cache\/quicklatex.com-2510519bbe1660dfdffb4195c7287343_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#70;\" title=\"Rendered by QuickLaTeX.com\" height=\"12\" width=\"14\" style=\"vertical-align: 0px;\"\/> distribution for a particular confidence value <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/lab.rivas.ai\/wp-content\/ql-cache\/quicklatex.com-8f0b6b1a01f8fcc2f95be0364c090397_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#92;&#97;&#108;&#112;&#104;&#97;\" title=\"Rendered by QuickLaTeX.com\" height=\"8\" width=\"11\" style=\"vertical-align: 0px;\"\/>. However, since <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/lab.rivas.ai\/wp-content\/ql-cache\/quicklatex.com-b45c3b6e6cfdfef39219e916a1249b29_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#92;&#99;&#104;&#105;&#95;&#123;&#70;&#125;&#94;&#123;&#50;&#125;\" title=\"Rendered by QuickLaTeX.com\" height=\"20\" width=\"22\" style=\"vertical-align: -5px;\"\/> could be too conservative, we also calculate the <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/lab.rivas.ai\/wp-content\/ql-cache\/quicklatex.com-b6dc7df4c1d17b62573b3463ef749fc0_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#70;&#95;&#70;\" title=\"Rendered by QuickLaTeX.com\" height=\"15\" width=\"22\" style=\"vertical-align: -3px;\"\/> statistic as follows:\n<p class=\"ql-center-displayed-equation\" style=\"line-height: 45px;\"><span class=\"ql-right-eqno\"> (2) <\/span><span class=\"ql-left-eqno\"> &nbsp; <\/span><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/lab.rivas.ai\/wp-content\/ql-cache\/quicklatex.com-7ce8de30f9d84c15ab87823cdf82b3e3_l3.png\" height=\"45\" width=\"169\" class=\"ql-img-displayed-equation quicklatex-auto-format\" alt=\"&#92;&#98;&#101;&#103;&#105;&#110;&#123;&#101;&#113;&#117;&#97;&#116;&#105;&#111;&#110;&#42;&#125; &#70;&#95;&#123;&#70;&#125;&#61;&#92;&#102;&#114;&#97;&#99;&#123;&#40;&#78;&#45;&#49;&#41;&#32;&#92;&#99;&#104;&#105;&#95;&#123;&#70;&#125;&#94;&#123;&#50;&#125;&#125;&#123;&#78;&#40;&#107;&#45;&#49;&#41;&#45;&#92;&#99;&#104;&#105;&#95;&#123;&#70;&#125;&#94;&#123;&#50;&#125;&#125;&#46; &#92;&#101;&#110;&#100;&#123;&#101;&#113;&#117;&#97;&#116;&#105;&#111;&#110;&#42;&#125;\" title=\"Rendered by QuickLaTeX.com\"\/><\/p>\nBased on the critical value, <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/lab.rivas.ai\/wp-content\/ql-cache\/quicklatex.com-7943788e853c1ceb2a6a5762ba71521c_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#70;&#95;&#123;&#70;&#125;\" title=\"Rendered by QuickLaTeX.com\" height=\"15\" width=\"22\" style=\"vertical-align: -3px;\"\/>, and <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/lab.rivas.ai\/wp-content\/ql-cache\/quicklatex.com-b45c3b6e6cfdfef39219e916a1249b29_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#92;&#99;&#104;&#105;&#95;&#123;&#70;&#125;&#94;&#123;&#50;&#125;\" title=\"Rendered by QuickLaTeX.com\" height=\"20\" width=\"22\" style=\"vertical-align: -5px;\"\/>, we evaluate <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/lab.rivas.ai\/wp-content\/ql-cache\/quicklatex.com-a4fc152da9c0802275c766010d183a54_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#72;&#95;&#48;\" title=\"Rendered by QuickLaTeX.com\" height=\"15\" width=\"22\" style=\"vertical-align: -3px;\"\/>; \nonce the null hypothesis is rejected, we apply a posthoc test. For this, we use the Nemenyi test to establish whether models differ significantly in their performance.\n<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We will start the process of getting this test done by ranking the data. First, we can load the data and verify it with respect to the table shown earlier.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import pandas as pd\nimport numpy as np\n\ndata = &#91;&#91;0.8937, 0.8839, 0.9072, 0.9102],\n        &#91;0.8023, 0.8024, 0.8229, 0.8238],\n        &#91;0.7130, 0.7132, 0.7198, 0.7206],\n        &#91;0.5084, 0.5085, 0.5232, 0.5273],\n        &#91;0.2331, 0.2326, 0.3620, 0.3952],\n        &#91;0.5174, 0.5175, 0.5307, 0.5178]]\n\nmodel_names = &#91;'Glorot N.', 'Glorot U.', 'Random G.', 'Repeated G.']\n\ndf = pd.DataFrame(data, columns=model_names)\n\nprint(df.describe())  #&lt;- use averages to verify if matches table<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Output:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>       Glorot N.  Glorot U.  Random G.  Repeated G.\ncount   6.000000   6.000000   6.000000     6.000000\nmean    0.611317   0.609683   0.644300     0.649150\nstd     0.240422   0.238318   0.206871     0.200173\nmin     0.233100   0.232600   0.362000     0.395200\n25%     0.510650   0.510750   0.525075     0.520175\n50%     0.615200   0.615350   0.625250     0.623950\n75%     0.779975   0.780100   0.797125     0.798000\nmax     0.893700   0.883900   0.907200     0.910200<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Next, we rank the models and get their averages like so:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>data = df.rank(1, method='average', ascending=False)\nprint(data)\nprint(data.describe())<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Output:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>   Glorot N.  Glorot U.  Random G.  Repeated G.\n0        3.0        4.0        2.0          1.0\n1        4.0        3.0        2.0          1.0\n2        4.0        3.0        2.0          1.0\n3        4.0        3.0        2.0          1.0\n4        3.0        4.0        2.0          1.0\n5        4.0        3.0        1.0          2.0\n\n       Glorot N.  Glorot U.  Random G.  Repeated G.\ncount   6.000000   6.000000   6.000000     6.000000\nmean    3.666667   3.333333   1.833333     1.166667\nstd     0.516398   0.516398   0.408248     0.408248\nmin     3.000000   3.000000   1.000000     1.000000\n25%     3.250000   3.000000   2.000000     1.000000\n50%     4.000000   3.000000   2.000000     1.000000\n75%     4.000000   3.750000   2.000000     1.000000\nmax     4.000000   4.000000   2.000000     2.000000<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">With this information, we can expand our initial results table to show the rankings by dataset and the average rankings across all datasets for each model.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"224\" src=\"https:\/\/baylor.ai\/wp-content\/uploads\/2023\/02\/image-10-1024x224.png\" alt=\"\" class=\"wp-image-2696\" srcset=\"https:\/\/lab.rivas.ai\/wp-content\/uploads\/2023\/02\/image-10-1024x224.png 1024w, https:\/\/lab.rivas.ai\/wp-content\/uploads\/2023\/02\/image-10-300x66.png 300w, https:\/\/lab.rivas.ai\/wp-content\/uploads\/2023\/02\/image-10-768x168.png 768w, https:\/\/lab.rivas.ai\/wp-content\/uploads\/2023\/02\/image-10-1536x336.png 1536w, https:\/\/lab.rivas.ai\/wp-content\/uploads\/2023\/02\/image-10-863x189.png 863w, https:\/\/lab.rivas.ai\/wp-content\/uploads\/2023\/02\/image-10-480x105.png 480w, https:\/\/lab.rivas.ai\/wp-content\/uploads\/2023\/02\/image-10.png 1966w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>\nNow that we have the rankings, we can proceed with the statistical analysis and do the following:\n<p class=\"ql-center-displayed-equation\" style=\"line-height: 67px;\"><span class=\"ql-right-eqno\"> (3) <\/span><span class=\"ql-left-eqno\"> &nbsp; <\/span><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/lab.rivas.ai\/wp-content\/ql-cache\/quicklatex.com-b7fd261f286621cdf8d57b028898841d_l3.png\" height=\"67\" width=\"409\" class=\"ql-img-displayed-equation quicklatex-auto-format\" alt=\"&#92;&#98;&#101;&#103;&#105;&#110;&#123;&#97;&#108;&#105;&#103;&#110;&#42;&#125; &#92;&#99;&#104;&#105;&#95;&#123;&#70;&#125;&#94;&#123;&#50;&#125;&#38;&#61;&#92;&#102;&#114;&#97;&#99;&#123;&#49;&#50;&#32;&#92;&#99;&#100;&#111;&#116;&#32;&#54;&#125;&#123;&#52;&#32;&#92;&#99;&#100;&#111;&#116;&#32;&#53;&#125;&#92;&#108;&#101;&#102;&#116;&#91;&#92;&#108;&#101;&#102;&#116;&#40;&#51;&#46;&#54;&#54;&#94;&#50;&#43;&#51;&#46;&#51;&#51;&#94;&#50;&#43;&#49;&#46;&#56;&#51;&#94;&#50;&#43;&#49;&#46;&#49;&#54;&#94;&#50;&#92;&#114;&#105;&#103;&#104;&#116;&#41;&#45;&#92;&#102;&#114;&#97;&#99;&#123;&#52;&#32;&#92;&#99;&#100;&#111;&#116;&#32;&#53;&#94;&#50;&#125;&#123;&#52;&#125;&#92;&#114;&#105;&#103;&#104;&#116;&#93;&#32;&#92;&#110;&#111;&#110;&#117;&#109;&#98;&#101;&#114;&#32;&#92;&#92; &#38;&#61;&#49;&#53;&#46;&#51;&#54;&#52;&#32;&#92;&#110;&#111;&#110;&#117;&#109;&#98;&#101;&#114;&#32; &#92;&#101;&#110;&#100;&#123;&#97;&#108;&#105;&#103;&#110;&#42;&#125;\" title=\"Rendered by QuickLaTeX.com\"\/><\/p>\n<p class=\"ql-center-displayed-equation\" style=\"line-height: 37px;\"><span class=\"ql-right-eqno\"> (4) <\/span><span class=\"ql-left-eqno\"> &nbsp; <\/span><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/lab.rivas.ai\/wp-content\/ql-cache\/quicklatex.com-50355cf67311cf809c4017704b4a564f_l3.png\" height=\"37\" width=\"225\" class=\"ql-img-displayed-equation quicklatex-auto-format\" alt=\"&#92;&#98;&#101;&#103;&#105;&#110;&#123;&#101;&#113;&#117;&#97;&#116;&#105;&#111;&#110;&#42;&#125; &#70;&#95;&#123;&#70;&#125;&#61;&#92;&#102;&#114;&#97;&#99;&#123;&#53;&#32;&#92;&#99;&#100;&#111;&#116;&#32;&#49;&#53;&#46;&#51;&#54;&#52;&#125;&#123;&#54;&#32;&#92;&#99;&#100;&#111;&#116;&#32;&#51;&#45;&#49;&#53;&#46;&#51;&#54;&#52;&#125;&#61;&#50;&#57;&#46;&#49;&#52;&#51;&#32;&#92;&#110;&#111;&#110;&#117;&#109;&#98;&#101;&#114; &#92;&#101;&#110;&#100;&#123;&#101;&#113;&#117;&#97;&#116;&#105;&#111;&#110;&#42;&#125;\" title=\"Rendered by QuickLaTeX.com\"\/><\/p>\nThe critical value at <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/lab.rivas.ai\/wp-content\/ql-cache\/quicklatex.com-37cbad7dd6cbfce0029e7398590b30b4_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#92;&#97;&#108;&#112;&#104;&#97;&#61;&#48;&#46;&#48;&#49;\" title=\"Rendered by QuickLaTeX.com\" height=\"12\" width=\"66\" style=\"vertical-align: 0px;\"\/> is 5.417. Thus, because the critical value is below our statistics obtained, we reject <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/lab.rivas.ai\/wp-content\/ql-cache\/quicklatex.com-a4fc152da9c0802275c766010d183a54_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#72;&#95;&#48;\" title=\"Rendered by QuickLaTeX.com\" height=\"15\" width=\"22\" style=\"vertical-align: -3px;\"\/> with 99% confidence.\n<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The critical value can be obtained from any <a href=\"http:\/\/www.socr.ucla.edu\/Applets.dir\/F_Table.html\">table that has the F distribution<\/a>. In the table the degrees of freedom across columns (denoted as <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/lab.rivas.ai\/wp-content\/ql-cache\/quicklatex.com-1d2e7e142a0ce1327582b5cb6361276d_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#100;&#102;&#95;&#49;\" title=\"Rendered by QuickLaTeX.com\" height=\"16\" width=\"21\" style=\"vertical-align: -4px;\"\/>) is <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/lab.rivas.ai\/wp-content\/ql-cache\/quicklatex.com-1a4c925a5ad5141d0726a09015ceadcd_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#107;&#45;&#49;\" title=\"Rendered by QuickLaTeX.com\" height=\"12\" width=\"39\" style=\"vertical-align: 0px;\"\/>, that is the number of models minus one; the degrees of freedom across rows (denoted as <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/lab.rivas.ai\/wp-content\/ql-cache\/quicklatex.com-1416fb0e5b5884c08c0d05834759783a_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#100;&#102;&#95;&#50;\" title=\"Rendered by QuickLaTeX.com\" height=\"16\" width=\"22\" style=\"vertical-align: -4px;\"\/>) is <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/lab.rivas.ai\/wp-content\/ql-cache\/quicklatex.com-afc4a158f6b06019e14697ece2186a0a_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#40;&#107;&#45;&#49;&#41;&#92;&#116;&#105;&#109;&#101;&#115;&#40;&#78;&#45;&#49;&#41;\" title=\"Rendered by QuickLaTeX.com\" height=\"19\" width=\"135\" style=\"vertical-align: -5px;\"\/>, that is, the number of models minus one, times the number of datasets minus one. In our case  this is <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/lab.rivas.ai\/wp-content\/ql-cache\/quicklatex.com-d0af8a723266a07fc441628a86186ce7_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#100;&#102;&#95;&#49;&#61;&#51;\" title=\"Rendered by QuickLaTeX.com\" height=\"16\" width=\"55\" style=\"vertical-align: -4px;\"\/> and <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/lab.rivas.ai\/wp-content\/ql-cache\/quicklatex.com-c0e8f03596d4493d006ef62d8f86272b_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#100;&#102;&#95;&#50;&#61;&#49;&#53;\" title=\"Rendered by QuickLaTeX.com\" height=\"17\" width=\"63\" style=\"vertical-align: -4px;\"\/>.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Nemenyi Test<\/h4>\n\n\n\n<p>\nThe Nemenyi test is a post-hoc test that compares multiple models after a significant result from Friedman&#8217;s test. The null hypothesis for Nemenyi is that <em>there is no difference between any two models<\/em>, and the alternative hypothesis is that <em>at least one pair of models is different<\/em>.\n<\/p>\n<p>\nThe formula for Nemenyi is as follows:\n\n<p class=\"ql-center-displayed-equation\" style=\"line-height: 43px;\"><span class=\"ql-right-eqno\"> &nbsp; <\/span><span class=\"ql-left-eqno\"> &nbsp; <\/span><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/lab.rivas.ai\/wp-content\/ql-cache\/quicklatex.com-31ddfa7633974f9a4c569740f85c93b8_l3.png\" height=\"43\" width=\"156\" class=\"ql-img-displayed-equation quicklatex-auto-format\" alt=\"&#92;&#91;&#67;&#68;&#32;&#61;&#32;&#113;&#95;&#123;&#92;&#97;&#108;&#112;&#104;&#97;&#125;&#32;&#92;&#115;&#113;&#114;&#116;&#123;&#92;&#102;&#114;&#97;&#99;&#123;&#107;&#40;&#107;&#43;&#49;&#41;&#125;&#123;&#54;&#78;&#125;&#125;&#92;&#93;\" title=\"Rendered by QuickLaTeX.com\"\/><\/p>\n\nwhere <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/lab.rivas.ai\/wp-content\/ql-cache\/quicklatex.com-027e3d6cd537a86181b2483bb5054825_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#113;&#95;&#123;&#92;&#97;&#108;&#112;&#104;&#97;&#125;\" title=\"Rendered by QuickLaTeX.com\" height=\"12\" width=\"17\" style=\"vertical-align: -4px;\"\/> is the critical difference of the Studentized range distribution at the chosen significance level and <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/lab.rivas.ai\/wp-content\/ql-cache\/quicklatex.com-3422b6bb5c160593658b7c39425d9880_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#107;\" title=\"Rendered by QuickLaTeX.com\" height=\"12\" width=\"9\" style=\"vertical-align: 0px;\"\/> is the number of groups. The <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/lab.rivas.ai\/wp-content\/ql-cache\/quicklatex.com-027e3d6cd537a86181b2483bb5054825_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#113;&#95;&#123;&#92;&#97;&#108;&#112;&#104;&#97;&#125;\" title=\"Rendered by QuickLaTeX.com\" height=\"12\" width=\"17\" style=\"vertical-align: -4px;\"\/> value can be obtained from the following table:\n<\/p>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/baylor.ai\/wp-content\/uploads\/2023\/02\/image-6-1024x181.png\" alt=\"\" class=\"wp-image-2681\" width=\"425\" height=\"75\" srcset=\"https:\/\/lab.rivas.ai\/wp-content\/uploads\/2023\/02\/image-6-1024x181.png 1024w, https:\/\/lab.rivas.ai\/wp-content\/uploads\/2023\/02\/image-6-300x53.png 300w, https:\/\/lab.rivas.ai\/wp-content\/uploads\/2023\/02\/image-6-768x136.png 768w, https:\/\/lab.rivas.ai\/wp-content\/uploads\/2023\/02\/image-6-863x153.png 863w, https:\/\/lab.rivas.ai\/wp-content\/uploads\/2023\/02\/image-6-480x85.png 480w, https:\/\/lab.rivas.ai\/wp-content\/uploads\/2023\/02\/image-6.png 1118w\" sizes=\"auto, (max-width: 425px) 100vw, 425px\" \/><figcaption class=\"wp-element-caption\">Critical values for the Nemenyi test, which is conducted following the Friedman test, with two-tailed results.<\/figcaption><\/figure>\n\n\n\n<p>\nThus, for our particular case study, the critical differences are:\n<a name=\"id2403180684\"><\/a><p class=\"ql-center-displayed-equation\" style=\"line-height: 43px;\"><span class=\"ql-right-eqno\"> (5) <\/span><span class=\"ql-left-eqno\"> &nbsp; <\/span><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/lab.rivas.ai\/wp-content\/ql-cache\/quicklatex.com-de8531cfeab3dc71f0f47c4d1bc62fa0_l3.png\" height=\"43\" width=\"253\" class=\"ql-img-displayed-equation quicklatex-auto-format\" alt=\"&#92;&#98;&#101;&#103;&#105;&#110;&#123;&#101;&#113;&#117;&#97;&#116;&#105;&#111;&#110;&#42;&#125; &#67;&#68;&#95;&#123;&#92;&#97;&#108;&#112;&#104;&#97;&#61;&#48;&#46;&#48;&#53;&#125;&#61;&#50;&#46;&#53;&#54;&#57;&#32;&#92;&#115;&#113;&#114;&#116;&#123;&#92;&#102;&#114;&#97;&#99;&#123;&#52;&#32;&#92;&#99;&#100;&#111;&#116;&#32;&#53;&#125;&#123;&#54;&#32;&#92;&#99;&#100;&#111;&#116;&#32;&#54;&#125;&#125;&#32;&#61;&#32;&#49;&#46;&#57;&#49;&#53;&#32;&#92;&#110;&#111;&#110;&#117;&#109;&#98;&#101;&#114; &#92;&#101;&#110;&#100;&#123;&#101;&#113;&#117;&#97;&#116;&#105;&#111;&#110;&#42;&#125;\" title=\"Rendered by QuickLaTeX.com\"\/><\/p>\n<p class=\"ql-center-displayed-equation\" style=\"line-height: 43px;\"><span class=\"ql-right-eqno\"> (6) <\/span><span class=\"ql-left-eqno\"> &nbsp; <\/span><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/lab.rivas.ai\/wp-content\/ql-cache\/quicklatex.com-c2a82293ea84fd14e06554d4459fec78_l3.png\" height=\"43\" width=\"254\" class=\"ql-img-displayed-equation quicklatex-auto-format\" alt=\"&#92;&#98;&#101;&#103;&#105;&#110;&#123;&#101;&#113;&#117;&#97;&#116;&#105;&#111;&#110;&#42;&#125; &#67;&#68;&#95;&#123;&#92;&#97;&#108;&#112;&#104;&#97;&#61;&#48;&#46;&#49;&#48;&#125;&#61;&#50;&#46;&#50;&#57;&#49;&#32;&#92;&#115;&#113;&#114;&#116;&#123;&#92;&#102;&#114;&#97;&#99;&#123;&#52;&#32;&#92;&#99;&#100;&#111;&#116;&#32;&#53;&#125;&#123;&#54;&#32;&#92;&#99;&#100;&#111;&#116;&#32;&#54;&#125;&#125;&#32;&#61;&#32;&#49;&#46;&#55;&#48;&#56;&#32;&#92;&#110;&#111;&#110;&#117;&#109;&#98;&#101;&#114; &#92;&#101;&#110;&#100;&#123;&#101;&#113;&#117;&#97;&#116;&#105;&#111;&#110;&#42;&#125;\" title=\"Rendered by QuickLaTeX.com\"\/><\/p>\nSince the difference in rank between the randomized Gabor and baseline Glorot normal is 1.83 and is less than the <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/lab.rivas.ai\/wp-content\/ql-cache\/quicklatex.com-92f13c6e5c064aa8ba160493e6540e49_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#67;&#68;&#95;&#123;&#92;&#97;&#108;&#112;&#104;&#97;&#61;&#48;&#46;&#49;&#48;&#125;&#61;&#49;&#46;&#55;&#48;&#56;\" title=\"Rendered by QuickLaTeX.com\" height=\"16\" width=\"138\" style=\"vertical-align: -3px;\"\/>, we conclude Gabor is better. Similarly, since the difference in rank between the fixed Gabor and baseline Glorot uniform is 2.17 and is less than the <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/lab.rivas.ai\/wp-content\/ql-cache\/quicklatex.com-2811e07f978e247c19bd4b4bbb8bb85f_l3.png\" class=\"ql-img-inline-formula quicklatex-auto-format\" alt=\"&#67;&#68;&#95;&#123;&#92;&#97;&#108;&#112;&#104;&#97;&#61;&#48;&#46;&#48;&#53;&#125;&#61;&#49;&#46;&#57;&#49;&#53;\" title=\"Rendered by QuickLaTeX.com\" height=\"16\" width=\"137\" style=\"vertical-align: -3px;\"\/>, we conclude that Gabor is better. Yes, there is sufficient statistical evidence to show that <strong>our model is better with high confidence<\/strong>.\n<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Things we would like to see in papers <\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">First of all, it would be nice to have a complete table that includes the results of the statistical tests as part of the caption or as a footnote, like this:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"248\" src=\"https:\/\/baylor.ai\/wp-content\/uploads\/2023\/02\/image-8-1024x248.png\" alt=\"\" class=\"wp-image-2688\" srcset=\"https:\/\/lab.rivas.ai\/wp-content\/uploads\/2023\/02\/image-8-1024x248.png 1024w, https:\/\/lab.rivas.ai\/wp-content\/uploads\/2023\/02\/image-8-300x73.png 300w, https:\/\/lab.rivas.ai\/wp-content\/uploads\/2023\/02\/image-8-768x186.png 768w, https:\/\/lab.rivas.ai\/wp-content\/uploads\/2023\/02\/image-8-1536x372.png 1536w, https:\/\/lab.rivas.ai\/wp-content\/uploads\/2023\/02\/image-8-863x209.png 863w, https:\/\/lab.rivas.ai\/wp-content\/uploads\/2023\/02\/image-8-445x108.png 445w, https:\/\/lab.rivas.ai\/wp-content\/uploads\/2023\/02\/image-8.png 1996w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Second of all, graphics always help! A simple and visually appealing diagram is a powerful way to represent post hoc test results when comparing multiple classifiers. The figure below, which illustrates the data analysis from the table above, displays the average ranks of methods along the top line of the diagram. To facilitate interpretation, the axis is oriented so that the best ranks appear on the right side, which enables us to perceive the methods on the right as superior.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/baylor.ai\/wp-content\/uploads\/2023\/02\/image-7-1024x441.png\" alt=\"\" class=\"wp-image-2685\" width=\"450\" height=\"193\" srcset=\"https:\/\/lab.rivas.ai\/wp-content\/uploads\/2023\/02\/image-7-1024x441.png 1024w, https:\/\/lab.rivas.ai\/wp-content\/uploads\/2023\/02\/image-7-300x129.png 300w, https:\/\/lab.rivas.ai\/wp-content\/uploads\/2023\/02\/image-7-768x330.png 768w, https:\/\/lab.rivas.ai\/wp-content\/uploads\/2023\/02\/image-7-863x371.png 863w, https:\/\/lab.rivas.ai\/wp-content\/uploads\/2023\/02\/image-7-251x108.png 251w, https:\/\/lab.rivas.ai\/wp-content\/uploads\/2023\/02\/image-7.png 1218w\" sizes=\"auto, (max-width: 450px) 100vw, 450px\" \/><figcaption class=\"wp-element-caption\">Comparison of all models against each other with the Nemenyi test. Models not significantly different at <em>\u03b1<\/em> = 0.10 or <em>\u03b1<\/em> = 0.05 are connected.<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">When comparing all the algorithms against each other, the groups of algorithms that are not significantly different are connected with a bold solid line. Such an approach clearly highlights the most effective models while also providing a robust analysis of the differences between models. Additionally, the critical difference is shown above the graph, further enhancing the visualization of the analysis results. Overall, this simple yet powerful diagrammatic approach provides a clear and concise representation of the performance of multiple classifiers, enabling more informed decision-making in selecting the best-performing model.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Main Sources<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">The statistical tests are based on this paper:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\">Dem\u0161ar, Janez. &#8220;Statistical comparisons of classifiers over multiple data sets.&#8221;&nbsp;<em>The Journal of Machine learning research<\/em>&nbsp;7 (2006): 1-30.<\/p>\n<\/blockquote>\n\n\n\n<p class=\"wp-block-paragraph\">The case study is based on the following research:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\">Rai, Mehang. &#8220;On the Performance of Convolutional Neural Networks Initialized with Gabor Filters.&#8221; Thesis, Baylor University, 2021.<\/p>\n<\/blockquote>\n","protected":false},"excerpt":{"rendered":"<p>Do you need help determining which machine learning model is superior? This post presents a step-by-step guide using basic statistical techniques and a real case study! \ud83e\udd16\ud83d\udcc8 #AIOrthoPraxy #MachineLearning #Statistics #DataScience When employing Machine Learning to address problems, our choice of a model plays a crucial role. Evaluating models can be straightforward when performance disparities &hellip; <a href=\"https:\/\/lab.rivas.ai\/?p=2665\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">How to Show that Your Model is Better: A Basic Guide to Statistical Hypothesis Testing<\/span><\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[1],"tags":[3,4,5],"class_list":["post-2665","post","type-post","status-publish","format-standard","hentry","category-uncategorized","tag-ai-ethics-standards","tag-ai-lab","tag-ai-orthopraxy"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/lab.rivas.ai\/index.php?rest_route=\/wp\/v2\/posts\/2665","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/lab.rivas.ai\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/lab.rivas.ai\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/lab.rivas.ai\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/lab.rivas.ai\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=2665"}],"version-history":[{"count":38,"href":"https:\/\/lab.rivas.ai\/index.php?rest_route=\/wp\/v2\/posts\/2665\/revisions"}],"predecessor-version":[{"id":4702,"href":"https:\/\/lab.rivas.ai\/index.php?rest_route=\/wp\/v2\/posts\/2665\/revisions\/4702"}],"wp:attachment":[{"href":"https:\/\/lab.rivas.ai\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=2665"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/lab.rivas.ai\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=2665"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/lab.rivas.ai\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=2665"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}