[PYTHON] Mann-Whitney U-test anti-pattern

Mann-Whitney's U test, a nonparametric, convenient way to test if the populations of two groups of data are identical, right? ?? ?? I've listed some examples of mistakes when I'm emptying my head (it's just a personal memo, so typographical errors etc.)

What is the Mann-Whitney U test?

Overview

[Wikipedia](https://ja.wikipedia.org/wiki/%E3%83%9E%E3%83%B3%E3%83%BB%E3%83%9B%E3%82%A4%E3%83 See% 83% E3% 83% 88% E3% 83% 8B% E3% 83% BC% E3% 81% AEU% E6% A4% 9C% E5% AE% 9A). If you want to know more details, please contact Original Paper or [Book](https://www.amazon.co.jp/Introduction-Nonparametric- Statistics-Chapman-Statistical / dp / 03767194848).

Importantly, for all the elements of the data in one group, count how many values are less than that element value in the merged data of the two groups, and use the sum (below) in the test statistic.

T_U = \sum_{i=1}^{N_x} \sum_{j=1}^{N_y} I(X_i < Y_j)

Therefore, for example, if the number of data in the two groups is the same ($ N_x = N_y = N $) and the population is the same, it will be approximately $ T_U \ approx N (N + 1) / 2 $. Use this to create an anti-pattern.

Examples that work

An example used between normal distributions with different mean and standard deviations. A total of 1000 p-values have been calculated, and with the following parameterization, the p-value is 0.0125, which is sufficiently small even for 75% tiles of p-value.

N = 1000
iter = 1000
np.random.seed(seed=1145141919)

display(pd.Series([mannwhitneyu(norm.rvs(loc=0,scale=1,size=N), norm.rvs(loc=1,scale=10,size=N)).pvalue for i in range(iter)]).describe())
count   1000.0000
mean       0.0212
std        0.0563
min        0.0000
25%        0.0001
50%        0.0011
75%        0.0125
max        0.4742
dtype: float64

Anchipatan

Create an anti-pattern, keeping in mind that $ T_U \ approx N (N + 1) / 2 $ is indistinguishable.

Same distribution

For comparison with other cases.

Uniform distribution

N = 1000
iter = 1000
np.random.seed(seed=1145141919)

display(pd.Series([mannwhitneyu(uniform.rvs(loc=0,scale=1,size=N), uniform.rvs(loc=0,scale=1,size=N)).pvalue for i in range(iter)]).describe())
count   1000.0000
mean       0.2583
std        0.1438
min        0.0001
25%        0.1361
50%        0.2613
75%        0.3835
max        0.5000
dtype: float64

Normal distributions

N = 1000
iter = 1000
np.random.seed(seed=1145141919)

display(pd.Series([mannwhitneyu(norm.rvs(loc=0,scale=1,size=N), norm.rvs(loc=0,scale=1,size=N)).pvalue for i in range(iter)]).describe())
count   1000.0000
mean       0.2524
std        0.1426
min        0.0001
25%        0.1259
50%        0.2572
75%        0.3713
max        0.5000
dtype: float64

Normal distribution with the same mean but significantly different standard deviations

The standard deviation of one was multiplied by 100. Since the distribution with a small standard deviation is localized near the center of the distribution with a large standard deviation, most of the samples from the distribution with a small standard deviation occupy the rank around the mean value among the merged data of the two groups. .. Therefore, if it occupies all, $ T_U \ approx \ frac {N} {2} \ sum_ {i = 1} ^ {N} 1 = N (N + 1) / 2 $.

N = 1000
iter = 1000
np.random.seed(seed=1145141919)

display(pd.Series([mannwhitneyu(norm.rvs(loc=0,scale=1,size=N), norm.rvs(loc=0,scale=100,size=N)).pvalue for i in range(iter)]).describe())
count   1000.0000
mean       0.2211
std        0.1498
min        0.0000
25%        0.0844
50%        0.2090
75%        0.3511
max        0.4996
dtype: float64

When there is a uniform distribution between mixed uniform distributions

Consider a mixed uniform distribution consisting of two uniform distributions, the two of which do not intersect and have another uniform distribution between the two. A test is performed between this mixed uniform distribution and the uniform distribution. Similarly, for example, a one-dimensional mixed Gaussian distribution and a distribution localized only near the valley can be discussed in the same way.

In the following, uniform data in the $ [0, M) \ cup [3M, 4M) $ interval and uniform data in the $ [M, 3M) $ interval were created and tested. Sampling was performed from the mixture distribution as follows. The reference page is here.

def get_mixture_uniform(sample_size, M):
    distributions = [
    {"type": np.random.uniform, "kwargs": {"low": 0, "high": M}},
    {"type": np.random.uniform, "kwargs": {"low": 3 * M, "high": 4 * M}},
    ]
    coefficients = np.array([0.5, 0.5])

    num_distr = len(distributions)
    data = np.zeros((sample_size, num_distr))
    for idx, distr in enumerate(distributions):
        data[:, idx] = distr["type"](size=(sample_size,),**distr["kwargs"])
    random_idx = np.random.choice(np.arange(num_distr), size=(sample_size,), p=coefficients)
    samples = data[np.arange(sample_size), random_idx]
    return samples
N = 1000
M = 10

display(pd.Series([mannwhitneyu(np.random.uniform(low=M, high=3*M, size=N), get_mixture_uniform(N, M)).pvalue for i in range(iter)]).describe())
count   1000.0000
mean       0.2287
std        0.1555
min        0.0000
25%        0.0909
50%        0.2194
75%        0.3784
max        0.5000
dtype: float64

Recommended Posts

Mann-Whitney U-test anti-pattern