[PYTHON] Underfitting and overfitting

Underfitting and overfitting

Unable to capture logic when underlearning model is behind the data Overfitting training is overfitting to training data and deviates from the essence

Library import

import numpy as np 
from sklearn.model_selection import train_test_splita

/ * train_test_splita = As you can see from the name, a training module * /

Creating data to divide

a = np.arange(1,101)
a

Output of divided data

array([  1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,  13,
        14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,
        27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,
        40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,  52,
        53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,  65,
        66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,  78,
        79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,  91,
        92,  93,  94,  95,  96,  97,  98,  99, 100])

You can see that the values ​​from 1 to 100 are stored as an array. / * The point here is that the data is an array → The procedure to divide the data into two becomes easier * /

b = np.arange(501,601) 
b
array([501, 502, 503, 504, 505, 506, 507, 508, 509, 510, 511, 512, 513,
       514, 515, 516, 517, 518, 519, 520, 521, 522, 523, 524, 525, 526,
       527, 528, 529, 530, 531, 532, 533, 534, 535, 536, 537, 538, 539,
       540, 541, 542, 543, 544, 545, 546, 547, 548, 549, 550, 551, 552,
       553, 554, 555, 556, 557, 558, 559, 560, 561, 562, 563, 564, 565,
       566, 567, 568, 569, 570, 571, 572, 573, 574, 575, 576, 577, 578,
       579, 580, 581, 582, 583, 584, 585, 586, 587, 588, 589, 590, 591,
       592, 593, 594, 595, 596, 597, 598, 599, 600])

Create split data in both a .b

Data split

train_test_split(a) #(b)And check if the same data can be divided
[array([87, 32, 90,  1,  2,  8, 51, 73, 22, 95,  4, 57, 27, 58, 48, 99, 96,  
        74, 72, 29, 76, 64,  3, 12, 53,  6, 18, 16, 65, 66, 63, 46, 39, 17, 
        91, 25, 15, 78, 83, 19, 45, 68, 33, 98, 97, 14, 44, 86, 80, 34, 70,
        47, 54, 93, 94, 85, 42, 60, 92, 41, 61, 71, 89, 23, 21, 11, 84, 13,
        82, 59, 49, 79, 36, 55,  5]),
 array([ 24,  56,  40,   9,  69,  75,  10,  28,  38,  30,  62,  67, 100,
         88,  37,  20,   7,  31,  77,  43,  35,  26,  81,  52,  50])]

Divide the two data (objects) into four

a_train, a_test = train_test_split(a, b, test_size=0.2, random_state=365)

Check the result

Confirmation of shape

a_train,shape.a_test,shape
((80,), (20,))

Confirmation of data contents

a_train
array([ 25,  32,  99,  73,  91,  66,   3,  59,  94,   1,   8,  15,  90,
        54,  31,  20,  77,  82,  30,  35,  95,  42,  38,   7,  11,  50,
        21,  48,   2,  17,  10,  58,  68,  43,  41,  16,  88,  72,  79,
       100,  80,  39,  24,  86,  22,  23,  62,  76,  18,  47,  55,  26,
        60,  19,  71,  64,  51,  63,  65,  28,  12,  78,  13,  44,  75,
        87,  40,   4,  29,  49,  37,  57,  27,  74,   6,  45,  92,  34,
        53,  83])

/ * If you output in this state, the data is shuffled. In most cases the data is shuffled. * /

Confirmation of data contents

a_test
array([ 9, 69, 81, 56, 33, 93, 84, 61, 46, 89, 85, 67, 97,  5, 70, 36, 98,
       96, 14, 52])

/ * Benefits of rain_test_split Array or matrix Can be divided into random training and test data * /

Recommended Posts

Underfitting and overfitting
About time series data and overfitting
About _ and __