The data you handle when using libraries and tools such as machine learning is very important. Without the data, you can't even demo. It is best to use real data, but I think there are many cases where you do not have the type of real data you are looking for. Recently, some companies have published actual data that is easy to use for analysis, but it is limited to use for research purposes. In some cases, the conditions of use may not be met.
If you don't have the data, you can create it yourself, so it is convenient if you can freely create dummy data. When creating dummy data, it is necessary to devise a way to create it depending on the purpose. I think it can be roughly divided into the following two.
In performance measurement, if it is a simple thing such as measuring the reading speed of all data, I think that there are many cases where the requirements are satisfied as long as the amount of data is combined. However, when performing some complicated performance measurements such as joining and filtering data with compression ratio and SQL, it is necessary to pay attention to the cardinality of the data. If all the values are the same and only the amount of data is combined, the compression ratio may be abnormally good compared to the actual data, resulting in unusable measured values.
Data analysis requires dummy data with slightly more detailed scenarios than those required for performance measurements. Also, when showing it in a demo, I think that I will show what kind of analysis result it is, so it does not look good with a list of numbers and meaningless character strings.
This time, I would like to introduce the procedure for creating dummy data used in data analysis with a little attention, rather than filling it with uniform random numbers.
Decide what attributes you want to have when creating customer data. I would like to create dummy data for the following items.
attribute | Data characteristics | Example |
---|---|---|
Customer ID | Unique value | 12345 |
Customer name | Any Japanese name | Taro Tanaka |
age | Uniform distribution | 30 years old |
height | normal distribution | 176cm |
annual income | Lognormal distribution | 4.56 million yen |
Car possession flag | 0 possession,1 None(4:6) | 1 |
Marriage status | 0 single,1 married,2 Divorce(3:6:1) | 2 |
This time, it is a little for explanation, but I think that it is perfect as dummy data if there are about 20 to 30 attributes.
This time I will create dummy data using Python. I chose Python because I can use NumPy to generate values for various distributions and Faker. This is because the package called is quite convenient. (The biggest reason is my personal favorite language)
Since it is a unique value, you can complete it by creating numbers in order by sequence.
lang:python3.4.3
for i in range(1000)
i
By using the Faker package, you can easily create dummy data such as names. Although the number of types of data has decreased, it is possible to generate data for each country.
lang:python3.4.3
from faker import Faker
fake = Faker('ja_JP')
fake.name()
There is a wide variety, and you can generate only surnames and email addresses. There are few patterns in Japanese, but there are quite a few in English.
lang:python3.4.3
from faker import Faker
fake = Faker('ja_JP')
fake.last_name()
fake.email()
Since it is the age of the user, a uniform distribution of values from 15 to 85 is generated by random numbers.
lang:python3.4.3
import numpy as np
from numpy.random import *
randint(15,85)
I'm not sure what kind of case the height is included in the customer data, but I can't think of any other attribute values that are generally called normal distribution.
lang:python3.4.3
import numpy as np
from numpy.random import *
normal(170,6)
Since the average and variance are different for men and women, I think it would be nice to generate it twice. Also, since the average and variance by gender and age are published in the survey results of the Ministry of Health, Labor and Welfare, I think that if the normal distribution is generated multiple times according to that, it will be very realistic dummy data.
Although the average annual income is 4 million yen, the median is 2 million yen and there is a big difference between the average and the median. This is due to the fact that there are a certain number of outliers. In the financial industry, it's called fat tail. When creating such data, let's generate random numbers with a lognormal distribution.
lang:python3.4.3
import numpy as np
from numpy.random import *
lognormal(0,1)
Some data is called a yes, no managed flag, such as what you have done or what you have. Since it is basically managed by 0 or 1, various generation methods can be considered. Generate a number, divide by 2 and use the remainder. This time, we will create data with a ownership rate of 60%, with 1 being the person who owns the car and 0 being the person who does not own the car.
lang:python3.4.3
import numpy as np
from numpy.random import *
CarFlagList = [0,1]
Weight = [0.4,0.6]
np.random.choice(CarFlagList,p=Weight)
Unlike flags, some are in multiple states instead of two. It is possible to express it with a combination of flags, but it is not manageable to increase the attributes unnecessarily, so it is common to have multiple values obediently. This time, we will create random numbers from three types of data: 0 single, 1 married, 2 divorced, as the marriage status. The ratio of each is 3: 6: 1.
lang:python3.4.3
import numpy as np
from numpy.random import *
MariageList = ["0 single","1 married","2 Divorce"]
Weight = [0.3,0.6,0.1]
np.random.choice(MariageList ,p=Weight)
Random numbers can be created by almost the same procedure as the car possession flag. This time, the value is not a number but a character string, but it is convenient to be able to respond flexibly here as well.
Since I wrote a set of random number generation patterns, I can create dummy data for each scenario by turning a loop with this combination and For statement. In a certain project, I made dummy data for 70 columns, but it was the most difficult to set the weights for each of the 70 columns. ..
Recommended Posts