I wrote an article "Significance of Interfaces Learned from Java Collection". Since this article focused on the Collection interface, this time we will focus on the implementation classes that are often used, and talk about the internal implementation, features, and usage of each implementation class.
This is a Collection class that you will probably use often.
HashMap is an implementation class of Map, not Collection, but since it is often used and has a high relationship with HashSet, I will explain it together with HashSet.
Class diagram
The role of the interface (Click here for details (http://qiita.com/frost_star/items/14a12d64ccbe85a8ac3f))
interface | role |
---|---|
List | An ordered group of elements. Basically allow duplication. |
Set | Group that does not allow duplicate elements(set).. The order depends on the implementation class. |
ArrayList, as the name implies, is an implementation of List by Array. It has an array internally, and stores, references, and inserts data in the array. Therefore, in order to know the characteristics of ArrayList, it is necessary to know the characteristics of the array.
An array reserves a contiguous area in memory. Its best feature is that it can be referenced by subscripts at high speed. Since the area is continuous, you can find the address you want to refer to by the following formula as long as you know the start address, subscript, and data size per one.
Address to refer to=Start address+Subscript x data size per one
Since the array needs to have such a continuous area, it cannot be changed from the originally determined number of elements. However, ArrayList allows you to add elements dynamically. ArrayList automatically reallocates an array when you add elements and run out of arrays. It is easy to say that it is re-secured, but in reality it is a very heavy process because it processes a new array with 1.5 times the number of elements as the original size and copies the data from the original array. Become. It is said that it is better to decide the initial capacity (constructor argument) in ArrayList because it reduces the frequency of executing this reallocation process by deciding the size of the array to be allocated first.
Also, arrays are very vulnerable to insertion. This is because the area where the data is stored is fixed, so the process of shifting the location cannot be performed. In ArrayList, the process of inserting data to an arbitrary location is implemented by the ʻadd` method, but this internal also reallocates the array, and the data after the insertion position is inserted by shifting the index and copying. We are processing to make room.
Have you ever heard of a data structure called a linear list? LinkedList is an implementation of List based on the structure of a linear list.
A linear list is a data structure that treats data and links (references to the next element) as one object (node), and can handle data columns by concatenating the nodes.
The advantage of this data structure is that you can access the elements by following the links within each node, as long as you know the root (reference to the first node). Therefore, it is not necessary for each node to exist in a contiguous area like an array. Also, when inserting data, you only need to change the references of the nodes before and after, so you do not need a large-scale copy process like ArrayList.
The downside of linear lists is slow random access. For example, in order to access the 2nd element, follow the link from the root semi-repeatingly like [Root]-> [0th element]-> [1st element]-> [2nd element]. I need to go. In LinkedList, in order to speed up random access as much as possible, we have devised such things as making the link bidirectional and holding a reference to the last element. However, the more elements there are, the slower the random access is inevitable.
Also, because each node has a reference as a field in addition to data, it uses more memory than an ArrayList with the same number of elements.
HashSet is an implementation class of Set, unlike the previous two Lists. That is, it does not allow duplicate elements and cannot be randomly accessed. Also, HashSet does not preserve order.
Not allowing duplicates means that when you add an element, you need to determine if the element already exists in the Set. HashSet makes good use of arrays, linear lists, and hash values to achieve fast existence verification.
The hash value is a value calculated from the original data by calculation based on a specific formula. The same hash value can be calculated from the same data, but it is designed so that if the data is slightly different, the values will be significantly different. Also, even if it is irreversible and the hash value can be calculated from the data, the data cannot be restored from the hash value. The hash value itself is widely used in the information processing world such as authentication, validity check, and encryption.
The hash value in Java is a value for identifying an instance, and is an int type integer calculated by the hashCode
method.
The hashCode
method is defined as an Object type.
Based on the characteristic that "the same hash value can be calculated from the same data of hash value", the same hash value must be returned between instances where the ʻequals` method returns true, and conversely, if the data is different, it is the same as much as possible. It should not be a value.
HashSet realizes high-speed existence confirmation by making good use of this hash value.
HashSet reserves an array of size s
when instantiated.
When storing an instance ʻe, HashSet first finds the hash value with ʻe.hashCode ()
and then calculates where it should be stored.
Find the remainder (division remainder) of ʻe.hashCode ()and
s and store ʻe
in that location.
array[ e.hashCode() % s ] = e;
Since the storage location is calculated from ʻe.hashCode ()% s`, it is not necessary to search the array one by one when checking the existence, and the hash value of the given instance is calculated and the instance is there. You can confirm the existence by checking if there is any.
That is just an ideal theory.
Actually, the size s
of the array is a small number compared to the hash value, so the phenomenon that the data already exists in the place where you tried to store it occurs.
This is called a collision.
In case of a collision, HashSet stores data in a data structure that has a link to the next element like a linear list when storing data.
Then, when a collision occurs, the data is connected as the next element of the element that already exists.
This makes it possible to check the existence by searching only the groups that have the same value of ʻe.hashCode ()% s`, even if it is not a single reference.
The more data you have, the more likely you are to collide.
For example, if s = 10
and 11 data are stored, a collision will definitely occur (pigeonhole principle).
Therefore, when the number of data increases, the array is re-allocated with a large capacity and the data is reinserted.
The data reinsertion here is not a simple copy, but the data structure is not disrupted because the data reinsertion is performed so that the correspondence between the array subscript and ʻe.hashCode ()% s` is not disrupted.
HashSet determines the storage location based on the value of hashCode
.
Therefore, the performance of the hashCode
expression is directly linked to the collision probability of the HashSet.
In an extreme case, if the contents of hashCode
are processed to always return a constant likereturn 0;
, a collision will occur every time data is stored, so the search performance will be lower than the Linked List.
Therefore, it is important to override the appropriate hashCode method for the class of the element you are storing.
However, there is a high possibility of collision with Oreore hashCode.
The best is to use the ʻObjects.hashCode` method.
Objects.hashCode(Field 1,Field 2,Field 3);
Since the argument is a variable argument, you can pass data of multiple Object types. However, since the final hash value is calculated using the hash value obtained by hashCode of each field, it is necessary to override the hashCode method in each field class as well.
So far, I've talked about the internal implementation of HashSet, but it's actually a lie. As I wrote in Another article, the internal implementation of HashSet is actually realized by HashMap. So, the internal implementation we've talked about so far was actually the internal implementation of HashMap. However, since the implementation of internal processing only depends on HashMap, the behavior is the same for both.
As for HashMap, HashMap is an implementation class of Map that holds values in two data pairs, Key and Value. Key corresponds to the data part of HashSet explained earlier. Since the data is stored by the hash value of the Key instance, it is possible to search the data from the Key at high speed. Value is simply the value associated with Key and is stored with Key. In HashSet, by setting a static value as Value, it is implemented using HashMap without consuming extra memory.
In summary, let's compare the performance of each implementation class in order notation. If you don't understand the order notation, * O * (n) is slower than * O * (1).
Implementation class | add to | Insert/Delete | Search | Random access | memory usage |
---|---|---|---|---|---|
ArrayList | O(1)※ | O(n) | O(n) | O(1) | Few |
LinkedList | O(1) | O(1) | O(n) | O(n) | During ~ |
HashSet | O(1)※ | O(1) | O(1) | impossible | Many |
*: Resize may occur
By comparison, you can see the characteristics of each implementation class. For example, if there are many inserts, LinkedList, if there are many random accesses, ArrayList, etc. Which class is suitable depends on the processing content, so select an appropriate implementation class.
Recommended Posts