WHY

The demand to get only arbitrary information from structured data (HTML, Json, XML) is also needed for anomaly detection, chlorler, and data analysis. There are also methods of manually tuning and acquiring by supervised machine learning, but the method introduced this time is a method of focusing on the structure of structured data, calculating the difference in the structure, and acquiring only arbitrary information. I'd like to introduce_______

The following points can be achieved by this method.

--Highly versatile information extraction is possible --Since teacher data is not required, the creation cost is low. --Can also be applied to preprocessing for teacher data creation

WHAT

This section describes the specific procedure.

--Prepare the source of information (HTML, JSON, etc.) that you want to get. --Compare the page with each source to calculate the distance. --Get the ones that are close together.

This is the procedure. For distance calculation, describe from Tree Edit Distance, which is intuitively easy to understand, and then describe PQ Edit Distance, which is an approximate calculation method.

"PQ Edit Distance" is good when dealing with a large amount of data or when you want to make a comparison focusing on the overall structural difference, and "Tree Edit Distance" is good when you want to compare individual elements and when you want to handle a small amount of data.

In the author's environment, there is a difference of about 100 times in calculation time, so when dealing with a large amount of data (it depends on the environment, it cannot be said unconditionally), "PQ Edit Distance" is selected.

Tree Edit Distance

Structured data works well with tree structures. At this time, the Tree Edit Distance comes to mind immediately. Tree Edit Distance is a technique mainly used to calculate the distance between tree structures.

There are three operations required to make a tree structure of structural data and measure the difference in the structure.

add to

Delete

Replace

First, assume a simple array S and think as follows.

gamma is the replacement cost H as the first element T for the rest of the elements

Operation from above Replacement: The replacement cost between the first elements and the remaining elements are recursively calculated and the total value of the distances is returned. Delete: Recursively calculate the cost of deleting the first element and the cost of the array to be compared with the remaining elements, and return the total value of the distance. Insert: The additional cost between the first elements and the cost calculation other than the additional cost of the array to be compared with the original array are recursively returned.

It is done recursively because d is inside.

d(S_1, S_2) = min(\gamma(H(S_1),H(S_2)) + d(T(S_1),T(S_2)), \\
                  \gamma(H(S_1),\vec{0}) + d(T(S_1),S_2), \\
                  \gamma(\vec{0},H(S_2)) + d(S_1,T(S_2)), 
)

Next, let's think about trees. In the case of a tree, you should think about the root and child nodes, but if there are multiple child nodes, multiple trees will be created when the root is removed, and the above description cannot be made. Think of a tree as a collection of ordered trees. If you delete the root node, you can decompose it into the forest of the child node of the rightmost tree and the forest of other nodes.

Replacement: The total value of the replacement distance cost calculation of the rightmost tree, the forest distance calculation of the child node of the rightmost tree, and the distance calculation of the other forests. Delete: The total value of the deletion distance cost calculation of the rightmost tree and the distance between the forest of the child node of the rightmost tree after deletion and the forest to be compared with other forests Addition: The total value of the additional distance cost calculation of the rightmost tree and the distance between the original forest and the forest of the child node of the rightmost tree to be compared and the other forests to be compared.

Figure

d(F_1, F_2) = min(\gamma(R(F_1),R(F_2)) + d(C(F_1),C(F_2))+ d(L(F_1),L(F_2)), \\
                  \gamma(R(F_1),\vec{0}) + d(L(F_1) + C(F_1),F_2), \\
                  \gamma(\vec{0},R(F_2)) + d(F_1, L(F_2) + C(F_2)),
)

When comparing and calculating all nodes, the calculation cost of ʻO (n ^ 2) `is incurred. Even if the distance cost is memorized and calculated by dynamic programming, the distance calculation cost will be high. Therefore, it is conceivable to use an approximate method without performing a strict distance calculation.

PQ Edit Distance

PQ Edit Distance is a method in which the calculation cost converges at ʻO (nlogn) and the calculation space is also suppressed to ʻO (n). n is the number of nodes in the tree.

PQ Edit Distance creates what is called a profile from a tree structure without considering the replace, insert, and delete operations. This profile covers the patterns that the tree structure can take and is a method to measure the degree of matching of the patterns. Profile creation depends on P and Q parameters.

There are two essential parts.

--Creating a PQ profile --Distance calculation

For creating PQ profiles, pq-gram treats trees and subtrees with a special structure.

Normal tree

Screen Shot 2016-11-29 at 3.12.28 PM.png

Trees handled by PQ (P: 2, Q: 3)

Since the values of P and Q depend on the structure of the tree, the values of P and Q are currently manually tuned to select the best one. I would like to know if this can be set automatically.

Screen Shot 2016-11-29 at 3.11.07 PM.png

Create a profile by performing recursive processing on roots and leaves as shown below.

Definition 6: Define the distance of pq grams as follows

\Delta^{p,q}(\vec{T_1}, \vec{T_2}) = 1 - 2\frac{P^{p,q}(\vec{T_1}) \cap P^{p,q}(\vec{T_2})}{P^{p,q}(\vec{T_1}) \cup P^{p,q}(\vec{T_2})}

The reason for doubling is that the maximum match rate is half that of the union of two trees.

Difference from Tree Edit Distance

In the case of PQ, the difference is grasped by the distance between profiles. Although it is an approximate method, it has some advantages over Tree Edit Distance.

Screen Shot 2016-11-28 at 3.41.56 PM.png

In the figure, in the case of Tree Edit Distance, the edit distance is 2 for both T'and T''. This is because T'has not enough "g" and "k" elements, and T "has not enough" c "and" e "elements, so the editing distance is 2 compared to T. However, T'and T'' are very different in structure. It is not very good that these distances are treated as the same. In the case of PQ, the difference is clear. This is the difference between Tree Edit Distance, which measures only individual differences, and PQ Edit Distance, which measures the overall difference.

HOW

We will introduce the algorithm using the above definition and introduce an implementation example.

algorithm

Consider applying it to the following tree structures.

First, prepare two shift registers.

anc: p (for parent node) sib: q (for child nodes)

The register process is to take out the old one and insert the new one.

Example

shift((a,b,c),x) return (b,c,x)

Create a pq-gram by combining anc and sib.

Based on these, the profile of PQ gram is returned.

anc step down sib shifts from left to right

The flow of the specific algorithm is shown in the figure.

Screen Shot 2016-11-28 at 12.49.25 PM.png

The target trees for this processing are as follows.

Screen Shot 2016-11-28 at 1.05.23 PM.png

What you want to do in this process is to create a profile. Profile processing can be described recursively, so it will be as follows.

pg-GRAM-PROFILE(T, p, q)
    P: empty relation with schema(labels)
    anc: shift register of size p (filled with *)
    P = PROFILE(T, p, q, P, root(T), anc)
    return P

Initialize the profile first. Fill the anc with "*". Pass the tree you want to create a profile for this time, the set p and q values, the profile, the root node of the tree, and anc to the PROFILE function.

PROFILE(T, p, q, P, r, anc)
    anc:= shift(anc, l(r))
    sib: shift register of size q (filled with *)
    if r is a leaf then
        P := P U (anc ○ sib)
    else
        for each child c (from left to right) of r do
            sib := shift(sib, l(c))
            P := P U (anc ○ sib)
            P := PROFILE(T, p, q, P, c, anc)
        for k := 1 to q-1
            sib := shift(sib, *)
            P := P U (anc ○ sib)
    return P

Insert the label r in anc. (R is the root node at first, and the node changes due to recursive processing) Initialize sib. If r is a leaf, create a union of profiles concatenated with anc and sib If not, processing is performed for each child node of the node. In the case of the first processing, processing is performed for each child node (a, b, c) with a as the root node.

Screen Shot 2016-11-28 at 1.34.19 PM.png

The first anc is fixed at *, a Insert the second into sib t *, *, a. The part we are paying attention to here is the following part. Concatenate anc and sib to create *, a, *, *, a and add it to PROFILE to create a union.

Since it is a recursive process, if there is a child node, we will focus on the following part and process it.

Screen Shot 2016-11-28 at 2.41.11 PM.png

When processing for ʻe with anc as ʻa, a

anc：a, a sib：*, *, e

Screen Shot 2016-11-28 at 2.42.24 PM.png

When anc is processed as ʻa, e`

anc：a, e sib：*, *, *

Since it finally reaches the leaves, the union is added so far. Repeat this recursively to create a PQ profile.

Measure the distance of the tree structure between this profile and the distance defined earlier.

The values of p and q are hyperparameters and depend on the tree structure. It is necessary to set the most appropriate parameters according to the structure of the tree.

Implementation method

--Creating a Node for tree structure data --Shift Register for editing --PROFILE for edit distance calculation

This can be achieved with a simple implementation method.

Creating a Node for tree structure data

Structured data needs to be converted to a tree structure for handling. I created it simply with the code below

Initialization process to set a label on Node Processing to add a child node to Node


class NodePQ(object):

    def __init__(self, label):
        self.label = label
        self.children = list()

    def addkid(self, node, before=False):
        if before:
            self.children.insert(0, node)
        else:
            self.children.append(node)
        return node

Shift Register for editing

Enter the description of Shift Register.

--Prepare a list initialized with the size of the argument * --The list is combined with a conch. --The oldest element is removed by shift processing and the element given by the argument is added.

class ShiftRegister(object):

    def __init__(self, size):
        self.register = list()
        for i in range(size):
            self.register.append("*")

    def concatenate(self, reg):
        temp = list(self.register)
        temp.extend(reg.register)
        return temp

    def shift(self, el):
        self.register.pop(0)
        self.register.append(el)

PROFILE

This is the part that creates anc and sib and recursively creates PROFILE.

--Set ShiftRegister for ancestors in the initialization part. --Create a PROFILE using the set ancestors. --Sort the created PROFILE


class ProfilePQ(object):

    def __init__(self, root, p=2, q=3):
        super(ProfilePQ, self).__init__()
        ancestors = ShiftRegister(p)
        self.list = list()

        self.profile(root, p, q, ancestors)
        self.sort()

This part is an important process for PQ distance.

--Set the root label given as an argument to ancestors. --Set shift registers for q size in siblings. --When the number of children becomes 0, concatenate ancestors and siblings to create a profile. --Add an element to siblings for root children and click. This process is recursive until there are no more children. --When you go to the youngest child, fill in the siblings under it with "*"

    def profile(self, root, p, q, ancestors):
        ancestors.shift(root.label)
        siblings = ShiftRegister(q)

        if(len(root.children) == 0):
            self.append(ancestors.concatenate(siblings))
        else:
            for child in root.children:
                siblings.shift(child.label)
                self.append(ancestors.concatenate(siblings))
                self.profile(child, p, q, copy.deepcopy(ancestors))
            for i in range(q-1):
                siblings.shift("*")
                self.append(ancestors.concatenate(siblings))

It will be the distance calculation part. This ʻedit_distance calculates the distance between PROFILE and the other given as an argument. The match rate of the Node element created by ʻintersection is calculated.


    def edit_distance(self, other):
        with Bench():
            union = len(self) + len(other)
            return 1.0 - 2.0*(self.intersection(other)/union)

    def intersection(self, other):
        intersect = 0.0
        i = j = 0
        while i < len(self) and j < len(other):
            intersect += self.gram_edit_distance(self[i], other[j])
            if self[i] == other[j]:
                i += 1
                j += 1
            elif self[i] < other[j]:
                i += 1
            else:
                j += 1
        return intersect

    def gram_edit_distance(self, gram1, gram2):
        distance = 0.0
        if gram1 == gram2:
            distance = 1.0
        return distance

This is the end of the implementation, but you need to test it to make sure it works properly.

About Test

The following process checks if the Profiles are equal.

--Check length and return False if different, or False if the first element is different --Get elements from profile and return False if each element does not match at least one

    
def checkProfileEquality(self, profile1, profile2):
        if len(profile1) != len(profile2) or len(profile1[0]) != len(profile2[0]):
            return False
        for gram1 in profile1:
            contains = False
            for gram2 in profile2:
                if gram1 == gram2:
                    contains = True
                    break
            if contains == False:
                return False
        return True

Initial settings for p and q. Random value setting

    def setUp(self):
        self.p = 2
        self.q = 3
        num_random = 10
        self.trees = list()
        self.profiles = list()

Create a tree for testing


        # Construct one-node trees
        self.small_tree1 = NodePQ("a")
        self.small_tree2 = NodePQ("b")
        self.trees.append(self.small_tree1)
        self.trees.append(self.small_tree2)

        self.small_profile1 = [['*','a','*','*','*']]
        self.small_profile2 = [['*','b','*','*','*']]

Create a random tree and create a profile from that tree.

        # Construct a few randomly generated trees
        for i in range(0, num_random):
            depth = random.randint(1, 10)
            width = random.randint(1, 5)
            self.trees.append(randtree(depth=depth, width=width, repeat=4))

        # Generate Profiles
        for tree1 in self.trees:
            self.profiles.append(ProfilePQ(tree1, self.p, self.q))

Create a random tree by the following process.

def randtree(depth=2, alpha='abcdefghijklmnopqrstuvwxyz', width=2, repeat=2):
    labels = [''.join(x) for x in itertools.product(alpha, repeat=repeat)]
    random.shuffle(labels)
    labels = (x for x in labels)
    root = NodePQ("root")
    p = [root]
    c = list()
    for x in range(depth-1):
        for y in p:
            for z in range(random.randint(1,1+width)):
                n = NodePQ(next(labels))
                y.addkid(n)
                c.append(n)
        p = c
        c = list()
    return root

I am calculating if the distance calculation is correct

    def testProfileCreation(self):
        small_tree1_equality = self.checkProfileEquality(self.profiles[0], self.small_profile1)
        small_tree2_equality = self.checkProfileEquality(self.profiles[1], self.small_profile2)

        self.assertEqual(small_tree1_equality, True)
        self.assertEqual(small_tree2_equality, True)

We check whether the calculation is correct even if the distance between A and B and the distance between B and A are reversed.


def testSymmetry(self):
        """x.edit_distance(y) should be the same as y.edit_distance(x)"""
        for profile1 in self.profiles:
            for profile2 in self.profiles:
                self.assertEqual(profile1.edit_distance(profile2), profile2.edit_distance(profile1))

Check if the distance calculation is between 0 and 1

    def testEditDistanceBoundaries(self):
        for profile1 in self.profiles:
            for profile2 in self.profiles:
                self.assertTrue(profile1.edit_distance(profile2) <= 1.0 and profile1.edit_distance(profile2) >= 0

Summary

The advantage of this method is that it can be used universally for structured data. Since it is a method that can be used if there is a data set that can be converted to a tree structure such as HTML, json, XML, etc., it has a wide range of applications such as extraction of necessary data and abnormality detection, so I would appreciate it if you could try it.

reference

Tree Edit Distance and application to natural language processing Approximate Matching of Hierarc hical Data Using pq -Grams

[PYTHON] How to get only the data you need from a structured data set using a versatile method