[Python3 / ElementTree] If you cannot access the XPath format well, check the XML hierarchy carefully ... (self-advised)

~~ I checked the tutorial on the net and it didn't work, so I will post it as a reminder. ~~ [2019/11/22 postscript] It's embarrassing, but this case seems to have been caused by my misunderstanding, and the problem I was presenting at the beginning did not occur in the first place. .. .. I will leave the article because the behavior around ".//" seems to be a little helpful, including the meaning of self-discipline.

What happened

I had an opportunity to handle XML (character code is EUC-JP) in python3, and I was dealing with it

test.xml(The file that was originally handled had more tags than this, and there were many attribute values, etc.)


<root><tag>
    <hoge>
        <hogehoge>aaa</hogehoge>
    </hoge>
    <fuga>bbb</fuga>
    <fugo>ccc</fugo>
・
(Omitted)
・
</tag>
<tag2>
・
(Omitted)
・
</tag2>
</root>

The access methods to the elements are as follows.

test.py


import xml.etree.ElementTree as Et


def test():
    # ElementTree.parse()Seems to only support unicode
    #Once you open the file and get the string from read
    with open(r'HogePath/FugaPath/test.xml', 'r', encoding='euc_jp') as f:
        root: Et.Element = Et.fromstring(f.read())

    #hoge When getting the text "aaa" of hoge
    print(root[0][0][0].text)


if __name__=='__main__':
    test()

As XML became complicated, I wanted to take it with XPath, so refer to the information on the net

I finally realized that I was wrong in the first place(Self-discipline)


    print(root.findall('./hoge/hogehoge')[0].text)

I tried to write, but it didn't work and I got "Index Error: list index out of range". In the first place, the following result itself is empty ... ~~ In other words, it seems that it is not taken correctly. ~~ ** ← This is wrong! !! ** **

    print(root.findall('./hoge/hogehoge'))

** The example is too simple and easy to understand, but at this time I overlooked the "tag". .. .. </ font> The following is a method that was half-forced to solve this oversight. ** **

(Wrong) solution (it can be done)

There seems to be a problem with the specification method around the root of XPath

    print(root.findall('.//hoge/hogehoge')[0].text)

It was solved by changing the head part to ".//" like. ** ← It seems that it was solved by force ** I wonder if I should remember that it is the same as the URL etc ... I haven't followed the principle like this, so I'd be happy if anyone could teach me. ~~

[Updated on November 22, 2019] According to what @LOZTPX taught in the comments

// represents the set of all descendants of the starting node, omitting the node path.

I see! So even if I overlooked the above, I was picking it up without being aware of the "tag" ... This was a learning experience.

Reflection

Let's see the contents of the XML file to be handled properly ()

Recommended Posts

[Python3 / ElementTree] If you cannot access the XPath format well, check the XML hierarchy carefully ... (self-advised)
If you are told cannot by Python import, review the file name
Check if the URL exists in Python
Check if the characters are similar in Python