Speed comparison of Wiktionary full text processing with F # and Python

I wrote the process of reading the full text from the Wiktionary dump in Python, but I will port it to F # and compare the processing speed.

This is a series of articles.

  1. Search for efficient Wiktionary processing
  2. Compare Wiktionary processing speed with F # and Python ← This article

The script for this article is posted in the following repositories.

Overview

When I first started dumping Wiktionary, I thought I would use F #, but the .NET Framework couldn't handle bzip2 by default, so I started implementing it in Python. After parallelizing, the process was completed in a little over a minute, so I felt that Python was sufficient in terms of speed.

Still, I was wondering how fast F # would be, so I'll try to make up for the missing library.

The environment used for the measurement is as follows.

Number of lines

An external library is required to work with bzip2 in the .NET Framework.

First, read all the lines of the dump and count the number of lines. List the corresponding Python code and duration.

language code Time required
Python python/research/countlines.py 3m34.911s
F# fsharp/research/countlines.fsx 2m40.270s
command bzcat FILE.xml.bz2 | wc -l 2m32.203s

F # is the speed approaching the command.

#r "AR.Compression.BZip2.dll"
open System
open System.IO
open System.IO.Compression

let target = "enwiktionary-20200501-pages-articles-multistream.xml.bz2"
let mutable lines = 0
do
    use fs = new FileStream(target, FileMode.Open)
    use bs = new BZip2Stream(fs, CompressionMode.Decompress)
    use sr = new StreamReader(bs)
    while not (isNull (sr.ReadLine())) do
        lines <- lines + 1
Console.WriteLine("lines: {0:#,0}", lines)

Stream split

Read the bzip2 file separately for each stream.

Mask and read FileStream to avoid the overhead of going through the bytes. Use the SubStream created in the following article.

Use the stream length data (streamlen.tsv) generated in Previous article.

language code Time required
Python python/research/countlines-BytesIO.py 3m37.827s
F# fsharp/research/countlines-split.fsx 5m23.122s

Apparently there is a non-negligible overhead in starting the BZip2Stream read, which is quite slow. I used SubStream to reduce the overhead even a little, but I can't catch up at all.

#r "AR.Compression.BZip2.dll"
#load "StreamUtils.fsx"
open System
open System.IO
open System.IO.Compression
open StreamUtils

let target, slen =
    use sr = new StreamReader("streamlen.tsv")
    sr.ReadLine(),
    [|  while not <| sr.EndOfStream do
        yield sr.ReadLine() |> Convert.ToInt32 |]

let mutable lines = 0
do
    use fs = new FileStream(target, FileMode.Open)
    for length in slen do
        use ss = new SubStream(fs, length)
        use bs = new BZip2Stream(ss, CompressionMode.Decompress)
        use sr = new StreamReader(bs)
        while not (isNull (sr.ReadLine())) do
            lines <- lines + 1
Console.WriteLine("lines: {0:#,0}", lines)

Overhead avoidance

When parallelized in Python, I read every 10 streams to reduce the overhead of interprocess communication, but I take the same action. Also try reading every 100 streams for comparison.

language code Time required Remarks
F# fsharp/research/countlines-split-10.fsx 2m50.913s Every 10 streams
F# fsharp/research/countlines-split-100.fsx 2m40.727s Every 100 streams

It's much faster. Since every 100 streams is faster, the next step will be splitting every 100 streams.

let mutable lines = 0
do
    use fs = new FileStream(target, FileMode.Open)
    for lengths in Seq.chunkBySize 100 slen do
        use ss = new SubStream(fs, Array.sum lengths)
        use bs = new BZip2Stream(ss, CompressionMode.Decompress)
        use sr = new StreamReader(bs)
        while not (isNull (sr.ReadLine())) do
            lines <- lines + 1
Console.WriteLine("lines: {0:#,0}", lines)

Divide by 100 elements with Seq.chunkBySize and sum with ʻArray.sum`.

String conversion

In Python it was faster to convert to a string and use StringIO. Try the same process in F #.

language code Time required Remarks
Python python/research/countlines-StringIO.py 3m18.568s Per stream
F# fsharp/research/countlines-split-string.fsx 7m50.915s Per stream
F# fsharp/research/countlines-split-string-10.fsx 3m55.453s Every 10 streams
F# fsharp/research/countlines-split-string-100.fsx 3m23.417s Every 100 streams

This method is rejected because it is not very fast in F #.

let mutable lines = 0
do
    use fs = new FileStream(target, FileMode.Open)
    for length in slen do
        let text =
            use ss = new SubStream(fs, length)
            use bs = new BZip2Stream(ss, CompressionMode.Decompress)
            use ms = new MemoryStream()
            bs.CopyTo(ms)
            Encoding.UTF8.GetString(ms.ToArray())
        use sr = new StringReader(text)
        while not (isNull (sr.ReadLine())) do
            lines <- lines + 1
Console.WriteLine("lines: {0:#,0}", lines)

Text extraction

Extract the contents of the XML <text> tag and count the number of lines.

common part


let mutable lines = 0
do
    use fs = new FileStream(target, FileMode.Open)
    fs.Seek(int64 slen.[0], SeekOrigin.Begin) |> ignore
    for lengths in Seq.chunkBySize 100 slen.[1 .. slen.Length - 2] do
        for _, text in getPages(fs, Array.sum lengths) do
            for _ in text do
                lines <- lines + 1
Console.WriteLine("lines: {0:#,0}", lines)

Try various methods. Replaces getPages with the method.

String processing

Parse tags with string processing line by line.

language code Time required Remarks
Python python/research/countlines-text.py 4m06.555s startswith
F# fsharp/research/countlines-text-split-StartsWith.fsx 4m42.877s StartsWith
F# fsharp/research/countlines-text-split-slice.fsx 4m14.069s slice
F# fsharp/research/countlines-text-split.fsx 4m05.507s Substring

The .NET Framework StartsWith is slow because it seems to be doing things like Unicode normalization.

Using Substring to do the equivalent of StartsWith, it's finally about as fast as Python.

StartsWith


let getPages(stream, length) = seq {
    use ss = new SubStream(stream, length)
    use bs = new BZip2Stream(ss, CompressionMode.Decompress)
    use sr = new StreamReader(bs)
    let mutable ns, id = 0, 0
    while sr.Peek() <> -1 do
        let mutable line = sr.ReadLine().TrimStart()
        if line.StartsWith "<ns>" then
            ns <- Convert.ToInt32 line.[4 .. line.IndexOf('<', 4) - 1]
            id <- 0
        elif id = 0 && line.StartsWith "<id>" then
            id <- Convert.ToInt32 line.[4 .. line.IndexOf('<', 4) - 1]
        elif line.StartsWith "<text " then
            let p = line.IndexOf '>'
            if line.[p - 1] = '/' then () else
            if ns <> 0 then
                while not <| line.EndsWith "</text>" do
                line <- sr.ReadLine()
            else
                line <- line.[p + 1 ..]
                yield id, seq {
                    while not <| isNull line do
                    if line.EndsWith "</text>" then
                        if line.Length > 7 then yield line.[.. line.Length - 8]
                        line <- null
                    else
                        yield line
                        line <- sr.ReadLine() }}

Create and replace a function that replaces StartsWith.

slice


let inline startsWith (target:string) (value:string) =
    target.Length >= value.Length && target.[.. value.Length - 1] = value

Substring


let inline startsWith (target:string) (value:string) =
    target.Length >= value.Length && target.Substring(0, value.Length) = value

XML parser

Compare how to create a tree with an XML parser and how to just parse without creating a tree.

A root element is required for XML parsing. As we've already seen, F # slows down via strings, so we combine Streams for processing. Use the ConcatStream created in the following article.

There is a tree

language code Time required Remarks
Python python/research/countlines-text-xml.py 5m50.826s ElementTree.fromstring
F# fsharp/research/countlines-text-split-doc.fsx 6m21.588s XmlDocument

F # is slower than Python.

let getPages(stream, length) = seq {
    use ss = new SubStream(stream, length)
    use cs = new ConcatStream([
        new MemoryStream("<pages>"B)
        new BZip2Stream(ss, CompressionMode.Decompress)
        new MemoryStream("</pages>"B) ])
    use xr = XmlReader.Create(cs)
    let doc = XmlDocument()
    doc.Load(xr)
    for page in doc.ChildNodes.[0].ChildNodes do
        let ns = Convert.ToInt32(page.SelectSingleNode("ns").InnerText)
        if ns = 0 then
            let id = Convert.ToInt32(page.SelectSingleNode("id").InnerText)
            let text = page.SelectSingleNode("revision/text").InnerText
            use sr = new StringReader(text)
            yield id, seq {
                while sr.Peek() <> -1 do
                    yield sr.ReadLine() }}

No tree

See the following articles for various Python parsers.

F # uses a pull-type XmlReader.

language code Time required Remarks
Python python/research/countlines-text-xmlparser.py 6m46.163s XMLParser
Python python/research/countlines-text-xmlpull.py 6m04.553s XMLPullParser
Python python/research/countlines-text-xmliter.py 6m29.298s ElementTree.iterparse
F# fsharp/research/countlines-text-split-reader.fsx 3m17.916s XmlReader
F# fsharp/research/countlines-text-reader.fsx 3m16.122s XmlReader(undivided)

The .NET Framework XmlDocument is constructed using XmlReader. XmlReader is overwhelmingly faster to use alone. In the steps that follow, we will only use the XmlReader method.

The situation should be the same in Python, but it seems that creating a tree is much more efficient, and creating a tree is faster.

let getPages(stream, length) = seq {
    use ss = new SubStream(stream, length)
    use cs = new ConcatStream([
        new MemoryStream("<pages>"B)
        new BZip2Stream(ss, CompressionMode.Decompress)
        new MemoryStream("</pages>"B) ])
    use xr = XmlReader.Create(cs)
    let mutable ns, id = 0, 0
    while xr.Read() do
        if xr.NodeType = XmlNodeType.Element then
            match xr.Name with
            | "ns" ->
                if xr.Read() then ns <- Convert.ToInt32 xr.Value
                id <- 0
            | "id" ->
                if id = 0 && xr.Read() then id <- Convert.ToInt32 xr.Value
            | "text" ->
                if ns = 0 && not xr.IsEmptyElement && xr.Read() then
                    yield id, seq {
                        use sr = new StringReader(xr.Value)
                        while sr.Peek() <> -1 do
                            yield sr.ReadLine() }
            | _ -> () }

Language table

From the code so far, we will squeeze out the common parts.

Create a table of which items contain data in which language (ʻoutput1.tsv). The language name is normalized to another table (ʻoutput2.tsv).

do
    use sw = new StreamWriter("output1.tsv")
    sw.NewLine <- "\n"
    for id, lid in results do
        fprintfn sw "%d\t%d" id lid
do
    use sw = new StreamWriter("output2.tsv")
    sw.NewLine <- "\n"
    for kv in langs do
        fprintfn sw "%d\t%s" kv.Value kv.Key

Since one article is separated by the line == language name ==, it will be detected and associated with id. Distinguish because === item name === is used in the lower headings.

As we have already seen, StartsWith is slow. Compare how to look up one character at a time from the beginning of a line with a regular expression.

XML parsing uses string processing in Python and XmlReader in F #.

language code Time required Remarks
Python python/research/checklang.py 4m26.421s startswith
F# fsharp/research/checklang-StartsWith.fsx 3m43.965s StartsWith
Python python/research/checklang-ch.py 4m30.566s Character by character
F# fsharp/research/checklang.fsx 3m24.302s Character by character
Python python/research/checklang-re.py 5m9.869s Regular expressions
F# fsharp/research/checklang-re.fsx 3m46.270s Regular expressions

In F #, it's fast to look at each character, but it's cumbersome to implement and not versatile. Regular expressions are almost as fast as StartsWith. Considering versatility, it seems safe to use regular expressions.

StartsWith


            for line in text do
                if line.StartsWith "==" && not <| line.StartsWith "===" then
                    let lang = line.[2..].Trim()
                    let mutable e = lang.Length - 1
                    while e > 0 && lang.[e] = '=' do e <- e - 1
                    let lang = lang.[..e].Trim()

Character by character


            for line in text do
                if line.Length >= 3 && line.[0] = '=' && line.[1] = '=' && line.[2] <> '=' then

Regular expressions


            for line in text do
                let m = r.Match line
                if m.Success then
                    let lang = m.Groups.[1].Value.Trim()

Parallelization

Parallelization is done using multithreading in F # and multiprocessing in Python.

language code Time required Remarks
Python python/research/checklang-parallel.py 1m16.566s startswith
F# fsharp/research/checklang-parallel.fsx 1m03.941s Character by character
Python python/research/checklang-parallel-re.py 1m19.372s Regular expressions
F# fsharp/research/checklang-parallel-re.fsx 1m07.009s Regular expressions

It's a close margin. The growth width due to parallelization is larger in Python.

Character by character


let getlangs(pos, length) = async { return [
    use fs = new FileStream(target, FileMode.Open, FileAccess.Read)
    fs.Seek(pos, SeekOrigin.Begin) |> ignore
    for id, text in MediaWikiParse.getPages(fs, length) do
        for line in text do
            if line.Length >= 3 && line.[0] = '=' && line.[1] = '=' && line.[2] <> '=' then
                let lang = line.[2..].Trim()
                let mutable e = lang.Length - 1
                while e > 0 && lang.[e] = '=' do e <- e - 1
                yield id, lang.[..e].Trim() ]}

Regular expressions


let getlangs(pos, length) = async { return [
    let r = Regex "^==([^=].*)=="
    use fs = new FileStream(target, FileMode.Open, FileAccess.Read)
    fs.Seek(pos, SeekOrigin.Begin) |> ignore
    for id, text in MediaWikiParse.getPages(fs, length) do
        for line in text do
            let m = r.Match line
            if m.Success then
                yield id, m.Groups.[1].Value.Trim() ]}

let results =
    sposlen.[1 .. sposlen.Length - 2]
    |> Seq.chunkBySize 100
    |> Seq.map Array.unzip
    |> Seq.map (fun (ps, ls) -> Array.min ps, Array.sum ls)
    |> Seq.map getlangs
    |> Async.Parallel
    |> Async.RunSynchronously
    |> List.concat
let langs = Dictionary<string, int>()
do
    use sw = new StreamWriter("output1.tsv")
    sw.NewLine <- "\n"
    for id, lang in results do
        let lid =
            if langs.ContainsKey lang then langs.[lang] else
            let lid = langs.Count + 1
            langs.[lang] <- lid
            lid
        fprintfn sw "%d\t%d" id lid
do
    use sw = new StreamWriter("output2.tsv")
    sw.NewLine <- "\n"
    for kv in langs do
        fprintfn sw "%d\t%s" kv.Value kv.Key

.NET Core and Mono

Measured with .NET Core and Mono on WSL1. Compare with the .NET Framework results on Windows.

The correspondence with the abbreviations is as follows.

code Framework Core Mono Remarks
fsharp/research/checklang.fsx 3m24.302s 3m25.545s 4m22.330s
fsharp/research/checklang-re.fsx 3m46.270s 3m42.882s 4m51.236s Regular expressions
fsharp/research/checklang-parallel.fsx 1m03.941s 0m59.014s 2m39.716s Parallel
fsharp/research/checklang-parallel-re.fsx 1m07.009s 1m06.136s 3m28.074s Parallel, regular expression

.NET Core seems to be as fast as or a bit faster than the .NET Framework.

.NET Core is basically to create a project, but since there are multiple executable files and it was troublesome, I wrote the config by the following method and coped with it.

Impressions

Using the fastest method resulted in a slightly faster F #. It was impressive that Python was faster even if it was ported with the same processing content.

As I tried in the following article, there is an overwhelming difference in character-based processing.

Recommended Posts

Speed comparison of Wiktionary full text processing with F # and Python
Python asynchronous processing ~ Full understanding of async and await ~
Comparison of CoffeeScript with JavaScript, Python and Ruby grammar
I tried to compare the processing speed with dplyr of R and pandas of Python
Speed comparison of Python XML parsing
Coexistence of Python2 and 3 with CircleCI (1.0)
I compared the speed of Hash with Topaz, Ruby and Python
Basics of binarized image processing with Python
(Java, JavaScript, Python) Comparison of string processing
Drawing with Matrix-Reinventor of Python Image Processing-
Full understanding of Python threading and multiprocessing
Comparison of matrix transpose speeds with Python
Speed comparison of murmurhash3, md5 and sha1
I replaced the numerical calculation of Python with Rust and compared the speed
I have 0 years of programming experience and challenge data processing with python
Rehabilitation of Python and NLP skills starting with "100 Language Processing Knock 2015" (Chapter 1)
I measured the speed of list comprehension, for and while with python2.7.
Performance comparison of face detector with Python + OpenCV
[Python3] Coarse graining of numpy.ndarray Speed comparison etc.
Compare the speed of Python append and map
Implementation of TRIE tree with Python and LOUDS
Comparison of R and Python writing (Euclidean algorithm)
Continuation of multi-platform development with Electron and Python
Example of reading and writing CSV with Python
Comparison of Python and Ruby (Environment / Grammar / Literal)
Let's play with Python Receive and save / display the text of the input form
Rehabilitation of Python and NLP skills starting with "100 Language Processing Knock 2015" (Chapter 2 second half)
What to use for Python stacks and queues (speed comparison of each data structure)
Rehabilitation of Python and NLP skills starting with "100 Language Processing Knock 2015" (Chapter 2 first half)
Full-width and half-width processing of CSV data in Python
Notes on HDR and RAW image processing with Python
Python netCDF4 read speed and nesting of for statements
File write speed comparison experiment between python 2.7.9 and pypy 2.5.0
Easy partial download of mp4 with python and youtube-dl!
[Chapter 5] Introduction to Python with 100 knocks of language processing
Image capture / OpenCV speed comparison with and without GPU
Visualize the range of interpolation and extrapolation with python
Text processing in Python
A quick comparison of Python and node.js test libraries
Challenge principal component analysis of text data with Python
[Chapter 3] Introduction to Python with 100 knocks of language processing
Image processing with Python
[Chapter 2] Introduction to Python with 100 knocks of language processing
Comparison table of frequently used processes of Python and Clojure
Image processing with Python (I tried binarizing it into a mosaic art of 0 and 1)
Summary of date processing in Python (datetime and dateutil)
Use python installed with Pyenv with Sublime REPL of Sublime Text 3
Various processing of Python
Version control of Node, Ruby and Python with anyenv
[Chapter 4] Introduction to Python with 100 knocks of language processing
3. Natural language processing with Python 5-1. Concept of sentiment analysis [AFINN-111]
Wrap C/C ++ with SWIG to speed up Python processing. [Overview]
Database search (verification of processing speed with or without index)
Perform isocurrent analysis of open channels with Python and matplotlib
[Python] How to set variable names dynamically and speed comparison
Send experiment results (text and images) to slack with Python
Get rid of dirty data with Python and regular expressions
Detect objects of a specific color and size with Python
Comparison of how to use higher-order functions in Python 2 and 3
[Let's play with Python] Image processing to monochrome and dots
BASIC and C and assembler speed comparison and optimization play with IchigoJam