[PYTHON] Limits that can be analyzed at once with MeCab

I happened to see [article](https://qiita.com/Sak1361/items/2519f29af82ffe965652#mecab%E3%81%AE%E3%83%90%E3%82%B0%E5%AF%BE%E7%AD The memo that I checked because it was written as a bug of mecab in% 96).

Conclusion: ** 2,621,440 characters ** is the limit

This can be seen by looking at the value defined in common.h, ** MAX_INPUT_BUFFER_SIZE **.

#define NBEST_MAX 512
#define NODE_FREELIST_SIZE 512
#define PATH_FREELIST_SIZE 2048
#define MIN_INPUT_BUFFER_SIZE 8192
#define MAX_INPUT_BUFFER_SIZE (8192*640)
#define BUF_SIZE 8192

It can be seen that it is likely to receive 8192 x 640 bytes of data at the maximum. In other words, 8192 x 640/2 = ** 2,621,440 characters **. that's all.

bonus

size_t ibufsize = std::min(MAX_INPUT_BUFFER_SIZE,
                             std::max(param.get<int>
                                            ("input-buffer-size"),
                                            MIN_INPUT_BUFFER_SIZE));

  const bool partial = param.get<bool>("partial");
  if (partial) {
    ibufsize *= 8;
  }

  MeCab::scoped_array<char> ibuf_data(new char[ibufsize]);
  char *ibuf = ibuf_data.get();

  MeCab::scoped_ptr<MeCab::Tagger> tagger(model->createTagger());

  if (!tagger.get()) {
    WHAT_ERROR("cannot create tagger");
  }

  for (size_t i = 0; i < rest.size(); ++i) {
    MeCab::istream_wrapper ifs(rest[i].c_str());
    if (!*ifs) {
      WHAT_ERROR("no such file or directory: " << rest[i]);
    }

    while (true) {
      if (!partial) {
        ifs->getline(ibuf, ibufsize);
      } else {
        std::string sentence;
        MeCab::scoped_fixed_array<char, BUF_SIZE> line;
        for (;;) {
          if (!ifs->getline(line.get(), line.size())) {
            ifs->clear(std::ios::eofbit|std::ios::badbit);
            break;
          }
          sentence += line.get();
          sentence += '\n';
          if (std::strcmp(line.get(), "EOS") == 0 || line[0] == '\0') {
            break;
          }
        }
        std::strncpy(ibuf, sentence.c_str(), ibufsize);
      }
      if (ifs->eof() && !ibuf[0]) {
        return false;
      }
      if (ifs->fail()) {
        std::cerr << "input-buffer overflow. "
                  << "The line is split. use -b #SIZE option." << std::endl;
        ifs->clear();
      }
      const char *r = (nbest >= 2) ? tagger->parseNBest(nbest, ibuf) :
          tagger->parse(ibuf);
      if (!r)  {
        WHAT_ERROR(tagger->what());
      }
      *ofs << r << std::flush;
    }
  }

  return EXIT_SUCCESS;

#undef WHAT_ERROR

From this code, it can be seen that the processing of MeCab.Tagger.parse () does not exceed ** MAX_INPUT_BUFFER_SIZE ** at the maximum. Next, string_buffer.h and tagger.cpp /blob/3a07c4eefaffb4e7a0690a7f4e5e0263d3ddb8a3/mecab/src/tagger.cpp) About lattice analysis. (string_buffer.h: Excerpt from lines 15-37)

bool StringBuffer::reserve(size_t length) {
  if (!is_delete_) {
    error_ = (size_ + length >= alloc_size_);
    return (!error_);
  }

  if (size_ + length >= alloc_size_) {
    if (alloc_size_ == 0) {
      alloc_size_ = DEFAULT_ALLOC_SIZE;
      ptr_ = new char[alloc_size_];
    }
    size_t len = size_ + length;
    do {
      alloc_size_ *= 2;
    } while (len >= alloc_size_);
    char *new_ptr = new char[alloc_size_];
    std::memcpy(new_ptr, ptr_, size_);
    delete [] ptr_;
    ptr_ = new_ptr;
  }

  return true;
}

This reserve that is acquiring the area is called only in tagger.cpp. (tagger.cpp: Excerpt from lines 733-741)

LatticeImpl::LatticeImpl(const Writer *writer)
    : sentence_(0), size_(0), theta_(kDefaultTheta), Z_(0.0),
      request_type_(MECAB_ONE_BEST),
      writer_(writer),
      ostrs_(0),
      allocator_(new Allocator<Node, Path>) {
  begin_nodes_.reserve(MIN_INPUT_BUFFER_SIZE);
  end_nodes_.reserve(MIN_INPUT_BUFFER_SIZE);
}

And LatticeImpl is (I think) executed when Lattice is materialized. (tagger.cpp: 227-239 excerpt)

class LatticeImpl : public Lattice {
 public:
  explicit LatticeImpl(const Writer *writer = 0);
  ~LatticeImpl();

  // clear internal lattice
  void clear();

  bool is_available() const {
    return (sentence_ &&
            !begin_nodes_.empty() &&
            !end_nodes_.empty());
  }

From these things, it can be seen that there seems to be no limit in the analysis of Lattice because it seems that the area is doubled and acquired by memcpy when the memory is insufficient (of course it should end if the memory is consumed, but before that It seems to stop at MAX_INPUT_BUFFER_SIZE).

Other references

mecab.h: Definition of structure such as lattice, etc. libmecab.cpp: [tagger.cpp](https://github.com/taku910/mecab/blob /3a07c4eefaffb4e7a0690a7f4e5e0263d3ddb8a3/mecab/src/tagger.cpp) mutable_lattice () and the definition of functions associated with model such as mecab_model_new_lattice ()

END´╝Ä

Recommended Posts

Limits that can be analyzed at once with MeCab
File types that can be used with Go
List packages that can be updated with pip
Color list that can be set with tkinter (memorial)
Python knowledge notes that can be used with AtCoder
I investigated the pretreatment that can be done with PyCaret
Let's make a diagram that can be clicked with IPython
[Python] Make a graph that can be moved around with Plotly
Make a Spinbox that can be displayed in Binary with Tkinter
I bought and analyzed the year-end jumbo lottery with Python that can be executed in Colaboratory
I made a shuffle that can be reset (reverted) with Python
Confirmation that rkhunter can be installed
Make a currency chart that can be moved around with Plotly (2)
Replace all at once with sed
Comparison of 4 styles that can be passed to seaborn with set_context
Make a Spinbox that can be displayed in HEX with Tkinter
Make a currency chart that can be moved around with Plotly (1)
[Python] Code that can be written with brain death at the beginning when scraping as a beginner
Acoustic signal processing module that can be used with Python-Sounddevice ASIO [Application]
Create a web app that can be easily visualized with Plotly Dash
Mathematical optimization that can be used for free work with Python + PuLP
Draw a graph that can be moved around with HoloViews and Bokeh
Acoustic signal processing module that can be used with Python-Sounddevice ASIO [Basic]
Address to the bug that node.surface cannot be obtained with python3 + mecab
Convert memo at once with Python 2to3
Send newsletters all at once with Gmail
[Python3] Code that can be used when you want to change the extension of an image at once
A memo for making a figure that can be posted to a journal with matplotlib
Moved Raspberry Pi remotely so that it can be LED attached with Python
Format DataFrame data with Pytorch into a form that can be trained with NN
Until youtube-dl can be used with Synology (DS120j)
Functions that can be used in for statements
Building Sphinx that can be written in Markdown
Erase image files at once with one liner
Update multiple tables at once with pandas to_sql
Convert multiple proto files at once with python
Upgrade all at once including dependencies with pip
Introduced "Glances" command, a monitoring tool that can be understood at a glance, on Mac
Convert images from FlyCapture SDK to a form that can be used with openCV
[auto-ohin] Introducing auto-ohin, a command line tool that can automatically stamp all at once [electronic stamp]
[Python] Introduction to web scraping | Summary of methods that can be used with webdriver
Morphological analysis and tfidf (with test code) that can be done in about 1 minute
File sharing server made with Raspberry Pi that can be used for remote work