[PYTHON] Think about the next generation of Rack and WSGI

I thought of an alternative to Rack and WSGI (the protocol spec, not the libraries (rack.rb and wsgiref.py)). Please note that it may not be organized because I just wrote down my ideas.

I think this article will be revised several times in the future. Feel free to comment if you have any comments.

Rack and WSGI overview

Ruby's Rack and Python's WSGI are abstract specifications for HTTP requests and responses.

For example in Rack:

class RackApp
  def call(env)    #env is a Hash object that represents the request
    status  = 200                             #Status code
    headers = {"Content-Type"=>"text/plain"}  #header
    body    = "Hello"                         #body
    return status, headers, [body]   #These three represent the response
  end
end

Ruby's Rack and Python's WSGI are specifications that abstract HTTP requests and responses in this way.

This allows web applications to be used with any application server (WEBrick, Unicorn, Puma, UWSGI, waitress) that supports Rack or WSGI. For example, you can easily switch between using WEBrick and waitress, which are easy to use during development, and using the fast Unicorn, Puma, and UWSGI in a production environment.

Rack and WSGI are also designed to make it easy to add functionality by using the so-called decorator pattern. For example

You can do that without changing your web application.

##Original Rack application
app = RackApp()

##For example, add session functionality
require 'rack/sesison/cookie'
app = Rack::Session::Cookie.new(app,
        :key => 'rack.session', :path=>'/',
        :expire_after => 3600,
        :secret => '54vYjDUSB0z7NO0ck8ZeylJN0rAX3C')

##For example, show detailed errors only in development environment
if ENV['RACK_ENV'] == "development"
  require 'rack/showexceptions'
  app = Rack::ShowExceptions(app)
end

Wrapper objects that add functionality to the original web application in this way are called "Middleware" in Rack and WSGI. In the above example, Rack :: Session :: Cookie and Rack :: ShowException are middleware.

Problems with WSGI (Python)

WSGI is the original specification for Rack. Rack wouldn't have been born without WSGI.

When WSGI first appeared, there was a similar Java Servlet. However, the Servlet specification was quite complicated and difficult to implement [^ 1]. Also, due to the complicated specifications, the behavior may differ slightly for each application server, so in the end, everyone was in a state of checking the specifications by running Tomcat, which is the reference implementation, without looking at the specifications.

That's why WSGI came out as a very simple thing with completely different specifications, although I sympathize with the idea of Servlet.

[^ 1]: Java and IBM are good at making things unnecessarily complicated.

Let's look at the specific code. Below is the WSGI sample code.

class WSGIApp(object):

  ##environ is a hash representing the request(dictionary)object
  def __call__(self, environ, start_response):
    status  = "200 OK"      #Strings, not numbers
    headers = [             #List of keys and values, not hashes
      ('Content-Type', 'text/plain'),
    ]
    start_response(status, headers)   #Start a response
    return [b"Hello World"]  #Return the body

If you look at this, you can see that it is quite different from Rack.

Now, in my opinion, the biggest problem with WSGI is probably the existence of a callback function called start_response (). Because of this, beginners must first understand "functions that receive functions (higher-order functions)" in order to understand WSGI, which is a high threshold [^ 2].

[^ 2]: Advanced people who say "higher-order functions are easy to understand" are fundamentally lacking in the ability to understand where beginners stumble, so they are functional without beginners. Please return to the world of languages. Not a great player or a great manager. A person who is versatile in sports is not suitable for teaching exercise onchi.

Calling a WSGI application is also wasteful because of start_response (). This is really troublesome.

##If you don't prepare something like this one by one
class StartResponse(object):
  def __call__(self, status, headers):
    self.status = status
    self.headers = headers

##Unable to call WSGI application
app = WSGIApplication()
environ = {'REQUEST_METHOD': 'GET', ...(snip)... }
start_response = StartResponse()
body = app.__call__(environ, start_response)
print(start_response.status)
print(start_response.headers)

(Actually, for WSGI (PEP-333), a specification called Web3 (PEP-444) that improved this point was proposed in the past. In this Web3, the callback function is abolished and it is similar to Rack. It was designed to return status, headers, body to. I personally expected it, but it was not adopted in the end. I'm sorry.)

WSGI also has a bit of annoyance that the response header is a list of keys and values instead of a hash (dictionary) object. That's because you have to search the list every time you set the header.

##For example, if you have a response header like this
resp_headers = [
  ('Content-Type', "text/html"),
  ('Content-Disposition', "attachment;filename=index.html"),
  ('Content-Encoding', "gzip"),
]
##You have to search the list one by one to set the value
key = 'Content-Length'
val = str(len(content))
for i, (k, v) in enumerate(resp_headers):
  if k == key:   # or k.tolower() == key.tolower()
    break
else:
  i = -1
if i >= 0:   #Overwrite if there is
  resp_headers[i] = (key, val)
else:        #If not, add
  resp_headers.append((key, val))

This is a hassle. It would be nice to define a dedicated utility function, but it was better to use a hash (dictionary) object anyway.

##Hash object(Dictionary object)Then ...
resp_headers = {
  'Content-Type':        "text/html",
  'Content-Disposition': "attachment;filename=index.html",
  'Content-Encoding':    "gzip",
]
##Very easy to set the value!
## (However, it is assumed that the case of the key name is unified.)
resp_headers['Content-Length'] = str(len(content))

Problems with Rack (Ruby)

Rack (Ruby) is a specification determined with reference to WSGI (Python). Rack is very similar to WSGI, but has been improved to be simpler.

class RackApp
  def call(env)   #env is a hash object that represents the request
    status  = 200
    headers = {
      'Content-Type' => 'text/plain;charset=utf-8',
    }
    body    = "Hello World"
    return status, headers, [body]  #These three represent the response
  end
end

The specific differences are as follows.

Now, in Rack, the response header is represented by a hash object. In this case, what about headers that can appear multiple times, such as Set-Cookie?

In Rack Specifications, there is the following description.

The values of the header must be Strings, consisting of lines (for multiple header values, e.g. multiple Set-Cookie values) separated by "\n".

In other words, if the value of the header is a multi-line string, it is considered that the header has appeared multiple times.

But what about this specification? That's because we need to find out if every response header contains a newline character. This will reduce performance.

headers.each do |k, v|
  v.split(/\n/).each do |s|   #← Double loop;-(
    puts "#{k}: #{s}"
  end
end

Rather than this, the specification that "headers that appear multiple times make the values an array" seems to be better.

headers.each do |k, v|
  if v.is_a?(Array)     #← This is better
    v.each {|s| puts "#{k}: #{s}" }
  else
    puts "#{k}: #{v}"
  end
end

Alternatively, you can treat only the Set-Cookie header specially. The only header that can appear multiple times is Set-Cookie [^ 3], so this specification is not bad either.

set_cookie = "Set-Cookie"
headers.each do |k, v|
  if k == set_cookie     # ← Set-Special treatment only for cookies
    v.split(/\n/).each {|s| puts "#{k}: #{s}" }
  else
    puts "#{k}: #{v}"
  end
end

[^ 3]: I think there was another Via header, but it's not covered in the Rack or WSGI category, so you should only consider Set-Cooki.

Another point is about the close () method of the response body. The Rack and WSGI specifications specify that if the response body object has a method called close (), the application server will call close () when the response to the client is complete. This is a specification mainly assuming that the response body is a File object.

  def call(env)
    filename = "logo.png "
    headers = {'Content-Type'   => "image/png",
               'Content-Length' => File.size(filename).to_s}
    ##Open the file
    body = File.open(filename, 'rb')
    ##The opened file is sent by the application server when the response is completed.
    ##Automatically close()Is called
    return [200, headers, body]
  end

But I think this is just a matter of closing the file at the end of the ʻeach ()` method.

class AutoClose
  def initialize(file)
    @file = file
  end
  def each
    ##This is not efficient because it is read line by line
    #@file.each |line|
    #  yield line
    #end
    ##It is more efficient to read in a larger size
    while (s = @file.read(8192))
      yield s
    end
  ensure            #If you read all the files or if there is an error
    @file.close()   #Automatically close
  end
end

This specification to call if there is a close () method is necessary in the case where the ʻeach () method of the response body is never called. Personally, I think I should have considered a cleanup specification like teardown ()` in xUnit rather than a "only thinking about File objects" specification (although). I don't have a good idea either).

About the Environment object

In both Rack and WSGI, HTTP requests are represented as hash (dictionary) objects. This is called the Environment in the Rack and WSGI specifications.

Let's see what this looks like.

## Filename: sample1.ru

require 'rack'

class SampleApp
  ## Inspect Environment data
  def call(env)
    status = 200
    headers = {'Content-Type' => "text/plain;charset=utf-8"}
    body = env.map {|k, v| "%-25s: %s\n" % [k.inspect, v.inspect] }.join()
    return status, headers, [body]
  end
end

app = SampleApp.new

run app

When I ran this with rackup sample1.ru -E production -s puma -p 9292 and accessed http: // localhost: 9292 / index? X = 1 in a browser, I got the following result, for example. This is the contents of the Environment.

"rack.version"           : [1, 3]
"rack.errors"            : #<IO:<STDERR>>
"rack.multithread"       : true
"rack.multiprocess"      : false
"rack.run_once"          : false
"SCRIPT_NAME"            : ""
"QUERY_STRING"           : "x=1"
"SERVER_PROTOCOL"        : "HTTP/1.1"
"SERVER_SOFTWARE"        : "2.15.3"
"GATEWAY_INTERFACE"      : "CGI/1.2"
"REQUEST_METHOD"         : "GET"
"REQUEST_PATH"           : "/index"
"REQUEST_URI"            : "/index?x=1"
"HTTP_VERSION"           : "HTTP/1.1"
"HTTP_HOST"              : "localhost:9292"
"HTTP_CACHE_CONTROL"     : "max-age=0"
"HTTP_COOKIE"            : "_ga=GA1.1.1305719166.1445760613"
"HTTP_CONNECTION"        : "keep-alive"
"HTTP_ACCEPT"            : "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
"HTTP_USER_AGENT"        : "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9"
"HTTP_ACCEPT_LANGUAGE"   : "ja-jp"
"HTTP_ACCEPT_ENCODING"   : "gzip, deflate"
"HTTP_DNT"               : "1"
"SERVER_NAME"            : "localhost"
"SERVER_PORT"            : "9292"
"PATH_INFO"              : "/index"
"REMOTE_ADDR"            : "::1"
"puma.socket"            : #<TCPSocket:fd 14>
"rack.hijack?"           : true
"rack.hijack"            : #<Puma::Client:0x3fd60649ac48 @ready=true>
"rack.input"             : #<Puma::NullIO:0x007fac0c896060>
"rack.url_scheme"        : "http"
"rack.after_reply"       : []

(rack.hijack is a new feature introduced in Rack 1.5. For more information, please see here.)

This environment contains three types of data.

The Environment is a collection of these items. Personally, I don't like this kind of specification, and I would like you to at least separate the request header from the rest.

The reason for this specification is that it is based on the CGI specification. I don't think young people today know about CGI, but that's why it was used very often in the past. WSGI borrowed this CGI specification to determine the Environment specification, and Rack inherits it. Therefore, it may look strange to someone who does not know CGI. Someone might say, "Why is the User-Agent header changed to HTTP_USER_AGENT? You can just use the User-Agent string."

Problems with Environment objects

As we have seen, an Environment object is a hash object that contains dozens of elements.

From a performance standpoint, creating a hash object with dozens of elements is undesirable in Ruby and Python, as it is quite expensive to operate. For example, with Keight.rb, a framework 100 times faster than Ruby on Rails, ** it may take longer to create an Environment object than it takes to process a request **.

Let's actually check it with a benchmark script.

# -*- coding: utf-8 -*-
require 'rack'
require 'keight'
require 'benchmark/ips'

##Action class(Controller in MVC)Create
class API < K8::Action
  mapping '/hello',  :GET=>:say_hello
  def say_hello()
    return "<h1>Hello, World!</h1>"
  end
end

##Create a Rack application and assign an action class
mapping = [
    ['/api',   API],
]
rack_app = K8::RackApplication.new(mapping)

##Execution example
expected = [
  200,
  {"Content-Length"=>"22", "Content-Type"=>"text/html; charset=utf-8"},
  ["<h1>Hello, World!</h1>"]
]
actual = rack_app.call(Rack::MockRequest.env_for("/api/hello"))
actual == expected  or raise "assertion failed"

## GET /api/Environment object that represents hello
env = Rack::MockRequest.env_for("/api/hello")

##benchmark
Benchmark.ips do |x|
  x.config(:time => 5, :warmup => 1)

  ##Create a new Environment object(make a copy)
  x.report("just copy env") do |n|
    i = 0
    while (i += 1) <= n
      env.dup()
    end
  end

  ##Create an Environment object to handle the request
  x.report("Keight (copy env)") do |n|
    i = 0
    while (i += 1) <= n
      actual = rack_app.call(env.dup)
    end
    actual == expected  or raise "assertion failed"
  end

  ##Reuse Environment objects to handle requests
  x.report("Keight (reuse env)") do |n|
    i = 0
    while (i += 1) <= n
      actual = rack_app.call(env)
    end
    actual == expected  or raise "assertion failed"
  end

  x.compare!
end

When I ran this, I got the following results, for example (Ruby 2.3, Keight.rb 0.2, OSX El Capitan):

Calculating -------------------------------------
       just copy env    12.910k i/100ms
   Keight (copy env)     5.523k i/100ms
  Keight (reuse env)    12.390k i/100ms
-------------------------------------------------
       just copy env    147.818k (± 8.0%) i/s -    735.870k
   Keight (copy env)     76.103k (± 4.4%) i/s -    381.087k
  Keight (reuse env)    183.065k (± 4.8%) i/s -    916.860k

Comparison:
  Keight (reuse env):   183064.5 i/s
       just copy env:   147818.2 i/s - 1.24x slower
   Keight (copy env):    76102.8 i/s - 2.41x slower

From the last three lines we can see that:

In this situation, further speeding up the framework will not make the application much faster. To overcome this deadlock, it seems good to improve the Rack specification itself.

Problems with decorator patterns

(TODO)

Think about the next generation of Rack and WSGI

Well, finally get into the main subject.

To solve the problems I've described so far, I've considered an alternative to the current Rack and WSGI. So-called, "My thoughts on Saikyo no Raku".

The new specification remains an abstraction of HTTP requests and responses. So I'll focus on how to abstract these two.

Also, the current Rack and WSGI partially inherit the CGI specification. However, CGI is an old-fashioned specification that assumes that data is passed via environment variables. It's not suitable for this era, so you can forget about the CGI spec.

HTTP request

HTTP requests are divided into the following elements:

The request method can be an uppercase string or Symbol. Symbol seems to be better in terms of performance.

meth = :GET

The request path can be a string. Rack needs to consider SCRIPT_NAME as well as PATH_INFO, but now that no one will use SCRIPT_NAME, we'll just consider the PATH_INFO equivalent.

path = "/index.html"

The request header can be a hash object. Also, I don't want to convert like User-Agent → HTTP_USER_AGENT, but HTTP / 2 seems to have lowercase header names, so I'll probably match it.

headers = {
  "host"       => "www.example.com",
  "user-agent" => "Mozilla/5.0 ....(snip)....",
  ....(snip)....,
}

The query parameter is either nil or a string. If there is no ?, It becomes nil, and if there is, it becomes a string (it may be an empty string).

query = "x=1"

I / O related (rack.input and rack.errors and rack.hijack or puma.socket) should be in one array. These are just the equivalents of stdin, stderr and stdout ... aren't they? Perhaps socket doubles as rack.input, but I'm not familiar with it, so I'll separate it here.

ios = [
  StringIO.new(),   # rack.input
  $stderr,          # rack.errors
  puma_socket,
]

The value of other request information changes for each request. This should be a hash object.

options = {
  http:       "1.1",    # HTTP_VERSION
  client:     "::1",    # REMOTE_ADDR
  protocol:   "http",   # rack.url_scheme
}

The last server information should not change unless the application server has changed. So once you create it as a hash object, you can reuse it.

server = {
  name:  "localhost".freeze,    # SERVER_NAME
  port:  "9292".freeze,         # SERVER_PORT
  'rack.version':       [1, 3].freeze,
  'rack.multithread':   true,
  'rack.multiprocess':  false,
  'rack.run_once':      false,
}.freeze

Consider a Rack application that receives these.

class RackApp
  def call(meth, path, headers, query, ios, options, server)
    input, errors, socket = ios
    ...
  end
end

Wow, it has 7 arguments. This is a little cool, isn't it? The first three (meth, path and headers) are the core of the request, so leaving them alone as arguments, query and ios are likely to be grouped into options.

options = {
  query:    "x=1",     # QUERY_STRING
  #
  input:    StringIO.new,   # rack.input,
  error:    $stderr,        # rack.erros,
  socket:   puma_socket,    # rack.hijack or puma.socket
  #
  http:     "1.1",     # HTTP_VERSION
  client:   "::1",     # REMOTE_ADDR
  protocol: "http",    # rack.url_scheme
}

This will reduce the number of arguments from seven to five.

class RackApp
  def call(meth, path, headers, options, server)
    query  = options[:query]
    input  = options[:input]
    error  = options[:error]
    socket = options[:socket]   # or :output ?
    ...
  end
end

Well, I think it's okay to use this.

HTTP response

The HTTP response can still be represented by the status, header, and body.

  def call(meth, path, headers, options, server)
    status  = 200
    headers = {"content-type"=>"application/json"},
    body    = '{"message":"Hello!"}'
    return status, headers, body
  end

However, I think the Content-Type header can be treated specially. Because in current Rack applications, only Content-Type headers, such as {"Content-Type "=>" text / html} " and {" Content-Type "=>" application / json "} This is because there are many cases where it is not included. Therefore, if you treat only Content-Type specially and make it independent, it will be a little simpler.

  def call(meth, path, headers, options, server)
    ##Than this
    return 200, {"Content-Type"=>"text/plain"}, ["Hello"]
    ##This is more concise
    return 200, "text/plain", {}, ["Hello"]
  end

There are some other issues as well.

Is the status an integer or a string?
Integers are fine, but it would be nice if there was a way to specify a custom status. However, that is not a Rack specification, and I think it would be fine if there was a way to register for each application server.
Is the header a hash or a list?
This should be a hash anymore.
What if there are multiple Set-Cookie headers?
This should allow an array of strings in the header values, as already explained. And decide that the header value should not contain a newline character.
Whether to allow strings in the body?
The current Rack spec requires that the body implement the ʻeach () `method, so you can't specify a string directly on the body. Instead, it's standard to specify an array of strings.

However, since most responses return a string as a body, it is useless to wrap it in an array one by one. If possible, the body should be a "string, or an object that returns a string with ʻeach ()`". </ dd>

Should I call the body's `close ()` method when the response is complete?
This is a difficult problem. As mentioned earlier, this specification is not required as long as it is guaranteed that the ʻeach ()` method will be called. However, there is no such guarantee, so it would guarantee to call `close ()` instead.

But what is really desirable is to have a feature equivalent to teardown (). It's a pity that I can't think of any specific specifications [^ 4]. </ dd>

[^ 4]: I thought that rack.after_reply was that, but it seems to be a unique function of Puma.

Decorator pattern

(TODO)

Event driven and non-blocking I / O

(TODO)

HTTP / 2 support

(TODO)

in conclusion

We would like to hear the opinions of experts.

References

WSGI related

  • PEP-0333 -- Python Web Server Gateway Interface v1.0.1
  • https://www.python.org/dev/peps/pep-3333/ (Revised version for Python3)
  • https://www.python.org/dev/peps/pep-0333/ (original)
  • WSGI specifications. Everything started here.
  • PEP-0444 -- Python Web3 Interface
    • https://www.python.org/dev/peps/pep-0444/#values-returned-by-a-web3-application
  • A specification proposed to be the successor to WSGI. Unfortunately, it was not adopted.

Rack related

  • Rack: a Ruby Webserver Interface
  • http://rack.github.io/ (Web site)
  • http://rubydoc.info/github/rack/rack/master/file/SPEC
  • Recent specifications. There is also a description about the hijacking API.
  • Rubyist Magazine: Rack spec (translation)
  • http://magazine.rubyist.net/?0033-TranslationArticle
  • Note that it is old because it is the Rack 1.1 era.
  • About Rack 1.5 new feature "Hijack API"
  • http://kwatch.houkagoteatime.net/blog/2013/01/25/rack-hijacking-api/
  • To be honest, I wonder what this specification is.

Related Links

Just on the Rack mailing list I got a question about HTTP2 support. There was a little talk about Rack2 related to it, so I went through various things.

  • Wardrop/Rack-Next: For discussion and planning of the next ruby web interface.
    • https://github.com/Wardrop/Rack-Next
  • Expectations for the next version of Rack. I'll read it later.
  • Rack 2.0, or Rack for the Future
    • https://gist.github.com/raggi/11c3491561802e573a47
  • Lists the pros and cons of Rack.
  • the_metal: A spike for thoughts about Rack 2.0
    • https://github.com/tenderlove/the_metal
  • An experiment by Nokogiri author tenderlove to determine the minimum specifications for Request and Response objects. There is an interesting opinion on Issue.
  • halorgium/rack2: async hax for rack
    • https://github.com/halorgium/rack2
  • Attempts to make Rack asynchronous?

Recommended Posts