Convert to a tag to URL string in Rails

In rails mail processing, I sometimes wanted to convert the URL included in the character string to the a tag.

After a little research, you can easily get the URL of the string by using ʻURI.extract`. .. .. I thought it would be relatively easy to write, but in fact there were quite a few traps and I got stuck, so I decided to write it after reviewing it.

TL;DR The final code was solved by doing the following. How did you get to this? I will explain why this is good later.

def convert_url_to_a_element(text)
  uri_reg = URI.regexp(%w[http https])
  text.gsub(uri_reg) { %{<a href='#{$&}' target='_blank'>#{$&}</a>} }
end

text = 'url1: http://hogehoge.com/hoge url2: http://hogehoge.com/fuga'
convert_url_to_a_element(text)
=> "url1: <a href='http://hogehoge.com/hoge' target='_blank'>http://hogehoge.com/hoge</a> url2: <a href='http://hogehoge.com/fuga' target='_blank'>http://hogehoge.com/fuga</a>"

Anti-pattern

First of all, how to write the wrong process. However, even with this, if the text is as follows, it can be processed without any problem. That's why I didn't immediately notice this writing trap this time. .. ..

def convert_url_to_a_element(text)
  URI.extract(text, %w[http https]).uniq.each do |url|
    sub_text = "<a href='#{url}' target='_blank'>#{url}</a>"
    text.gsub(url, sub_text)
  end
  text
end

text = 'url1: http://hogehoge.com url2: http://fugafuga.com'
convert_url_to_a_element(text)
=> 'url1: http://hogehoge.com url2: http://fugafuga.com'

By using ʻURI.extract`, you can get all the character strings in URL format as shown below.

text = 'url1: http://hogehoge.com url2: http://fugafuga.com'
URI.extract(text, %w[http https])
=> ["http://hogehoge.com", "http://fugafuga.com"]

This is replaced by turning each. However, if you implement it with two types of URLs with the same domain name as shown below. .. ..

text = 'url1: http://hogehoge.com/hoge url2: http://hogehoge.com'
convert_url_to_a_element(text)
=> "url1: <a href='<a href='http://hogehoge.com' target='_blank'>http://hogehoge.com</a>/hoge' target='_blank'><a href='http://hogehoge.com' target='_blank'>http://hogehoge.com</a>/hoge</a> url2: <a href='http://hogehoge.com' target='_blank'>http://hogehoge.com</a>"

Somehow it's really crumbled. .. ..

Cause

The cause is that the text after a tag conversion was also replaced in the second replacement. As you can see, there is a pitfall that ** does not work well if there are two or more URLs with the same host name ** in the above writing method.

counter-measure

You can prevent double substitution by getting a regular expression and replacing it with the gsub pattern using the regular expression instead of turning the string obtained by ʻURI.extract` with each.

def convert_url_to_a_element(text)
  uri_reg = URI.regexp(%w[http https])
  text.gsub(uri_reg) { %{<a href='#{$&}' target='_blank'>#{$&}</a>} }
end

Supplementary memo

About URI.regexp

ʻURI.regexp` is a method that returns the pattern of the URL string of the specified schema as a regular expression. Regular expressions are strings, so you can write them yourself, but this method creates them quickly.

As you can see from the return value, I didn't feel like writing this from scratch. .. ..

URI.regexp(%w[http https])
=> /(?=(?-mix:http|https):)
        ([a-zA-Z][\-+.a-zA-Z\d]*):                           (?# 1: scheme)
        (?:
           ((?:[\-_.!~*'()a-zA-Z\d;?:@&=+$,]|%[a-fA-F\d]{2})(?:[\-_.!~*'()a-zA-Z\d;\/?:@&=+$,\[\]]|%[a-fA-F\d]{2})*)                    (?# 2: opaque)
        |
           (?:(?:
             \/\/(?:
                 (?:(?:((?:[\-_.!~*'()a-zA-Z\d;:&=+$,]|%[a-fA-F\d]{2})*)@)?        (?# 3: userinfo)
                   (?:((?:(?:[a-zA-Z0-9\-.]|%\h\h)+|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}|\[(?:(?:[a-fA-F\d]{1,4}:)*(?:[a-fA-F\d]{1,4}|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})|(?:(?:[a-fA-F\d]{1,4}:)*[a-fA-F\d]{1,4})?::(?:(?:[a-fA-F\d]{1,4}:)*(?:[a-fA-F\d]{1,4}|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}))?)\]))(?::(\d*))?))? (?# 4: host, 5: port)
               |
                 ((?:[\-_.!~*'()a-zA-Z\d$,;:@&=+]|%[a-fA-F\d]{2})+)                 (?# 6: registry)
               )
             |
             (?!\/\/))                           (?# XXX: '\/\/' is the mark for hostport)
             (\/(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*(?:;(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*)*(?:\/(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*(?:;(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*)*)*)?                    (?# 7: path)
           )(?:\?((?:[\-_.!~*'()a-zA-Z\d;\/?:@&=+$,\[\]]|%[a-fA-F\d]{2})*))?                 (?# 8: query)
        )
        (?:\#((?:[\-_.!~*'()a-zA-Z\d;\/?:@&=+$,\[\]]|%[a-fA-F\d]{2})*))?                  (?# 9: fragment)
      /x

About gsub

The gsub method itself can be replaced by passing a string instead of a regular expression. In the former case, the acquired URL character string is simply passed by each and replaced, but as a result, if the URL contains the same domain, the character string after a tag conversion is also replaced. Seems to be executed and the string becomes strange.

If you think about it, that's right. .. .. I was worried because I couldn't think of this measure. First of all, gsub

text.gsub!(uri_reg) { %{<a href="#{$&}">#{$&}</a>} }

About URI.extract

First of all, ʻURI.extract` was used first, but you can get only the URL string from the text by specifying the schema. I didn't use it this time, but it seemed to be useful if I wanted to get only the URL string simply.

text = 'aaaaa http://xxx.com/hoge bbbbb http://xxx.com'
URI.extract(text, %w[http https])
=> ["http://xxx.com/hoge" "http://xxx.com"]

Summary

If you want to perform a tag conversion, it seems better to replace gsub after pattern matching with a regular expression.
The regular expression itself can be easily obtained using ʻURI.regexp`

There were twists and turns, but I think it was a good code. If you have any other good writing style, please let me know.