In rails mail processing, I sometimes wanted to convert the URL included in the character string to the a tag.
After a little research, you can easily get the URL of the string by using ʻURI.extract`. .. .. I thought it would be relatively easy to write, but in fact there were quite a few traps and I got stuck, so I decided to write it after reviewing it.
TL;DR The final code was solved by doing the following. How did you get to this? I will explain why this is good later.
def convert_url_to_a_element(text)
uri_reg = URI.regexp(%w[http https])
text.gsub(uri_reg) { %{<a href='#{$&}' target='_blank'>#{$&}</a>} }
end
text = 'url1: http://hogehoge.com/hoge url2: http://hogehoge.com/fuga'
convert_url_to_a_element(text)
=> "url1: <a href='http://hogehoge.com/hoge' target='_blank'>http://hogehoge.com/hoge</a> url2: <a href='http://hogehoge.com/fuga' target='_blank'>http://hogehoge.com/fuga</a>"
First of all, how to write the wrong process. However, even with this, if the text is as follows, it can be processed without any problem. That's why I didn't immediately notice this writing trap this time. .. ..
def convert_url_to_a_element(text)
URI.extract(text, %w[http https]).uniq.each do |url|
sub_text = "<a href='#{url}' target='_blank'>#{url}</a>"
text.gsub(url, sub_text)
end
text
end
text = 'url1: http://hogehoge.com url2: http://fugafuga.com'
convert_url_to_a_element(text)
=> 'url1: http://hogehoge.com url2: http://fugafuga.com'
By using ʻURI.extract`, you can get all the character strings in URL format as shown below.
text = 'url1: http://hogehoge.com url2: http://fugafuga.com'
URI.extract(text, %w[http https])
=> ["http://hogehoge.com", "http://fugafuga.com"]
This is replaced by turning each. However, if you implement it with two types of URLs with the same domain name as shown below. .. ..
text = 'url1: http://hogehoge.com/hoge url2: http://hogehoge.com'
convert_url_to_a_element(text)
=> "url1: <a href='<a href='http://hogehoge.com' target='_blank'>http://hogehoge.com</a>/hoge' target='_blank'><a href='http://hogehoge.com' target='_blank'>http://hogehoge.com</a>/hoge</a> url2: <a href='http://hogehoge.com' target='_blank'>http://hogehoge.com</a>"
Somehow it's really crumbled. .. ..
The cause is that the text after a tag conversion was also replaced in the second replacement. As you can see, there is a pitfall that ** does not work well if there are two or more URLs with the same host name ** in the above writing method.
You can prevent double substitution by getting a regular expression and replacing it with the gsub pattern using the regular expression instead of turning the string obtained by ʻURI.extract` with each.
def convert_url_to_a_element(text)
uri_reg = URI.regexp(%w[http https])
text.gsub(uri_reg) { %{<a href='#{$&}' target='_blank'>#{$&}</a>} }
end
ʻURI.regexp` is a method that returns the pattern of the URL string of the specified schema as a regular expression. Regular expressions are strings, so you can write them yourself, but this method creates them quickly.
As you can see from the return value, I didn't feel like writing this from scratch. .. ..
URI.regexp(%w[http https])
=> /(?=(?-mix:http|https):)
([a-zA-Z][\-+.a-zA-Z\d]*): (?# 1: scheme)
(?:
((?:[\-_.!~*'()a-zA-Z\d;?:@&=+$,]|%[a-fA-F\d]{2})(?:[\-_.!~*'()a-zA-Z\d;\/?:@&=+$,\[\]]|%[a-fA-F\d]{2})*) (?# 2: opaque)
|
(?:(?:
\/\/(?:
(?:(?:((?:[\-_.!~*'()a-zA-Z\d;:&=+$,]|%[a-fA-F\d]{2})*)@)? (?# 3: userinfo)
(?:((?:(?:[a-zA-Z0-9\-.]|%\h\h)+|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}|\[(?:(?:[a-fA-F\d]{1,4}:)*(?:[a-fA-F\d]{1,4}|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})|(?:(?:[a-fA-F\d]{1,4}:)*[a-fA-F\d]{1,4})?::(?:(?:[a-fA-F\d]{1,4}:)*(?:[a-fA-F\d]{1,4}|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}))?)\]))(?::(\d*))?))? (?# 4: host, 5: port)
|
((?:[\-_.!~*'()a-zA-Z\d$,;:@&=+]|%[a-fA-F\d]{2})+) (?# 6: registry)
)
|
(?!\/\/)) (?# XXX: '\/\/' is the mark for hostport)
(\/(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*(?:;(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*)*(?:\/(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*(?:;(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*)*)*)? (?# 7: path)
)(?:\?((?:[\-_.!~*'()a-zA-Z\d;\/?:@&=+$,\[\]]|%[a-fA-F\d]{2})*))? (?# 8: query)
)
(?:\#((?:[\-_.!~*'()a-zA-Z\d;\/?:@&=+$,\[\]]|%[a-fA-F\d]{2})*))? (?# 9: fragment)
/x
The gsub method itself can be replaced by passing a string instead of a regular expression. In the former case, the acquired URL character string is simply passed by each and replaced, but as a result, if the URL contains the same domain, the character string after a tag conversion is also replaced. Seems to be executed and the string becomes strange.
If you think about it, that's right. .. .. I was worried because I couldn't think of this measure. First of all, gsub
text.gsub!(uri_reg) { %{<a href="#{$&}">#{$&}</a>} }
First of all, ʻURI.extract` was used first, but you can get only the URL string from the text by specifying the schema. I didn't use it this time, but it seemed to be useful if I wanted to get only the URL string simply.
text = 'aaaaa http://xxx.com/hoge bbbbb http://xxx.com'
URI.extract(text, %w[http https])
=> ["http://xxx.com/hoge" "http://xxx.com"]
There were twists and turns, but I think it was a good code. If you have any other good writing style, please let me know.
Recommended Posts