unicode - Why is it that "using anything but a utf-8 decoder...might be insecure" in a URL percent decoding algorithm? -
i implementing url parser , have question w3c url spec (at http://www.w3.org/tr/2014/wd-url-1-20141209/ ) in section "2. percent-encoded bytes" has following algorithm (emphasis added):
to percent decode byte sequence input, run these steps:
using utf-8 decoder when input contains bytes outside range 0x00 0x7f might insecure , not recommended.
let output empty byte sequence.
for each byte byte in input, run these steps:
if byte not '%', append byte output.
otherwise, if byte '%' , next 2 bytes after byte in input not in ranges 0x30 0x39, 0x41 0x46, , 0x61 0x66, append byte output.
otherwise, run these substeps:
let bytepoint 2 bytes after byte in input, decoded, , interpreted hexadecimal number.
append byte value bytepoint output.
skip next 2 bytes in input.
return output.
in original spec, word "decoded" (in bold above) link utf-8 decoding algorithm. assume "utf-8 decoder" referred in second sentence (italicized) above.
i understand invalid sequences of utf-8 bytes can cause security problems. however, in step uses decoder, bytes have been verified valid ascii hex digits preceding sub-substep 2, seems using utf-8 decoder here security overkill.
can explain how using other utf-8 decoder in algorithm possibly insecure, when decoder used byte values in ranges 0x30 0x39, 0x41 0x46, , 0x61 0x66? or interpreting incorrectly in spec?
it seems me bytes outside range 0x00 0x7f copied output as-is (either in substep 1 because not %, or in sub-substep 2 because not ascii hex digits), never end in decoder in algorithm.
Comments
Post a Comment