unicode - Why is it that "using anything but a utf-8 decoder...might be insecure" in a URL percent decoding algorithm? -


i implementing url parser , have question w3c url spec (at http://www.w3.org/tr/2014/wd-url-1-20141209/ ) in section "2. percent-encoded bytes" has following algorithm (emphasis added):

to percent decode byte sequence input, run these steps:

using utf-8 decoder when input contains bytes outside range 0x00 0x7f might insecure , not recommended.

  1. let output empty byte sequence.

  2. for each byte byte in input, run these steps:

    1. if byte not '%', append byte output.

    2. otherwise, if byte '%' , next 2 bytes after byte in input not in ranges 0x30 0x39, 0x41 0x46, , 0x61 0x66, append byte output.

    3. otherwise, run these substeps:

      1. let bytepoint 2 bytes after byte in input, decoded, , interpreted hexadecimal number.

      2. append byte value bytepoint output.

      3. skip next 2 bytes in input.

  3. return output.

in original spec, word "decoded" (in bold above) link utf-8 decoding algorithm. assume "utf-8 decoder" referred in second sentence (italicized) above.

i understand invalid sequences of utf-8 bytes can cause security problems. however, in step uses decoder, bytes have been verified valid ascii hex digits preceding sub-substep 2, seems using utf-8 decoder here security overkill.

can explain how using other utf-8 decoder in algorithm possibly insecure, when decoder used byte values in ranges 0x30 0x39, 0x41 0x46, , 0x61 0x66? or interpreting incorrectly in spec?

it seems me bytes outside range 0x00 0x7f copied output as-is (either in substep 1 because not %, or in sub-substep 2 because not ascii hex digits), never end in decoder in algorithm.


Comments

Popular posts from this blog

qt - Using float or double for own QML classes -

Create Outlook appointment via C# .Net -

ios - Swift Array Resetting Itself -