java - How do I convert a Windows-1251 text to something readable? -

- March 15, 2014

i have string, returned jericho html parser , contains russian text. according source.getencoding() , header of respective html file, encoding windows-1251.

how can convert string readable?

i tried this:

import java.io.unsupportedencodingexception;  public class program {     public void run() throws unsupportedencodingexception {         final string windows1251string = getwindows1251string();         system.out.println("string (windows-1251): " + windows1251string);         final string readablestring = convertstring(windows1251string);         system.out.println("string (converted): " + readablestring);     }     private string convertstring(string windows1251string) throws unsupportedencodingexception {         return new string(windows1251string.getbytes(), "utf-8");     }     private string getwindows1251string() {         final byte[] bytes = new byte[] {32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32};         return new string(bytes);     }     public static void main(final string[] args) throws unsupportedencodingexception {         final program program = new program();         program.run();     } }

the variable bytes contains data shown in debugger, it's result of net.htmlparser.jericho.element.getcontent().tostring().getbytes(). copy , pasted array here.

this doesn't work - readablestring contains garbage.

how can fix it, i. e. make sure windows-1251 string decoded properly?

update 1 (30.07.2015 12:45 msk): when change encoding in call in convertstring windows-1251, nothing changes. see screenshot below.

update 2: attempt:

update 3 (30.07.2015 14:38): texts need decode correspond texts in drop-down list shown below.

update 4 (30.07.2015 14:41): encoding detector (code see below) says encoding not windows-1251, utf-8.

public static string guessencoding(byte[] bytes) {     string default_encoding = "utf-8";     org.mozilla.universalchardet.universaldetector detector =         new org.mozilla.universalchardet.universaldetector(null);     detector.handledata(bytes, 0, bytes.length);     detector.dataend();     string encoding = detector.getdetectedcharset();     system.out.println("detected encoding: " + encoding);     detector.reset();     if (encoding == null) {         encoding = default_encoding;     }     return encoding; }

(in light of updates deleted original answer , started again)

the text appears

пїЅпїЅпїЅпїЅпїЅпїЅ

is accurate decoding of these byte values

-17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67

(padded @ either end 32, space.)

so either

1) text garbage or

2) text supposed or

3) encoding not windows-1215

this line notably wrong

return new string(windows1251string.getbytes(), "utf-8");

extracting bytes out of string , constructing new string not way of "converting" between encodings. both input string , output string use utf-16 encoding internally (and don't need know or care that). times other encodings come play when text data stored outside of string object - ie in initial byte array. conversion occurs when string constructed , done. there no conversion 1 string type - same.

the fact this

return new string(bytes);

does same this

return new string(bytes, "windows-1251");

suggests windows-1251 platforms default encoding. (which further supported timezone being msk)

Search This Blog

Chrom

java - How do I convert a Windows-1251 text to something readable? -

Comments

Post a Comment

Popular posts from this blog

qt - Using float or double for own QML classes -

json - ORA-06502: PL/SQL: numeric or value error: character string buffer too small - Convert Clob to varchar2 -

python - jinja2: TemplateSyntaxError: expected token ',', got 'string' -