java - How do I convert a Windows-1251 text to something readable? -
i have string, returned jericho html parser , contains russian text. according source.getencoding()
, header of respective html file, encoding windows-1251.
how can convert string readable?
i tried this:
import java.io.unsupportedencodingexception; public class program { public void run() throws unsupportedencodingexception { final string windows1251string = getwindows1251string(); system.out.println("string (windows-1251): " + windows1251string); final string readablestring = convertstring(windows1251string); system.out.println("string (converted): " + readablestring); } private string convertstring(string windows1251string) throws unsupportedencodingexception { return new string(windows1251string.getbytes(), "utf-8"); } private string getwindows1251string() { final byte[] bytes = new byte[] {32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32}; return new string(bytes); } public static void main(final string[] args) throws unsupportedencodingexception { final program program = new program(); program.run(); } }
the variable bytes
contains data shown in debugger, it's result of net.htmlparser.jericho.element.getcontent().tostring().getbytes()
. copy , pasted array here.
this doesn't work - readablestring
contains garbage.
how can fix it, i. e. make sure windows-1251 string decoded properly?
update 1 (30.07.2015 12:45 msk): when change encoding in call in convertstring
windows-1251
, nothing changes. see screenshot below.
update 2: attempt:
update 3 (30.07.2015 14:38): texts need decode correspond texts in drop-down list shown below.
update 4 (30.07.2015 14:41): encoding detector (code see below) says encoding not windows-1251
, utf-8
.
public static string guessencoding(byte[] bytes) { string default_encoding = "utf-8"; org.mozilla.universalchardet.universaldetector detector = new org.mozilla.universalchardet.universaldetector(null); detector.handledata(bytes, 0, bytes.length); detector.dataend(); string encoding = detector.getdetectedcharset(); system.out.println("detected encoding: " + encoding); detector.reset(); if (encoding == null) { encoding = default_encoding; } return encoding; }
(in light of updates deleted original answer , started again)
the text appears
пїЅпїЅпїЅпїЅпїЅпїЅ
is accurate decoding of these byte values
-17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67
(padded @ either end 32, space.)
so either
1) text garbage or
2) text supposed or
3) encoding not windows-1215
this line notably wrong
return new string(windows1251string.getbytes(), "utf-8");
extracting bytes out of string , constructing new string not way of "converting" between encodings. both input string , output string use utf-16 encoding internally (and don't need know or care that). times other encodings come play when text data stored outside of string object - ie in initial byte array. conversion occurs when string constructed , done. there no conversion 1 string type - same.
the fact this
return new string(bytes);
does same this
return new string(bytes, "windows-1251");
suggests windows-1251 platforms default encoding. (which further supported timezone being msk)
Comments
Post a Comment