JAVA读取HDFS的文件数据出现乱码的解决方案-APISpace

JAVA读取HDFS的文件数据出现乱码的解决方案

使用java api读取HDFS文件乱码踩坑

想写一个读取HFDS上的部分文件数据做预览的接口，根据网上的博客实现后，发现有时读取信息会出现乱码，例如读取一个csv时，字符串之间被逗号分割

英文字符串aaa，能正常显示

中文字符串“你好”，能正常显示

中英混合字符串如“aaa你好”，出现乱码

查阅了众多博客，解决方案大概都是：使用xxx字符集解码。抱着不信的想法，我依次尝试，果然没用。

解决思路

因为HDFS支持6种字符集编码，每个本地文件编码方式又是极可能不一样的，我们上传本地文件的时候其实就是把文件编码成字节流上传到文件系统存储。那么在GET文件数据时，面对不同文件、不同字符集编码的字节流，肯定不是一种固定字符集解码就能正确解码的吧。

那么解决方案其实有两种

固定HDFS的编解码字符集。比如我选用UTF-8，那么在上传文件时统一编码，即把不同文件的字节流都转化为UTF-8编码再进行存储。这样的话在获取文件数据的时候，采用UTF-8字符集解码就没什么问题了。但这样做的话仍然会在转码部分存在诸多问题，且不好实现。

动态解码。根据文件的编码字符集选用对应的字符集对解码，这样的话并不会对文件的原生字符流进行改动，基本不会乱码。

我选用动态解码的思路后，其难点在于如何判断使用哪种字符集解码。参考下面的内容，获得了解决方案

java检测文本(字节流)的编码方式

需求：

某文件或者某字节流要检测他的编码格式。

实现：

基于jchardet

net.sourceforge.jchardet

jchardet

1.0

代码如下：

public class DetectorUtils {

private DetectorUtils() {

}

static class ChineseCharsetDetectionObserver implements

nsICharsetDetectionObserver {

private boolean found = false;

private String result;

public void Notify(String charset) {

found = true;

result = charset;

}

public ChineseCharsetDetectionObserver(boolean found, String result) {

super();

this.found = found;

this.result = result;

}

public boolean isFound() {

return found;

}

public String getResult() {

return result;

}

public static String[] detectChineseCharset(InputStream in)

throws Exception {

String[] prob=null;

BufferedInputStream imp = null;

try {

boolean found = false;

String result = Charsets.UTF_8.toString();

int lang = nsPSMDetector.CHINESE;

nsDetector det = new nsDetector(lang);

ChineseCharsetDetectionObserver detectionObserver = new ChineseCharsetDetectionObserver(

found, result);

det.Init(detectionObserver);

imp = new BufferedInputStream(in);

byte[] buf = new byte[1024];

int len;

boolean isAscii = true;

while ((len = imp.read(buf, 0, buf.length)) != -1) {

if (isAscii)

isAscii = det.isAscii(buf, len);

if (!isAscii) {

if (det.DoIt(buf, len, false))

break;

}

det.DataEnd();

boolean isFound = detectionObserver.isFound();

if (isAscii) {

isFound = true;

prob = new String[] { "ASCII" };

} else if (isFound) {

prob = new String[] { detectionObserver.getResult() };

} else {

prob = det.getProbableCharsets();

}

return prob;

} finally {

IOUtils.closeQuietly(imp);

IOUtils.closeQuietly(in);

}

测试：

String file = "C:/3737001.xml";

String[] probableSet = DetectorUtils.detectChineseCharset(new FileInputStream(file));

for (String charset : probableSet) {

System.out.println(charset);

}

Google提供了检测字节流编码方式的包。那么方案就很明了了，先读一些文件字节流，用工具检测编码方式，再对应进行解码即可。

具体解决代码

pom

net.sourceforge.jchardet

jchardet

1.0

从HDFS读取部分文件做预览的逻辑

// 获取文件的部分数据做预览

public List getFileDataWithLimitLines(String filePath, Integer limit) {

FSDataInputStream fileStream = openFile(filePath);

return readFileWithLimit(fileStream, limit);

}

// 获取文件的数据流

private FSDataInputStream openFile(String filePath) {

FSDataInputStream fileStream = null;

try {

fileStream = fs.open(new Path(getHdfsPath(filePath)));

} catch (IOException e) {

logger.error("fail to open file:{}", filePath, e);

}

return fileStream;

}

// 读取最多limit行文件数据

private List readFileWithLimit(FSDataInputStream fileStream, Integer limit) {

byte[] bytes = readByteStream(fileStream);

String data = decodeByteStream(bytes);

if (data == null) {

return null;

}

List rows = Arrays.asList(data.split("\\r\\n"));

return rows.stream().filter(StringUtils::isNotEmpty)

.limit(limit)

.collect(Collectors.toList());

}

// 从文件数据流中读取字节流

private byte[] readByteStream(FSDataInputStream fileStream) {

byte[] bytes = new byte[1024*30];

int len;

ByteArrayOutputStream stream = new ByteArrayOutputStream();

try {

while ((len = fileStream.read(bytes)) != -1) {

stream.write(bytes, 0, len);

}

} catch (IOException e) {

logger.error("read file bytes stream failed.", e);

return null;

}

return stream.toByteArray();

}

// 解码字节流

private String decodeByteStream(byte[] bytes) {

if (bytes == null) {

return null;

}

String encoding = guessEncoding(bytes);

String data = null;

try {

data = new String(bytes, encoding);

} catch (Exception e) {

logger.error("decode byte stream failed.", e);

}

return data;

}

// 根据Google的工具判别编码

private String guessEncoding(byte[] bytes) {

UniversalDetector detector = new UniversalDetector(null);

detector.handleData(bytes, 0, bytes.length);

dhttp://etector.dataEnd();

String encoding = detector.getDetectedCharset();

detector.reset();

if (StringUtils.isEmpty(encoding)) {

encoding = "UTF-8";

}

return encoding;

}

以上就是JAVA读取HDFS的文件数据出现乱码的解决方案的详细内容，更多关于JAVA读取HDFS的文件乱码的资料请关注我们其它相关文章！

c语言sscanf函数的用法是什么

351 2023-02-21

JAVA读取HDFS的文件数据出现乱码的解决方案

c语言sscanf函数的用法是什么

linux怎么查看本机内存大小

linux cpu占用率如何看

推荐文章

api接口有哪几种分类及功能

什么是API接口?API接口简单介绍

短信API接口概述，短信API接口的优势

7款快递物流的物流查询API工具，物流快递查询API接口怎么对接？

企业四要素: 了解企业经营成功的关键

什么是语音验证码?,语音验证码平台有哪些

全国工商查询系统怎么查企业名录

哪些平台提供实名认证的接口？

PHP如何调用API接口?

如何使用百度天气预报API接口?

最近发表

热评文章

数据接口api（数据接口API开发平台）

数据开放接口api（数据服务api开发）

Python爬虫教程：爬取酷狗音乐（python爬取

hbuilder怎么更改字体大小和颜色

直播平台api接口 - 构建卓越的直播平台

实时股票数据api接口（股票实时行情api接口）