快速去除UTF-8 BOM

工作中多多少少都会遇到UTF-8 BOM(后面直接叫BOM),有时第三方工具不支持就要自己去掉BOM,例如阿里云导出的SQL文件是有BOM的,但是Navicat不支持,这就要去掉BOM了。 后文所用的测试文件是一个阿里云导出的SQL文件,265M,测试时文件已缓存(time显示的 File system inputs接近0) #### 用sed去BOM sed -e '1s/^\xef\xbb\xbf//' file 用time看一下sed方法耗时: $ /usr/bin/time -v sed -e '1s/^\xef\xbb\xbf//' sqlResult_1601835.sql > /dev/null ... User time (seconds): 0.33 System time (seconds): 0.11 Percent of CPU this job got: 98% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.46 ... User time较大,因为sed会对每一行都进行处理,但是实际上只有第一行有BOM,所以浪费了CPU。 sed还支持原地更新(-i): $ /usr/bin/time -v sed -e '1s/^\xef\xbb\xbf//' sqlResult_1601835.sql -i ... User time (seconds): 1.31 System time (seconds): 3.89 Percent of CPU this job got: 71% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:07.32 ... 因为会写入文件,所以会更慢,用strace可以发现,sed是通过输出到临时文件然后覆盖原文件实现更新的 open("sqlResult_1601835.sql", O_RDONLY) = 3 open("./sedGlXm60", O_RDWR|O_CREAT|O_EXCL, 0600) = 4 ... rename("./sedGlXm60", "sqlResult_1601835.sql") #### 用tail去BOM tail --bytes=+4 file 用tail可以直接跳过BOM,然后直接复制文件内容,减少了不必要的CPU处理: $ /usr/bin/time -v tail --bytes=+4 sqlResult_1601835.sql > /dev/null ... User time (seconds): 0.01 System time (seconds): 0.12 Percent of CPU this job got: 96% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.14 ... 但是tail必须自己重定向到新文件再覆盖旧文件。 #### strip-bom 为了结合sed和tail的优点,我写了一个strip-bom,支持原地更新文件。 先测试一下重定向: $ /usr/bin/time -v php strip-bom.phar sqlResult_1601835.sql > /dev/null ... User time (seconds): 0.11 System time (seconds): 0.22 Percent of CPU this job got: 98% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.35 ... 只比sed快了20%,User time少了但System time增加了。因为是个循环读写的过程,每次循环就是一次read和write调用,所以我增加了一个参数来调节每次读的块大小,可以减少循环次数和系统调用,可以比sed快60%: $ /usr/bin/time -v php strip-bom.phar -b 16384 sqlResult_1601835.sql > /dev/null ... User time (seconds): 0.06 System time (seconds): 0.12 Percent of CPU this job got: 96% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.19 测试原地更新,比sed快30%: $ /usr/bin/time -v php strip-bom.phar -i -b 16384 sqlResult_1601835.sql User time (seconds): 0.23 System time (seconds): 0.67 Percent of CPU this job got: 17% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:05.11 #### copy_file_range Linux 4.5增加了一个系统调用: ssize_t copy_file_range(int fd_in, loff_t *off_in, int fd_out, loff_t *off_out, size_t len, unsigned int flags); 可以直接在两个文件描述符间复制内容,而且通常只要一个系统调用,所以可以参考sed复制到临时文件,然后覆盖旧文件,实现代码在:Gist 测试: $ /usr/bin/time -v ./copy_file_range sqlResult_1601835.sql ... User time (seconds): 0.00 System time (seconds): 2.47 Percent of CPU this job got: 37% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:06.52 减少了系统调用也只比sed快一点,复制到临时文件还是比strip-bom原地更新慢。 #### dos2unix去BOM 一直以为dos2unix就是转CRLF的,看Feng_Yu评论之后看了man page,原来dos2unix功能很多,其中有去BOM的选项(-r): $ /usr/bin/time -v dos2unix -r sqlResult_1601835.sql dos2unix: 正在转换文件 sqlResult_1601835.sql 为Unix格式... Command being timed: "dos2unix -r sqlResult_1601835.sql" User time (seconds): 10.01 System time (seconds): 0.90 Percent of CPU this job got: 60% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:18.20 dos2unix实现类似sed,也是写到临时文件再覆盖,也和sed一样,会处理每一行,所以性能并不好。
联系我们

邮箱 626512443@qq.com
电话 18611320371(微信)
QQ群 235681453

Copyright © 2015-2022

备案号:京ICP备15003423号-3