Text Files, Convert Character Encoding

Text Files, Convert Character Encoding

Text Files, Convert Character Encoding

Text ファイル, 文字エンコーディングの変換

Converts text files charset (Coded Character Set). For example, converts UTF-8 encoding to Shift_JIS or UTF-16. If multiple files are attached, all will be converted according to the same rules.

Auto Step icon
Configs for this Auto Step
SelectConfA1
A1: Select FILE DATA for Original Text Files *
StrConfA2
A2: Set Original Charset (eg “UTF-8” )#{EL}
StrConfB1
B1: Set New Charset (eg “UTF-16” )#{EL}
SelectConfB2
B2: Select FILE DATA that stores New Text Files (append) *
Script (click to open)
// GraalJS Script (engine type: 2)

//////// START "main()" /////////////////////////////////////////////////////////////////
main();
function main(){ 

//// == Config Retrieving / 工程コンフィグの参照 ==
const filesPocketInput    = configs.getObject( "SelectConfA1" );  /// REQUIRED ///////////////
  let filesInput          = engine.findData( filesPocketInput );  // java.util.ArrayList
  if( filesInput        === null ) {
    throw new Error( "\n AutomatedTask UnexpectedFileError:" +
                     " No File {A1} is attached \n" );
  }else{
    engine.log( " AutomatedTask FilesArray {A1}: " +
                filesInput.size() + " files" );
  }
let   strInputCharset     = configs.get( "StrConfA2" );           // NotRequired /////////////
  if( strInputCharset   === "" ){
      strInputCharset     = "UTF-8";
  }
let   strOutputCharset    = configs.get( "StrConfB1" );           // NotRequired /////////////
  if( strOutputCharset  === "" ){
      strOutputCharset    = "UTF-8";
  }
const filesPocketOutput   = configs.getObject( "SelectConfB2" );  /// REQUIRED ///////////////
  let filesOutput         = engine.findData( filesPocketOutput ); // java.util.ArrayList
  if( filesOutput       === null ) {
    engine.log( " AutomatedTask FilesArray {B2}: (empty)" );
    filesOutput           = new java.util.ArrayList();
  }else{
    engine.log( " AutomatedTask FilesArray {B2}: " +
                filesOutput.size() + " files" );
  }


//// == Data Retrieving / ワークフローデータの参照 ==
// (Nothing. Retrieved via Expression Language in Config Retrieving)


//// == Calculating / 演算 ==
const numFilesInput = filesInput.size() - 0;
const regMime = /\w+\/[-\w.+]+/;                                                // +++ in v2023★

for( let i = 0; i < numFilesInput; i++ ){
  const strInputFileName = filesInput.get(i).getName() + "";
  const strInputFileSize = filesInput.get(i).getLength() + " bytes";

  //  const strInputFileMime = filesInput.get(i).getContentType();              // --- in v2023★
  const arrInputFileMime = filesInput.get(i).getContentType().match( regMime ); // +++ in v2023★
  if( arrInputFileMime === null ){                                              // +++ in v2023★
    throw new Error( "\n AutomatedTask RuntimeError:" +
                     " MIME Type of the input, unreferenable \n" );
  }
  const strInputFileMime = arrInputFileMime[0];                                 // +++ in v2023★

  let strInputText = "";
  let numLineCounter = 0;
  fileRepository.readFile( filesInput.get(i), strInputCharset, function(line) {
  // com.questetra.bpms.core.event.scripttask.FileRepositoryWrapper
  // https://questetra.zendesk.com/hc/ja/articles/360024574471-R2300#FileRepositoryWrapper
      strInputText += line + '\n';
      numLineCounter ++;
  });
  engine.log( " AutomatedTask FileLoaded: " + strInputFileName + " (" + strInputFileMime + ")" );
  engine.log( " AutomatedTask: " + strInputFileSize + " / " + numLineCounter + " lines" );
  engine.log( " AutomatedTask FileOutput as: " +
                strInputFileMime + "; charset=" + strOutputCharset );           // +++ in v2023★

  filesOutput.add(
    new com.questetra.bpms.core.event.scripttask.NewQfile(
      strInputFileName,
      strInputFileMime + "; charset=" + strOutputCharset,
      strInputText
    )
  );
}


//// == Data Updating / ワークフローデータへの代入 ==
engine.setData( filesPocketOutput, filesOutput );

} //////// END "main()" /////////////////////////////////////////////////////////////////


/*
Notes:
- Used when incorporating "Step in which Text file Encoding is automatically changed" in the workflow.
    - Charset of Text file is automatically changed when the process reaches this automated task.
- The file name of the output file will be the same as the input file.
    - The line feed code is `LF`.
- Converts according to the specified Encodings.
    - No auto-detect feature.
    - If not specified, the default Encoding is `UTF-8`.

APPENDIX:
- `UTF-8`
    - Compactly encodes more than 1 million Unicode characters in the world with 1 to 4 bytes.
    - It became the most common character code in 2008 and is used in 97% of web pages as of 2021.
    - UTF-8 is superset of US-ASCII (single-byte characters). (upward compatible)
        - That is, ASCII files are also UTF-8 files. (US-ASCII is a subset of UTF-8)
        - Similarly, ASCII files are also Shift_JIS files.
- `UTF-16`
    - Encodes over 1 million Unicode characters in the world with 2-4 bytes.
    - If there are many Asian characters such as Japanese and Chinese, encode them compactly.
- Another encoding
    - `charset=UTF-16` (Unicode [characters around the world])
    - `charset=UTF-16BE` (Unicode [characters around the world])
    - `charset=UTF-16LE` (Unicode [characters around the world])
    - `charset=UTF-32` (Unicode [characters around the world])
    - `charset=x-UTF-32LE-BOM` (Unicode [characters around the world])
    - `charset=ISO-8859-1` (Western language characters)
    - `charset=Shift_JIS` (Japanese characters)
    - `charset=Big5` (Traditional Chinese characters)
    - `charset=GB2312` (Simplified Chinese EUC characters)
    - `charset=GBK` (Simplified Chinese GB characters)
    - `charset=KOI8-R` (Russian)
    - In addition, "UTF8B (UTF-8 with BOM)" cannot be output. (File for Windows / pray for its eradication)
    - https://docs.oracle.com/javase/9/intl/supported-encodings.htm
- Example Values of `.getContentType()` // +++ in v2023★
    - "text/plain"
    - "text/html;charset=UTF-8"
    - "application/json; q=0.2 charset=utf8"
    - "text/html; charset=ISO-8859-4"
- Content-Type
    - https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Type
    - https://developer.mozilla.org/ja/docs/Web/HTTP/Headers/Content-Type
    - https://httpwg.org/specs/rfc9110.html#field.content-type
- Media Types
    - https://www.iana.org/assignments/media-types/media-types.xhtml


Notes-ja:
- ワークフロー内に「TextファイルEncodingが自動的に変更される工程」を組み込む際に利用します。
    - 案件が自動処理工程に到達した際、TextファイルのCharsetが自動的に変更されます。
- 出力ファイルのファイル名は、入力ファイルと同じファイル名になります。
    - 改行コードは `LF` です。
- 指定された Encoding に従って変換します。
    - 自動判別機能はありません。
    - 未指定の場合、デフォルトの Encoding は `UTF-8` です。

APPENDIX-ja:
- `UTF-8`
    - 世界100万種以上のUnicode文字を、1~4バイトでコンパクトにエンコードします。
    - 2008年に最も一般的な文字コードとなり、2021年時点で97%のウェブページで利用されています。
    - UTF-8 は US-ASCII(1バイト文字)の上位互換です。(US-ASCII は UTF-8 のサブセットです)
        - すなわち ASCII ファイルは UTF-8 ファイルでもあります。
        - 同様に ASCII ファイルは Shift_JIS ファイルでもあります。
- `UTF-16`
    - 世界100万種以上のUnicode文字を、2~4バイトでエンコードします。
    - 日本語や中国語などのアジア文字が多い場合は、コンパクトにエンコードします。
- その他のエンコーディング
    - `charset=UTF-16` (Unicode[世界中の文字])
    - `charset=UTF-16BE` (Unicode[世界中の文字])
    - `charset=UTF-16LE` (Unicode[世界中の文字])
    - `charset=UTF-32` (Unicode[世界中の文字])
    - `charset=x-UTF-32LE-BOM` (Unicode[世界中の文字])
    - `charset=ISO-8859-1` (ヨーロッパ言語の文字)
    - `charset=Shift_JIS` (日本語の文字)
    - `charset=Big5` (繁体中国語の文字)
    - `charset=GB2312` (簡体中国語EUC文字)
    - `charset=GBK` (簡体中国語GBの文字)
    - `charset=KOI8-R` (ロシア語)
    - なお "UTF8B (BOM付 UTF-8)" は出力できません。(Windows用ファイル/その撲滅を祈念)
    - https://docs.oracle.com/javase/9/intl/supported-encodings.htm
    - https://docs.oracle.com/javase/jp/9/intl/supported-encodings.htm
- Example Values of `.getContentType()` // +++ in v2023★
    - "text/plain"
    - "text/html;charset=UTF-8"
    - "application/json; q=0.2 charset=utf8"
    - "text/html; charset=ISO-8859-4"
- Content-Type
    - https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Type
    - https://developer.mozilla.org/ja/docs/Web/HTTP/Headers/Content-Type
    - https://httpwg.org/specs/rfc9110.html#field.content-type
- Media Types
    - https://www.iana.org/assignments/media-types/media-types.xhtml
*/

Download

warning Freely modifiable JavaScript (ECMAScript) code. No warranty of any kind.
(Installing Addon Auto-Steps are available only on the Professional edition.)

Notes

  • Used when incorporating a Step in which Text file Encoding is automatically changed, into the workflow.
    • Charset of Text file is automatically changed when the process reaches this automated task.
  • The file name of the output file will be the same as the input file.
    • The line feed code is LF.
  • Converts according to the specified Encodings.
    • No auto-detect feature.
    • If not specified, the default Encoding is UTF-8.

Capture

Converts text files charset (Coded Character Set). For example, converts UTF-8 encoding to Shift_JIS or UTF-16. If multiple files are attached, all will be converted according to the same rules.

Appendix

  • UTF-8
    • Compactly encodes more than 1 million Unicode characters in the world with 1 to 4 bytes.
    • It became the most common character code in 2008 and is used in 97% of web pages as of 2021.
    • UTF-8 is superset of US-ASCII (single-byte characters). (upward compatible)
      • That is, ASCII files are also UTF-8 files. (US-ASCII is a subset of UTF-8)
      • Similarly, ASCII files are also Shift_JIS files.
  • UTF-16
    • Encodes over 1 million Unicode characters in the world with 2-4 bytes.
    • If there are many Asian characters such as Japanese and Chinese, encode them compactly.
  • Another encoding
    • charset=UTF-16 (Unicode [characters around the world])
    • charset=UTF-16BE (Unicode [characters around the world])
    • charset=UTF-16LE (Unicode [characters around the world])
    • charset=UTF-32 (Unicode [characters around the world])
    • charset=x-UTF-32LE-BOM (Unicode [characters around the world])
    • charset=ISO-8859-1 (Western language characters)
    • charset=Shift_JIS (Japanese characters)
    • charset=Big5 (Traditional Chinese characters)
    • charset=GB2312 (Simplified Chinese EUC characters)
    • charset=GBK (Simplified Chinese GB characters)
    • charset=KOI8-R (Russian)
    • In addition, “UTF8B (UTF-8 with BOM)” cannot be output. (File for Windows / pray for its eradication)
    • https://docs.oracle.com/javase/9/intl/supported-encodings.htm
  • Example Values of .getContentType() // +++ in v2023★
    • “text/plain”
    • “text/html;charset=UTF-8”
    • “application/json; q=0.2 charset=utf8”
    • “text/html; charset=ISO-8859-4”
  • Content-Type
  • Media Types

See Also

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Scroll to Top

Discover more from Questetra Support

Subscribe now to keep reading and get access to the full archive.

Continue reading