TSV String, Filter by RegExp

Narrows TSV text to column matched rows only. A regular expression (RegExp) can be specified as the match condition, and the column value is matched for each row. It is also possible to store excluded rows (filtered rows).

Configs for this Auto Step

StrConfA1: A1: Set TSV Text *^#{EL}
StrConfB0: B0: Set ID of COLUMN to be Matched (eg: “0”) *^#{EL}
StrConfB1: B1: Set Matching Condition (REGEXP) for column value *^#{EL}
BoolConfB2: B2: Case Sensitive or Case should be Ignored
SelectConfC1: C1: Select STRING that stores Narrowed Text (update)
SelectConfC2: C2: Select STRING that stores Removed Lines (update)

Script (click to open)

// GraalJS Script (engine type: 2)

//////// START "main()" ////////////////////////////////////////////////////////////////
main();
function main(){ 

//// == Config Retrieving / 工程コンフィグの参照 ==
const strInput         = configs.get      ( "StrConfA1" );    /// REQUIRED
  if( strInput       === "" ){
    throw new Error( "\n AutomatedTask ConfigError:" +
                     " Config {A1: String} is empty \n" );
  }
const strColMatchId    = configs.get      ( "StrConfB0" );    /// REQUIRED
  if( strColMatchId  === "" ){
    throw new Error( "\n AutomatedTask ConfigError:" +
                     " Config {B0: String} is empty \n" );
  }
const strRegexp        = configs.get      ( "StrConfB1" );    /// REQUIRED
  if( strRegexp      === "" ){
    throw new Error( "\n AutomatedTask ConfigError:" +
                     " Config {B1: String} is empty \n" );
  }
const boolIgnoreCase   = configs.getObject( "BoolConfB2" );   /// TOGGLE
  // https://questetra.zendesk.com/hc/ja/articles/360024574471-R2300 "Boolean object"
const strPocketOutput  = configs.getObject( "SelectConfC1" ); // NotRequired
const strPocketRemoved = configs.getObject( "SelectConfC2" ); // NotRequired


//// == Data Retrieving / ワークフローデータの参照 ==
// (Nothing. Retrieved via Expression Language in Config Retrieving)


//// == Calculating / 演算 ==
const numColMatchId    = parseInt ( strColMatchId );

const arrInput         = strInput.split( '\n' );
const regConstructor   = boolIgnoreCase ?
                         new RegExp( strRegexp, 'i' ) : new RegExp( strRegexp );
// engine.log( " AutomatedTask RegExp: " + regConstructor.toString() );

let   arrOutput    = [];
let   arrRemoved   = [];

for ( let i = 0; i < arrInput.length; i++ ) {
  if ( regConstructor.test( arrInput[i].split( '\t', (numColMatchId + 1) )?.[ numColMatchId ] ?? "" ) ){
    // https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Operators/Optional_chaining
    // #combining_with_the_nullish_coalescing_operator (Null 合体演算子と共に使う)
    arrOutput.push ( arrInput[i] );
  } else {
    arrRemoved.push ( arrInput[i] );
  }
}

let strOutput  = arrOutput?.join( '\n' ) ?? "";
let strRemoved = arrRemoved?.join( '\n' ) ?? "";
// https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Operators/Optional_chaining


//// == Data Updating / ワークフローデータへの代入 ==
if ( strPocketOutput !== null ) {
  engine.setData ( strPocketOutput,  strOutput );
}
if ( strPocketRemoved !== null ) {
  engine.setData ( strPocketRemoved, strRemoved );
}

} //////// END "main()" /////////////////////////////////////////////////////////////////



/*
Notes:
- When a process reaches this automated step, "TSV text" is automatically processed.
    - The column in each line of TSV text is tested, and only lines matching the condition are output.
    - Note that it is possible to store not only filtered-down lines, but also filtered-out lines.
- If the amount of calculation is too large, the process will be terminated in the middle.
    - A large number of Lines or complex RegExp that requires more than 30 secs will result in a termination.
- No line feed code is inserted at the end of the output string.
    - "Narrowed Text" (filtered down)
    - "Removed Lines" (filtered out)
- Various human steps can be automated (unmanned).
    - Filter by date
    - Remove invalid email addresses
    - Remove invalid zip codes
    - Extract URLs starting with "http://"
    - Extract only jp domain URLs
- Note that if the column ID is greater than the number of columns, an empty string is evaluated.
    - This does not result in an error.
    - If a regular expression is set to match the blank character, the test result will be true.

Appendix-en:
- Basic regular expressions: Groups
    - `(apple|orange)`
        - Either "apple" or "orange"
- Basic regular expressions: A character class
    - `[xyz]`
        - Any one of the enclosed characters, "x" "y" or "z"
    - `[a-c]`
        - Any one in the range "a" to "c"
    - `[^xyz]`
        - Any one that is neither "x" nor "y" nor "z"
    - `\d`
        - Any digit / `[0-9]`
    - `\w`
        - Any alphanumeric character / `[A-Za-z0-9_]`
    - `\t`
        - A horizontal tab
    - `.`
        - Any single character except line terminators
- Basic regular expressions: Assertions
    - `^`
        - Beginning
    - `$`
        - End
    - `\b`
        - A word boundary
    - `\B`
        - A non-word boundary
- Basic regular expressions: Quantifiers
    - `x*`
        - The preceding item "x" 0 or more times
    - `x+`
        - The preceding item "x" 1 or more times
    - `x?`
        - The preceding item "x" 0 or 1 times
    - `x{n}`
        - "n" occurrences of the preceding item "x"
    - `x{n,m}`
        - at least "n" and at most "m" occurrences
- RegExp Examples
    - Japanese postal code
        - `^(\d{3}-\d{4})|(\d{7})$`
        - `^([0-9]{3}-[0-9]{4})|([0-9]{7})$`
    - Gmail address
        - `^[a-zA-Z0-9\.\+]+@(gmail\.com|googlemail\.com)$`
        - `^[a-zA-Z0-9.+]+@(gmail\.com|googlemail\.com)$`
    - Secure HTTP
        - `^https:\/\/`
        - `^https://`
    - ".com/" or ".com" at the end of the line
        - `\.com/?$`
    - Date (2022-12-15 to 2023-01-06)
        - `(^2022-12-1[5-9]$)|(^2022-12-[2-3][0-9]$)|(^2023-01-0[1-6]$)`

Notes-ja:
- 案件がこの自動工程に到達した際、「TSV テキスト」が自動的に加工さます。
    - TSV テキスト各行のカラムが正規表現でテストされ、条件にマッチした行だけが出力されます。
    - なお、フィルタ抽出された行だけでなく、フィルタ除去された行を格納することも可能です。
- 計算量が多すぎる場合、途中終了します。
    - 行数が多い・正規表現が極めて複雑など、30秒を超えるような処理は途中で終了する可能性があります。
- 出力文字列の末尾に改行コードは挿入されません。
    - "絞り込まれたテキスト"
    - "除去された行"
- 様々なヒューマン工程を自動化（無人化）できます。
    - 日付を絞り込む
    - 不正なメールアドレスを除去する
    - 不正な郵便番号を除去する
    - "http://" 始まりのURLを抽出する
    - jpドメインのURLだけを抽出する
- なお、マッチ判定カラムのIDがカラム数より大きい場合、空文字 "" が評価されます。
    - エラーにはなりません。
    - 空文字にマッチする正規表現が設定されている場合、テスト結果は真となります。

Appendix-ja:
- 基本的な正規表現（グループ／Groups）
    - `(apple|orange)`
        - "apple" か "orange" のいずれか　／　Either "apple" or "orange"
- 基本的な正規表現（文字クラス／A character class）
    - `[xyz]`
        - "x" か "y" か "z" の 一文字　／　Any one of the enclosed characters, "x" "y" or "z"
    - `[a-c]`
        - "a" から "c" までのいずれか一文字　／　Any one in the range "a" to "c"
    - `[^xyz]`
        - "x" でも "y" でも "z" でもない一文字　／　Any one that is neither "x" nor "y" nor "z"
    - `\d`
        - 数字一文字　／　Any digit　／　`[0-9]`
    - `\w`
        - 半角英数字一文字　／　Any alphanumeric character　／　`[A-Za-z0-9_]`
    - `\t`
        - 水平タブ　／　A horizontal tab
    - `.`
        - 改行文字を除くあらゆる一文字　／　Any single character except line terminators
- 基本的な正規表現（アサーション言明／Assertions）
    - `^`
        - 先頭　／　Beginning
    - `$`
        - 末尾　／　End
    - `\b`
        - 区切り　／　A word boundary
    - `\B`
        - 区切り以外　／　A non-word boundary
- 基本的な正規表現（数量詞／Quantifiers）
    - `x*`
        - 直前アイテム "x" の0回以上の繰返　／　The preceding item "x" 0 or more times
    - `x+`
        - 直前アイテム "x" の1回以上の繰返　／　The preceding item "x" 1 or more times
    - `x?`
        - 直前アイテム "x" の0回か1回の出現　／　The preceding item "x" 0 or 1 times
    - `x{n}`
        - 直前アイテム "x" のn回の出現　／　"n" occurrences of the preceding item "x"
    - `x{n,m}`
        - 直前アイテム "x" がnからm回出現　／　at least "n" and at most "m" occurrences
- 正規表現 設定例 Example
    - 日本の郵便番号
        - `^(\d{3}-\d{4})|(\d{7})$`
        - `^([0-9]{3}-[0-9]{4})|([0-9]{7})$`
    - Gmailアドレス
        - `^[a-zA-Z0-9\.\+]+@(gmail\.com|googlemail\.com)$`
        - `^[a-zA-Z0-9.+]+@(gmail\.com|googlemail\.com)$`
    - セキュアHTTP
        - `^https:\/\/`
        - `^https://`
    - 行末の ".com/" もしくは ".com"
        - `\.com/?$`
    - 日付（2022-12-15 to 2023-01-06）
        - `(^2022-12-1[5-9]$)|(^2022-12-[2-3][0-9]$)|(^2023-01-0[1-6]$)`

Test Data for Debug:
date	email	zipcode	url
2022-12-14	example@gmail.com	604-0835	https://questetra.com/
2022-12-14	suzuki@example.com	604-0835	https://QUESTETRA.COM/
2022-12-15	sato@example.com	604-0835	https://questetra.com/
2022-12-18	example@gmail.com	604-0835	https://support.questetra.com/
2023-01-06	example@googlemail.com	604-0835	HTTP://QUESTETRA.COM/
2023-01-06	suzuki@example.com	6040-835	https://questetra.com/
2023-01-06	SATO@example.com	6040835	https://questetra.com/
2023-01-07	EXAMPLE@GMAIL.COM	604-0835	http://support.questetra.com/
*/

Download

tsv-string-filter-by-regexp-2023.xml
- 2023-01-24 (C) Questetra, Inc. (MIT License)

warning Freely modifiable JavaScript (ECMAScript) code. No warranty of any kind.
(Installing Addon Auto-Steps are available only on the Professional edition.)

Notes

When a process reaches this automated step, the TSV text is automatically processed.
- The column in each line of TSV text is tested, and only lines matching the condition are output.
- Note that it is possible to store not only filtered rows but also unfiltered rows.
If the calculation is too large, the process will be terminated in the middle.
- A large number of lines or complex RegExp that exceeds 30 seconds will result in a termination.
No line feed code is inserted at the end of the output string.
- “Narrowed Text” (filtered down)
- “Removed Lines” (filtered out)
Various human steps can be automated (unmanned).
- Filter by date
- Remove invalid email addresses
- Remove invalid zip codes
- Extract URLs starting with “http://”
- Extract only jp domain URLs
Note that if the column ID is greater than the number of columns, an empty string is evaluated.
- This does not result in an error.
- If a regular expression is set to match the blank character, the test result will be true.

Capture

Narrows down to just the column-matched lines from the TSV text. A regular expression (RegExp) can be specified as the match condition. Evaluates whether the column matches the regular expression. Also stores the filtered-out lines (removed lines).

Appendix

Basic regular expressions: Groups
- (apple|orange)
  - Either “apple” or “orange”
Basic regular expressions: A character class
- [xyz]
  - Any one of the enclosed characters, “x” “y” or “z”
- [a-c]
  - Any one in the range “a” to “c”
- [^xyz]
  - Any one that is neither “x” nor “y” nor “z”
- \d
  - Any digit / [0-9]
- \w
  - Any alphanumeric character / [A-Za-z0-9_]
- \t
  - A horizontal tab
- .
  - Any single character except line terminators
Basic regular expressions: Assertions
- ^
  - Beginning
- $
  - End
- \b
  - A word boundary
- \B
  - A non-word boundary
Basic regular expressions: Quantifiers
- x*
  - The preceding item “x” 0 or more times
- x+
  - The preceding item “x” 1 or more times
- x?
  - The preceding item “x” 0 or 1 times
- x{n}
  - “n” occurrences of the preceding item “x”
- x{n,m}
  - at least “n” and at most “m” occurrences
RegExp Examples
- Japanese postal code
  - ^(\d{3}-\d{4})|(\d{7})$
  - ^([0-9]{3}-[0-9]{4})|([0-9]{7})$
- Gmail address
  - ^[a-zA-Z0-9\.\+]+@(gmail\.com|googlemail\.com)$
  - ^[a-zA-Z0-9.+]+@(gmail\.com|googlemail\.com)$
- Secure HTTP
  - ^https:\/\/
  - ^https://
- “.com/” or “.com” at the end of the line
  - \.com/?$
- Date (2022-12-15 to 2023-01-06)
  - (^2022-12-1[5-9]$)|(^2022-12-[2-3][0-9]$)|(^2023-01-0[1-6]$)

TSV String, Filter by RegExp

TSV String, Filter by RegExp

Download

Notes

Capture

Appendix

See Also

Like this:

Related

Leave a ReplyCancel reply

Questetra BPM Suite

TSV String, Filter by RegExp

TSV String, Filter by RegExp

Download

Notes

Capture

Appendix

See Also

Share this:

Like this:

Related

Leave a ReplyCancel reply

Questetra BPM Suite

Discover more from Questetra Support