TSV String, Filter by RegExp

TSV String, Filter by RegExp

TSV String, Filter by RegExp

TSV 文字列, 正規表現フィルタ

Narrows TSV text to column matched rows only. A regular expression (RegExp) can be specified as the match condition, and the column value is matched for each row. It is also possible to store excluded rows (filtered rows).

Auto Step icon
Configs for this Auto Step
StrConfA1
A1: Set TSV Text *#{EL}
StrConfB0
B0: Set ID of COLUMN to be Matched (eg: “0”) *#{EL}
StrConfB1
B1: Set Matching Condition (REGEXP) for column value *#{EL}
BoolConfB2
B2: Case Sensitive or Case should be Ignored
SelectConfC1
C1: Select STRING that stores Narrowed Text (update)
SelectConfC2
C2: Select STRING that stores Removed Lines (update)
Script (click to open)
// GraalJS Script (engine type: 2)

//////// START "main()" ////////////////////////////////////////////////////////////////
main();
function main(){ 

//// == Config Retrieving / 工程コンフィグの参照 ==
const strInput         = configs.get      ( "StrConfA1" );    /// REQUIRED
  if( strInput       === "" ){
    throw new Error( "\n AutomatedTask ConfigError:" +
                     " Config {A1: String} is empty \n" );
  }
const strColMatchId    = configs.get      ( "StrConfB0" );    /// REQUIRED
  if( strColMatchId  === "" ){
    throw new Error( "\n AutomatedTask ConfigError:" +
                     " Config {B0: String} is empty \n" );
  }
const strRegexp        = configs.get      ( "StrConfB1" );    /// REQUIRED
  if( strRegexp      === "" ){
    throw new Error( "\n AutomatedTask ConfigError:" +
                     " Config {B1: String} is empty \n" );
  }
const boolIgnoreCase   = configs.getObject( "BoolConfB2" );   /// TOGGLE
  // https://questetra.zendesk.com/hc/ja/articles/360024574471-R2300 "Boolean object"
const strPocketOutput  = configs.getObject( "SelectConfC1" ); // NotRequired
const strPocketRemoved = configs.getObject( "SelectConfC2" ); // NotRequired


//// == Data Retrieving / ワークフローデータの参照 ==
// (Nothing. Retrieved via Expression Language in Config Retrieving)


//// == Calculating / 演算 ==
const numColMatchId    = parseInt ( strColMatchId );

const arrInput         = strInput.split( '\n' );
const regConstructor   = boolIgnoreCase ?
                         new RegExp( strRegexp, 'i' ) : new RegExp( strRegexp );
// engine.log( " AutomatedTask RegExp: " + regConstructor.toString() );

let   arrOutput    = [];
let   arrRemoved   = [];

for ( let i = 0; i < arrInput.length; i++ ) {
  if ( regConstructor.test( arrInput[i].split( '\t', (numColMatchId + 1) )?.[ numColMatchId ] ?? "" ) ){
    // https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Operators/Optional_chaining
    // #combining_with_the_nullish_coalescing_operator (Null 合体演算子と共に使う)
    arrOutput.push ( arrInput[i] );
  } else {
    arrRemoved.push ( arrInput[i] );
  }
}

let strOutput  = arrOutput?.join( '\n' ) ?? "";
let strRemoved = arrRemoved?.join( '\n' ) ?? "";
// https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Operators/Optional_chaining


//// == Data Updating / ワークフローデータへの代入 ==
if ( strPocketOutput !== null ) {
  engine.setData ( strPocketOutput,  strOutput );
}
if ( strPocketRemoved !== null ) {
  engine.setData ( strPocketRemoved, strRemoved );
}

} //////// END "main()" /////////////////////////////////////////////////////////////////



/*
Notes:
- When a process reaches this automated step, "TSV text" is automatically processed.
    - The column in each line of TSV text is tested, and only lines matching the condition are output.
    - Note that it is possible to store not only filtered-down lines, but also filtered-out lines.
- If the amount of calculation is too large, the process will be terminated in the middle.
    - A large number of Lines or complex RegExp that requires more than 30 secs will result in a termination.
- No line feed code is inserted at the end of the output string.
    - "Narrowed Text" (filtered down)
    - "Removed Lines" (filtered out)
- Various human steps can be automated (unmanned).
    - Filter by date
    - Remove invalid email addresses
    - Remove invalid zip codes
    - Extract URLs starting with "http://"
    - Extract only jp domain URLs
- Note that if the column ID is greater than the number of columns, an empty string is evaluated.
    - This does not result in an error.
    - If a regular expression is set to match the blank character, the test result will be true.

Appendix-en:
- Basic regular expressions: Groups
    - `(apple|orange)`
        - Either "apple" or "orange"
- Basic regular expressions: A character class
    - `[xyz]`
        - Any one of the enclosed characters, "x" "y" or "z"
    - `[a-c]`
        - Any one in the range "a" to "c"
    - `[^xyz]`
        - Any one that is neither "x" nor "y" nor "z"
    - `\d`
        - Any digit / `[0-9]`
    - `\w`
        - Any alphanumeric character / `[A-Za-z0-9_]`
    - `\t`
        - A horizontal tab
    - `.`
        - Any single character except line terminators
- Basic regular expressions: Assertions
    - `^`
        - Beginning
    - `$`
        - End
    - `\b`
        - A word boundary
    - `\B`
        - A non-word boundary
- Basic regular expressions: Quantifiers
    - `x*`
        - The preceding item "x" 0 or more times
    - `x+`
        - The preceding item "x" 1 or more times
    - `x?`
        - The preceding item "x" 0 or 1 times
    - `x{n}`
        - "n" occurrences of the preceding item "x"
    - `x{n,m}`
        - at least "n" and at most "m" occurrences
- RegExp Examples
    - Japanese postal code
        - `^(\d{3}-\d{4})|(\d{7})$`
        - `^([0-9]{3}-[0-9]{4})|([0-9]{7})$`
    - Gmail address
        - `^[a-zA-Z0-9\.\+]+@(gmail\.com|googlemail\.com)$`
        - `^[a-zA-Z0-9.+]+@(gmail\.com|googlemail\.com)$`
    - Secure HTTP
        - `^https:\/\/`
        - `^https://`
    - ".com/" or ".com" at the end of the line
        - `\.com/?$`
    - Date (2022-12-15 to 2023-01-06)
        - `(^2022-12-1[5-9]$)|(^2022-12-[2-3][0-9]$)|(^2023-01-0[1-6]$)`

Notes-ja:
- 案件がこの自動工程に到達した際、「TSV テキスト」が自動的に加工さます。
    - TSV テキスト各行のカラムが正規表現でテストされ、条件にマッチした行だけが出力されます。
    - なお、フィルタ抽出された行だけでなく、フィルタ除去された行を格納することも可能です。
- 計算量が多すぎる場合、途中終了します。
    - 行数が多い・正規表現が極めて複雑など、30秒を超えるような処理は途中で終了する可能性があります。
- 出力文字列の末尾に改行コードは挿入されません。
    - "絞り込まれたテキスト"
    - "除去された行"
- 様々なヒューマン工程を自動化(無人化)できます。
    - 日付を絞り込む
    - 不正なメールアドレスを除去する
    - 不正な郵便番号を除去する
    - "http://" 始まりのURLを抽出する
    - jpドメインのURLだけを抽出する
- なお、マッチ判定カラムのIDがカラム数より大きい場合、空文字 "" が評価されます。
    - エラーにはなりません。
    - 空文字にマッチする正規表現が設定されている場合、テスト結果は真となります。

Appendix-ja:
- 基本的な正規表現(グループ/Groups)
    - `(apple|orange)`
        - "apple" か "orange" のいずれか / Either "apple" or "orange"
- 基本的な正規表現(文字クラス/A character class)
    - `[xyz]`
        - "x" か "y" か "z" の 一文字 / Any one of the enclosed characters, "x" "y" or "z"
    - `[a-c]`
        - "a" から "c" までのいずれか一文字 / Any one in the range "a" to "c"
    - `[^xyz]`
        - "x" でも "y" でも "z" でもない一文字 / Any one that is neither "x" nor "y" nor "z"
    - `\d`
        - 数字一文字 / Any digit / `[0-9]`
    - `\w`
        - 半角英数字一文字 / Any alphanumeric character / `[A-Za-z0-9_]`
    - `\t`
        - 水平タブ / A horizontal tab
    - `.`
        - 改行文字を除くあらゆる一文字 / Any single character except line terminators
- 基本的な正規表現(アサーション言明/Assertions)
    - `^`
        - 先頭 / Beginning
    - `$`
        - 末尾 / End
    - `\b`
        - 区切り / A word boundary
    - `\B`
        - 区切り以外 / A non-word boundary
- 基本的な正規表現(数量詞/Quantifiers)
    - `x*`
        - 直前アイテム "x" の0回以上の繰返 / The preceding item "x" 0 or more times
    - `x+`
        - 直前アイテム "x" の1回以上の繰返 / The preceding item "x" 1 or more times
    - `x?`
        - 直前アイテム "x" の0回か1回の出現 / The preceding item "x" 0 or 1 times
    - `x{n}`
        - 直前アイテム "x" のn回の出現 / "n" occurrences of the preceding item "x"
    - `x{n,m}`
        - 直前アイテム "x" がnからm回出現 / at least "n" and at most "m" occurrences
- 正規表現 設定例 Example
    - 日本の郵便番号
        - `^(\d{3}-\d{4})|(\d{7})$`
        - `^([0-9]{3}-[0-9]{4})|([0-9]{7})$`
    - Gmailアドレス
        - `^[a-zA-Z0-9\.\+]+@(gmail\.com|googlemail\.com)$`
        - `^[a-zA-Z0-9.+]+@(gmail\.com|googlemail\.com)$`
    - セキュアHTTP
        - `^https:\/\/`
        - `^https://`
    - 行末の ".com/" もしくは ".com"
        - `\.com/?$`
    - 日付(2022-12-15 to 2023-01-06)
        - `(^2022-12-1[5-9]$)|(^2022-12-[2-3][0-9]$)|(^2023-01-0[1-6]$)`

Test Data for Debug:
date	email	zipcode	url
2022-12-14	example@gmail.com	604-0835	https://questetra.com/
2022-12-14	suzuki@example.com	604-0835	https://QUESTETRA.COM/
2022-12-15	sato@example.com	604-0835	https://questetra.com/
2022-12-18	example@gmail.com	604-0835	https://support.questetra.com/
2023-01-06	example@googlemail.com	604-0835	HTTP://QUESTETRA.COM/
2023-01-06	suzuki@example.com	6040-835	https://questetra.com/
2023-01-06	SATO@example.com	6040835	https://questetra.com/
2023-01-07	EXAMPLE@GMAIL.COM	604-0835	http://support.questetra.com/
*/

Download

warning Freely modifiable JavaScript (ECMAScript) code. No warranty of any kind.
(Installing Addon Auto-Steps are available only on the Professional edition.)

Notes

  • When a process reaches this automated step, the TSV text is automatically processed.
    • The column in each line of TSV text is tested, and only lines matching the condition are output.
    • Note that it is possible to store not only filtered rows but also unfiltered rows.
  • If the calculation is too large, the process will be terminated in the middle.
    • A large number of lines or complex RegExp that exceeds 30 seconds will result in a termination.
  • No line feed code is inserted at the end of the output string.
    • “Narrowed Text” (filtered down)
    • “Removed Lines” (filtered out)
  • Various human steps can be automated (unmanned).
    • Filter by date
    • Remove invalid email addresses
    • Remove invalid zip codes
    • Extract URLs starting with “http://&#8221;
    • Extract only jp domain URLs
  • Note that if the column ID is greater than the number of columns, an empty string is evaluated.
    • This does not result in an error.
    • If a regular expression is set to match the blank character, the test result will be true.

Capture

Narrows down to just the column-matched lines from the TSV text. A regular expression (RegExp) can be specified as the match condition. Evaluates whether the column matches the regular expression. Also stores the filtered-out lines (removed lines).

Appendix

  • Basic regular expressions: Groups
    • (apple|orange)
      • Either “apple” or “orange”
  • Basic regular expressions: A character class
    • [xyz]
      • Any one of the enclosed characters, “x” “y” or “z”
    • [a-c]
      • Any one in the range “a” to “c”
    • [^xyz]
      • Any one that is neither “x” nor “y” nor “z”
    • \d
      • Any digit / [0-9]
    • \w
      • Any alphanumeric character / [A-Za-z0-9_]
    • \t
      • A horizontal tab
    • .
      • Any single character except line terminators
  • Basic regular expressions: Assertions
    • ^
      • Beginning
    • $
      • End
    • \b
      • A word boundary
    • \B
      • A non-word boundary
  • Basic regular expressions: Quantifiers
    • x*
      • The preceding item “x” 0 or more times
    • x+
      • The preceding item “x” 1 or more times
    • x?
      • The preceding item “x” 0 or 1 times
    • x{n}
      • “n” occurrences of the preceding item “x”
    • x{n,m}
      • at least “n” and at most “m” occurrences
  • RegExp Examples
    • Japanese postal code
      • ^(\d{3}-\d{4})|(\d{7})$
      • ^([0-9]{3}-[0-9]{4})|([0-9]{7})$
    • Gmail address
      • ^[a-zA-Z0-9\.\+]+@(gmail\.com|googlemail\.com)$
      • ^[a-zA-Z0-9.+]+@(gmail\.com|googlemail\.com)$
    • Secure HTTP
      • ^https:\/\/
      • ^https://
    • “.com/” or “.com” at the end of the line
      • \.com/?$
    • Date (2022-12-15 to 2023-01-06)
      • (^2022-12-1[5-9]$)|(^2022-12-[2-3][0-9]$)|(^2023-01-0[1-6]$)

See Also

TSV String, Filter by Keywords

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Discover more from Questetra Support

Subscribe now to keep reading and get access to the full archive.

Continue reading

Scroll to Top