Processing steps

Overview

Processing steps are specific methods used to extract data within Monitoring Tasks, such as xPath, JSON, and Regex. Each monitoring task may have multiple processing steps, determining how to extract data from different sources. In this section, you’ll learn how to set up, manage, and modify processing steps to ensure precise and efficient data extraction, ultimately improving your competitive intelligence strategy.

In detail

Processing steps are functions that run every single time you scrape a website/API. They help you bend and transform monitoring results into standardized data.

In order to build a Market Intelligence monitoring pipeline, Midesk offers many ways to processes and extract to exact results you want.

A default view of the user interface with a processing step that extracts a title from a webpage.

There are in three very flexible methods that can help you: xPath and JSON. The use depends on the type of data you work with. If you work with HTML elements and websites, xPath is your friend. If you work with API or object structures, JSON method will help you tremendously.

Processing Methods

Midesk offers multiple way how to process, mold and extract the exact content you’d like to. There are three key methods you can use:

xPath

Use case: Structured HTML or XML processing

Description: XPath (XML Path Language) is a query language for selecting nodes from an XML document. In addition, XPath may be used to compute values (e.g., strings, numbers, or Boolean values) from the content of an XML document. XPath was defined by the World Wide Web Consortium (W3C).

You can use the xPath method to not only process traditional websites but also XML files and RSS.

JSON

Use case: Structured API & Objects processing

Description: JSON is an open standard file format, and data interchange format, that uses human-readable text to store and transmit data objects consisting of attribute—value pairs and array data types. It is a very common data format, with a diverse range of applications, such as serving as a replacement for XML in AJAX

Regex

Use case: Unstructured text processing

Description: A regular expression (shortened as regex or regexp; also referred to as rational expression) is a sequence of characters that specifies a search pattern. Usually such patterns are used by string-searching algorithms for “find” or “find and replace” operations on strings, or for input validation.

JSON Processing Functions

Transform

The notation of transform uses so-called “dot notation”. A dot notation allows you to represent nested keys with a text notation. Each key is separated by a dot . sign.

Notation for simple elements

Finding in the previous step

{"city": "Hamburg", "county": "Germany"}

Expression

Location: ${city} (${country})

Output

[
  "Location: Hamburg (Germany)"
]

Notation for chunked elements

Finding in the previous step

[
  {
    "title": "Company",
    "text": "Strategy"
  },
  {
    "title": "Other company",
    "text": "Market Intelligence"
  }
]

Expression

Location: ${ title } SOME TEXT ${ text }

Output

[
  "Location: Germany SOME TEXT Strategy",
  "Location: Other company SOME TEXT Market Intelligence"
]

Pipes

The transform method supports many pipes that help you work with the text. Pipes can be chained and perform operations from left to right. Pipes are denoted by a pipe | sign. Pipes may have one or two additional parameters which are separated by a colon : sign.

How to apply pipes

Finding in the previous step

{
  "city": "Hamburg",
  "county": "Germany"
}

Expression

Location: ${ city | upper } (${ country | upper })

Output

[
  "Location: HAMBURG (GERMANY)"
]

List of Pipes

after

The after pipe returns everything after the given value in a string. The entire string will be returned if the value does not exist within the string.

Finding in the previous step

some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc

Expression

after : '123'

Output

Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc

afterLast

The afterLast pipe returns everything after the last occurrence of the given value in a string. The entire string will be returned if the value does not exist within the string.

Finding in the previous step

some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc

Expression

afterLast : '123'

Output

<a><i>Text in a span</i></a> - text $ Abc

append

The append pipe appends the given values to the string.

Finding in the previous step

some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc

Expression

append : ' appended text'

Output

some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc appended text

ascii

The ascii pipe will attempt to transliterate the string into an ASCII value.

Finding in the previous step

ûěščřžýáí

Expression

ascii

Output

uescrzyai

before

The before pipe returns everything before the given value in a string.

Finding in the previous step

some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc

Expression

before : '123'

Output

some

beforeLast

The beforeLast pipe returns everything before the last occurrence of the given value in a string.

Finding in the previous step

some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc

Expression

beforeLast : '123'

Output

some 123 Some random 123 text

between

The between pipe returns the portion of a string between two values.

Finding in the previous step

some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc

Expression

beforeLast : 'random' : '123'

Output

some 123 Some

betweenFirst

The betweenFirst pipe returns the smallest possible portion of a string between two values.

Finding in the previous step

some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc

Expression

betweenFirst : '123' : '123'

Output

Some random

camel

The camel pipe converts the given string to camelCase.

Finding in the previous step

some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc

Expression

camel

Output

some123SomeRandom123Text123<a><i>TextInASpan</i></a>Text$Abc

cast

The cast pipe transforms provided value into one of the following types: text, date, number, boolean.

Finding in the previous step

Wed Sep 14 2022 19:29:50 GMT+0000

Expression

cast : 'text'

Output

Wed Sep 14 2022 19:29:50 GMT+0000

Finding in the previous step

Wed Sep 14 2022 19:29:50 GMT+0000

Expression

cast : 'number'

Output

Finding in the previous step

Wed Sep 14 2022 19:29:50 GMT+0000

Expression

cast : 'boolean'

Output

Finding in the previous step

Wed Sep 14 2022 19:29:50 GMT+0000

Expression

cast : 'date'

Output

2022-09-14

contains

The contains pipe determines if the given string contains the given value. This method is case sensitive:

Finding in the previous step

soome text is here

Expression

contains : 'text'

Output

Finding in the previous step

some text is here

Expression

contains : 'NOT THERE'

Output

decode

The decode pipe converts HTML entities to their corresponding characters.

Finding in the previous step

&lt;a href=&#039;http://midesk.co&#039;&gt;&lt;/a&gt;

Expression

decode

Output

<a href='http://midesk.co'></a>

endsWith

The endsWith pipe determines if the given string ends with the given value.

Finding in the previous step

some/path/

Expression

endsWith : '/'

Output

Finding in the previous step

some/path

Expression

endsWith : '/'

Output

finish

The finish pipe adds a single instance of the given value to a string if it does not already end with that value.

Finding in the previous step

some/path/

Expression

finish : '/'

Output

some/path/

Finding in the previous step

some/path

Expression

finish : '/'

Output

some/path/

is

The is pipe determines if a given string matches a given pattern. Asterisks may be used as wildcard values.

Finding in the previous step

random

Expression

is : 'ran*'

Output

Finding in the previous step

random

Expression

is : 'random'

Output

Finding in the previous step

random

Expression

is : 'nomatch'

Output

isAscii

The isAscii pipe determines if a given string is an ASCII string.

Finding in the previous step

Midesk

Expression

isAscii

Output

Finding in the previous step

Česko

Expression

isAscii

Output

isUuid

The isUuid pipe determines if the given string is a valid UUID.

Finding in the previous step

4a601006-5d47-44ac-a316-bd49447fac61

Expression

isUuid

Output

Finding in the previous step

midesk

Expression

isUuid

Output

kebab

The kebab pipe converts the given string to kebab-case.

Finding in the previous step

Midesk is great

Expression

kebab

Output

midesk-is-great

lcfirst

The lcfirst pipe returns the given string with the first character lowercased.

Finding in the previous step

Midesk Is Great

Expression

lcfirst

Output

midesk Is Great

length

The length pipe returns the length of the given string.

Finding in the previous step

some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc

Expression

length

Output

limit

The limit pipe truncates the given string to the specified length.

Finding in the previous step

some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc

Expression

limit : '5'

Output

some

lower

The lower pipe converts the given string to lowercase.

Finding in the previous step

some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc

Expression

lower

Output

some 123 some random 123 text 123 <a><i>text in a span</i></a> - text $ abc

ltrim

The ltrim pipe trims the left side of the string. This pipe is automatically applied to the end result.

Finding in the previous step

some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc

Expression

ltrim

Output

some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc

match

The match pipe will return the portion of a string that matches a given regular expression pattern.

Finding in the previous step

some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc

Expression

match : '/random (.*) text/'

Output

123 text 123 <a><i>Text in a span</i></a> -

matchAll

The matchAll pipe will return a collection containing the portions of a string that match a given regular expression pattern.

Finding in the previous step

some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc

Expression

matchAll : '/(.*) text/'

Output

["some 123 Some random 123 text 123 <a><i>Text in a span<\/i><\/a> -"]

md5

The md5 pipe uses a cryptographically broken but still widely used hash function producing a 128-bit hash value.

Finding in the previous step

Midesk

Expression

md5

Output

8d658da2a7bcb1d515993d51ed030c4b

padBoth

The padBoth pipe adds padding both sides of a string with another string until the final string reaches the desired length.

Finding in the previous step

Midesk

Expression

padBoth : '10' : '_'

Output

__Midesk__

padLeft

The padLeft method adds padding the left side of a string with another string until the final string reaches the desired length.

Finding in the previous step

Midesk

Expression

padLeft : '10' : '_'

Output

____Midesk

padRight

The padRight method adds padding the right side of a string with another string until the final string reaches the desired length.

Finding in the previous step

Midesk

Expression

padRight : '10' : '_'

Output

Midesk____

plural

The plural pipe converts a singular word string to its plural form. This function currently only supports the English language. You may provide an integer as a first parameter to retrieve the singular or plural form of the string:

Finding in the previous step

car

Expression

plural

Output

cars

Finding in the previous step

car

Expression

plural:1

Output

car

Finding in the previous step

car

Expression

plural:20

Output

cars

prepend

The prepend pipe prepends the given values onto the string.

Finding in the previous step

is great

Expression

prepend : 'Midesk '

Output

Midesk is great

remove

The remove pipe removes the given value or array of values from the string.

Finding in the previous step

some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc

Expression

remove : 'e'

Output

som 123 Som random 123 txt 123 <a><i>Txt in a span</i></a> - txt $ Abc

replace

The replace method replaces a given string within the string.

Finding in the previous step

some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc

Expression

replace : '123' : 'ABC'

Output

some ABC Some random ABC text ABC <a><i>Text in a span</i></a> - text $ Abc

replaceFirst

The replaceFirst pipe replaces the first occurrence of a given value in a string.

Finding in the previous step

some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc

Expression

replaceFirst : '123' : 'ABC'

Output

some ABC Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc

replaceLast

The replaceLast pipe replaces the last occurrence of a given value in a string.

Finding in the previous step

some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc

Expression

replaceLast : '123' : 'ABC'

Output

some 123 Some random 123 text ABC <a><i>Text in a span</i></a> - text $ Abc

replaceMatches

The replaceMatches pipe replaces all portions of a string matching a pattern with the given replacement string.

Finding in the previous step

some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc

Expression

replaceMatches : '/[^A-Za-z0-9]++/' : ''

Output

some123Somerandom123text123aiTextinaspaniatextAbc

rtrim

The rtrim pipe trims the right side of the given string. This pipe is automatically applied to the end result.

Finding in the previous step

Midesk

Expression

rtrim

Output

Midesk

Finding in the previous step

Midesk/

Expression

rtrim : '/'

Output

Midesk

singular

The singular pipe converts a string to its singular form. This function currently only supports the English language.

Finding in the previous step

cars

Expression

singular

Output

car

Finding in the previous step

cheese

Expression

singular

Output

cheese

Finding in the previous step

the cups

Expression

singular

Output

the cup

slug

The slug pipe generates a URL friendly "slug" from the given string.

Finding in the previous step

some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc

Expression

slug

Output

some-123-some-random-123-text-123-aitext-in-a-spania-text-abc

snake

The snake pipe converts the given string to snake_case.

Finding in the previous step

some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc

Expression

snake

Output

some123_some_random123_text123<a><i>_text_in_a_span</i></a>-_text$_abc

squish

The squish pipe removes all extraneous white space from a string, including extraneous white space between words.

Finding in the previous step

Midesk     is     great

Expression

squish

Output

Midesk is great

start

The start pipe adds a single instance of the given value to a string if it does not already start with that value.

Finding in the previous step

/some/path/

Expression

start : '/'

Output

/some/path/

Finding in the previous step

/some/path

Expression

start : '/'

Output

/some/path

startsWith

The startsWith pipe determines if the given string begins with the given value.

Finding in the previous step

/some/path

Expression

startsWith : '/'

Output

Finding in the previous step

some/path

Expression

startsWith : '/'

Output

stripTags

The stripTags pipe strips HTML and PHP tags from the given string.

Finding in the previous step

some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc

Expression

stripTags

Output

some 123 Some random 123 text 123 Text in a span - text $ Abc

studly

The studly pipe converts the given string to StudlyCase.

Finding in the previous step

some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc

Expression

studly

Output

Some123SomeRandom123Text123<a><i>TextInASpan</i></a>Text$Abc

substr

The substr pipe returns the portion of the string specified by the given start and length parameters.

Finding in the previous step

some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc

Expression

substr : '10' : '5'

Output

ome r

substrCount

The substrCount pipe returns the number of occurrences of a given value in the given string.

Finding in the previous step

some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc

Expression

substrCount : '123'

Output

title

The title pipe converts the given string to Title Case.

Finding in the previous step

some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc

Expression

title

Output

Some 123 Some Random 123 Text 123 <A><I>Text In A Span</I></A> - Text $ Abc

trim

The trim pipe trims the given string. This pipe is automatically applied to the end result.

Finding in the previous step

some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc

Expression

trim

Output

some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc

ucfirst

The ucfirst pipe returns the given string with the first character capitalized.

Finding in the previous step

midesk is great

Expression

ucfirst

Output

Midesk is great

upper

The upper pipe converts the given string to uppercase.

Finding in the previous step

midesk is great

Expression

upper

Output

MIDESK IS GREAT

wordCount

The wordCount function returns the number of words that a string contains.

Finding in the previous step

some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc

Expression

wordCount

Output

xPath

Finding in the previous step

some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc

Expression

xPath://i

Output

Text in a span