Processing steps

Overview

Processing steps are specific methods used to extract data within Monitoring Tasks, such as xPath, JSON, and Regex. Each monitoring task may have multiple processing steps, determining how to extract data from different sources. In this section, you’ll learn how to set up, manage, and modify processing steps to ensure precise and efficient data extraction, ultimately improving your competitive intelligence strategy.

In detail

Processing steps are functions that run every single time you scrape a website/API. They help you bend and transform monitoring results into standardized data.

In order to build a Market Intelligence monitoring pipeline, Midesk offers many ways to processes and extract to exact results you want.

Midesk Step XPath

A default view of the user interface with a processing step that extracts a title from a webpage.

There are in three very flexible methods that can help you: xPath and JSON. The use depends on the type of data you work with. If you work with HTML elements and websites, xPath is your friend. If you work with API or object structures, JSON method will help you tremendously.


Processing Methods

Midesk offers multiple way how to process, mold and extract the exact content you’d like to. There are three key methods you can use:

xPath

Use case: Structured HTML or XML processing

Description: XPath (XML Path Language) is a query language for selecting nodes from an XML document. In addition, XPath may be used to compute values (e.g., strings, numbers, or Boolean values) from the content of an XML document. XPath was defined by the World Wide Web Consortium (W3C).

You can use the xPath method to not only process traditional websites but also XML files and RSS.

JSON

Use case: Structured API & Objects processing

Description: JSON is an open standard file format, and data interchange format, that uses human-readable text to store and transmit data objects consisting of attribute—value pairs and array data types. It is a very common data format, with a diverse range of applications, such as serving as a replacement for XML in AJAX

Regex

Use case: Unstructured text processing

Description: A regular expression (shortened as regex or regexp; also referred to as rational expression) is a sequence of characters that specifies a search pattern. Usually such patterns are used by string-searching algorithms for “find” or “find and replace” operations on strings, or for input validation.

JSON Processing Functions

Transform

The notation of transform uses so-called “dot notation”. A dot notation allows you to represent nested keys with a text notation. Each key is separated by a dot . sign.

Notation for simple elements

Finding in the previous step
{"city": "Hamburg", "county": "Germany"}
Expression
Location: ${city} (${country})
Output
[
  "Location: Hamburg (Germany)"
]

Notation for chunked elements

Finding in the previous step
[
  {
    "title": "Company",
    "text": "Strategy"
  },
  {
    "title": "Other company",
    "text": "Market Intelligence"
  }
]
Expression
Location: ${ title } SOME TEXT ${ text }
Output
[
  "Location: Germany SOME TEXT Strategy",
  "Location: Other company SOME TEXT Market Intelligence"
]

Pipes

The transform method supports many pipes that help you work with the text. Pipes can be chained and perform operations from left to right. Pipes are denoted by a pipe | sign. Pipes may have one or two additional parameters which are separated by a colon : sign.

How to apply pipes

Finding in the previous step
{
  "city": "Hamburg",
  "county": "Germany"
}
Expression
Location: ${ city | upper } (${ country | upper })
Output
[
  "Location: HAMBURG (GERMANY)"
]

List of Pipes

after

The after pipe returns everything after the given value in a string. The entire string will be returned if the value does not exist within the string.

Finding in the previous step
some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc
Expression
after : '123'
Output
Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc

afterLast

The afterLast pipe returns everything after the last occurrence of the given value in a string. The entire string will be returned if the value does not exist within the string.

Finding in the previous step
some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc
Expression
afterLast : '123'
Output
<a><i>Text in a span</i></a> - text $ Abc

append

The append pipe appends the given values to the string.

Finding in the previous step
some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc
Expression
append : ' appended text'
Output
some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc appended text

ascii

The ascii pipe will attempt to transliterate the string into an ASCII value.

Finding in the previous step
ûěščřžýáí
Expression
ascii
Output
uescrzyai

before

The before pipe returns everything before the given value in a string.

Finding in the previous step
some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc
Expression
before : '123'
Output
some

beforeLast

The beforeLast pipe returns everything before the last occurrence of the given value in a string.

Finding in the previous step
some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc
Expression
beforeLast : '123'
Output
some 123 Some random 123 text

between

The between pipe returns the portion of a string between two values.

Finding in the previous step
some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc
Expression
beforeLast : 'random' : '123'
Output
some 123 Some

betweenFirst

The betweenFirst pipe returns the smallest possible portion of a string between two values.

Finding in the previous step
some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc
Expression
betweenFirst : '123' : '123'
Output
Some random

camel

The camel pipe converts the given string to camelCase.

Finding in the previous step
some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc
Expression
camel
Output
some123SomeRandom123Text123<a><i>TextInASpan</i></a>Text$Abc

cast

The cast pipe transforms provided value into one of the following types: text, date, number, boolean.

Finding in the previous step
Wed Sep 14 2022 19:29:50 GMT+0000
Expression
cast : 'text'
Output
Wed Sep 14 2022 19:29:50 GMT+0000
Finding in the previous step
Wed Sep 14 2022 19:29:50 GMT+0000
Expression
cast : 'number'
Output
0
Finding in the previous step
Wed Sep 14 2022 19:29:50 GMT+0000
Expression
cast : 'boolean'
Output
1
Finding in the previous step
Wed Sep 14 2022 19:29:50 GMT+0000
Expression
cast : 'date'
Output
2022-09-14

contains

The contains pipe determines if the given string contains the given value. This method is case sensitive:

Finding in the previous step
soome text is here
Expression
contains : 'text'
Output
1
Finding in the previous step
some text is here
Expression
contains : 'NOT THERE'
Output
0

decode

The decode pipe converts HTML entities to their corresponding characters.

Finding in the previous step
&lt;a href=&#039;http://midesk.co&#039;&gt;&lt;/a&gt;
Expression
decode
Output
<a href='http://midesk.co'></a>

endsWith

The endsWith pipe determines if the given string ends with the given value.

Finding in the previous step
some/path/
Expression
endsWith : '/'
Output
1
Finding in the previous step
some/path
Expression
endsWith : '/'
Output
0

finish

The finish pipe adds a single instance of the given value to a string if it does not already end with that value.

Finding in the previous step
some/path/
Expression
finish : '/'
Output
some/path/
Finding in the previous step
some/path
Expression
finish : '/'
Output
some/path/

is

The is pipe determines if a given string matches a given pattern. Asterisks may be used as wildcard values.

Finding in the previous step
random
Expression
is : 'ran*'
Output
1
Finding in the previous step
random
Expression
is : 'random'
Output
1
Finding in the previous step
random
Expression
is : 'nomatch'
Output
0

isAscii

The isAscii pipe determines if a given string is an ASCII string.

Finding in the previous step
Midesk
Expression
isAscii
Output
1
Finding in the previous step
Česko
Expression
isAscii
Output
0

isUuid

The isUuid pipe determines if the given string is a valid UUID.

Finding in the previous step
4a601006-5d47-44ac-a316-bd49447fac61
Expression
isUuid
Output
1
Finding in the previous step
midesk
Expression
isUuid
Output
0

kebab

The kebab pipe converts the given string to kebab-case.

Finding in the previous step
Midesk is great
Expression
kebab
Output
midesk-is-great

lcfirst

The lcfirst pipe returns the given string with the first character lowercased.

Finding in the previous step
Midesk Is Great
Expression
lcfirst
Output
midesk Is Great

length

The length pipe returns the length of the given string.

Finding in the previous step
some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc
Expression
length
Output
75

limit

The limit pipe truncates the given string to the specified length.

Finding in the previous step
some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc
Expression
limit : '5'
Output
some

lower

The lower pipe converts the given string to lowercase.

Finding in the previous step
some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc
Expression
lower
Output
some 123 some random 123 text 123 <a><i>text in a span</i></a> - text $ abc

ltrim

The ltrim pipe trims the left side of the string. This pipe is automatically applied to the end result.

Finding in the previous step
some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc
Expression
ltrim
Output
some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc

match

The match pipe will return the portion of a string that matches a given regular expression pattern.

Finding in the previous step
some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc
Expression
match : '/random (.*) text/'
Output
123 text 123 <a><i>Text in a span</i></a> -

matchAll

The matchAll pipe will return a collection containing the portions of a string that match a given regular expression pattern.

Finding in the previous step
some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc
Expression
matchAll : '/(.*) text/'
Output
["some 123 Some random 123 text 123 <a><i>Text in a span<\/i><\/a> -"]

md5

The md5 pipe uses a cryptographically broken but still widely used hash function producing a 128-bit hash value.

Finding in the previous step
Midesk
Expression
md5
Output
8d658da2a7bcb1d515993d51ed030c4b

padBoth

The padBoth pipe adds padding both sides of a string with another string until the final string reaches the desired length.

Finding in the previous step
Midesk
Expression
padBoth : '10' : '_' 
Output
__Midesk__

padLeft

The padLeft method adds padding the left side of a string with another string until the final string reaches the desired length.

Finding in the previous step
Midesk
Expression
padLeft : '10' : '_' 
Output
____Midesk

padRight

The padRight method adds padding the right side of a string with another string until the final string reaches the desired length.

Finding in the previous step
Midesk
Expression
padRight : '10' : '_' 
Output
Midesk____

plural

The plural pipe converts a singular word string to its plural form. This function currently only supports the English language. You may provide an integer as a first parameter to retrieve the singular or plural form of the string:

Finding in the previous step
car
Expression
plural
Output
cars
Finding in the previous step
car
Expression
plural:1
Output
car
Finding in the previous step
car
Expression
plural:20
Output
cars

prepend

The prepend pipe prepends the given values onto the string.

Finding in the previous step
is great
Expression
prepend : 'Midesk '
Output
Midesk is great

remove

The remove pipe removes the given value or array of values from the string.

Finding in the previous step
some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc
Expression
remove : 'e'
Output
som 123 Som random 123 txt 123 <a><i>Txt in a span</i></a> - txt $ Abc

replace

The replace method replaces a given string within the string.

Finding in the previous step
some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc
Expression
replace : '123' : 'ABC' 
Output
some ABC Some random ABC text ABC <a><i>Text in a span</i></a> - text $ Abc

replaceFirst

The replaceFirst pipe replaces the first occurrence of a given value in a string.

Finding in the previous step
some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc
Expression
replaceFirst : '123' : 'ABC' 
Output
some ABC Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc

replaceLast

The replaceLast pipe replaces the last occurrence of a given value in a string.

Finding in the previous step
some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc
Expression
replaceLast : '123' : 'ABC' 
Output
some 123 Some random 123 text ABC <a><i>Text in a span</i></a> - text $ Abc

replaceMatches

The replaceMatches pipe replaces all portions of a string matching a pattern with the given replacement string.

Finding in the previous step
some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc
Expression
replaceMatches : '/[^A-Za-z0-9]++/' : '' 
Output
some123Somerandom123text123aiTextinaspaniatextAbc

rtrim

The rtrim pipe trims the right side of the given string. This pipe is automatically applied to the end result.

Finding in the previous step
Midesk  
Expression
rtrim
Output
Midesk
Finding in the previous step
Midesk/
Expression
rtrim : '/'
Output
Midesk

singular

The singular pipe converts a string to its singular form. This function currently only supports the English language.

Finding in the previous step
cars
Expression
singular
Output
car
Finding in the previous step
cheese
Expression
singular
Output
cheese
Finding in the previous step
the cups
Expression
singular
Output
the cup

slug

The slug pipe generates a URL friendly "slug" from the given string.

Finding in the previous step
some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc
Expression
slug
Output
some-123-some-random-123-text-123-aitext-in-a-spania-text-abc

snake

The snake pipe converts the given string to snake_case.

Finding in the previous step
some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc
Expression
snake
Output
some123_some_random123_text123<a><i>_text_in_a_span</i></a>-_text$_abc

squish

The squish pipe removes all extraneous white space from a string, including extraneous white space between words.

Finding in the previous step
Midesk     is     great
Expression
squish
Output
Midesk is great

start

The start pipe adds a single instance of the given value to a string if it does not already start with that value.

Finding in the previous step
/some/path/
Expression
start : '/'
Output
/some/path/
Finding in the previous step
/some/path
Expression
start : '/'
Output
/some/path

startsWith

The startsWith pipe determines if the given string begins with the given value.

Finding in the previous step
/some/path
Expression
startsWith : '/'
Output
1
Finding in the previous step
some/path
Expression
startsWith : '/'
Output
0

stripTags

The stripTags pipe strips HTML and PHP tags from the given string.

Finding in the previous step
some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc
Expression
stripTags
Output
some 123 Some random 123 text 123 Text in a span - text $ Abc

studly

The studly pipe converts the given string to StudlyCase.

Finding in the previous step
some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc
Expression
studly
Output
Some123SomeRandom123Text123<a><i>TextInASpan</i></a>Text$Abc

substr

The substr pipe returns the portion of the string specified by the given start and length parameters.

Finding in the previous step
some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc
Expression
substr : '10' : '5' 
Output
ome r

substrCount

The substrCount pipe returns the number of occurrences of a given value in the given string.

Finding in the previous step
some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc
Expression
substrCount : '123'
Output
3

title

The title pipe converts the given string to Title Case.

Finding in the previous step
some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc
Expression
title
Output
Some 123 Some Random 123 Text 123 <A><I>Text In A Span</I></A> - Text $ Abc

trim

The trim pipe trims the given string. This pipe is automatically applied to the end result.

Finding in the previous step
some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc
Expression
trim
Output
some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc

ucfirst

The ucfirst pipe returns the given string with the first character capitalized.

Finding in the previous step
midesk is great
Expression
ucfirst
Output
Midesk is great

upper

The upper pipe converts the given string to uppercase.

Finding in the previous step
midesk is great
Expression
upper
Output
MIDESK IS GREAT

wordCount

The wordCount function returns the number of words that a string contains.

Finding in the previous step
some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc
Expression
wordCount
Output
15

xPath

xPath

Finding in the previous step
some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc
Expression
xPath://i
Output
Text in a span
© 2019 - 2023 Midesk UG (haftungsbeschränkt)