Processing steps are specific methods used to extract data within Monitoring Tasks, such as xPath, JSON, and Regex. Each monitoring task may have multiple processing steps, determining how to extract data from different sources. In this section, you’ll learn how to set up, manage, and modify processing steps to ensure precise and efficient data extraction, ultimately improving your competitive intelligence strategy.
Processing steps are functions that run every single time you scrape a website/API. They help you bend and transform monitoring results into standardized data.
In order to build a Market Intelligence monitoring pipeline, Midesk offers many ways to processes and extract to exact results you want.
A default view of the user interface with a processing step that extracts a title from a webpage.
There are in three very flexible methods that can help you: xPath and JSON. The use depends on the type of data you work with. If you work with HTML elements and websites, xPath is your friend. If you work with API or object structures, JSON method will help you tremendously.
Midesk offers multiple way how to process, mold and extract the exact content you’d like to. There are three key methods you can use:
Use case: Structured HTML or XML processing
Description: XPath (XML Path Language) is a query language for selecting nodes from an XML document. In addition, XPath may be used to compute values (e.g., strings, numbers, or Boolean values) from the content of an XML document. XPath was defined by the World Wide Web Consortium (W3C).
You can use the xPath method to not only process traditional websites but also XML files and RSS.
Use case: Structured API & Objects processing
Description: JSON is an open standard file format, and data interchange format, that uses human-readable text to store and transmit data objects consisting of attribute—value pairs and array data types. It is a very common data format, with a diverse range of applications, such as serving as a replacement for XML in AJAX
Use case: Unstructured text processing
Description: A regular expression (shortened as regex or regexp; also referred to as rational expression) is a sequence of characters that specifies a search pattern. Usually such patterns are used by string-searching algorithms for “find” or “find and replace” operations on strings, or for input validation.
The notation of transform uses so-called “dot notation”. A dot notation allows you to represent nested keys with a text notation. Each key is separated by a dot .
sign.
{"city": "Hamburg", "county": "Germany"}
Location: ${city} (${country})
[
"Location: Hamburg (Germany)"
]
[
{
"title": "Company",
"text": "Strategy"
},
{
"title": "Other company",
"text": "Market Intelligence"
}
]
Location: ${ title } SOME TEXT ${ text }
[
"Location: Germany SOME TEXT Strategy",
"Location: Other company SOME TEXT Market Intelligence"
]
The transform method supports many pipes that help you work with the text. Pipes can be chained and perform operations from left to right. Pipes are denoted by a pipe | sign. Pipes may have one or two additional parameters which are separated by a colon : sign.
{
"city": "Hamburg",
"county": "Germany"
}
Location: ${ city | upper } (${ country | upper })
[
"Location: HAMBURG (GERMANY)"
]
The after pipe returns everything after the given value in a string. The entire string will be returned if the value does not exist within the string.
some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc
after : '123'
Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc
The afterLast pipe returns everything after the last occurrence of the given value in a string. The entire string will be returned if the value does not exist within the string.
some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc
afterLast : '123'
<a><i>Text in a span</i></a> - text $ Abc
The append pipe appends the given values to the string.
some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc
append : ' appended text'
some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc appended text
The ascii pipe will attempt to transliterate the string into an ASCII value.
ûěščřžýáí
ascii
uescrzyai
The before pipe returns everything before the given value in a string.
some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc
before : '123'
some
The beforeLast pipe returns everything before the last occurrence of the given value in a string.
some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc
beforeLast : '123'
some 123 Some random 123 text
The between pipe returns the portion of a string between two values.
some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc
beforeLast : 'random' : '123'
some 123 Some
The betweenFirst pipe returns the smallest possible portion of a string between two values.
some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc
betweenFirst : '123' : '123'
Some random
The camel pipe converts the given string to camelCase.
some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc
camel
some123SomeRandom123Text123<a><i>TextInASpan</i></a>Text$Abc
The cast pipe transforms provided value into one of the following types: text, date, number, boolean.
Wed Sep 14 2022 19:29:50 GMT+0000
cast : 'text'
Wed Sep 14 2022 19:29:50 GMT+0000
Wed Sep 14 2022 19:29:50 GMT+0000
cast : 'number'
0
Wed Sep 14 2022 19:29:50 GMT+0000
cast : 'boolean'
1
Wed Sep 14 2022 19:29:50 GMT+0000
cast : 'date'
2022-09-14
The contains pipe determines if the given string contains the given value. This method is case sensitive:
soome text is here
contains : 'text'
1
some text is here
contains : 'NOT THERE'
0
The decode pipe converts HTML entities to their corresponding characters.
<a href='http://midesk.co'></a>
decode
<a href='http://midesk.co'></a>
The endsWith pipe determines if the given string ends with the given value.
some/path/
endsWith : '/'
1
some/path
endsWith : '/'
0
The finish pipe adds a single instance of the given value to a string if it does not already end with that value.
some/path/
finish : '/'
some/path/
some/path
finish : '/'
some/path/
The is pipe determines if a given string matches a given pattern. Asterisks may be used as wildcard values.
random
is : 'ran*'
1
random
is : 'random'
1
random
is : 'nomatch'
0
The isAscii pipe determines if a given string is an ASCII string.
Midesk
isAscii
1
Česko
isAscii
0
The isUuid pipe determines if the given string is a valid UUID.
4a601006-5d47-44ac-a316-bd49447fac61
isUuid
1
midesk
isUuid
0
The kebab pipe converts the given string to kebab-case.
Midesk is great
kebab
midesk-is-great
The lcfirst pipe returns the given string with the first character lowercased.
Midesk Is Great
lcfirst
midesk Is Great
The length pipe returns the length of the given string.
some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc
length
75
The limit pipe truncates the given string to the specified length.
some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc
limit : '5'
some
The lower pipe converts the given string to lowercase.
some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc
lower
some 123 some random 123 text 123 <a><i>text in a span</i></a> - text $ abc
The ltrim pipe trims the left side of the string. This pipe is automatically applied to the end result.
some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc
ltrim
some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc
The match pipe will return the portion of a string that matches a given regular expression pattern.
some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc
match : '/random (.*) text/'
123 text 123 <a><i>Text in a span</i></a> -
The matchAll pipe will return a collection containing the portions of a string that match a given regular expression pattern.
some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc
matchAll : '/(.*) text/'
["some 123 Some random 123 text 123 <a><i>Text in a span<\/i><\/a> -"]
The md5 pipe uses a cryptographically broken but still widely used hash function producing a 128-bit hash value.
Midesk
md5
8d658da2a7bcb1d515993d51ed030c4b
The padBoth pipe adds padding both sides of a string with another string until the final string reaches the desired length.
Midesk
padBoth : '10' : '_'
__Midesk__
The padLeft method adds padding the left side of a string with another string until the final string reaches the desired length.
Midesk
padLeft : '10' : '_'
____Midesk
The padRight method adds padding the right side of a string with another string until the final string reaches the desired length.
Midesk
padRight : '10' : '_'
Midesk____
The plural pipe converts a singular word string to its plural form. This function currently only supports the English language. You may provide an integer as a first parameter to retrieve the singular or plural form of the string:
car
plural
cars
car
plural:1
car
car
plural:20
cars
The prepend pipe prepends the given values onto the string.
is great
prepend : 'Midesk '
Midesk is great
The remove pipe removes the given value or array of values from the string.
some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc
remove : 'e'
som 123 Som random 123 txt 123 <a><i>Txt in a span</i></a> - txt $ Abc
The replace method replaces a given string within the string.
some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc
replace : '123' : 'ABC'
some ABC Some random ABC text ABC <a><i>Text in a span</i></a> - text $ Abc
The replaceFirst pipe replaces the first occurrence of a given value in a string.
some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc
replaceFirst : '123' : 'ABC'
some ABC Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc
The replaceLast pipe replaces the last occurrence of a given value in a string.
some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc
replaceLast : '123' : 'ABC'
some 123 Some random 123 text ABC <a><i>Text in a span</i></a> - text $ Abc
The replaceMatches pipe replaces all portions of a string matching a pattern with the given replacement string.
some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc
replaceMatches : '/[^A-Za-z0-9]++/' : ''
some123Somerandom123text123aiTextinaspaniatextAbc
The rtrim pipe trims the right side of the given string. This pipe is automatically applied to the end result.
Midesk
rtrim
Midesk
Midesk/
rtrim : '/'
Midesk
The singular pipe converts a string to its singular form. This function currently only supports the English language.
cars
singular
car
cheese
singular
cheese
the cups
singular
the cup
The slug pipe generates a URL friendly "slug" from the given string.
some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc
slug
some-123-some-random-123-text-123-aitext-in-a-spania-text-abc
The snake pipe converts the given string to snake_case.
some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc
snake
some123_some_random123_text123<a><i>_text_in_a_span</i></a>-_text$_abc
The squish pipe removes all extraneous white space from a string, including extraneous white space between words.
Midesk is great
squish
Midesk is great
The start pipe adds a single instance of the given value to a string if it does not already start with that value.
/some/path/
start : '/'
/some/path/
/some/path
start : '/'
/some/path
The startsWith pipe determines if the given string begins with the given value.
/some/path
startsWith : '/'
1
some/path
startsWith : '/'
0
The stripTags pipe strips HTML and PHP tags from the given string.
some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc
stripTags
some 123 Some random 123 text 123 Text in a span - text $ Abc
The studly pipe converts the given string to StudlyCase.
some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc
studly
Some123SomeRandom123Text123<a><i>TextInASpan</i></a>Text$Abc
The substr pipe returns the portion of the string specified by the given start and length parameters.
some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc
substr : '10' : '5'
ome r
The substrCount pipe returns the number of occurrences of a given value in the given string.
some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc
substrCount : '123'
3
The title pipe converts the given string to Title Case.
some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc
title
Some 123 Some Random 123 Text 123 <A><I>Text In A Span</I></A> - Text $ Abc
The trim pipe trims the given string. This pipe is automatically applied to the end result.
some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc
trim
some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc
The ucfirst pipe returns the given string with the first character capitalized.
midesk is great
ucfirst
Midesk is great
The upper pipe converts the given string to uppercase.
midesk is great
upper
MIDESK IS GREAT
The wordCount function returns the number of words that a string contains.
some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc
wordCount
15
xPath
some 123 Some random 123 text 123 <a><i>Text in a span</i></a> - text $ Abc
xPath://i
Text in a span