Parse
type: "io.kestra.plugin.tika.Parse"
Parse a document and extract its content and metadata.
Examples
Extract text from a file.
id: tika_parse
namespace: company.team
inputs:
- id: file
type: FILE
tasks:
- id: parse
type: io.kestra.plugin.tika.Parse
from: '{{ inputs.file }}'
extractEmbedded: true
store: false
Extract text from an image using OCR.
id: tika_parse
namespace: company.team
inputs:
- id: file
type: FILE
tasks:
- id: parse
type: io.kestra.plugin.tika.Parse
from: '{{ inputs.file }}'
ocrOptions:
strategy: OCR_AND_TEXT_EXTRACTION
store: true
Properties
contentType
- Type: string
- Dynamic: ❌
- Required: ❌
- Default:
XHTML
- Possible Values:
TEXT
XHTML
XHTML_NO_HEADER
The content type of the extracted text.
extractEmbedded
- Type: boolean
- Dynamic: ❌
- Required: ❌
- Default:
false
Whether to extract the embedded document.
from
- Type: string
- Dynamic: ✔️
- Required: ❌
The file to parse.
Must be an internal storage URI.
ocrOptions
- Type: Parse-OcrOptions
- Dynamic: ❌
- Required: ❌
- Default:
{strategy=NO_OCR}
Custom options for OCR processing.
You need to install Tesseract to enable OCR processing.
store
- Type: boolean
- Dynamic: ❌
- Required: ❌
- Default:
true
Whether to store the data from the query result into an ion serialized data file in Kestra internal storage.
Outputs
result
- Type: Parse-Parsed
- Required: ❌
uri
- Type: string
- Required: ❌
- Format:
uri
Definitions
io.kestra.plugin.tika.Parse-OcrOptions
Properties
enableImagePreprocessing
- Type: boolean
- Dynamic: ❌
- Required: ❌
Whether to enable image preprocessing.
Apache Tika will run preprocessing of images (rotation detection and image normalizing with ImageMagick) before sending the image to Tesseract if the user has included dependencies (listed below) and if the user opts to include these preprocessing steps.
language
- Type: string
- Dynamic: ✔️
- Required: ❌
Language used for OCR.
strategy
- Type: string
- Dynamic: ❌
- Required: ❌
- Default:
NO_OCR
- Possible Values:
AUTO
NO_OCR
OCR_ONLY
OCR_AND_TEXT_EXTRACTION
OCR strategy to use for OCR processing.
You need to install Tesseract to enable OCR processing, along with Tesseract language pack.
io.kestra.plugin.tika.Parse-Parsed
Properties
content
- Type: string
- Dynamic: ❓
- Required: ❓
embedded
- Type: object
- SubType: string
- Dynamic: ❓
- Required: ❓
metadata
- Type: object
- Dynamic: ❓
- Required: ❓
Was this page helpful?