Component: Parser

The Parser component allows plugins to provide document parsing capabilities for LangBot. When a user uploads a document to a knowledge base, LangBot invokes the parser before the Knowledge Engine’s ingest step, extracting structured text from binary files such as PDF, Word, Markdown, etc. Relationship between Parser and KnowledgeEngine:

Parser is responsible for converting files to text (file → text)
KnowledgeEngine is responsible for indexing and retrieving text (text → chunks → vectors)

If the Knowledge Engine already has native document parsing capabilities (declared DOC_PARSING capability), users can choose to use the Knowledge Engine’s built-in parsing or an external Parser plugin.

Adding a Parser Component

A single plugin can add any number of parsers. Execute the command lbp comp Parser in the plugin directory and follow the prompts to enter the parser configuration.

➜  MyParserPlugin > lbp comp Parser
Generating component Parser...
Parser name: pdf_parser
Parser description: A PDF document parser
Component Parser generated successfully.

This will generate pdf_parser.yaml and pdf_parser.py files in the components/parser/ directory. The .yaml file defines the parser’s basic information and supported MIME types, and the .py file is the handler for this parser:

➜  MyParserPlugin > tree
...
├── components
│   ├── __init__.py
│   └── parser
│       ├── __init__.py
│       ├── pdf_parser.py
│       └── pdf_parser.yaml
...

Manifest File: Parser

apiVersion: v1  # Do not modify
kind: Parser  # Do not modify
metadata:
  name: pdf_parser  # Parser name, used to identify this parser
  label:
    en_US: PDF Parser  # Parser display name, shown in LangBot's UI, supports multiple languages
    zh_Hans: PDF 解析器
  description:
    en_US: 'A PDF document parser'
    zh_Hans: 'PDF 文档解析器'
spec:
  supported_mime_types:  # Declare supported file MIME types
    - application/pdf
execution:
  python:
    path: pdf_parser.py  # Parser handler, do not modify
    attr: PdfParser  # Class name of the parser handler, consistent with the class name in pdf_parser.py

supported_mime_types

supported_mime_types declares the file types this parser supports. Common MIME types:

MIME Type	Description
`application/pdf`	PDF documents
`application/vnd.openxmlformats-officedocument.wordprocessingml.document`	Word documents (.docx)
`text/markdown`	Markdown files
`text/plain`	Plain text files
`text/html`	HTML files

Plugin Handler

The following code will be generated by default (components/parser/<parser_name>.py). You need to implement the parse method.

from langbot_plugin.api.definition.components.parser.parser import Parser
from langbot_plugin.api.entities.builtin.rag.models import (
    ParseContext,
    ParseResult,
    TextSection,
)


class PdfParser(Parser):
    """Parser component for extracting text from files."""

    async def parse(self, context: ParseContext) -> ParseResult:
        """Parse a file and extract structured text.

        Args:
            context: Contains file_content (bytes), mime_type, filename, and metadata.

        Returns:
            ParseResult with extracted text and optional structured sections.
        """
        # TODO: Implement parsing logic
        text = context.file_content.decode("utf-8", errors="replace")

        return ParseResult(
            text=text,
            sections=[
                TextSection(
                    content=text,
                    heading=context.filename,
                    level=0,
                ),
            ],
            metadata={
                "filename": context.filename,
                "mime_type": context.mime_type,
            },
        )

Parse Method

The parse method is called when a document is uploaded to a knowledge base (before Knowledge Engine ingestion):

async def parse(self, context: ParseContext) -> ParseResult:

ParseContext contains the following information:

class ParseContext(pydantic.BaseModel):
    file_content: bytes          # Raw file bytes (read by LangBot from storage)
    mime_type: str               # Detected MIME type of the file
    filename: str                # Original filename
    metadata: dict[str, Any]     # Extra metadata from FileObject

ParseResult should return the parsing result:

class ParseResult(pydantic.BaseModel):
    text: str                            # Full extracted plain text
    sections: list[TextSection] = []     # Structured sections (optional)
    metadata: dict[str, Any] = {}        # Parsing metadata (e.g., page_count, language)

TextSection represents a section of text extracted from the document:

class TextSection(pydantic.BaseModel):
    content: str                   # Section text content
    heading: str | None = None     # Section heading
    level: int = 0                 # Nesting level
    page: int | None = None        # Source page number (for PDF, etc.)
    metadata: dict[str, Any] = {}  # Additional section metadata

Integration with KnowledgeEngine

When a user uploads a document, LangBot determines the parsing flow as follows:

If the user selects an external Parser plugin, LangBot first calls the Parser’s parse method, then passes the result to the Knowledge Engine’s ingest method via IngestionContext.parsed_content.
If the Knowledge Engine declares DOC_PARSING capability and the user does not select an external parser, the Knowledge Engine handles document parsing on its own.

KnowledgeEngine can check IngestionContext.parsed_content to determine whether pre-parsed content is available:

async def ingest(self, context: IngestionContext) -> IngestionResult:
    if context.parsed_content:
        # Use pre-parsed content from external Parser
        text = context.parsed_content.text
        sections = context.parsed_content.sections
    else:
        # Parse the document internally
        file_bytes = await self.plugin.get_knowledge_file_stream(context.file_object.storage_path)
        text = file_bytes.decode('utf-8')
    ...

Cross-Plugin Parser Invocation

Before invoking a parser from another plugin, you can use self.plugin.list_parsers to discover the parsers currently available on the host:

parsers = await self.plugin.list_parsers(mime_type="application/pdf")
# Each item includes plugin_id, plugin_author, plugin_name, name, description, supported_mime_types

If the list is empty, no connected Parser plugin currently supports that MIME type. After obtaining plugin_author and plugin_name, you can call self.plugin.invoke_parser:

parser = parsers[0]
result = await self.plugin.invoke_parser(
    plugin_author=parser["plugin_author"],
    plugin_name=parser["plugin_name"],
    storage_path=context.file_object.storage_path,
    mime_type=context.file_object.metadata.mime_type,
    filename=context.file_object.metadata.filename,
    metadata={},
)
# result is a dict containing text, sections, metadata

Testing the Parser

After creation, execute the command lbp run in the plugin directory to start debugging. Then in LangBot:

Go to the “Knowledge Base” page
Select a knowledge base and enter document management
When uploading a file, select your plugin’s parser in the parser selector
After uploading, verify that the document is correctly ingested

Plugin Development

Plugin SDK API

Core Development

Component: Parser

Adding a Parser Component

Manifest File: Parser

supported_mime_types

Plugin Handler

Parse Method

Integration with KnowledgeEngine

Cross-Plugin Parser Invocation

Testing the Parser

​Adding a Parser Component

​Manifest File: Parser

​supported_mime_types

​Plugin Handler

​Parse Method

​Integration with KnowledgeEngine

​Cross-Plugin Parser Invocation

​Testing the Parser

Adding a Parser Component

Manifest File: Parser

supported_mime_types

Plugin Handler

Parse Method

Integration with KnowledgeEngine

Cross-Plugin Parser Invocation

Testing the Parser