Skip to main content
The Parser component allows plugins to provide document parsing capabilities for LangBot. When a user uploads a document to a knowledge base, LangBot invokes the parser before the Knowledge Engine’s ingest step, extracting structured text from binary files such as PDF, Word, Markdown, etc. Relationship between Parser and KnowledgeEngine:
  • Parser is responsible for converting files to text (file → text)
  • KnowledgeEngine is responsible for indexing and retrieving text (text → chunks → vectors)
If the Knowledge Engine already has native document parsing capabilities (declared DOC_PARSING capability), users can choose to use the Knowledge Engine’s built-in parsing or an external Parser plugin.

Adding a Parser Component

A single plugin can add any number of parsers. Execute the command lbp comp Parser in the plugin directory and follow the prompts to enter the parser configuration.
  MyParserPlugin > lbp comp Parser
Generating component Parser...
Parser name: pdf_parser
Parser description: A PDF document parser
Component Parser generated successfully.
This will generate pdf_parser.yaml and pdf_parser.py files in the components/parser/ directory. The .yaml file defines the parser’s basic information and supported MIME types, and the .py file is the handler for this parser:
  MyParserPlugin > tree
...
├── components
   ├── __init__.py
   └── parser
       ├── __init__.py
       ├── pdf_parser.py
       └── pdf_parser.yaml
...

Manifest File: Parser

apiVersion: v1  # Do not modify
kind: Parser  # Do not modify
metadata:
  name: pdf_parser  # Parser name, used to identify this parser
  label:
    en_US: PDF Parser  # Parser display name, shown in LangBot's UI, supports multiple languages
    zh_Hans: PDF 解析器
  description:
    en_US: 'A PDF document parser'
    zh_Hans: 'PDF 文档解析器'
spec:
  supported_mime_types:  # Declare supported file MIME types
    - application/pdf
execution:
  python:
    path: pdf_parser.py  # Parser handler, do not modify
    attr: PdfParser  # Class name of the parser handler, consistent with the class name in pdf_parser.py

supported_mime_types

supported_mime_types declares the file types this parser supports. Common MIME types:
MIME TypeDescription
application/pdfPDF documents
application/vnd.openxmlformats-officedocument.wordprocessingml.documentWord documents (.docx)
text/markdownMarkdown files
text/plainPlain text files
text/htmlHTML files

Plugin Handler

The following code will be generated by default (components/parser/<parser_name>.py). You need to implement the parse method.
from langbot_plugin.api.definition.components.parser.parser import Parser
from langbot_plugin.api.entities.builtin.rag.models import (
    ParseContext,
    ParseResult,
    TextSection,
)


class PdfParser(Parser):
    """Parser component for extracting text from files."""

    async def parse(self, context: ParseContext) -> ParseResult:
        """Parse a file and extract structured text.

        Args:
            context: Contains file_content (bytes), mime_type, filename, and metadata.

        Returns:
            ParseResult with extracted text and optional structured sections.
        """
        # TODO: Implement parsing logic
        text = context.file_content.decode("utf-8", errors="replace")

        return ParseResult(
            text=text,
            sections=[
                TextSection(
                    content=text,
                    heading=context.filename,
                    level=0,
                ),
            ],
            metadata={
                "filename": context.filename,
                "mime_type": context.mime_type,
            },
        )

Parse Method

The parse method is called when a document is uploaded to a knowledge base (before Knowledge Engine ingestion):
async def parse(self, context: ParseContext) -> ParseResult:
ParseContext contains the following information:
class ParseContext(pydantic.BaseModel):
    file_content: bytes          # Raw file bytes (read by LangBot from storage)
    mime_type: str               # Detected MIME type of the file
    filename: str                # Original filename
    metadata: dict[str, Any]     # Extra metadata from FileObject
ParseResult should return the parsing result:
class ParseResult(pydantic.BaseModel):
    text: str                            # Full extracted plain text
    sections: list[TextSection] = []     # Structured sections (optional)
    metadata: dict[str, Any] = {}        # Parsing metadata (e.g., page_count, language)
TextSection represents a section of text extracted from the document:
class TextSection(pydantic.BaseModel):
    content: str                   # Section text content
    heading: str | None = None     # Section heading
    level: int = 0                 # Nesting level
    page: int | None = None        # Source page number (for PDF, etc.)
    metadata: dict[str, Any] = {}  # Additional section metadata

Integration with KnowledgeEngine

When a user uploads a document, LangBot determines the parsing flow as follows:
  • If the user selects an external Parser plugin, LangBot first calls the Parser’s parse method, then passes the result to the Knowledge Engine’s ingest method via IngestionContext.parsed_content.
  • If the Knowledge Engine declares DOC_PARSING capability and the user does not select an external parser, the Knowledge Engine handles document parsing on its own.
KnowledgeEngine can check IngestionContext.parsed_content to determine whether pre-parsed content is available:
async def ingest(self, context: IngestionContext) -> IngestionResult:
    if context.parsed_content:
        # Use pre-parsed content from external Parser
        text = context.parsed_content.text
        sections = context.parsed_content.sections
    else:
        # Parse the document internally
        file_bytes = await self.plugin.get_knowledge_file_stream(context.file_object.storage_path)
        text = file_bytes.decode('utf-8')
    ...

Cross-Plugin Parser Invocation

KnowledgeEngine plugins can invoke parsers provided by other plugins via self.plugin.invoke_parser:
result = await self.plugin.invoke_parser(
    plugin_author="author_name",
    plugin_name="plugin_name",
    storage_path=context.file_object.storage_path,
    mime_type=context.file_object.metadata.mime_type,
    filename=context.file_object.metadata.filename,
    metadata={},
)
# result is a dict containing text, sections, metadata

Testing the Parser

After creation, execute the command lbp run in the plugin directory to start debugging. Then in LangBot:
  1. Go to the “Knowledge Base” page
  2. Select a knowledge base and enter document management
  3. When uploading a file, select your plugin’s parser in the parser selector
  4. After uploading, verify that the document is correctly ingested