> ## Documentation Index
> Fetch the complete documentation index at: https://docs.langbot.app/llms.txt
> Use this file to discover all available pages before exploring further.

# Component: Parser

The Parser component allows plugins to provide document parsing capabilities for LangBot. When a user uploads a document to a knowledge base, LangBot invokes the parser before the Knowledge Engine's `ingest` step, extracting structured text from binary files such as PDF, Word, Markdown, etc.

Relationship between Parser and KnowledgeEngine:

* **Parser** is responsible for converting files to text (file → text)
* **KnowledgeEngine** is responsible for indexing and retrieving text (text → chunks → vectors)

If the Knowledge Engine already has native document parsing capabilities (declared `DOC_PARSING` capability), users can choose to use the Knowledge Engine's built-in parsing or an external Parser plugin.

## Adding a Parser Component

A single plugin can add any number of parsers. Execute the command `lbp comp Parser` in the plugin directory and follow the prompts to enter the parser configuration.

```bash theme={null}
➜  MyParserPlugin > lbp comp Parser
Generating component Parser...
Parser name: pdf_parser
Parser description: A PDF document parser
Component Parser generated successfully.
```

This will generate `pdf_parser.yaml` and `pdf_parser.py` files in the `components/parser/` directory. The `.yaml` file defines the parser's basic information and supported MIME types, and the `.py` file is the handler for this parser:

```bash theme={null}
➜  MyParserPlugin > tree
...
├── components
│   ├── __init__.py
│   └── parser
│       ├── __init__.py
│       ├── pdf_parser.py
│       └── pdf_parser.yaml
...
```

## Manifest File: Parser

```yaml theme={null}
apiVersion: v1  # Do not modify
kind: Parser  # Do not modify
metadata:
  name: pdf_parser  # Parser name, used to identify this parser
  label:
    en_US: PDF Parser  # Parser display name, shown in LangBot's UI, supports multiple languages
    zh_Hans: PDF 解析器
  description:
    en_US: 'A PDF document parser'
    zh_Hans: 'PDF 文档解析器'
spec:
  supported_mime_types:  # Declare supported file MIME types
    - application/pdf
execution:
  python:
    path: pdf_parser.py  # Parser handler, do not modify
    attr: PdfParser  # Class name of the parser handler, consistent with the class name in pdf_parser.py
```

### supported\_mime\_types

`supported_mime_types` declares the file types this parser supports. Common MIME types:

| MIME Type                                                                 | Description            |
| ------------------------------------------------------------------------- | ---------------------- |
| `application/pdf`                                                         | PDF documents          |
| `application/vnd.openxmlformats-officedocument.wordprocessingml.document` | Word documents (.docx) |
| `text/markdown`                                                           | Markdown files         |
| `text/plain`                                                              | Plain text files       |
| `text/html`                                                               | HTML files             |

## Plugin Handler

The following code will be generated by default (`components/parser/<parser_name>.py`). You need to implement the `parse` method.

```python theme={null}
from langbot_plugin.api.definition.components.parser.parser import Parser
from langbot_plugin.api.entities.builtin.rag.models import (
    ParseContext,
    ParseResult,
    TextSection,
)


class PdfParser(Parser):
    """Parser component for extracting text from files."""

    async def parse(self, context: ParseContext) -> ParseResult:
        """Parse a file and extract structured text.

        Args:
            context: Contains file_content (bytes), mime_type, filename, and metadata.

        Returns:
            ParseResult with extracted text and optional structured sections.
        """
        # TODO: Implement parsing logic
        text = context.file_content.decode("utf-8", errors="replace")

        return ParseResult(
            text=text,
            sections=[
                TextSection(
                    content=text,
                    heading=context.filename,
                    level=0,
                ),
            ],
            metadata={
                "filename": context.filename,
                "mime_type": context.mime_type,
            },
        )
```

### Parse Method

The `parse` method is called when a document is uploaded to a knowledge base (before Knowledge Engine ingestion):

```python theme={null}
async def parse(self, context: ParseContext) -> ParseResult:
```

**ParseContext** contains the following information:

```python theme={null}
class ParseContext(pydantic.BaseModel):
    file_content: bytes          # Raw file bytes (read by LangBot from storage)
    mime_type: str               # Detected MIME type of the file
    filename: str                # Original filename
    metadata: dict[str, Any]     # Extra metadata from FileObject
```

**ParseResult** should return the parsing result:

```python theme={null}
class ParseResult(pydantic.BaseModel):
    text: str                            # Full extracted plain text
    sections: list[TextSection] = []     # Structured sections (optional)
    metadata: dict[str, Any] = {}        # Parsing metadata (e.g., page_count, language)
```

**TextSection** represents a section of text extracted from the document:

```python theme={null}
class TextSection(pydantic.BaseModel):
    content: str                   # Section text content
    heading: str | None = None     # Section heading
    level: int = 0                 # Nesting level
    page: int | None = None        # Source page number (for PDF, etc.)
    metadata: dict[str, Any] = {}  # Additional section metadata
```

## Integration with KnowledgeEngine

When a user uploads a document, LangBot determines the parsing flow as follows:

* If the user selects an external Parser plugin, LangBot first calls the Parser's `parse` method, then passes the result to the Knowledge Engine's `ingest` method via `IngestionContext.parsed_content`.
* If the Knowledge Engine declares `DOC_PARSING` capability and the user does not select an external parser, the Knowledge Engine handles document parsing on its own.

KnowledgeEngine can check `IngestionContext.parsed_content` to determine whether pre-parsed content is available:

```python theme={null}
async def ingest(self, context: IngestionContext) -> IngestionResult:
    if context.parsed_content:
        # Use pre-parsed content from external Parser
        text = context.parsed_content.text
        sections = context.parsed_content.sections
    else:
        # Parse the document internally
        file_bytes = await self.plugin.get_knowledge_file_stream(context.file_object.storage_path)
        text = file_bytes.decode('utf-8')
    ...
```

## Cross-Plugin Parser Invocation

Before invoking a parser from another plugin, you can use `self.plugin.list_parsers` to discover the parsers currently available on the host:

```python theme={null}
parsers = await self.plugin.list_parsers(mime_type="application/pdf")
# Each item includes plugin_id, plugin_author, plugin_name, name, description, supported_mime_types
```

If the list is empty, no connected Parser plugin currently supports that MIME type.

After obtaining `plugin_author` and `plugin_name`, you can call `self.plugin.invoke_parser`:

```python theme={null}
parser = parsers[0]
result = await self.plugin.invoke_parser(
    plugin_author=parser["plugin_author"],
    plugin_name=parser["plugin_name"],
    storage_path=context.file_object.storage_path,
    mime_type=context.file_object.metadata.mime_type,
    filename=context.file_object.metadata.filename,
    metadata={},
)
# result is a dict containing text, sections, metadata
```

## Testing the Parser

After creation, execute the command `lbp run` in the plugin directory to start debugging. Then in LangBot:

1. Go to the "Knowledge Base" page
2. Select a knowledge base and enter document management
3. When uploading a file, select your plugin's parser in the parser selector
4. After uploading, verify that the document is correctly ingested
