ingest step, extracting structured text from binary files such as PDF, Word, Markdown, etc.
Relationship between Parser and KnowledgeEngine:
- Parser is responsible for converting files to text (file → text)
- KnowledgeEngine is responsible for indexing and retrieving text (text → chunks → vectors)
DOC_PARSING capability), users can choose to use the Knowledge Engine’s built-in parsing or an external Parser plugin.
Adding a Parser Component
A single plugin can add any number of parsers. Execute the commandlbp comp Parser in the plugin directory and follow the prompts to enter the parser configuration.
pdf_parser.yaml and pdf_parser.py files in the components/parser/ directory. The .yaml file defines the parser’s basic information and supported MIME types, and the .py file is the handler for this parser:
Manifest File: Parser
supported_mime_types
supported_mime_types declares the file types this parser supports. Common MIME types:
| MIME Type | Description |
|---|---|
application/pdf | PDF documents |
application/vnd.openxmlformats-officedocument.wordprocessingml.document | Word documents (.docx) |
text/markdown | Markdown files |
text/plain | Plain text files |
text/html | HTML files |
Plugin Handler
The following code will be generated by default (components/parser/<parser_name>.py). You need to implement the parse method.
Parse Method
Theparse method is called when a document is uploaded to a knowledge base (before Knowledge Engine ingestion):
Integration with KnowledgeEngine
When a user uploads a document, LangBot determines the parsing flow as follows:- If the user selects an external Parser plugin, LangBot first calls the Parser’s
parsemethod, then passes the result to the Knowledge Engine’singestmethod viaIngestionContext.parsed_content. - If the Knowledge Engine declares
DOC_PARSINGcapability and the user does not select an external parser, the Knowledge Engine handles document parsing on its own.
IngestionContext.parsed_content to determine whether pre-parsed content is available:
Cross-Plugin Parser Invocation
KnowledgeEngine plugins can invoke parsers provided by other plugins viaself.plugin.invoke_parser:
Testing the Parser
After creation, execute the commandlbp run in the plugin directory to start debugging. Then in LangBot:
- Go to the “Knowledge Base” page
- Select a knowledge base and enter document management
- When uploading a file, select your plugin’s parser in the parser selector
- After uploading, verify that the document is correctly ingested
