ParseStudio Documentation
Entry Point
The entry point for the parsestudio
library is the PDFParser module,
that acts as the main interface for the library.
The PDFParser module initializes the parser and provides a backend to parse a PDF file. Which could be either:
- Docling: Advanced parser with multimodal capabilities.
- PyMuPDF: Lightweight and efficient.
- LlamaParse: AI-enhanced parser with advanced capabilities.
Each backend parser has its own strengths. Choose the one that best fits your use case.
Basic Usage
To run the parser, you can use the run
method of the PDFParser module.
from parsestudio.parse import PDFParser
parser = PDFParser(name="docling") # or "pymupdf" or "llama"
output = parser.run("path/to/pdf/file")
Documentation
The PDFParser
module is initialized with a parser name
and its parser_kwargs
as arguments. Note that the parser_kwargs
are optional.
PDFParser
Parse PDF files using different parsers.
Source code in parsestudio/parse.py
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 |
|
__init__(parser='docling', parser_kwargs={})
Initialize the PDF parser with the specified backend.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
parser
|
str
|
The parser backend to use. Options are 'docling' and 'llama', and 'pymupdf'. Defaults to 'docling'. |
'docling'
|
parser_kwargs
|
dict
|
Additional keyword arguments to pass to the parser. Check the documentation of the parser for more information. |
{}
|
Raises:
Type | Description |
---|---|
ValueError
|
If an invalid parser is specified. |
Source code in parsestudio/parse.py
run(pdf_path, modalities=['text', 'tables', 'images'], **kwargs)
Run the PDF parser on the given PDF file(s).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
pdf_path
|
str or List[str]
|
The path to the PDF file(s) to parse. |
required |
modalities
|
List[str]
|
The modalities to extract from the PDF file(s). Defaults to ["text", "tables", "images"]. |
['text', 'tables', 'images']
|
**kwargs
|
Additional keyword arguments to pass to 'docling' parser. Check the documentation of the parser for more information. |
{}
|
Returns:
Type | Description |
---|---|
List[ParserOutput]
|
The parsed output(s) from the PDF file(s). |
Examples:
Example
from parsestudio.parse import PDFParser
# Initialize the parser
parser = PDFParser(parser="docling")
# Parse the PDF file
outputs = parser.run("path/to/file.pdf")
print(len(outputs)) # Number of PDF files
# Output: 1
# Access text content
print(outputs[0].text)
# Output: text='Hello, World!'
# Access tables
print(outputs[0].tables)
# Output:
# [
# TableElement(
# markdown='| Column 1 | Column 2 |
# |----------|----------|
# | Value 1 | Value 2 |
# | Value 3 | Value 4 |',
# dataframe= Column 1 Column 2
# 0 Value 1 Value 2
# 1 Value 3 Value 4,
# metadata=Metadata(page_number=1, bbox=[0, 0, 100, 100])
# )
# ]
for table in outputs[0].tables:
metadata = table.metadata
markdown_table = table.markdown
pandas_dataframe = table.dataframe
print(metadata)
print(markdown_table)
# Output:
# Metadata(page_number=1, bbox=[0, 0, 100, 100])
# | Column 1 | Column 2 |
# |----------|----------|
# | Value 1 | Value 2 |
# | Value 3 | Value 4 |
# Access images
print(outputs[0].images)
# Output:
# [
# ImageElement(
# image=<PIL.Image.Image image mode=RGB size=233x140 at 0x16E894E50>,
# metadata=Metadata(page_number=1, bbox=[0, 0, 100, 100])
# )
# ]
for image in outputs[0].images:
metadata = image.metadata
image = image.image
print(metadata)
image.show()
# Output:
# Metadata(page_number=1, bbox=[0, 0, 100, 100])
# [Image shown]
Source code in parsestudio/parse.py
38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 |
|
The run
method of the PDFParser
module returns a ParserOutput
object that contains the parsed data. Check the ParserOutput
class in Schemas for more information.