spaCy Layout: Process PDFs, Word documents and more with spaCy
This plugin integrates with Docling to bring structured processing of PDFs, Word documents and other input formats to your spaCy pipeline. It outputs clean, structured data in a text-based format and creates spaCy’s familiar Doc objects that let you access labelled text spans like sections or headings, and tables with their data converted to a pandas.DataFrame.
This workflow makes it easy to apply powerful NLP techniques to your documents, including linguistic analysis, named entity recognition, text classification and more. It’s also great for implementing chunking for RAG pipelines.
After initializing the spaCyLayout preprocessor with an nlp object for tokenization, you can call it on a document path to convert it to structured data. The resulting Doc object includes layout spans that map into the original raw text and expose various attributes, including the content type and layout features.
import spacy
from spacy_layout import spaCyLayout
nlp = spacy.blank("en")
layout = spaCyLayout(nlp)
# Process a document and create a spaCy Doc object
doc = layout("./starcraft.pdf")
# The text-based contents of the document
print(doc.text)
# Document layout including pages and page sizes
print(doc._.layout)
# Tables in the document and their extracted data
print(doc._.tables)
# Markdown representation of the document
print(doc._.markdown)
# Layout spans for different sections
for span in doc.spans["layout"]:
# Document section and token and character offsets into the text
print(span.text, span.start, span.end, span.start_char, span.end_char)
# Section type, e.g. "text", "title", "section_header" etc.
print(span.label_)
# Layout features of the section, including bounding box
print(span._.layout)
# Closest heading to the span (accuracy depends on document structure)
print(span._.heading)
If you need to process larger volumes of documents at scale, you can use the spaCyLayout.pipe method, which takes an iterable of paths or bytes instead and yields Doc objects:
paths = ["one.pdf", "two.pdf", "three.pdf", ...]
for doc in layout.pipe(paths):
print(doc._.layout)
After you’ve processed the documents, you can serialize the structured Doc objects in spaCy’s efficient binary format, so you don’t have to re-run the resource-intensive conversion.
# Load the transformer-based English pipeline
# Installation: python -m spacy download en_core_web_trf
nlp = spacy.load("en_core_web_trf")
layout = spaCyLayout(nlp)
doc = layout("./starcraft.pdf")
# Apply the pipeline to access POS tags, dependencies, entities etc.
doc = nlp(doc)
Tables and tabular data
Tables are included in the layout spans with the label "table" and under the shortcut Doc._.tables. They expose a layout extension attribute, as well as an attribute data, which includes the tabular data converted to a pandas.DataFrame.
for table in doc._.tables:
# Token position and bounding box
print(table.start, table.end, table._.layout)
# pandas.DataFrame of contents
print(table._.data)
By default, the span text is a placeholder TABLE, but you can customize how a table is rendered by providing a display_table callback to spaCyLayout, which receives the pandas.DataFrame of the data. This allows you to include the table figures in the document text and use them later on, e.g. during information extraction with a trained named entity recognizer or text classifier.
layout = spaCyLayout(nlp)
doc = layout("./starcraft.pdf")
print(doc._.layout)
for span in doc.spans["layout"]:
print(span.label_, span._.layout)
Attribute
Type
Description
Doc._.layout
DocLayout
Layout features of the document.
Doc._.pages
list[tuple[PageLayout, list[Span]]]
Pages in the document and the spans they contain.
Doc._.tables
list[Span]
All tables in the document.
Doc._.markdown
str
Markdown representation of the document.
Doc.spans["layout"]
spacy.tokens.SpanGroup
The layout spans in the document.
Span.label_
str
The type of the extracted layout span, e.g. "text" or "section_header". See here for options.
Span.label
int
The integer ID of the span label.
Span.id
int
Running index of layout span.
Span._.layout
SpanLayout | None
Layout features of a layout span.
Span._.heading
Span | None
Closest heading to a span, if available.
Span._.data
pandas.DataFrame | None
The extracted data for table spans.
dataclass PageLayout
Attribute
Type
Description
page_no
int
The page number (1-indexed).
width
float
Page with in pixels.
height
float
Page height in pixels.
dataclass DocLayout
Attribute
Type
Description
pages
list[PageLayout]
The pages in the document.
dataclass SpanLayout
Attribute
Type
Description
x
float
Horizontal offset of the bounding box in pixels.
y
float
Vertical offset of the bounding box in pixels.
width
float
Width of the bounding box in pixels.
height
float
Height of the bounding box in pixels.
page_no
int
Number of page the span is on.
classspaCyLayout
methodspaCyLayout.__init__
Initialize the document processor.
nlp = spacy.blank("en")
layout = spaCyLayout(nlp)
Argument
Type
Description
nlp
spacy.language.Language
The initialized nlp object to use for tokenization.
separator
str
Token used to separate sections in the created Doc object. The separator won’t be part of the layout span. If None, no separator will be added. Defaults to "\n\n".
attrs
dict[str, str]
Override the custom spaCy attributes. Can include "doc_layout", "doc_pages", "doc_tables", "doc_markdown", "span_layout", "span_data", "span_heading" and "span_group".
headings
list[str]
Labels of headings to consider for Span._.heading detection. Defaults to ["section_header", "page_header", "title"].
display_table
Callable[[pandas.DataFrame], str] | str
Function to generate the text-based representation of the table in the Doc.text or placeholder text. Defaults to "TABLE".
spaCy Layout: Process PDFs, Word documents and more with spaCy
This plugin integrates with Docling to bring structured processing of PDFs, Word documents and other input formats to your spaCy pipeline. It outputs clean, structured data in a text-based format and creates spaCy’s familiar
Doc
objects that let you access labelled text spans like sections or headings, and tables with their data converted to apandas.DataFrame
.This workflow makes it easy to apply powerful NLP techniques to your documents, including linguistic analysis, named entity recognition, text classification and more. It’s also great for implementing chunking for RAG pipelines.
📝 Usage
After initializing the
spaCyLayout
preprocessor with annlp
object for tokenization, you can call it on a document path to convert it to structured data. The resultingDoc
object includes layout spans that map into the original raw text and expose various attributes, including the content type and layout features.If you need to process larger volumes of documents at scale, you can use the
spaCyLayout.pipe
method, which takes an iterable of paths or bytes instead and yieldsDoc
objects:After you’ve processed the documents, you can serialize the structured
Doc
objects in spaCy’s efficient binary format, so you don’t have to re-run the resource-intensive conversion.spaCy also allows you to call the
nlp
object on an already createdDoc
, so you can easily apply a pipeline of components for linguistic analysis or named entity recognition, use rule-based matching or anything else you can do with spaCy.Tables and tabular data
Tables are included in the layout spans with the label
"table"
and under the shortcutDoc._.tables
. They expose alayout
extension attribute, as well as an attributedata
, which includes the tabular data converted to apandas.DataFrame
.By default, the span text is a placeholder
TABLE
, but you can customize how a table is rendered by providing adisplay_table
callback tospaCyLayout
, which receives thepandas.DataFrame
of the data. This allows you to include the table figures in the document text and use them later on, e.g. during information extraction with a trained named entity recognizer or text classifier.🎛️ API
Data and extension attributes
Doc._.layout
DocLayout
Doc._.pages
list[tuple[PageLayout, list[Span]]]
Doc._.tables
list[Span]
Doc._.markdown
str
Doc.spans["layout"]
spacy.tokens.SpanGroup
Span.label_
str
"text"
or"section_header"
. See here for options.Span.label
int
Span.id
int
Span._.layout
SpanLayout | None
Span._.heading
Span | None
Span._.data
pandas.DataFrame | None
dataclass PageLayout
page_no
int
width
float
height
float
dataclass DocLayout
pages
list[PageLayout]
dataclass SpanLayout
x
float
y
float
width
float
height
float
page_no
int
class
spaCyLayout
method
spaCyLayout.__init__
Initialize the document processor.
nlp
spacy.language.Language
nlp
object to use for tokenization.separator
str
Doc
object. The separator won’t be part of the layout span. IfNone
, no separator will be added. Defaults to"\n\n"
.attrs
dict[str, str]
"doc_layout"
,"doc_pages"
,"doc_tables"
,"doc_markdown"
,"span_layout"
,"span_data"
,"span_heading"
and"span_group"
.headings
list[str]
Span._.heading
detection. Defaults to["section_header", "page_header", "title"]
.display_table
Callable[[pandas.DataFrame], str] | str
Doc.text
or placeholder text. Defaults to"TABLE"
.docling_options
dict[InputFormat, FormatOption]
DocumentConverter
.spaCyLayout
method
spaCyLayout.__call__
Process a document and create a spaCy
Doc
object containing the text content and layout spans, available viaDoc.spans["layout"]
by default.source
str | Path | bytes | DoclingDocument
DoclingDocument
.Doc
Doc
object.method
spaCyLayout.pipe
Process multiple documents and create spaCy
Doc
objects. You should use this method if you’re processing larger volumes of documents at scale.sources
Iterable[str | Path | bytes]
Doc
Doc
object.