Class PDFNet.TextExtractor

PDFNet.TextExtractor

Class Summary
Constructor Attributes	Constructor Name and Description
	PDFNet.TextExtractor(id) TextExtractor is used to analyze a PDF page and extract words and logical structure within a given region.

Method Summary
Method Attributes	Method Name and Description
	begin(page, clip_ptr, flags) Start reading the page.
<static>	PDFNet.TextExtractor.create() Constructor and destructor
	destroy() Frees the native memory of the object.
	getAsText(dehyphen) get all words in the current selection as a single string.
	getAsXML(xml_output_flags) get text content in a form of an XML string.
	getFirstLine()
	getNumLines()
	getQuads(mtx, quads, quads_size) [CURRENTLY BUGGED]
	getRightToLeftLanguage()
	getTextUnderAnnot(annot) Get all the characters that intersect an annotation.
	getWordCount()
	setRightToLeftLanguage(rtl) Sets the directionality of text extractor.

Class Detail

PDFNet.TextExtractor(id)

TextExtractor is used to analyze a PDF page and extract words and logical structure within a given region. The resulting list of lines and words can be traversed element by element or accessed as a string buffer. The class also includes utility methods to extract PDF text as HTML or XML. Possible use case scenarios for TextExtractor include: - Converting PDF pages to text or XML for content repurposing. - Searching PDF pages for specific words or keywords. - Indexing large PDF repositories for indexing or content retrieval purposes (i.e. implementing a PDF search engine). - Classifying or summarizing PDF documents based on their text content. - Finding specific words for content editing purposes (such as splitting pages based on keywords etc). The main task of TextExtractor is to interpret PDF pages and offer a simple to use API to: - Normalize all text content to Unicode. - Extract inferred logical structure (word by word, line by line, or paragraph by paragraph). - Extract positioning information for every line, word, or a glyph. - Extract style information (such as information about the font, font size, font styles, etc) for every line, word, or a glyph. - Control the content analysis process. A number of options (such as removal of text obscured by images) is available to let the user direct the flow of content recognition algorithms that will meet their requirements. - Offer utility methods to convert PDF page content to text, XML, or HTML. Note: TextExtractor is analyzing only textual content of the page. This means that the rasterized (e.g. in scanned pages) or vectorized text (where glyphs are converted to path outlines) will not be recognized as text. Please note that it is still possible to extract this content using pdftron.PDF.ElementReader interface. In some cases TextExtractor may extract text that does not appear to be on the visible page (e.g. when text is obscured by an image or a rectangle). In these situations it is possible to use processing flags such as 'e_remove_hidden_text' and 'e_no_invisible_text' to remove hidden text. A sample use case (in C++):

Parameters:
id

Method Detail

begin(page, clip_ptr, flags)

Start reading the page.

Parameters:
{Page} page: Page to read.
{rect} clip_ptr: A pointer to the optional clipping rectangle. This parameter can be used to selectively read text from a given rectangle.
{number} flags: A list of ProcessingFlags used to control text extraction algorithm.

<static> {TextExtractor} PDFNet.TextExtractor.create()

Constructor and destructor

Returns:: {TextExtractor} A promise that resolves to an object of type: "TextExtractor" (generated documentation)

destroy()

Frees the native memory of the object.

{string} getAsText(dehyphen)

get all words in the current selection as a single string.

Parameters:
{boolean} dehyphen: If true, finds and removes hyphens that split words across two lines. Hyphens are often used a the end of lines as an indicator that a word spans two lines. Hyphen detection enables removal of hyphen character and merging of text runs to form a single word. This option has no effect on Tagged PDF files.

Returns:: {string} A promise that resolves to an object of type: "string" (generated documentation)

{string} getAsXML(xml_output_flags)

get text content in a form of an XML string.

Parameters:
{number} xml_output_flags: flags controlling XML output. For more information, please see TextExtract::XMLOutputFlags. XML output will be encoded in UTF-8 and will have the following structure:

Returns:: {string} A promise that resolves to an object of type: "string" (generated documentation)

{textextractorline} getFirstLine()

Returns:: {textextractorline} A promise that resolves to the first line of text on the selected page.

{number} getNumLines()

Returns:: {number} A promise that resolves to the number of lines of text on the selected page.

getQuads(mtx, quads, quads_size)

[CURRENTLY BUGGED]

Parameters:
{matrix2d} mtx: The quadrilateral representing a tight bounding box
{number} quads: n
{number} quads_size: n for this word (in unrotated page coordinates).

{boolean} getRightToLeftLanguage()

Returns:: {boolean} A promise that resolves to the directionality of text extractor.

{string} getTextUnderAnnot(annot)

Get all the characters that intersect an annotation.

Parameters:
{Annot} annot: The annotation to intersect with.

Returns:: {string} A promise that resolves to an object of type: "string" (generated documentation)

{number} getWordCount()

Returns:: {number} A promise that resolves to the number of words on the page.

setRightToLeftLanguage(rtl)

Sets the directionality of text extractor. Must be called before the processing of a page started.

Parameters:
{boolean} rtl: mode reverses the directionality of TextExtractor algorithm.

WebViewer HTML5 API Reference v2.2.1.49322

Classes

Class PDFNet.TextExtractor