Table Of Contents ≪ 8.5. Standard Theme Customize 8.7. Update Related ≫

8.6. IM-ContentsSearch for Accel Platform¶

8.6.1. There are limitatios on the maximum number of contents, if the trial version license is used.¶

Number of contents that can be registered for the trial version license are up to 20,000 contents.

Moreover, the unit for contents creation varies depending on each specification of application that registers the contents information.

If you register the license of IM-ContentsSearch for Accel Platform, unlimited number of contents may be registered.

8.6.2. There are limitations about the extraction of text.¶

[*] Supported File Format

Shown below is the list of file format that allows text extraction using the text extraction class which is provided as standard.

[List of File Format that allows text to be extracted]

1. plain text text/plain (txt)

2. HTML text/html (htm, html)

3. XML text/xml (xml)

4. PDF application/pdf (pdf)

5. Microsoft Office Word

・application/msword (doc)

・application/vnd.openxmlformats-officedocument.wordprocessingml.document (docx)

6. Microsoft Office PowerPoint

・application/vnd.ms-powerpoint (ppt)

・application/vnd.openxmlformats-officedocument.presentationml.presentation (pptx)

7. Microsoft Office Excel

・application/vnd.ms-excel (xls)

・application/vnd.openxmlformats-officedocument.spreadsheetml.sheet (xlsx)

8. Microsoft Visio application/ vnd.ms-visio.viewer (vsd)

9. ZIP archive application/zip (zip)

[*] Limitations about text extraction

Limitations about text extraction are stated below for each different file format.

1. Plain Text

API automatically judges the character code of text in the plain text file by the statistical method.

However, character code cannot be judged correctly, if the text in the file is too short or if the file contains

mixture of character codes. In this case the text that is extracted from the file is corrupted

and the search cannot be performed correctly.

2. HTML

API extracts text from the following portions of HTML.

・Text included in the TITLE element in HEAD element

・Text included in BODY element

This API tries to determine the HTML character code by first searching for content-type specified by META tag,

and if it fails, judges the character code by the statistical method like for the plain text.

If the character code cannot be correctly judged, text extracted from the file will be the corrupted one.

3. XML

API extracts all the text nodes in XML.

4. PDF

Depending on the specification of setting file (<CONTENTS_PATH>/WEB-INF/conf/solr-extractor-config.xml),

the API extracts text from PDF using either one of the followings :

1. PDFBox (http://pdfbox.apache.org/)

2. Xpdf (http://www.foolabs.com/xpdf/) attached command ”pdftotext”

3. Freeware ”xdoc2txt” (http://ebstudio.info/home/xdoc2txt.html)

Number of PDF types that support text extraction will increase by using pdftotext or xdoc2txt.

In case pdftotext or xdoc2txt is used, Xpdf or xdoc2txt.exe should be installed on every machine intra-mart AccelPlatform is working,

and the environment that allows each command to function correctly should be in place.

It should be noted that xdoc2txt.exe will work only on Microsoft Windows machine.

For every case stated above, it is not possible to extract text from the PDF files that are encrypted by password

(PDF files that request password when opened by PDF viewer).

5. Microsoft Office Word

Depending on the specifications of setting file, the API extracts text from Word file by using either one of the followings :

1. Apache POI (http://poi.apache.org/)

2. xdoc2txt

Regardless of whether you use 1. or 2. above, text from the Word file (doc) of Office 2003 or before and the Word file (docx) of Office 2007 can be extracted.

In case xdoc2txt is used, xdoc2txt.exe should have been installed, and the environment that allows normal operation should be in place.

Please be aware that xdoc2txt.exe will work only on Microsoft Windows.

Regardless of whether you use 1. or 2., it is not possible to extract text from the Word file that is password protected.

6. Microsoft Office PowerPoint

Depending on the specifications of setting file, the API extracts text from PowerPoint file by using either one of the followings :

1. Apache POI (http://poi.apache.org/)

2. xdoc2txt

Regardless of whether you use 1. or 2. above, text from the PowerPoint file (ppt) of Office 2003 or before and the PowerPoint file (pptx) of Office 2007 can be extracted.

In case xdoc2txt is used, xdoc2txt.exe should have been installed, and the environment that allows normal operation should be in place.

Please be aware that xdoc2txt.exe will work only on Microsoft Windows.

Regardless of whether you use 1. or 2., it is not possible to extract text from the PowerPoint file that is password protected.

7. Microsoft Office Excel

Depending on the specifications of setting file, the API extracts text from Excel file by using either one of the followings :

1. Apache POI

2. xdoc2txt

Regardless of whether you use 1. or 2. above, text from the Excel file (xls) of Office 2003 or before and the Excel file (xlsx) of Office 2007 can be extracted.

In case xdoc2txt is used, xdoc2txt.exe should have been installed, and the environment that allows normal operation should be in place.

Please be aware that xdoc2txt.exe will work only on Microsoft Windows.

Regardless of whether you use 1. or 2., it is not possible to extract text from the Excel file that is password protected.

8. Zip Archive

Text is extracted from each file in ZIP archive, and the name of each file is also added to the extracted text.

It is assumed that file names in ZIP archive are encoded by Windows-31J (known as Shift_JIS), and recorded in ZIP.

This is the correct assumption most of the times for ZIP files created on Japanese Windows.

If the file names in ZIP archive are not encoded by Windows-31J, extracted text would be corrupted.

Please be aware that the character corruption is seen only on file names and the text extracted from the file will not be affected.

If the ZIP file includes encrypted files, the API extracts only their file names as text.

Encryption can be judged only when the file is in ZIP 2.0 compatible archive format.

ZIP folders created on Windows are usually compatible with this format.

Table Of Contents ≪ 8.5. Standard Theme Customize 8.7. Update Related ≫