TIKA - 提取PDF( Extracting PDF)

以下是从PDF中提取内容和元数据的程序。

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.pdf.PDFParser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;
public class PdfParse {
   public static void main(final String[] args) throws IOException,TikaException {
      BodyContentHandler handler = new BodyContentHandler();
      Metadata metadata = new Metadata();
      FileInputStream inputstream = new FileInputStream(new File("Example.pdf"));
      ParseContext pcontext = new ParseContext();
      //parsing the document using PDF parser
      PDFParser pdfparser = new PDFParser(); 
      pdfparser.parse(inputstream, handler, metadata,pcontext);
      //getting the content of the document
      System.out.println("Contents of the PDF :" + handler.toString());
      //getting metadata of the document
      System.out.println("Metadata of the PDF:");
      String[] metadataNames = metadata.names();
      for(String name : metadataNames) {
         System.out.println(name+ " : " + metadata.get(name));
      }
   }
}

将上面的代码保存为PdfParse.java ，并使用以下命令从命令提示符编译它 -

javac PdfParse.java
java PdfParse

下面给出的是example.pdf的快照

我们传递的PDF具有以下属性 -

编译程序后，您将获得如下所示的输出。

Output -

<b class="notranslate">Contents of the PDF:</b>
Apache Tika is a framework for content type detection and content extraction 
which was designed by Apache software foundation. It detects and extracts metadata 
and structured text content from different types of documents such as spreadsheets, 
text documents, images or PDFs including audio or video input formats to certain extent.
<b class="notranslate">Metadata of the PDF:</b>
dcterms:modified :     2014-09-28T12:31:16Z
meta:creation-date :     2014-09-28T12:31:16Z
meta:save-date :     2014-09-28T12:31:16Z
dc:creator :     Krishna Kasyap
pdf:PDFVersion :     1.5
Last-Modified :     2014-09-28T12:31:16Z
Author :     Krishna Kasyap
dcterms:created :     2014-09-28T12:31:16Z
date :     2014-09-28T12:31:16Z
modified :     2014-09-28T12:31:16Z
creator :     Krishna Kasyap
xmpTPg:NPages :     1
Creation-Date :     2014-09-28T12:31:16Z
pdf:encrypted :     false
meta:author :     Krishna Kasyap
created :     Sun Sep 28 05:31:16 PDT 2014
dc:format :     application/pdf; version = 1.5
producer :     Microsoft® Word 2013
Content-Type :     application/pdf
xmp:CreatorTool :     Microsoft® Word 2013
Last-Save-Date :     2014-09-28T12:31:16Z

<上一篇.TIKA - GUI

TIKA - 提取ODF( Extracting ODF).下一篇>

TIKA - 教程

TIKA - 概述

TIKA - 建筑( Architecture)

TIKA - 环境

TIKA - 参考API( Referenced API)

TIKA - 文件格式( File Formats)

TIKA - 文件类型检测( Document Type Detection)

TIKA - 内容提取( Content Extraction)

TIKA - 元数据提取( Metadata Extraction)

TIKA - 语言检测( Language Detection)

TIKA - GUI

TIKA - 提取PDF( Extracting PDF)

TIKA - 提取ODF( Extracting ODF)

TIKA - Extracting MS-Office 文件

TIKA - 提取文本文档( Extracting Text Document)

TIKA - 提取HTML文档( Extracting HTML Document)

TIKA - 提取XML文档( Extracting XML Document)

TIKA - 提取.class文件( Extracting .class File)

TIKA - 提取JAR文件( Extracting JAR File)

TIKA - 提取图像文件( Extracting Image File)

TIKA - Extracting mp4 文件

TIKA - Extracting mp3 文件

TIKA - 快速指南

TIKA - 有用的资源

TIKA - 讨论

TIKA - 提取PDF( Extracting PDF)