Java uses Apache Tika to extract text from PDF files

Updated: April 21, 2025 08:39:11 Author: Xueliang Programming Notes

Apache Tika is a content analysis toolkit that can extract metadata and text content from various types of documents. Let’s take a look at how to use Apache Tika to extract text from PDF files.

Apache Tika is a content analysis toolkit that extracts metadata and text content from various types of documents. Tika supports a variety of file formats, including but not limited to PDF, Word documents, Excel tables, PowerPoint presentations, HTML, XML, image files, etc. Tika’s design goal is to provide a simple and consistent way to handle files in different formats.

Supported formats

File format	Package library	Tikaban
XML		XMLParser
HTML	It uses Tagsoup Library	HtmlParser
MS-Office composite document Ole2 to 2007 ooxml from 2007	, it uses the Apache Poi library	OfficeParser（OLE2） OOXMLParser（ooxml）
OpenDocument format openoffice		OpenOfficeParser
Portable file format (PDF)	Use the Apache PdfBox library with this package	PDFParser
Electronic publication format (digital books)		EpubParser
Rich text format		RTFParser
Compression and packaging formats	Use Common compression library with this package	PackageParser and CompressorParser and their subclasses
Text format		TXTParser
Feed and joint formats		FeedParser
Audio format	and .mp3	AudioParser MidiParser Mp3-suitable for mp3parser
Imageparsers		JpegParser - for jpeg images
Videoformats	.mp4 and this parser use simple algorithms internally to parse flash video formats	Mp4parser FlvParser
Java class files and jar files		ClassParser CompressorParser
Mobxformat (email)		MobXParser
Cad format		DWGParser
FontFormats		TrueTypeParser
Executable programs and libraries		ExecutableParser

Main functions

Metadata Extraction: Tika can extract metadata information such as author, creation date, modification date, etc. from a file.

Text Extraction: Tika is able to parse files and extract text contents, which is very useful for applications that require full text search or natural language processing of documents.

Language Detection: Tika also has the ability to recognize the language used by the document.

MIME type detection: Determines the MIME type by the content of the file (for example, application/pdf or text/plain).

Use scenarios

Search Engine: When building enterprise-level search systems, you can use Tika to index unstructured data.

Data Analysis: Tika provides a powerful tool set for data analysis projects that require information to be collected from a large number of documents in different formats.

Document Management System: Helps implement smarter document management solutions, automatically classify and tag uploaded files.

Security Audit: Check whether files passing in or out of the organization's boundaries contain sensitive information.

How to use Apache Tika

1. Installation

You can add Tika to your Java project via Maven. Add the following dependencies to the file:

&lt;dependency&gt;
    &lt;groupId&gt;&lt;/groupId&gt;
    &lt;artifactId&gt;tika-core&lt;/artifactId&gt;
    &lt;version&gt;2.4.1&lt;/version&gt; &lt;!-- Please adjust according to the latest version --&gt;
&lt;/dependency&gt;
&lt;dependency&gt;
    &lt;groupId&gt;&lt;/groupId&gt;
    &lt;artifactId&gt;tika-parsers&lt;/artifactId&gt;
    &lt;version&gt;2.4.1&lt;/version&gt; &lt;!-- Same as above --&gt;
&lt;/dependency&gt;

2. Sample code

Here is a simple example of how to extract text from a PDF file using Tika:

import ;
import ;
import ;
import ;
import ;
import ;
import ;

import ;
import ;
import ;

public class TikaExample {
    public static void main(String[] args) {
        try (FileInputStream input = new FileInputStream(new File(""))) {
            // Create a Tika instance            Tika tika = new Tika();
            
            // Get the MIME type of the file            String mimeType = (input);
            ("Detected MIME type: " + mimeType);

            // Reset the input stream position            ().position(0);

            // Prepare the parser            BodyContentHandler handler = new BodyContentHandler(-1); // -1 means no limit on output size            Metadata metadata = new Metadata();
            ParseContext context = new ParseContext();

            // parse PDF and get content            PDFParser parser = new PDFParser();
            (input, handler, metadata, context);

            // Output result            ("Extracted text:\n" + ());
            ("Metadata:");
            String[] metadataNames = ();
            for (String name : metadataNames) {
                (name + ": " + (name));
            }
        } catch (IOException | SAXException | TikaException e) {
            ();
        }
    }
}

This code first detects the MIME type of the given file, and then usesPDFParserObject to parse the file and print out the extracted text and some basic metadata information.

This is the article about Java extracting text from PDF files using Apache Tika. For more related Java Apache Tika extracting PDF text content, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!

java
apache
tika
extract
pdf

Usage of using Executors class to create and manage threads in Java concurrent programming

This article mainly introduces the usage of using the Executors class to create and manage threads in Java concurrent programming. The article gives examples of using it to start threads and set thread priorities. Friends who need it can refer to it
2016-03-03
Detailed explanation of Java Condition class case

This article mainly introduces a detailed explanation of Java Condition cases. This article explains the understanding and use of this technology through brief cases. The following is the detailed content. Friends who need it can refer to it.
2021-09-09
Detailed analysis of Java simple factory and factory method pattern

This article mainly introduces the detailed analysis of Java simple factory and factory method mode. The simple factory mode belongs to the innovative model of the class, also known as the static factory method mode, which is responsible for creating instances of other classes by specifically defining a class. The created instances usually have a common parent class. Friends who need it can refer to it.
2023-12-12
Solve the problem of incomplete reception of @PathVariable parameters

This article mainly introduces the problem of incomplete reception of @PathVariable parameters. It has good reference value and hopes it will be helpful to everyone. If there are any mistakes or no complete considerations, I hope you will be very grateful for your advice
2021-08-08
How to determine whether mybatis has existed before batch update

This article mainly introduces how to determine whether mybatis is already there before batch update. It has good reference value and hopes it will be helpful. If there are any mistakes or no complete considerations, I hope you will be very grateful for your advice
2022-08-08
SpringBoot interceptor and source code analysis

Interceptors are used in our daily projects, such as: logging (which we will talk about in the subsequent chapters), user login status interception, security interception, etc. Therefore, the following article mainly introduces relevant information about SpringBoot interceptors and source code. Friends who need it can refer to it.
2021-07-07
In-depth analysis of java web log4j configuration and techniques for configuring Log4j in web projects

This article mainly introduces
2015-11-11
In-depth talk about Java memory leak issues

The so-called memory leak refers to an object or variable that is no longer used by a program and has been occupied in memory. The following article mainly introduces relevant information about Java memory leak problem. The article introduces the example code in detail. Friends who need it can refer to it.
2022-04-04
Detailed explanation of JVM garbage collection mechanism and garbage collector

This article mainly introduces the JVM garbage collection mechanism and garbage collector. In order to make programmers focus more on the implementation of code without having to consider too much memory release, so in the Java language, there is an automatic garbage collection mechanism, which is also the GC we often mention. Friends who need it can refer to it.
2022-07-07
The detailed meaning of 7 parameters of Java thread pool

Thread pool technology is often used during multi-threading Java development. This article is a detailed explanation of the seven parameters when creating a Java thread pool. It has certain reference value. Interested friends can refer to it.
2022-03-03