Java enables extraction of PDF text and images

PDF files are always used to carry a lot of great information content. To make better use of this information, you need to use some tools to extract text and image information from the PDF. Below are the texts and photos to extract PDF through Java.

Tool use:

-Free Spire Pdf for JAVA 2.4.4 (free version)

Intellij IDEA / Eclipse

Jar package introduction:

--Method 1: After downloading the stress of Free Spire.Pdf for Java from the official site, add it to Shift + Ctrl + Alt + S in IDEA or Eclipse. By adding the Spire.Pdf.jar packet to the program, the jar file Can be obtained in the lib folder under the decompression path. The result of introducing the jar package is as follows:

--Method 2: Install from maven library. Refer to the installation method (https://www.e-iceblue.com/Tutorials/Licensing/How-to-install-Spire.PDF-for-Java-from-Maven-Repository.html).

The test source documentation is as follows:

See Java code example:

[Example 1] Extract the text content of PDF

** Step 1: ** Add namespace;

import com.spire.pdf.*;
import java.io.FileWriter;

** Step 2: ** Create an instance of PDF and load the PDF source file;

//Create the PDF
PdfDocument doc = new PdfDocument();
//Load the PDF file
doc.loadFromFile("data/Sample.pdf");

** Step 3: ** Define an example of a character buffer that traverses the entire PDF document using the StringBuider method;

// Traverse the PDF
StringBuilder buffer = new StringBuilder();
for(int i = 1; i<doc.getPages().getCount(); i++){
    PdfPageBase page = doc.getPages().get(i);
    buffer.append(page.extractText());
}

** Step 4: ** Define an instance of one writer to write data to the buffer area and use write () to write the data in the buffer area to a text.txt file and save it.

//save text
String fileName = "output/text.txt";
FileWriter writer = new FileWriter(fileName);
writer.write(buffer.toString());
writer.flush();
writer.close();

Text extraction result:

[Example 2] Extracting pictures in PDF

** Step 1: ** Add namespace;

import com.spire.pdf.*;
import javax.imageio.ImageIO;
import java.awt.image.BufferedImage;
import java.io.File;

** Step 2: ** Create an instance of PDF and load the PDF source file;

        //Create the PDF
        PdfDocument pdf = new PdfDocument();
	   //Load the PDF file
        pdf.loadFromFile("data/Sample.pdf");

** Step 3: ** The for loop goes through each page of the PDF, gets the image of the specified page using the extractImages () method, and finally saves the image in PNG format.

        // Declare an int variable
	 int index = 0;
        // loop through the pages
        for (int i= 0;i< pdf.getPages().getCount(); i ++){
            //Get the PDF pages
            PdfPageBase page = pdf.getPages().get(i);
            // Extract images from a particular page 
            for (BufferedImage image : page.extractImages()) {
            //specify the file path and name
                File output = new File("output/" + String.format("Image_%d.png ", index++));                
            //Save image as .png file    
            ImageIO.write(image, "PNG", output);
            }
        }

Image extraction result: