A review on knowledge and information extraction from PDF documents and storage approaches

Publication Type

Journal Article

Journal Name

Frontiers in Artificial Intelligence

Publication Date

1-1-2025

Abstract

Introduction: Automating the extraction of information from Portable Document Format (PDF) documents represents a major advancement in information extraction, with applications in various domains such as healthcare, law, or biochemistry. However, existing solutions face challenges related to accuracy, domain adaptability, and implementation complexity. Methods: A systematic review of the literature was conducted using the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) methodology to examine approaches and trends in PDF information extraction and storage approaches. Results: The review revealed three dominant methodological categories: rule-based systems, statistical learning models, and neural network-based approaches. Key limitations include the rigidity of rule-based methods, the lack of annotated domain-specific datasets for learning-based approaches, and issues such as hallucinations in large language models. Discussion: To overcome these limitations, a conceptual framework is proposed comprising nine core components: project manager, document manager, document pre-processor, ontology manager, information extractor, annotation engine, question-answering tool, knowledge visualizer, and data exporter. This framework aims to improve the accuracy, adaptability, and usability of PDF information extraction systems.

Keywords

knowledge base, knowledge extraction, knowledge graphs, large language models, natural language processing

Recommended Citation

Atagong, S., Tonnang, H., Senagi, K., Wamalwa, M., Agboka, K., & Odindi, J. (2025). A review on knowledge and information extraction from PDF documents and storage approaches. Frontiers in Artificial Intelligence, 8 https://doi.org/10.3389/frai.2025.1466092

All Peer-Reviewed Publications

A review on knowledge and information extraction from PDF documents and storage approaches

Publication Type

Journal Name

Publication Date

Abstract

Keywords

Recommended Citation

Search

Browse

Resources

Links

All Peer-Reviewed Publications

A review on knowledge and information extraction from PDF documents and storage approaches

Publication Type

Journal Name

Name of Author

Publication Date

Abstract

Keywords

Recommended Citation

Share

Search

Browse

Resources

Links