Businesses rely on processing an inflow of documents to drive processes and make decisions. Many such documents are combined into a single file. For example, a loan application may have a driver’s license, paystub, W2, bank statement, and other document types within a single file. The complexity of handling many document types within a single file makes it difficult for businesses to manage at scale. 

Google Cloud committed to solving these challenges with continued investment in our Document AI solutions suite which offers machine learning products for document processing and insights. Document AI Workbench helps users quickly build ML models with world-class accuracy, trained for their specific use cases. In February 2023, Google launched the Custom Document Extractor (CDE) in General Availability (GA) to help users extract structured data from documents in production use cases. In March 2023, Google launched the Custom Document Classifier (CDC) in GA to help automatically classify document types. Today, Google announce the newest feature of Document AI Workbench, Custom Document Splitter (CDS) in GA to help users automatically split and classify multiple documents within a single file. 

CDS provides tangible business value to customers by helping them sort and classify documents. For example, businesses can validate if they have all the needed documents from an applicant. Furthermore, individually classified documents enable businesses to better automate downstream processes, including selecting the proper storage, analysis, or processing steps based on the document type. The efficiencies enabled by CDS helps businesses lower their document processing time and cost.

Benefits of splitting and classification models in Document AI Workbench 

Document AI Workbench can save time and money by simplifying model training, from dataset management, to testing, to deployment. CDS helps businesses achieve higher automation rates to scale processes while lowering costs.

Sean Earley, VP of Delivery Services at Zencore said, “We completed a project for a large bank using Document AI Workbench to split, classify, and extract data from documents to automate Home Mortgage Disclosure Act reporting. Given the accuracy of the models we built, our client estimated increasing loan reporting coverage from 20% to 100% while eliminating thousands of errors per year, drastically reducing the operational cost of the bank’s compliance reporting procedures.”  

Fabian Beckmann, Manager Artificial Intelligence & Data at Deloitte Consulting GmbH said, “By leveraging Document AI’s Custom Document Splitter, our client, Commerzbank, a large european bank, can effortlessly segment customer submissions tailored to their back-office requirements, significantly diminishing the need for extra manual sorting or routing. This integration paves the way towards seamless automation within the Document AI pipeline, delivering substantial business benefits.“

According to Kaïs Albichari – ML Tribe Tech Lead, G Cloud at IT services firm Devoteam, “Custom Document Splitter (CDS) has helped one of our clients in the financial services industry save significant time and improve data accuracy. By identifying which parts of documents they can discard and which they retain for entity extraction, CDS has helped the company automate its document processing tasks. The implementation resulted in a more efficient and streamlined workflow, freeing employees to focus on other tasks. Devoteam’s G Cloud team helped the company implement CDS and achieve these benefits.”

Frank Neugebauer, a Google Cloud Insurance Solutions Consultant, worked with an Insurance company and used CDS to create a model to split and classify millions of insurance documents with up to 98% accuracy. With this information, the insurer shares that they can better understand the nature of their unstructured data to inform business strategy, including volume for specific document types to inform extraction work. The customer considers this level of insight unprecedented in their history.

How to use Custom Document Splitter

You can leverage a simple interface in the Google Cloud Console and a set of public APIs to prepare training data, create and evaluate models, deploy a model into production, and call an API endpoint to split and classify document types. You can follow the documentation for instructions to create, train, evaluate, deploy, and run predictions with models.

Import and prepare training data

To get started, import and label documents to train and evaluate an ML model. 

To quickly build a training dataset, import single documents, one document per file, and bulk label them with the relevant document type. You can import one folder or multiple folders at once and choose the correct document type per folder. As shown in the next image, one import could have a folder with 200 bank statements, another folder with 200 W2s, another folder with 200 paystubs, etc., all of which are labeled at once while imported. Up to 30,000 documents and 100,000 pages can be inputted for training. This way, you can build a training dataset with hundreds of labeled documents per class in minutes. As always, if documents are already labeled using other tools, simply import labels with JSON in the Document format.

You can initiate training with a click of a button. Once you have trained a model, you can use it  to automatically label documents added to your dataset, letting you quickly build robust test and training datasets to evaluate and improve model performance.

To accurately evaluate a CDS model, import files which contain multiple document types within the same file and assign them to the test dataset. Then, use a simple interface to define document boundaries and types.

The ground truth you label in the test dataset is used to evaluate splitting and classification predictions from the CDS model.

Going into production

Once a model meets accuracy targets, it’s time to deploy into production and call the API endpoint to split and classify document types.

Getting started with Document AI Workbench 

Custom Document Splitter is publicly available in GA and ready to help customers automate document splitting and classification.