Recently the CIA released the previously secret documents surrounding the investigation of the JFK assassination. In the words of the National Archive :
“John F. Kennedy was killed on November 22, 1963. Almost 30 years later, Congress enacted the President John F. Kennedy Assassination Records Collection Act of 1992. The Act mandated that all assassination-related material be housed in a single collection in the National Archives and Records Administration (NARA). The resulting Collection consists of more than 5 million pages of assassination-related records (approximately 2,000 cubic feet of records).”
We had heard lots of rumors of exciting never-seen-before information such as:
Even in lockdown, we haven't got the time to read all 5 million pages so we wanted to upload all of the documents in SharePoint so that we could use enterprise search to help us find the answers.
We contacted the National Archives and they kindly gave us access to the bulk downloads of the documents.
In total there we 49,118 files containing a little over 5 million pages. The total size of the files was 33.8 GB.
Adding so many large files into SharePoint will certainly take some time and we also needed to know would SharePoint be able to search the text within them.
We used Ocrato run an audit of the files to see if they could be optimised before we migrate them into SharePoint. You can download a free copy of Ocrato which allows for unlimted File Share and SharePoint audits. It took a few hours to analyze the files and here is what we learnt.
So, we have discovered a few key facts by auditing the files:
It seems a good place to start would be to make the files as small as possible. That way, any further processing (OCR, de-duplication and migration) should be quicker.
We ran Ocrato hyper-compression against the files. This keeps the image quality the same but makes them much, much smaller (more details on file hyper-compression). Here is the before and after view of the files following hyper-compression.
The single step of hyper-compression has reduced the total file size from 33.8 GB to 10.2 GB – saving us 70% on storage! Although this is impressive, it is useful to note that in a more typical set of business documents the reductions are often hundreds of percent.
Now we have shrunk the file sizes, let's remove any duplicates. We want to reduce the amount of files to migrate to our SharePoint search and the intial audit identified there was around 1400 duplicate files.
Just to be sure we aren’t losing any information we will run the de-duplicate report in “report only” initially. This report interrogates the files a little deeper than the audit report.
Once we have checked the report we can see they are all genuine duplicated documents. In this case we will use Ocrato to simply delete the duplicates but in a normal business content these could be archived or replace with link to the orginal file (including within SharePoint).
Sure enough, there were 1400 duplicate files in total, removing these saved us another 500 MB. We have now reduced the storage from 33.8 GB to 9.7 GB, of course this also means fewer files to migrate over to SharePoint.
Again, although saving another 500 MB of storage is useful, from our experience with large organisations the amount of document files is usually between 15% and 25%, please see file de-duplication for more details.
Ocrato uses industry-leading optical character recognition (OCR) to convert scanned documents such as the JFK archives into searchable files. It adds a text layer to each page, it can even read handwriting and works in hundreds of languages and file types.
If we uploaded the PDF files directly into SharePoint, they would not be seachable as there is no text in the files, they are just images (more on SharePoint file OCR)
We have 5 million pages to read so this is definately a job to start at the end of the day to allow for some serious OCR processing to happen overnight.
Now we have compressed, de-duplicared and run OCR against the JFK documents it is time to upload them to SharePoint.
As there are around 50,000 documents we will use our favourite SharePoint FTP Client, SPFileZilla.
We added each folder as a new library to keep under the SharePoint item limit of 5000. Once they are uploaded we will ask SharePoint Online to re-index the libries (Library Siettings -> Advanced Settings -> Re-Index library) so we can get searching.
We can now seach using any keywords we wish directly against the documents. As they have all been OCR'd all of the text is now available to query against.
The CIA did indeed try to hire the Mafia to kill Fidel Castrol PDF
Oswold did attempt to join the KGB but was rejected for being a 'neurotic maniac' PDF
Candy Cane helped the CIA with thier enquiries PDF)
Ocrato was able to reduce the size of the files by around 80%, removed the duplicates and used OCR to convert 5 million pages to a searchable format.