Stories from the edge. 

Life and limb with a dash of infosec and litigation support. 

Big Data. Understanding a culture of information generation.

Posted by Sid Newby

May 22, 2014 8:59:00 AM

We just attended a great summit in Phoenix where it was demonstrated that with significant resources, 1.5 Terabyte of client data could be processed in 24 hours with a fair level of ease. This represented about 1,500 custodians, 15 million individual text records, with full metadata extraction, text extraction, OCR where possible and a full database index and optimization. The crowd was seriously impressed.

Call me a bully, but I wasn't impressed.

When we're creating 5 billion gigabytes of organic data every 48 hours, we're going to need a bigger pipe than that. Much bigger. The problem isn't with the code we saw demonstrated. It's much more diabolical than that.

While the process completed in just under 24 hours, 16 hours of that was SQL Optimization. Database table concatenation, text indexing and general clean up took 2x the overall extraction and processing time. The problem isn't the process. The process is sound. Even awesome. The problem is the database.

The status quo in our industry right now is Microsoft SQL Server. While upcoming technologies like Hekaton promise to move SQL operations forward at breakneck speed by converting individual tables to in-memory data, it's not a solution for big data.

There are some rebel entities in our industry that haver stepped outside of the SQL sandbox and have built successful platforms on newer, big-data databases like Hadoop, an open source no-SQL database built and delivered by Apache. Other upstarts are using even cooler open source implementations like RavenDB. New open source technologies like Lucene allow developers to access document and message content for high speed extraction without altering or even executing the file in question.

However, our industry is still fighting the uphill battle with Microsoft's limited .net framework and Microsoft's SQL Server. Understandably, its everywhere. Chances are, your company's backup servers, antivirus, accounting platforms, your website and more are already running some implementation of SQL. And why not? It's easy to use, there's tons of connectors for it, it's well supported and extremely well documented. Right now, it's the path of least resistance.

However, one day soon it won't be enough.

Enjoy a video. Big data databases like Hadoop are growing leaps and bounds. Be excited. http://youtu.be/usPbkw-p9k0

Topics: Platinum Culture, Litigation Support Technology