WTF Is Unstructured Data, and Is It the Future of Data Management?
by Sarah O'Neill, on 26 August, 2021
Unstructured data was the reason I failed my science exams at school, apparently. Thanks Dr. Nandra! 🙃
But this is a different type of unstructured data than a sub-par bar chart. No, this unstructured data is good.
According to the latest figures, the volume of unstructured data is set to grow from 33 zettabytes in 2019 to 175 zettabytes, or 175 Billion terabytes, by 2025. That's an unfathomable amount. I didn't even know there was such a thing as a zettabyte. If you told me Zetta Byte was a drag queen, I'd believe you.
Although there's a lot of nerdy speculation around unstructured data, most experts believe that 80% to 90% of the data in the world is unstructured. On top of this, 90% of this data has been created in the last two years alone. But only 0.5% is analysed and used today.
"Intuitively, [the average IT organisation] knows that much is unstructured and it grows by two digits, but they do not know exactly how much they have and how fast it is growing," says President and COO of Komprise, Krishna Subramanian.
But before we define un-structured data, we need to define structured data.
So, what is structured data?
Structured data is usually stored in a relational database (or RDBMS). It can be easily mapped into designated fields, be it postcodes, credit cards, or phone numbers. Data that conforms to this structure is easy to search for, by humans or software.
So, unstructured data is data that contrasts to this. Unstructured data doesn't fit into these predefined data models - it can't be stored in an RDBMS. Because of the many, many formats, it's difficult for conventional software to ingest, process and analyse. But simple content searches can be performed across textual unstructured data with the right tools.
And as a nice additional treat, let's have a quick look into semi structured data. This form of data is largely unstructured, but uses internal tags and markings that separate various data elements, meaning that they're placed into pairings and hierarchies.
Email is an example - the metadata in an email allows analytics tools to classify and search for keywords. Sensor data, social media data, mark-up languages, and NoSQL databases are examples of unstructured data that have greater searchability, and therefore may be considered semi structured data.
Semi structured data is basically the bridge between structured and unstructured. It doesn't have a predefined data model, and is a bit more complex than structured, but is easier to store than unstructured.
A further look into unstructured data
Unstructured data is basically information, in many different forms, that doesn't follow conventional data models, making it difficult to store and manage in a normal, mainstream database. Unstructured data is the most abundant form of data, because it can be anything - audio, sensor data, media, imaging, IoT, analytics, text etc. etc.
So, companies have largely been unable to tap into high value data, like rich media, social network conversations, or customer interactions, with the aforementioned tools only recently being developed and commercialised.
Organisations are increasingly relying on unstructured data for regulatory, analytical, and decision making purposes. Plus, it supports machine learning, and business intelligence.
What is unstructured data used for?
Because of its nature, unstructured data isn't entirely suited to the processing applications that handle structured data. Instead, it's primarily used for business intelligence and analytics. Companies analyse unstructured data to improve customer experience and develop impactful targeted marketing.
Analysing data from IT systems can highlight trends, limitations, and can flag system crashes and other issues. So, unstructured data analytics can aid in compliance efforts, and can help companies understand what corporate documents and records contain.
What are the pros (and cons) of unstructured data?
Well, unstructured data:
- Remains undefined until needed. When stored in its native format, unstructured data stays undefined. This adaptability increases file formats in the database, which widens the data pool, which means data scientists can prepare and analyse the data they need.
- Has fast accumulation rates. The data can be collected quickly and easily, since there's no need to predefine the data.
- Allows for massive storage and pay-as-you-use pricing, which cuts costs and eases scalability.
But on the other hand, a few challenges remain, including:
- Data silos. This is not a new challenge, even with new tech. The rise of hybrid and multi-cloud infrastructures has companies working to get data closer to users and applications, but is also creating a huge amount of data silos. These are difficult to manage!
- Unmanaged cost. Dispersing data across multiple environments makes it harder to find, and understand its value. So, without that knowledge, it's difficult to manage cost or data placement strategies.
- Requires expertise. Thanks to its non-formatted nature, data science expertise is needed to work with unstructured data. This alienated business users who may not be able to fully understand specialised data topics.
- Requires specialised tools. Tools are required to manipulate unstructured data, which limits product choices for data managers.
- Correct data placement. Data value may change over time. Primary, secondary and tertiary data provide bright targets for budgeting spend, with attendant impacts on all sorts of things, including latency, throughput and scalability.
What's the future of data?
Like everything else: Artificial Intelligence, and Machine Learning. Beep Boop.
AI and ML are leading the way in the future of data, through enhancing BI and innovation.
When asked if the data management challenge spur a whole new sector of startups, Krishna answered:
"Definitely. Analysts are beginning to recognize data management software as a new category. Beyond the use cases above, consider all the new types of data analytics companies getting funded, such as SnowFlake, Databricks, and Apache Spark.
So many companies are coming to light right now to solve data management and data analytics issues at scale."
And are the big cloud providers responding to problems and opportunities with unstructured data growth?
"They are all offering more services to store data at different performance and price points. Amazon Elastic File System (Amazon EFS) and Azure Files were born to address the need for file storage in the cloud.
The major CSPs are investing in partners across many areas of unstructured data management, including migration and analytics"