The elephant in the data centre: what is dark data and why does it matter?

Iron Mountain

By Garry Valenzisi, Vice President & General Manager, Asia Pacific, Global Industries, Iron Mountain
Wednesday, 06 December, 2023

‘Dark data’, a term coined by Gartner, is defined as “the information assets organisations collect, process and store during regular business activities, but generally fail to use for other purposes”. Like dark matter, dark data takes up huge amounts of space in data centres and is virtually invisible. This doesn’t mean we can ignore it. It’s worth taking a moment to think about the nature of dark data, its impact and what we might be able to do to improve things.

Personal footprint

Dark data is easiest to grasp and deal with at a personal level. For most of us it consists of unused photos and videos. In the old days, film was precious and development expensive, but now we can take 20 shots to get the one we want, and we can edit easily, creating more backup files in the process. In 2020, Google said it stored 4 trillion photos, with 28 billion new photos and videos uploaded each week. Google Photos is just one photo service, and those upload rates have no doubt grown in the last few years.

This personal dark data also creates a privacy issue. However secure a cloud service is, there is always the possibility that data such as ID photos, personal chat screengrabs and private files can be used by cybercriminals. The answer? Think before you shoot, tidy up caches and archives regularly, and be particularly careful not to leave sensitive files lying around.

Hidden losses

For businesses, the challenge is on a larger scale and affects the bottom line. Dark data consists of things like near-identical images or documents, IoT data sets, log files, and applications. This data takes up server space, and powering these servers takes up energy and equipment, which not only costs money, but can also mean significant emissions if low-carbon or renewable power is not being used. Dark data is also unstructured and unexplored, which brings with it privacy and compliance risks.

No organisation is unaffected. Globally, estimated levels of commercial dark data vary by sector from 40 to 90%, so it’s extremely likely that the majority of your company’s data is dark. According to the World Economic Forum, companies generate 1.3 trillion gigabytes of dark data every day. Storing that data for a year using non-renewables generates as much CO₂ as three million flights from London to New York. So, if we’re interested in decarbonising — and we should be — we should tackle this issue.

Technology lag

For many businesses the level of dark data is a reflection of a lack of data structuring processes. The ability of an organisation to collect data can exceed the throughput at which it can analyse the data. In some cases a business may not even be aware the data is being collected.

Companies retain dark data for a multitude of reasons. Often it is stored for regulatory compliance and record keeping, but equally often the complexity of compliance, privacy and data discovery is the reason that these data lakes are allowed to build up. Some organisations believe that dark data could be useful to them in the future once they have acquired better analytic and business intelligence technology to process the information.

New tools and standards

There is good news here. The scale of the task may appear daunting for CIO and CDOs, but artificial intelligence (AI) and machine learning (ML) have now advanced to the point that they can help automate the data structuring process. Only a tiny percentage of dark data needs to be reviewed at the outset by humans to kickstart the process. This can then be followed up with a reinforcement learning model to assess the relevance of remaining data and prioritise it. From then on, a virtuous cycle of tagging and analysis makes the process easier to manage.

Measurement would also help to benchmark progress. Considering the scale of the problem, there may be a case for setting standards for effective data use such as a Data Usage Effectiveness (DUE) metric to sit alongside CUE (carbon), WUE (water) AND PUE (power). This, or some similar metric, would be well worth working towards, and could also have value as a digital performance indicator. However, it may be too early to measure, while so much dark data remains invisible.

Let’s talk

Whatever dark data means to you or your business, it is an ‘elephant in the room’ for data centres, and the more we talk about it the likelier we are to come up with incremental improvements. For individual data users there are things we can do to reduce single-use data. For organisations, it’s a bit more complicated, but approaches and tools are emerging. These should be discussed and shared.

As with energy efficiency, identifying and eliminating waste at source is the most obvious opportunity. According to IBM, 60% of data loses its value within milliseconds of being acquired, and any scheme to use data more effectively must first address the issue of collecting useless data. A robust approach to data gathering is the key here; assessing how data can be used, or if it is usable.

The next step is structuring the data we keep. Structured data is not only more valuable, but easier to track and, if necessary, delete. By making data more visible, it should be possible to reduce the environmental and financial burden of storage at the same time as using our valuable data to empower our organisations and serve our customers better.

Image credit: iStock.com/carloscastilla

The elephant in the data centre: what is dark data and why does it matter?

Personal footprint

Hidden losses

Technology lag

New tools and standards

Let’s talk

Optical fibre: the foundation of AI-ready data centres

Why businesses can't afford to wait on slow data

The future of data centres in the age of AI

Content from other channels on our network