You are currently viewing Cybersecurity Lakehouses Finest Practices Half 4: Information Normalization Methods

Cybersecurity Lakehouses Finest Practices Half 4: Information Normalization Methods

On this four-part weblog sequence “Classes realized from constructing Cybersecurity Lakehouses,” we’re discussing quite a lot of challenges organizations face with information engineering when constructing out a Lakehouse for cybersecurity information, and supply some options, ideas, tips, and finest practices that we now have used within the area to beat them.

In half one, we started with uniform occasion timestamp extraction. In half two, we checked out easy methods to spot and deal with delays in log ingestion. And in half three, we tackled easy methods to parse semi-structured, machine-generated information. On this last a part of the sequence, we focus on one of the vital vital points of cyber analytics: information normalization utilizing a typical info mannequin.

By the tip of this weblog, you should have a stable understanding of among the points confronted when normalizing information right into a Cybersecurity Lakehouse and the strategies we are able to use to beat them.

What’s a Frequent Info Mannequin (CIM)?

A Frequent Info Mannequin (CIM) is required for cyber safety analytics engines to facilitate efficient communication, interoperability, and understanding of security-related information and occasions throughout disparate techniques, functions, and gadgets inside a corporation.

Organizations have completely different techniques and functions that generate logs and occasions in numerous buildings and codecs. A CIM supplies a standardized mannequin that defines frequent information buildings, attributes, and relationships. This standardization permits analytics engines to normalize and harmonize information collected from disparate sources, making it simpler to course of, analyze, and correlate info successfully.

Why use a Frequent Info Mannequin?

Organizations use a wide range of safety instruments, functions, and gadgets from completely different distributors, which generate logs particular to their respective applied sciences. Normalizing information right into a identified set of buildings with constant and comprehensible naming conventions is essential to allow information correlation, risk detection, and incident response features.

As a working instance, suppose we wished to know which techniques and functions consumer ‘Joe’ has efficiently authenticated towards inside the final 30 days.

To reply this query and not using a single mannequin to interrogate, an analyst could be required to craft queries to look tens or a whole lot of logs. Every log file experiences the username and the results of any authentication outcomes (success or failure) as completely different area names with completely different values. The app area identify is also completely different in addition to the occasion time. This isn’t a workable answer. Enter the Frequent Info Mannequin and the normalization course of!

Common Information Model

The picture above exhibits how disparate logs from many sources filter occasions into event-specific tables, utilizing identified column names, permitting a single easy question to reply the query as soon as information has been normalized.

Issues to think about when normalizing information

There are a selection of situations that needs to be accounted for when normalizing disparate information sources right into a single CIM-compliant desk:

Differing Column Varieties: Unifying disparate information sources and particular occasions into the CIM (event-driven) desk might have clashing information sorts.

Derived Fields: The normalization course of usually requires new fields to be derived from a number of supply columns.

Lacking Fields: Fields might unexpectedly not exist or comprise null values. Make sure the CIM caters to lacking or null worth information sorts.

Literal Fields: Information to assist a goal CIM area might should be created, or the sphere might should be set to a literal worth comparable to “Success” or “Failure” to make sure a unified search functionality. For instance (the place motion=”Success”)

Schema Evolution: Each information and the CIM might evolve over time. Guarantee you might have a mechanism to offer backward compatibility, particularly inside the CIM tables, to cater for adjustments in information.

Enrichment: CIM information is commonly enriched with different context comparable to risk information and asset info. Take into account easy methods to add this info to offer a complete view of the occasions collected.

Which mannequin ought to I select?

There are lots of frequent Info fashions to select from when constructing out a Cybersecurity Lakehouse, from open supply fashions to vendor-specific publically out there fashions. The choice on what to make use of relies upon primarily in your particular person use case.

Some concerns are:

  • Are you augmenting Delta Lake with one other SIEM or SOAR product? Does it make sense to undertake that one for simpler integration?
  • Are you solely constructing a Cybersecurity Lakehouse for a selected use case? As an example, do you solely need to analyze Microsoft endpoint information? In that case, does it make sense to align with Microsoft ASIM mannequin?
  • Are you constructing out a Lakehouse as your group’s predominant cyber analytics platform? Does it make sense to align with an open supply mannequin like OCSF or OSSEM or construct your individual?

In the end, the selection is organizational-specific, relying in your wants. One other consideration is the completeness of the mannequin you select. Fashions are generic and can seemingly require some adaptation to suit your wants; nevertheless they need to primarily assist your information and necessities earlier than you start adopting the mannequin, as mannequin adjustments after the very fact are time-consuming.

Ideas and finest practices

Whatever the mannequin you select, there are just a few ideas to make sure gaps don’t exist in your general safety posture.

  • Most queries rely closely on entities. Supply host, vacation spot host, supply consumer, and software used are seemingly essentially the most looked for columns in any desk. Guarantee these are well-mapped and normalized.
  • Fashions usually present steering on area protection (obligatory, advisable, non-obligatory). Guarantee at a minimal that obligatory fields are mapped and have information integrity checks utilized tfor a constant search surroundings.


Frequent Info Mannequin-based tables are a cornerstone of an efficient cyber analytics platform. The mannequin you undertake when constructing out a Cybersecurity Lakehouse is organization-specific, however any mannequin ought to largely be appropriate on your group’s wants earlier than you start. Databricks has beforehand solved this downside for purchasers utilizing the rules outlined within the weblog.

Get in Contact

If you wish to study extra about how Databricks cyber options can empower your group to determine and mitigate cyber threats, contact [email protected] and take a look at our Lakehouse for Cybersecurity Purposes webpage.

Leave a Reply