What's the difference between EDM, IDM, and OCR classification techniques in Zscaler?

A walkthrough of various DLP techniques.

What's the difference between EDM, IDM, and OCR classification techniques in Zscaler?
Photo by Agence Olloweb / Unsplash
💡
This is part of an on-going series in cybersecurity foundations. Check the cyber 101 article tag index from time to time for more content.

Zscaler has a number of techniques for performing DLP matching. In today's article I'd like to briefly step through them and explain how each approach is different.

  • Exact Data Match (EDM)
    • This technique is intended to examine/protect data when it's structured. Typically this type of data would be in a fixed format or schema (e.g. databases, spreadsheets, forms, etc).
    • EDM takes a "fingerprint" (hash) of individual fields (columns) for organizations specific, highly sensitive data records. The DLP engine then checks network traffic for the exact match of this "fingerprinted" data.
    • The goal is to identify when there is an exact occurrence of a sensitive record.
    • Example use case: ensuring that a specific set of active employee records, patient data, or a customer list isn't exfiltrated.
  • Indexed Document Matching (IDM)
    • This technique is intended to examine/protect unstructured data.
    • It works by creating a unique index (fingerprint) of an entire sensitive document (or set of documents). It then compares the content of data in motion to the stored index looking for a full or partial match.
    • So for example, the DLP engine might trigger if 75% of a document content matches the indexed versions.
    • Because of this, the approach is more flexible but also possibly subject to false positives.
  • Optical Character Recognition (OCR)
    • This technique extracts text from image files (e.g. screenshots, scanned documents, JPEGs, PNGs, or images embedded in other file types like Word).
    • It works by extracting text from visual forms and then passing that through standard DLP classification mechanisms, looking for things like keyword matching, regular expressions, or EDM/IDM).
    • Here again, the possibility for false positives is higher because the data being examined is subject to image quality.
💡
One other key point to mention: these techniques can be used in combination if desired. It's also possible for Zscaler to apply these techniques both at data-in-motion (network traffic) and data-at-rest (in cloud storage) depending on the product/module.

For more information on this topic, check out the following resources:

https://help.zscaler.com/unified/about-exact-data-match

https://help.zscaler.com/unified/about-indexed-document-match

https://help.zscaler.com/unified/configuring-dlp-advanced-settings