What's the difference between EDM, IDM, and OCR classification techniques in Zscaler?
A walkthrough of various DLP techniques.
💡
This is part of an on-going series in cybersecurity foundations. Check the cyber 101 article tag index from time to time for more content.
Zscaler has a number of techniques for performing DLP matching. In today's article I'd like to briefly step through them and explain how each approach is different.
- Exact Data Match (EDM)
- This technique is intended to examine/protect data when it's structured. Typically this type of data would be in a fixed format or schema (e.g. databases, spreadsheets, forms, etc).
 - EDM takes a "fingerprint" (hash) of individual fields (columns) for organizations specific, highly sensitive data records. The DLP engine then checks network traffic for the exact match of this "fingerprinted" data.
 - The goal is to identify when there is an exact occurrence of a sensitive record.
 - Example use case: ensuring that a specific set of active employee records, patient data, or a customer list isn't exfiltrated.
 
 -  Indexed Document Matching (IDM) 
- This technique is intended to examine/protect unstructured data.
 - It works by creating a unique index (fingerprint) of an entire sensitive document (or set of documents). It then compares the content of data in motion to the stored index looking for a full or partial match.
 - So for example, the DLP engine might trigger if 75% of a document content matches the indexed versions.
 - Because of this, the approach is more flexible but also possibly subject to false positives.
 
 - Optical Character Recognition (OCR) 
- This technique extracts text from image files (e.g. screenshots, scanned documents, JPEGs, PNGs, or images embedded in other file types like Word).
 - It works by extracting text from visual forms and then passing that through standard DLP classification mechanisms, looking for things like keyword matching, regular expressions, or EDM/IDM).
 - Here again, the possibility for false positives is higher because the data being examined is subject to image quality.
 
 
💡
One other key point to mention: these techniques can be used in combination if desired. It's also possible for Zscaler to apply these techniques both at data-in-motion (network traffic) and data-at-rest (in cloud storage) depending on the product/module. 
For more information on this topic, check out the following resources:
https://help.zscaler.com/unified/about-exact-data-match
https://help.zscaler.com/unified/about-indexed-document-match
https://help.zscaler.com/unified/configuring-dlp-advanced-settings