Understanding Duplicate Data

Voyager offers a default Saved Search for detecting duplicate data. There are three types of duplicates that users can search for:

  1. An MD5 hash is a byte by byte comparison of a file. It would be considered an exact duplicate. It is like a digital fingerprint of the file. There is a very small possibility of getting two identical hashes of two different files. 

  2. A Schema Hash indicates that the data you are looking at is in the same schema. We don't compare data or row counts.

  3. A Content Hash compares the schema, number of rows, and the first few hundred rows and use the combined data to build a hash. If multiple datasets come up with the same hash value then the data has a good chance of being duplicate
