I'm beginning a Proof of Concept that will store a checksum of each and every file of each and every server on my lan. I want to keep repeat data to a minimum, and still be able to "reconstruct" and search for files based on file size, names, locations and even the checksum's (hashes) themselves. I've attached a CSV that should give you a indication on how I envisioned the DB to be laid out. What I am curious about is if can be improved, if I can make any other optimizations to keep the DB as "light" as possible.
The reason it's designed the way it is should be obvious, however the "path-to-hash" section still troubles me that there will be a lot of repeated data over and over in all the fields.
I have hundreds of workstations and servers, so keeping repeat data and overall DB size to a minimum is an obvious design goal. As such, Filenames are independent of file hash(checksum), and ID numbers (in base36) are used instead. Exact copies of files will hash the same, even if the name is changed, so again hash's are independent of name. Root's, like partition's and drive letters are also independent of the rest of the path. Paths are also reduced to base36 number ID's. All of that seems well and good, but when cataloging which file names, their hash's, paths and on which computer they are on, it seems like I could probably reduce this further, but I'm wondering if I've already crossed a point of diminishing returns? Suggestions, questions and people telling me I'm crazy are welcome.
The reason it's designed the way it is should be obvious, however the "path-to-hash" section still troubles me that there will be a lot of repeated data over and over in all the fields.
I have hundreds of workstations and servers, so keeping repeat data and overall DB size to a minimum is an obvious design goal. As such, Filenames are independent of file hash(checksum), and ID numbers (in base36) are used instead. Exact copies of files will hash the same, even if the name is changed, so again hash's are independent of name. Root's, like partition's and drive letters are also independent of the rest of the path. Paths are also reduced to base36 number ID's. All of that seems well and good, but when cataloging which file names, their hash's, paths and on which computer they are on, it seems like I could probably reduce this further, but I'm wondering if I've already crossed a point of diminishing returns? Suggestions, questions and people telling me I'm crazy are welcome.
Comment