HelloKenLee

Leetcode 609. Find Duplicate File in System 解题报告

Posted on By KenLee

题目信息

题目: 609. Find Duplicate File in System
难度: Medium
正确率: 52.7%
问题描述:

Given a list of directory info including directory path, and all the files with contents in this directory, you need to find out all the groups of duplicate files in the file system in terms of their paths.

A group of duplicate files consists of at least** two** files that have exactly the same content.

A single directory info string in the input list has the following format:

root/d1/d2/.../dm f1.txt(f1_content) f2.txt(f2_content) ... fn.txt(fn_content)

It means there are n files (f1.txt, f2.txt … fn.txt with content f1_content, f2_content … fn_content, respectively) in directory root/d1/d2/…/dm. Note that n >= 1 and m >= 0. If m = 0, it means the directory is just the root directory.

The output is a list of group of duplicate file paths. For each group, it contains all the file paths of the files that have the same content. A file path is a string that has the following format:

directory_path/file_name.txt

Example 1:

Input:
["root/a 1.txt(abcd) 2.txt(efgh)", "root/c 3.txt(abcd)", "root/c/d 4.txt(efgh)", "root 4.txt(efgh)"]
Output:  
[["root/a/2.txt","root/c/d/4.txt","root/4.txt"],["root/a/1.txt","root/c/3.txt"]]

Note:

  1. No order is required for the final output.
  2. You may assume the directory name, file name and file content only has letters and digits, and the length of file content is in the range of [1,50].
  3. The number of files given is in the range of [1,20000].
  4. You may assume no files or directories share the same name in the same directory.
  5. You may assume each given directory info represents a unique directory. Directory path and file info are separated by a single blank space.

Follow-up beyond contest:

  1. Imagine you are given a real file system, how will you search files? DFS or BFS?
  2. If the file content is very large (GB level), how will you modify your solution?
  3. If you can only read the file by 1kb each time, how will you modify your solution?
  4. What is the time complexity of your modified solution? What is the most time-consuming part and memory consuming part of it? How to optimize?
  5. How to make sure the duplicated files you find are not false positive?

问题分析

  该题的意是,输入给你若干个字符串,每一个字符串表示一个目录,以及该目录下的所有文件的文件名,他们之间用空格隔开。每个文件名后面用括号括住该文件的内容。

  要你找出给出的文件中,文件内容相同的所有文件。可能会有多个组,要求每一个组里的文件内容都相同。输出所有组中的文件的完整路径(所在目录+文件名),用一个二维数组输出。

解题思路

  该问题的难点有2个,一是字符串处理,二是快速判重的数据结构的设计。

  总的来说该题比较简单。我是使用了一个unordered_map<string, pair<string, int>>样的哈希表作为快速判重的。该哈希表的key是文件的内容,后面的一个 pair<string, int>表示第一个文件内容为key的文件的路径以及如果该key已经有重复的文件了,文件内容为key的文件组在输出数组中的下标。这样子会比仅仅存相同文件内容的文件数量来说省去了最后一次遍历出结果,在期望上能降低一些常数时间复杂度。

  这道题目真正有意思的是下面的问题,也就是Follow-up beyond contest下面思考的问题。这些问题简单来说是当你面对一个真正的文件系统的时候,而且文件的内容高达GB级别的时候,而文件的读写速度仅仅有1KBps的时候,怎么设计一个一样的Solution?

  看到问题的第一反应是:这不是百度云盘吗??我的想法是:采用BFS遍历,因为BFS遍历的时候同一目录的文件会被相邻访问,省去了不断切换目录的时间。然后基本的思路不变,但是因为面对GB级别的文件,因此不能再使用整个文件内容作为哈希的key了。我认为应该使用文件大小+抽关键值MD5作为查重的手段。文件大小就是简单地记录文件所占空间。抽关键值MD5是指,对于一个很大的文件,我从文件头等间隔到文件尾抽取10个1KB大小的序列值,然后对这10KB的数据计算MD5,如果文件大小一样且MD5一样,我们就认为他是同一个文件。这样做系统的瓶颈不是在计算时间而是在文件的IO上,因此需要提高文件的IO速度才能比较大的优化系统速度。而且这样做不能保证完全正确,因为MD5的值可能会有冲突,因此在加一层文件大小的判断把冲突的几率降到比较低。

结果分析

  • 时间复杂度: $ O(N) $ 假设 $ N $ 表示给出的输入路径中的文件数,则至少要遍历一次这些文件
  • 空间复杂度: $ O(N) $ 最坏情况下每一个文件的路径都会在duplicate数组中出现一次
  • 通过时间: 79 ms
  • 击败: 99.14%的提交

源代码

class Solution {
public:
    vector<vector<string>> findDuplicate(vector<string>& paths) {
        unordered_map<string, pair<string, int>> fileHash;
        vector<vector<string>> duplicate;
        for(auto p : paths){
            // get directory
            int nowSpace = p.find(' ', 0);
            string dir = p.substr(0, nowSpace);
            while(nowSpace != string::npos){
                // get each file name & content
                int left = p.find('(', nowSpace + 1);
                int right = p.find(')', left + 1);
                string fileName = p.substr(nowSpace + 1, left - nowSpace -1);
                string content = p.substr(left + 1, right - left -1);
                // check the hash map
                auto it = fileHash.find(content);
                if(it == fileHash.end()){
                    // content not duplicate
                    fileHash[content] = make_pair(dir + "/" + fileName, -1);
                } else {
                    // content duplicate
                    int index = (it->second).second;
                    if(index == -1){
                        (it->second).second = index = duplicate.size();
                        duplicate.push_back(vector<string>());
                        duplicate.back().push_back(it->second.first);
                    }
                    duplicate[index].push_back(dir + "/" + fileName);
                }
                // update
                nowSpace = p.find(' ', right + 1);
            }
        }
        return duplicate;
    }
};