Linux: Find duplicate files easy!
One very common task (especially when you go out of disk space) is to find duplicate and compress them delete them etc.. in this article i…
One very common task (especially when you go out of disk space) is to find duplicate and compress them delete them etc.. in this article i will show you how you can find those duplicates and create a csv file with them which can help you decide what you do next! Lets start!
The find command
The command we will use is the find command, find allows to list files and directories of the file system using some criteria, in our example since we want to keep it simple we will not use any special criteria, except from listing only files and not directories, this command lists all files under /
sudo find / -type fStill the command is not complete, this just lists the files but does indicate if a file is a duplicate or not! here comes the md5sum command
The md5sum command
md5 is a hashing algorithm, without getting into any maths, if the md5 hash of a file is same of a another file then the files are the same
root@vmi1900762:~# md5sum a.txt
6f5902ac237024bdd0c176cb93063dc4 a.txt
root@vmi1900762:~# md5sum b.txt
6f5902ac237024bdd0c176cb93063dc4 b.txtHow to combine find and md5sum
Now we need to combine find and md5sum, to do this we will use the exec option of find, exec allows a command to be executed uppon a file result and insert as parameter the path of the found file
find / -type f -exec md5sum {} +-exec md5sum {} +: calculates the md5 hash of each found result
Running the command will create results like this
f9fcec8ed448bde3fcbe1b3aab50dddb /root/gemi_jsons/active/94520.json
459ad26a19b2319012e97dd69e24ea6f /root/gemi_jsons/active/0.json
0d43d55085ffe0ecbae6fe400f3c5b17 /root/gemi_jsons/active/100.json
4c2684dfec7e80f9167527d26dd215aa /root/gemi_jsons/active/1000.json
59d075e66f032085fee250356ffe2329 /root/gemi_jsons/active/10000.json
d43bf33c4659f4a57c63920e6669f522 /root/gemi_jsons/active/10020.json
fd2898b91c6252fbe9c61a59eed617bc /root/gemi_jsons/active/10040.json
70805844b241f8f9b102255d9bcfc69f /root/gemi_jsons/active/10060.json
0ad1f9531584e39b46197b7a59895fe8 /root/gemi_jsons/active/10080.json
b2f872877515455ba9af8e64eb2d450f /root/gemi_jsons/active/10100.jsonMaking the output as a proper csv
Now we need to format the output a bit to conform as a csv file, awk can help us on this.. awk what only does is prints second column first and inserts a comma between them
find / -type f -exec md5sum {} + | awk '{print $2 "," $1}'
/root/gemi_jsons/active/94520.json,f9fcec8ed448bde3fcbe1b3aab50dddb
/root/gemi_jsons/active/0.json,459ad26a19b2319012e97dd69e24ea6f
/root/gemi_jsons/active/100.json,0d43d55085ffe0ecbae6fe400f3c5b17
/root/gemi_jsons/active/1000.json,4c2684dfec7e80f9167527d26dd215aa
/root/gemi_jsons/active/10000.json,59d075e66f032085fee250356ffe2329
/root/gemi_jsons/active/10020.json,d43bf33c4659f4a57c63920e6669f522
/root/gemi_jsons/active/10040.json,fd2898b91c6252fbe9c61a59eed617bcAdding a file modify timestamp
Its nice so far but how we can decide which is the original file and not a duplicate? assuming that the older files are the original ones we can add an extra column that will indicate the file modification time
find / -type f -exec sh -c 'for file; do md5=$(md5sum "$file" | awk "{print \$1}"); mod_date=$(stat -c %y "$file"| cut -d" " -f1,2); echo "$file,$md5,\"$mod_date\""; done' sh {} +Dont be afraid! it might look complicated but its not! i will explain everything!
-exec sh -c...: For each file start a new shell and execute some commands with the found file as inputfor file;: for the found file execute the following commandsmd5=$(md5sum "$file" | awk "{print \$1}"); Get the md5 hash part of the md5sum command and store it in variable md5mod_date=$(stat -c %y "$file"| cut -d" " -f1,2); Get the modification time of the fileecho "$file,$md5,\"$mod_date\"; print for the file path,hash,modification datedone' sh {} +; done closes the loop, sh is the shell for each file and {} is the stdin that will keep its found file name
Running this command will generate output like this
/root/gemi_jsons/active/12460.json,f66ba72d9490dc019e7c2e65bcd2f42e,"2024-05-25 22:33:32.000000000"
/root/gemi_jsons/active/12480.json,220e07790afbd5c4a42af85ff3170de6,"2024-05-25 22:33:44.000000000"
/root/gemi_jsons/active/12500.json,fef287625d23ce856b35cbe8e3571368,"2024-05-25 22:33:56.000000000"
/root/gemi_jsons/active/12520.json,16e28b18e301083dbe2368e4164ee422,"2024-05-25 22:34:07.000000000"
/root/gemi_jsons/active/12540.json,d9a253fc5a01a849032ced6e30227e0e,"2024-05-25 22:34:18.000000000"
/root/gemi_jsons/active/12560.json,cd9a0ad77060efe462702b0e85e0299c,"2024-05-25 22:34:29.000000000"
/root/gemi_jsons/active/12580.json,5be9ebfaec9d17d4ebe0e20d13489d9a,"2024-05-25 22:34:40.000000000"
/root/gemi_jsons/active/1260.json,6f07e80934d6fc275cae55c4c7e08cc3,"2024-05-25 20:25:12.000000000"Filter out only duplicates sorted by last modification file
Since the one-liner creates a bit listing of files it will be very difficult to identify duplicates, so we need a way to print only duplicates sorted by their modification date which can help us decide which are the original files, to do this we will feed the csv file to an sqlite3 database and perform a query that will print only duplicates based on the md5 hash.
# Create the SQLite database and table
sqlite3 files.db "CREATE TABLE files (path TEXT, md5 TEXT, mod_date TEXT);"
# Import the CSV data into the SQLite table
sqlite3 files.db <<EOF
.mode csv
.import output.csv files
EOF
# Query the database for duplicates and sort the results
sqlite3 files.db <<EOF
.mode csv
.output duplicates_sorted.csv
SELECT path, md5, mod_date
FROM files
WHERE md5 IN (
SELECT md5
FROM files
GROUP BY md5
HAVING COUNT(md5) > 1
)
ORDER BY md5, mod_date;
EOF- The first sql command creates the table
- The second inserts the csv file into the table files
- The third one does the actuall query that finds the duplicates, results can be found in duplicates_sorted.csv
An example run generated this output
cat duplicates_sorted.csv
/home/user/documents/file3.txt,0cc175b9c0f1b6a831c399e269772661,"2024-08-03 16:00:00"
/home/user/documents/file4.txt,0cc175b9c0f1b6a831c399e269772661,"2024-08-03 17:00:00"
/home/user/documents/file1.txt,d41d8cd98f00b204e9800998ecf8427e,"2024-08-03 14:00:00"
/home/user/documents/file2.txt,d41d8cd98f00b204e9800998ecf8427e,"2024-08-03 15:00:00"Things to note
- delete all csv files used as input or output generated by the one-liner or the csv query before use, else you might and with wrong results, the same applies for the database file
- Run the one-liner with the aprorpiate permissions
Conclusion
In this article we saw how we can find duplicate files and sort them by modification date, find in combination with other tools is powerful and can help you accomplish day to day tasks with ease!