By Konstantinos Patronas — 04 Aug 2024

Linux: Find duplicate files easy!

One very common task (especially when you go out of disk space) is to find duplicate and compress them delete them etc.. in this article i…

Photo by luis arias on Unsplash

One very common task (especially when you go out of disk space) is to find duplicate and compress them delete them etc.. in this article i will show you how you can find those duplicates and create a csv file with them which can help you decide what you do next! Lets start!

The find command

The command we will use is the find command, find allows to list files and directories of the file system using some criteria, in our example since we want to keep it simple we will not use any special criteria, except from listing only files and not directories, this command lists all files under /

sudo find / -type f

Still the command is not complete, this just lists the files but does indicate if a file is a duplicate or not! here comes the md5sum command

The md5sum command

md5 is a hashing algorithm, without getting into any maths, if the md5 hash of a file is same of a another file then the files are the same

root@vmi1900762:~# md5sum a.txt 
6f5902ac237024bdd0c176cb93063dc4  a.txt 
root@vmi1900762:~# md5sum b.txt 
6f5902ac237024bdd0c176cb93063dc4  b.txt

How to combine find and md5sum

Now we need to combine find and md5sum, to do this we will use the exec option of find, exec allows a command to be executed uppon a file result and insert as parameter the path of the found file

find / -type f -exec md5sum {} +

-exec md5sum {} + : calculates the md5 hash of each found result

Running the command will create results like this

f9fcec8ed448bde3fcbe1b3aab50dddb  /root/gemi_jsons/active/94520.json 
459ad26a19b2319012e97dd69e24ea6f  /root/gemi_jsons/active/0.json 
0d43d55085ffe0ecbae6fe400f3c5b17  /root/gemi_jsons/active/100.json 
4c2684dfec7e80f9167527d26dd215aa  /root/gemi_jsons/active/1000.json 
59d075e66f032085fee250356ffe2329  /root/gemi_jsons/active/10000.json 
d43bf33c4659f4a57c63920e6669f522  /root/gemi_jsons/active/10020.json 
fd2898b91c6252fbe9c61a59eed617bc  /root/gemi_jsons/active/10040.json 
70805844b241f8f9b102255d9bcfc69f  /root/gemi_jsons/active/10060.json 
0ad1f9531584e39b46197b7a59895fe8  /root/gemi_jsons/active/10080.json 
b2f872877515455ba9af8e64eb2d450f  /root/gemi_jsons/active/10100.json

Making the output as a proper csv

Now we need to format the output a bit to conform as a csv file, awk can help us on this.. awk what only does is prints second column first and inserts a comma between them

find / -type f -exec md5sum {} + | awk '{print $2 "," $1}' 
/root/gemi_jsons/active/94520.json,f9fcec8ed448bde3fcbe1b3aab50dddb 
/root/gemi_jsons/active/0.json,459ad26a19b2319012e97dd69e24ea6f 
/root/gemi_jsons/active/100.json,0d43d55085ffe0ecbae6fe400f3c5b17 
/root/gemi_jsons/active/1000.json,4c2684dfec7e80f9167527d26dd215aa 
/root/gemi_jsons/active/10000.json,59d075e66f032085fee250356ffe2329 
/root/gemi_jsons/active/10020.json,d43bf33c4659f4a57c63920e6669f522 
/root/gemi_jsons/active/10040.json,fd2898b91c6252fbe9c61a59eed617bc

Adding a file modify timestamp

Its nice so far but how we can decide which is the original file and not a duplicate? assuming that the older files are the original ones we can add an extra column that will indicate the file modification time

find / -type f -exec sh -c 'for file; do md5=$(md5sum "$file" | awk "{print \$1}"); mod_date=$(stat -c %y "$file"| cut -d" " -f1,2); echo "$file,$md5,\"$mod_date\""; done' sh {} +

Dont be afraid! it might look complicated but its not! i will explain everything!

-exec sh -c ... : For each file start a new shell and execute some commands with the found file as input
for file; : for the found file execute the following commands
md5=$(md5sum "$file" | awk "{print \$1}"); Get the md5 hash part of the md5sum command and store it in variable md5
mod_date=$(stat -c %y "$file"| cut -d" " -f1,2); Get the modification time of the file
echo "$file,$md5,\"$mod_date\"; print for the file path,hash,modification date
done' sh {} + ; done closes the loop, sh is the shell for each file and {} is the stdin that will keep its found file name

Running this command will generate output like this

/root/gemi_jsons/active/12460.json,f66ba72d9490dc019e7c2e65bcd2f42e,"2024-05-25 22:33:32.000000000" 
/root/gemi_jsons/active/12480.json,220e07790afbd5c4a42af85ff3170de6,"2024-05-25 22:33:44.000000000" 
/root/gemi_jsons/active/12500.json,fef287625d23ce856b35cbe8e3571368,"2024-05-25 22:33:56.000000000" 
/root/gemi_jsons/active/12520.json,16e28b18e301083dbe2368e4164ee422,"2024-05-25 22:34:07.000000000" 
/root/gemi_jsons/active/12540.json,d9a253fc5a01a849032ced6e30227e0e,"2024-05-25 22:34:18.000000000" 
/root/gemi_jsons/active/12560.json,cd9a0ad77060efe462702b0e85e0299c,"2024-05-25 22:34:29.000000000" 
/root/gemi_jsons/active/12580.json,5be9ebfaec9d17d4ebe0e20d13489d9a,"2024-05-25 22:34:40.000000000" 
/root/gemi_jsons/active/1260.json,6f07e80934d6fc275cae55c4c7e08cc3,"2024-05-25 20:25:12.000000000"

Filter out only duplicates sorted by last modification file

Since the one-liner creates a bit listing of files it will be very difficult to identify duplicates, so we need a way to print only duplicates sorted by their modification date which can help us decide which are the original files, to do this we will feed the csv file to an sqlite3 database and perform a query that will print only duplicates based on the md5 hash.

# Create the SQLite database and table 
sqlite3 files.db "CREATE TABLE files (path TEXT, md5 TEXT, mod_date TEXT);" 
 
# Import the CSV data into the SQLite table 
sqlite3 files.db <<EOF 
.mode csv 
.import output.csv files 
EOF 
 
# Query the database for duplicates and sort the results 
sqlite3 files.db <<EOF 
.mode csv 
.output duplicates_sorted.csv 
SELECT path, md5, mod_date 
FROM files 
WHERE md5 IN ( 
    SELECT md5 
    FROM files 
    GROUP BY md5 
    HAVING COUNT(md5) > 1 
) 
ORDER BY md5, mod_date; 
EOF

The first sql command creates the table
The second inserts the csv file into the table files
The third one does the actuall query that finds the duplicates, results can be found in duplicates_sorted.csv

An example run generated this output

cat duplicates_sorted.csv 
/home/user/documents/file3.txt,0cc175b9c0f1b6a831c399e269772661,"2024-08-03 16:00:00" 
/home/user/documents/file4.txt,0cc175b9c0f1b6a831c399e269772661,"2024-08-03 17:00:00" 
/home/user/documents/file1.txt,d41d8cd98f00b204e9800998ecf8427e,"2024-08-03 14:00:00" 
/home/user/documents/file2.txt,d41d8cd98f00b204e9800998ecf8427e,"2024-08-03 15:00:00"

Things to note

delete all csv files used as input or output generated by the one-liner or the csv query before use, else you might and with wrong results, the same applies for the database file
Run the one-liner with the aprorpiate permissions

Conclusion

In this article we saw how we can find duplicate files and sort them by modification date, find in combination with other tools is powerful and can help you accomplish day to day tasks with ease!