最近用到了文件去重,简单写了一个基于md5的小脚本。
脚本
脚本dedupicate.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
|
#!/bin/bash
WROK_DIR=/path/to/dir
file_num=0
del_num=0
OIFS=$IFS
IFS=$'\n'
cd $WROK_DIR
while read line; do
md5sum $line | awk '{print $1}' >>$WROK_DIR/.temp.txt
((file_num = file_num + 1))
done <<<$(ls $WROK_DIR)
while read line; do
if_del="false"
while read name; do
md5_val=$(md5sum $name | awk '{print $1}')
if [[ $md5_val == $line && $if_del == "false" ]]; then
if_del="true"
elif [[ $md5_val == $line && $if_del == "true" ]]; then
rm $name
((del_num = del_num + 1))
fi
done <<<$(ls $WROK_DIR)
done <<<$(sort $WROK_DIR/.temp.txt | uniq -d)
IFS=$OIFS
rm $WROK_DIR/.temp.txt
echo "该文件夹下总计$file_num个文件, 去除重复文件$del_num个。"
|
效率不太行。😮💨文件数量少也就罢了,数量一多慢得要命,改天再找找好的写法吧。🥱
——2022.5.7——
在这里找到了一位大佬写的,修改一下:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
|
#!/bin/bash
WORK_DIR=/path/to/dir
del_num=0
cd $WORK_DIR
find . -maxdepth 1 -type f -print0 | xargs -0 md5sum | sort >all.txt
cat all.txt | uniq -w 32 >uniq.txt
while read line; do
if [ $line ]; then
rm $line
((del_num = del_num + 1))
fi
done <<<$(comm all.txt uniq.txt -2 -3 | cut -c 35-)
rm all.txt uniq.txt
echo "该文件夹下现有文件$(ls -l | grep "^-" | wc -l)个, 已去除重复文件$del_num个。"
|
感觉好多了。
OVER