Message-ID:

Except for 75% of the women, everyone in the whole world wants to have sex. -- Ellyn Mustard

devel / comp.lang.c / Give a Score 0-100 for Similarity of Files

Give a Score 0-100 for Similarity of Files

<0ed2bae9-944d-4a9b-9478-19e8a8d7cf5fn@googlegroups.com>

https://www.novabbs.com/devel/article-flat.php?id=18649&group=comp.lang.c#18649

X-Received: by 2002:a37:4152:: with SMTP id o79mr17209840qka.169.1633992743922;
Mon, 11 Oct 2021 15:52:23 -0700 (PDT)
X-Received: by 2002:ac8:4e30:: with SMTP id d16mr18267032qtw.309.1633992743748;
Mon, 11 Oct 2021 15:52:23 -0700 (PDT)
Path: rocksolid2!i2pn.org!weretis.net!feeder6.news.weretis.net!news.snarked.org!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Mon, 11 Oct 2021 15:52:23 -0700 (PDT)
Injection-Info: google-groups.googlegroups.com; posting-host=92.40.182.78; posting-account=w4UqJAoAAAAYC-PItfDbDoVGcg0yISyA
NNTP-Posting-Host: 92.40.182.78
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <0ed2bae9-944d-4a9b-9478-19e8a8d7cf5fn@googlegroups.com>
Subject: Give a Score 0-100 for Similarity of Files
From: cauldwel...@gmail.com (Frederick Gotham)
Injection-Date: Mon, 11 Oct 2021 22:52:23 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 42

by: Frederick Gotham - Mon, 11 Oct 2021 22:52 UTC

On my hard disk, I have 350 files of various sizes (all less than a megabyte).

When I'm supplied with a new file, I want to compare it to all my other 350 files. 9 times out of 10, I will have an exact match to one of my 350 files.

Checking for an identical file among my 350 is the easy part. Also I can stop the search as soon as I find an identical file. I can even compare pre-computed sha256 sums to speed things up further.

The other 10% of the time, the new file will be very very similar to one of the 350 files. In these cases, I want to compare the new file to each of the 350 files, and I want to give each of the 350 files a score from 0 to 100 as to how similar it is to the new file.

Once I have all these scores, I want to pick out the file that is the best match.

I know of two common Linux programs that can help me here.
1) 'diff' is commonly used for text files and source code, and it operates on a line by line basis
2) 'cmp' is used for binary files and just points out individual bytes that don't match

My 350 files contain text however it is XML and there's no new lines in it. However since XML contains a lot of open angle brackets (i.e. < ), all I have to do is put a new line before every <, like this:

cat my_file.xml | sed 's/</\n</g'

After making this alteration, these files will be well-suited to be compared with 'diff'.

The best command I can find so far to get a score of how similar two files area is as follows:

diff -y --suppress-common-lines file1 file2 | wc -l

This command will give me an integer specifying how many lines are different in the two files.

Has anyone got any better idea?

Re: Give a Score 0-100 for Similarity of Files

<Similarity-20211012015057@ram.dialup.fu-berlin.de>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=18650&group=comp.lang.c#18650

copy link Newsgroups: comp.lang.c

Path: rocksolid2!news.neodome.net!fu-berlin.de!uni-berlin.de!not-for-mail
From: ram...@zedat.fu-berlin.de (Stefan Ram)
Newsgroups: comp.lang.c
Subject: Re: Give a Score 0-100 for Similarity of Files
Date: 12 Oct 2021 00:51:14 GMT
Organization: Stefan Ram
Lines: 13
Expires: 1 Dec 2021 11:59:58 GMT
Message-ID: <Similarity-20211012015057@ram.dialup.fu-berlin.de>
References: <0ed2bae9-944d-4a9b-9478-19e8a8d7cf5fn@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Trace: news.uni-berlin.de Rtfp4fmoUvqYWG+4za6FugwHrm+3eopAiJ1QQsAvSxIdTk
X-Copyright: (C) Copyright 2021 Stefan Ram. All rights reserved.
Distribution through any means other than regular usenet
channels is forbidden. It is forbidden to publish this
article in the Web, to change URIs of this article into links,
and to transfer the body without this notice, but quotations
of parts in other Usenet posts are allowed.
X-No-Archive: Yes
Archive: no
X-No-Archive-Readme: "X-No-Archive" is set, because this prevents some
services to mirror the article in the web. But the article may
be kept on a Usenet archive server with only NNTP access.
X-No-Html: yes
Content-Language: en-US
Accept-Language: de-DE, en-US, it, fr-FR

by: Stefan Ram - Tue, 12 Oct 2021 00:51 UTC

Frederick Gotham <cauldwell.thomas@gmail.com> writes:
>Has anyone got any better idea?

Compress each file individually yielding file sizes s0 and
s1 for the compressed files. Now, compress the concatenation
yielding s. The ratio (s0+s1)/s might measure the similarity
if an appropriate algorithm was chosen for the compression.
Try different compression programs with different settings.

See also:
Normalized Compression Distance (NCD)

Re: Give a Score 0-100 for Similarity of Files

<distance-20211012022935@ram.dialup.fu-berlin.de>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=18651&group=comp.lang.c#18651

copy link Newsgroups: comp.lang.c

Path: rocksolid2!news.neodome.net!fu-berlin.de!uni-berlin.de!not-for-mail
From: ram...@zedat.fu-berlin.de (Stefan Ram)
Newsgroups: comp.lang.c
Subject: Re: Give a Score 0-100 for Similarity of Files
Date: 12 Oct 2021 01:35:58 GMT
Organization: Stefan Ram
Lines: 18
Expires: 1 Dec 2021 11:59:58 GMT
Message-ID: <distance-20211012022935@ram.dialup.fu-berlin.de>
References: <0ed2bae9-944d-4a9b-9478-19e8a8d7cf5fn@googlegroups.com> <Similarity-20211012015057@ram.dialup.fu-berlin.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Trace: news.uni-berlin.de 7x5NoAlPPoKOXdygB7erAwoitHpwapt6Oykm0Jb1mOJpQF
X-Copyright: (C) Copyright 2021 Stefan Ram. All rights reserved.
Distribution through any means other than regular usenet
channels is forbidden. It is forbidden to publish this
article in the Web, to change URIs of this article into links,
and to transfer the body without this notice, but quotations
of parts in other Usenet posts are allowed.
X-No-Archive: Yes
Archive: no
X-No-Archive-Readme: "X-No-Archive" is set, because this prevents some
services to mirror the article in the web. But the article may
be kept on a Usenet archive server with only NNTP access.
X-No-Html: yes
Content-Language: en-US
Accept-Language: de-DE, en-US, it, fr-FR

by: Stefan Ram - Tue, 12 Oct 2021 01:35 UTC

ram@zedat.fu-berlin.de (Stefan Ram) writes:
>Normalized Compression Distance (NCD)

Or one can choose n numerical properties p[i] of the files
(like the file size), normalize them to the range [0, 1]
and then calculate the distance between files f and g as

d( f, g )= sqrt( sum over i of:( p[i]( f )-p[i]( g ))^2 )

. The choice of the p[i] can be guided by knowledge of the
purpose of the measurement.

See also: normalized euclidean distance (NED), string
similarity measures, edit distance, bag distance, N-gram
measures, Jaro variant, Smith-Waterman distance, Editex,
and Syllable alignment.

Re: Give a Score 0-100 for Similarity of Files

<ca7c13d3-4413-47f1-a7d6-7bf064be15b0n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=18652&group=comp.lang.c#18652

copy link Newsgroups: comp.lang.c

X-Received: by 2002:ac8:57d0:: with SMTP id w16mr20079615qta.96.1634017230546;
Mon, 11 Oct 2021 22:40:30 -0700 (PDT)
X-Received: by 2002:a0c:c2c4:: with SMTP id c4mr28104634qvi.30.1634017230386;
Mon, 11 Oct 2021 22:40:30 -0700 (PDT)
Path: rocksolid2!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Mon, 11 Oct 2021 22:40:30 -0700 (PDT)
In-Reply-To: <0ed2bae9-944d-4a9b-9478-19e8a8d7cf5fn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2a00:23c7:a280:3401:5954:f833:fcd9:f24a;
posting-account=3LA7mQoAAAByiBtHIUvpFq0_QEKnHGc9
NNTP-Posting-Host: 2a00:23c7:a280:3401:5954:f833:fcd9:f24a
References: <0ed2bae9-944d-4a9b-9478-19e8a8d7cf5fn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <ca7c13d3-4413-47f1-a7d6-7bf064be15b0n@googlegroups.com>
Subject: Re: Give a Score 0-100 for Similarity of Files
From: mark.blu...@gmail.com (Mark Bluemel)
Injection-Date: Tue, 12 Oct 2021 05:40:30 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 47

by: Mark Bluemel - Tue, 12 Oct 2021 05:40 UTC

On Monday, 11 October 2021 at 23:52:30 UTC+1, Frederick Gotham wrote:
> On my hard disk, I have 350 files of various sizes (all less than a megabyte).
>
> When I'm supplied with a new file, I want to compare it to all my other 350 files. 9 times out of 10, I will have an exact match to one of my 350 files.
>
> Checking for an identical file among my 350 is the easy part. Also I can stop the search as soon as I find an identical file. I can even compare pre-computed sha256 sums to speed things up further.
>
> The other 10% of the time, the new file will be very very similar to one of the 350 files. In these cases, I want to compare the new file to each of the 350 files, and I want to give each of the 350 files a score from 0 to 100 as to how similar it is to the new file.
>
> Once I have all these scores, I want to pick out the file that is the best match.
>
> I know of two common Linux programs that can help me here.
> 1) 'diff' is commonly used for text files and source code, and it operates on a line by line basis
> 2) 'cmp' is used for binary files and just points out individual bytes that don't match
>
> My 350 files contain text however it is XML and there's no new lines in it. However since XML contains a lot of open angle brackets (i.e. < ), all I have to do is put a new line before every <, like this:
>
> cat my_file.xml | sed 's/</\n</g'
>
> After making this alteration, these files will be well-suited to be compared with 'diff'.
>
> The best command I can find so far to get a score of how similar two files area is as follows:
>
> diff -y --suppress-common-lines file1 file2 | wc -l
>
> This command will give me an integer specifying how many lines are different in the two files.
>
> Has anyone got any better idea?

What is your C question here?

You'd probably be more on-topic at stack exchange's code review site...

Re: Give a Score 0-100 for Similarity of Files

<70b6bc57-6282-4b17-b272-d432dd88b3c0n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=18653&group=comp.lang.c#18653

copy link Newsgroups: comp.lang.c

X-Received: by 2002:a05:6214:1947:: with SMTP id q7mr28561778qvk.67.1634031404005;
Tue, 12 Oct 2021 02:36:44 -0700 (PDT)
X-Received: by 2002:ad4:4e0b:: with SMTP id dl11mr10263536qvb.23.1634031403881;
Tue, 12 Oct 2021 02:36:43 -0700 (PDT)
Path: rocksolid2!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c
Date: Tue, 12 Oct 2021 02:36:43 -0700 (PDT)
In-Reply-To: <0ed2bae9-944d-4a9b-9478-19e8a8d7cf5fn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2a00:23a8:400a:5601:dd1a:e539:b6ba:2419;
posting-account=Dz2zqgkAAADlK5MFu78bw3ab-BRFV4Qn
NNTP-Posting-Host: 2a00:23a8:400a:5601:dd1a:e539:b6ba:2419
References: <0ed2bae9-944d-4a9b-9478-19e8a8d7cf5fn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <70b6bc57-6282-4b17-b272-d432dd88b3c0n@googlegroups.com>
Subject: Re: Give a Score 0-100 for Similarity of Files
From: malcolm....@gmail.com (Malcolm McLean)
Injection-Date: Tue, 12 Oct 2021 09:36:44 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 57

by: Malcolm McLean - Tue, 12 Oct 2021 09:36 UTC

However the real answer depends on what you are doing, how the files differ, and what you consider to be a big change.
For example "Prussia" and "Russia" is only one letter change, but it's as big a difference as "Italy" and "France".
while "The United Kingdom" and "Great Britain" is almost no difference at all, using the same metric. Obviously, an
algorithm would have to be geography-aware to pick this up. It's not always possible to come up with a perfect
measure.

Re: Give a Score 0-100 for Similarity of Files

<sk4rs2$om3$1@gioia.aioe.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=18654&group=comp.lang.c#18654

copy link Newsgroups: comp.lang.c

Path: rocksolid2!news.neodome.net!news.mixmin.net!aioe.org!ux6ld97kLXxG8kVFFLnoWg.user.46.165.242.75.POSTED!not-for-mail
From: chris.m....@gmail.com (Chris M. Thomasson)
Newsgroups: comp.lang.c
Subject: Re: Give a Score 0-100 for Similarity of Files
Date: Tue, 12 Oct 2021 13:40:33 -0700
Organization: Aioe.org NNTP Server
Message-ID: <sk4rs2$om3$1@gioia.aioe.org>
References: <0ed2bae9-944d-4a9b-9478-19e8a8d7cf5fn@googlegroups.com>
<Similarity-20211012015057@ram.dialup.fu-berlin.de>
<distance-20211012022935@ram.dialup.fu-berlin.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="25283"; posting-host="ux6ld97kLXxG8kVFFLnoWg.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.14.0
X-Notice: Filtered by postfilter v. 0.9.2
Content-Language: en-US

by: Chris M. Thomasson - Tue, 12 Oct 2021 20:40 UTC

On 10/11/2021 6:35 PM, Stefan Ram wrote:
> ram@zedat.fu-berlin.de (Stefan Ram) writes:
>> Normalized Compression Distance (NCD)
>
> Or one can choose n numerical properties p[i] of the files
> (like the file size), normalize them to the range [0, 1]
> and then calculate the distance between files f and g as
>
> d( f, g )= sqrt( sum over i of:( p[i]( f )-p[i]( g ))^2 )
>
> . The choice of the p[i] can be guided by knowledge of the
> purpose of the measurement.
>
> See also: normalized euclidean distance (NED), string
> similarity measures, edit distance, bag distance, N-gram
> measures, Jaro variant, Smith-Waterman distance, Editex,
> and Syllable alignment.
>
>

God its been ages since did anything close to that. Fwiw, I remember
compressing two files, and adding their sizes together. Then I did what
you said and concatenated the two original files, compressed the result
and compared and contrasted the concatenated compressed size vs the sum
of the two individually compressed file sizes.

Subject	Author
Give a Score 0-100 for Similarity of Files	Frederick Gotham
Re: Give a Score 0-100 for Similarity of Files	Stefan Ram
Re: Give a Score 0-100 for Similarity of Files	Stefan Ram
Re: Give a Score 0-100 for Similarity of Files	Chris M. Thomasson
Re: Give a Score 0-100 for Similarity of Files	Mark Bluemel
Re: Give a Score 0-100 for Similarity of Files	Malcolm McLean