Sets are collections of unique values.
To be able to eliminate the multiplicates, the set
needs to know when two values are equal.
Sets use the standard Clojure equality semantics, meaning that if two values are =
, then it removes the multiplicates.
Example:
(= :foo :foo) ;; #=> true #{:foo :bar :foo} ;; #=> #{:foo :bar} ;; One of the two :foo values is removed because they are equal (`=`)
But what if you want to have a set with custom equality semantics?
Say you have a list of person entities, where each person has a :person/id
attribute that identifies the person. When you put the person entities in a set, you want entities with unique :person/id
s, no matter what other data the entity contains.
Example:
#{{:person/id 1 :person/firstName "Mikko"} {:person/id 2 :person/firstName "John" :person/lastName "Doe"} {:person/id 1 :person/lastName "Koski"}} ;; #=> #{#:person{:id 1, :firstName "Mikko"} ;; #:person{:id 1, :lastName "Koski"} ;; #:person{:id 2, :firstName "John", :lastName "Doe"}} ;; Note that there are two entities with `:person/id` 1, which is not what we want.
How could we create a set
that uses the :person/id
attribute to decide if two values are equal?
sorted-set-by
Sets are unordered data structures, but you can create ordered sets with sorted-set
. Using sorted-set-by
, you can have a sorted set with your own comparator for sorting.
The sorted-set
has another interesting property: In addition to the sort order, the comparator also defines the equality semantics, meaning that if the comparator thinks that two values are equal (i.e., the comparator returns 0
), then the set threats them as equal values, that is, it removes multiplicates.
So for example, we can make a comparator that compares only the :person/id
attribute and return 0
for two maps with the same :person/id
, regarless of the other keys and values in the map:
(defn compare-id [x y] (compare (:person/id x) (:person/id y))) (compare-id {:person/id 1 :person/firstName "Mikko"} {:person/id 1 :person/lastName "Koski"}) ;; #=> 0 ;; Zero means the values are equal from comparator point-of-view
And if we use compare-id
as the comparator with sorted-set-by
, we get the result we wanted:
(def s (sorted-set-by compare-id {:person/id 1 :person/firstName "Mikko"} {:person/id 2 :person/firstName "John"} {:person/id 1 :person/lastName "Koski"})) s ;; #=> #{#:person{:id 1, :firstName "Mikko"} #:person{:id 2, :firstName "John"}}
disj
, contains?
and get
?Yes, they do.
These functions are documented as follows in Clojure reference:
Sets support 'removal' with disj, as well as contains? and get, the latter returning the object that is held in the set which compares equal to the key, if found:
Here's an example that demonstrates that everything works as documented:
(disj s {:person/id 1 :person/lastName "Koski"}) ;; #=> #{#:person{:id 2, :firstName "John"}} (contains? s {:person/id 1 :person/lastName "Koski"}) ;; #=> true (get s {:person/id 1 :person/lastName "Koski"}) ;; #=> #:person{:id 1, :firstName "Mikko"}
clojure.set
operators?So now that we've set up our set of person entities with custom equality by :person/id
and tested that the essential functions disj
, contains?
and get
work as expected, the next question is, do the set operations provided by the clojure.set
namespace work with sets with custom equality?
The answer is yes, but with one caveat:
Do not mix different set types.
If we look at the source code of clojure.set, we can see that the functions don't create new sets for the return values; instead, they reuse the ones given as parameters. This is good news because we don't want to lose our custom equality semantics after using the clojure.set
operations.
However, the functions in clojure.set
use count
of the given sets to determine which one will be used as a base for the return value. You shouldn't rely on this logic, so it's safest to use two sets of the same type.
Let's take an example.
First, let's call clojure.set/intersection
with a sorted set of two values and a normal unsorted set of one value:
(def result-when-sorted-set-has-more-values (clojure.set/intersection (sorted-set-by compare-id {:person/id 1 :person/firstName "Mikko"} {:person/id 2 :person/firstName "John"}) #{{:person/id 2 :person/lastName "Doe"}})) result-when-sorted-set-has-more-values ;; #=> #{#:person{:id 2, :lastName "Doe"}} (sorted? result-when-sorted-set-has-more-values) ;; #=> false (contains? result-when-sorted-set-has-more-values {:person/id 2}) ;; #=> false
In the example above, the returned set has the value we expected but is not sorted, nor does it use the equality by :person/id
.
Next, let's have a sorted set of one value and an unsorted set of two values:
(def result-when-unsorted-set-has-more-values (clojure.set/intersection (sorted-set-by compare-id {:person/id 2 :person/firstName "John"}) #{{:person/id 1 :person/firstName "Mikko"} {:person/id 2 :person/lastName "Doe"}})) result-when-unsorted-set-has-more-values ;; #=> #{} (sorted? result-when-unsorted-set-has-more-values) ;; #=> true (contains? result-when-unsorted-set-has-more-values {:person/id 2}) ;; #=> false
Now we did got back a sorted set, but the set didn't have any values, which is not what we wanted.
To get the correct result and correct set type, we have to use the same set type for both arguments:
(def result-when-both-sets-are-sorted (clojure.set/intersection (sorted-set-by compare-id {:person/id 2 :person/firstName "John"}) (sorted-set-by compare-id {:person/id 1 :person/firstName "Mikko"} {:person/id 2 :person/lastName "Doe"}))) result-when-both-sets-are-sorted ;; #=> #{#:person{:id 2, :firstName "John"}} (sorted? result-when-both-sets-are-sorted) ;; #=> true (contains? result-when-both-sets-are-sorted {:person/id 2}) ;; #=> true
Using the sorted-set-by
with custom equality semantics was a fun thing to learn, but I'm still unsure if there are any real world use-cases out there where this would be actually useful.
In the examples above, I had two sets of person entities with slightly different attributes, and the aim was to eliminate the duplicates. In reality, I'd probably do that with something like distinct-by
(if it existed), or I'd use group-by
and then map the vals
with merge
to get distinct entities with the attributes merged, which is probably what I'd want anyway:
(->> [{:person/id 1 :person/firstName "Mikko"} {:person/id 2 :person/firstName "John"} {:person/id 1 :person/lastName "Koski"}] (group-by :person/id) vals (map #(apply merge %))) ;; #=> (#:person{:id 1, :firstName "Mikko", :lastName "Koski"} ;; #:person{:id 2, :firstName "John"})
But maybe if I'd need to do actual set operations (difference
, intersection
etc.) with the two sets of entities, then I'd maybe use the sorted-set-by
with custom equality. Maybe.
What do you think? Is there any real-world use for this?
Please let me know and comment on Mastodon!