how to use distinct in-memory filtering and distinct server-side filtering

Distinct Operations in Entity Framework Core

In SQL, the DISTINCT keyword allows us to filter out duplicate values, ensuring that each record appears only once based on the specified column(s). However, when dealing with entire records, using DISTINCT on all columns can result in unexpected outcomes since each record typically has a unique identifier (like an Id). To effectively use DISTINCT, we need to determine the specific column or combination of columns we want to filter by. For example, consider the following query:

SELECT DISTINCT City FROM Customers;

The above query retrieves a unique list of customer locations.. Now let's see how we can achieve the same with the help of LINQ?

LINQ already has a Distinct method that we can use but, unfortunately not all LINQ methods could be translated to SQL. For that reason the code examples below are separated in two sections - LINQ queries that can only be executed in memory and LINQ queries where SQL server execution is possible.

In application memory

1. Using Distinct and IEqualityComparer for in-memory filtering

In the approach you will need a IEqualityComparer class that will describe how two Customer records are compared by one another.

var context = new SupermarketContext();
using (context)
{
    var customers = context.Customer.ToList().Distinct(new CustomerComparer());           
}

 public class CustomerComparer : IEqualityComparer<Customer>
 {
     public bool Equals(Customer x, Customer y)
     {
         if (x == null || y == null)
             return false;

         return x.City == y.City;
     }

     public int GetHashCode(Customer obj)
     {
         if (obj == null)
             return 0;

         // Combine hash codes for Name and City properties
         int hashCity = obj.City == null ? 0 : obj.City.GetHashCode();

         return hashCity;
     }
 }

2. Using DistinctBy for in-memory filtering

DistinctBy is designed for use with one or multiple properties but cannot be translated directly into a SQL query.

var context = new SupermarketContext();
using (context)
{
    var customers = context.Customer.ToList().DistinctBy(x => x.City);
}

You can even use DistinctBy with multiple properties like so

var context = new SupermarketContext();
using (context)
{
    var customers = context.Customer.ToList().DistinctBy(x => new { x.City, x.Name });
}

Server-Side Execution

1. Using Select plus Distinct for Server-Side Filtering

By combining Select and Distinct you can accomplish server side filtering. It will work with both single property and multiple properties.

var queryString = context.Customer
                 .Select(c => new { c.City, c.Name })
                 .Distinct()
                 .ToQueryString();
SELECT DISTINCT [c].[City], [c].[Name] FROM [Customer] AS [c]

The result of this operation will be a list of anonymous type objects having two properties City and Name.

2. Using GroupBy plus Select for Server-Side Filtering

Now let's say that you don’t want just the City and Name but the entire Customer object. Consider this example table:

Id

Name

City

Address

1

Alice Smith

New York

123 Main St

2

Bob Johnson

San Francisco

789 Pine Street

3

Eve Adams

Seattle

456 Elm St

4

John Doe

Seattle

143 Spruce St

5

John Doe

Seattle

64 Strawberry St

Running

var distinctCustomers = context.Customer
  .Select(c => new { c.City, c.Name })
  .Distinct();

will group records 4 and 5 into one consolidated result. If the goal is to take one customer representation for each distinct group, then we need to decide which exact customer will be that, based on some other criteria. I decided that I want the last customer to represent the group.

var distinctQuery = context.Customer
    .GroupBy(x => new { x.Name, x.City })
    .Select(g => g.OrderBy(x=>x.Id).Last()).ToQueryString();
SELECT [c3].[Id], [c3].[Address], [c3].[City], [c3].[Name]
FROM (
    SELECT [c].[Name], [c].[City]
    FROM [Customer] AS [c]
    GROUP BY [c].[Name], [c].[City]
) AS [c1]
LEFT JOIN (
    SELECT [c2].[Id], [c2].[Address], [c2].[City], [c2].[Name]
    FROM (
        SELECT [c0].[Id], [c0].[Address], [c0].[City], [c0].[Name], ROW_NUMBER() OVER(PARTITION BY [c0].[Name], [c0].[City] ORDER BY [c0].[Id] DESC) AS [row]
        FROM [Customer] AS [c0]
    ) AS [c2]
    WHERE [c2].[row] <= 1
) AS [c3] ON [c1].[Name] = [c3].[Name] AND [c1].[City] = [c3].[City]

3. Direct SQL Query Execution

To pull the last customer from each distinct group we have to order them by Id descending and take the first from each group.

SELECT * FROM
	(SELECT *, ROW_NUMBER() OVER (PARTITION BY [City] ORDER BY [Id] DESC) AS ROW_NUM FROM [dbo].[Customer]) AS City_Partition
WHERE City_Partition.ROW_NUM = 1
var customers = context.Customer
.FromSql($"SELECT * FROM
	(SELECT *, ROW_NUMBER() OVER (PARTITION BY [City] ORDER BY [Id] DESC) AS ROW_NUM FROM [dbo].[Customer]) AS City_Partition
WHERE City_Partition.ROW_NUM = 1")
.ToList();