I recently needed to identify duplicate Contact records in a Salesforce.com org that contained over 40K Contacts. The amount of duplicates were not horrible (~5% of their Contacts) considering the sales reps had not been trained to watch out for actions that would spawn duplicate records. However the situation was bound to get worse without some training, action and follow up by leadership.
How did this happen? I suspect that the duplicates just creeped in over time from the following:
Poor initial User training – i.e. not using "Search" before creating a new Contact or Lead record and not being diligent about checking for a duplicate in the Lead conversion process)
Buying Data.com and Hoover leads en masse – e.g. buying all the Contact record for a location without checking to see if they already exist in Salesforce
How to identify how many duplicate Contact records are in your Salesforce org using Excel:
The way I accomplished this using a common tool like MS Excel, is to start with an Accounts and Contacts report. I select the minimum amount of fields to do the data work: Account Owner, Account ID, Account Name, Contact Owner, Contact ID, and Contact Email Address. I use the Email address as a unique ID for a Contact, figuring the Name and Phone fields were unreliable as Users are going to type in variations of person name (Jonathan, John, Johnny, etc) and mix and match direct lines, cell phone numbers, home phone numbers, and a company’s main line for the Contact phone field.
Once I have the report written in Salesforce, I dump the file to an Excel CSV file so I can start the data work. Once I’ve pulled up the file in Excel, I sort the Email address column data A-Z. I do this so I can delete Contact records without an email address that show up at the bottom of the spreadsheet. Depending on your version of Excel, you might need to work with smaller subsets of the data. For example, I have learned that if you need to run a pivot table (which is what we are about to do), you’ll need to use 10,000 or less records at time using the 2003/2007 version of Excel. The reason I sorted the list of email addresses alphabetically is so I can easily keep track of how much data I grabbing at a time (i.e. email addresses that start with "A" through email addresses that start with "D," then E- M, etc.).
I’ll then cut my <10,000 record subset and paste it into a new worksheet (you’ll need the column headers as well), select all the records and start a pivot table in a new tab. In the Pivot Table wizard you’ll drag the Email address column into the "Row Labels" section and then drag the Contact ID into the "Values" section (see screen shot below). That will give you two columns – the email address in one column and a count of how many times that email address was found in the spreadsheet. If you see a value higher that "1″ – that’s a duplicate record!
At this point I highlight the columns in the pivot table, open a new tab and do "Paste Special > Values Only" and then sort the data by the Contact ID column so the values greater that 1 (the duplicate email addresses) are at the top of the list.
You can then copy and paste the email address from your spreadsheet into the Search box in Salesforce and start merging your duplicate Contacts using the native wizard.
If you have a TON of Contacts to go through or you’re not familiar with the data, you’re probably going to have to rely on the Contact Owner (Salesforce User) to make the decisions needed to merge the Contact records. Once you have you list of duplicate email addresses in a tab, you can add another column to do a "vlookup" function in Excel and pull in the Contact Owner from your original Accounts and Contacts report that you exported to a CSV. I’m not going to go over how to do a "vlookup" in Excel – if you don’t know how, find an Excel guru in your office! Then you can parse and farm out the duplicate emails to the right Users to research and merge.
NOTE: Matching on 15 character vs 18 character Record IDs in Excel
Excel is NOT case sensitive when matching records using the vlookup function. For example, Excel will think these two record IDs are the same, when in fact they are different records: 003A0000004T6Og vs. 003A0000004t6Og. If you run a report from within Salesforce and request the record ID you will get a 15 character record ID, and since Excel is not case sensitive, you do not get a unique key to match on. If you use the Data Loader to export your records you will get an 18 character Record ID. This WILL be a unique key that you can use in Excel.
Also, there is now a native formula in Salesforce that allows you to do the conversion: CASESAFEID(id) – it converts a 15-character ID into a case insensitive 18-character ID.