Data cleaning – selection of data from two data sources that contain duplicate data

Posted: October 6, 2012 in MSSQL
Tags:

Task: to create a SSIS package to delete duplicates by means of Fuzzy logic



The process of creating a SSIS solution consists of the following steps:
1. creating of a SSDT (SQL Server Data Tools) SSIS project
2. defining of connection to data sources
3. defining of connection to saving duplicate records
4. defining of connection to saving error messages
5. defining of connection to a database
6. creating of a table in a target database
7. creating of a Data Flow diagram
8. removing of exact duplicates
9. removing of “similar” duplicates by means of Fuzzy logic
10. saving of retrieved duplicates

Deleting duplicates by means of Fuzzy logic

Figure 1: Deleting duplicates by means of Fuzzy logic


Source code: SSIS Package (SQL Server 2012 – SSDT)
Note: do not forget to customize the Connection Managers in solution setting

Additional references:
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s