DQSIntroWebinar v6

Report 3 Downloads 146 Views
GETTING STARTED with SQL Server 2012 Data Quality Services (DQS)

Your Presenters Cleaner Data in SQL Server

Ben Gallagher Systems Development Director

Josh Buckler Business Development Director

The Challenge of SQL Data Quality

Today’s Agenda I.

Creating Knowledge Bases and Domains

II. Using DQS to Clean Your Database III. Fuzzy Matching IV. DQS Performance & Hardware Requirements

V. Summary VI. Questions

Join the Conversation on Twitter with @helpIT via #DQSIntro

CREATING KNOWLEDGE BASES & DOMAINS

Overview of Knowledge Bases • Knowledge Bases – knowledge about Data Domains

• Allow DQS to make intelligent decisions about data • Knowledge Base examples: • name and address data – Domains for firstname, lastname • Product data – Domains for models, descriptions • Create Knowledge Bases and populate them before processing

Overview of Domains • Domains allow incoming data to be: • Normalized to improve matching (e.g. replace special characters with spaces/NULLs) • Properly cased • Spell checked • Common errors corrected • Apply synonyms (e.g. Bob, Bobby, Rob & Robby = Robert)

Overview of Domains

• Term Based Relations allow find and replace • Domains can be combined to create Composite Domains • Composite Domains allow us to define cross-domain rules • Each Domain can only be used in a single Composite Domain

Knowledge Discovery • Knowledge Base Discovery automatically generates Knowledge Bases from data • Easy to create Knowledge Bases for product names, first names, zips, cities, states • Knowledge Bases then used to validate incoming data • Manual review and addition of valid values to Knowledge Base

Summary • Useful for improving relatively well structured data such as departments, product names • Knowledge Discovery feature is a good way to get started loading values into your Knowledge Base

• Can be time consuming for large Knowledge Bases • Industry-specific Knowledge Bases will emerge over time

• Some requirements more suited to third party solutions e.g. address standardization using USPS data

USING DQS TO CLEANSE YOUR DATABASE

Summary • Create a Data Quality Project using Knowledge Bases

• Existing Data Quality Projects cannot access a new or republished knowledgebase • DQS provides a clean, easy to use interface to create projects • DQS processing results can be previewed within the DQS UI

FUZZY MATCHING

Overview of Fuzzy Matching • A Knowledge Base can have a Matching Policy • Can create multiple rules and apply matching weights to Domain fields • Matching is exact or fuzzy (not phonetic) so won’t match e.g. Deighton and Dayton • Grading of matches can be a very iterative process

Summary • DQS matches data within a SQL Server table

• Grades matches based on similarity between fields in a pair of records, modified by a weight for that field • No easy way to identify overlap between 2 different tables • Can’t match a single record e.g. a new record to an existing database

PERFORMANCE AND HARDWARE REQUIREMENTS

DQS Performance

*Reproduced from Data Quality Services Performance Best Practices (April 2012)

DQS Hardware Requirements

SUMMARY

Summary • DQS is driven through knowledge bases created from sample data sets • These are evolved through manual tuning as you process data • DQS can make simple corrections and normalizations • DQS fuzzy matching works best when data is most granular and well structured • Contact data is typically more complex and of variable standard • Processing large volumes of data using DQS is resource intensive

DQS and the Data Quality Firewall

Josh Buckler Director of Business Development 866.332.7132 [email protected]

www.helpIT.com