US 20080140696A1
(19) United States (12) Patent Application Publication (10) Pub. No.: US 2008/0140696 A1 Mathuria (54)
(43) Pub. Date:
SYSTEM AND METHOD FOR ANALYZING DATA SOURCES TO GENERATE METADATA
(75) Inventor:
Jun. 12, 2008
Publication Classi?cation
(51)
Janak Mathuria, Fairfax, VA (U S)
Int. Cl.
G06F 17/30
(200601)
Correspondence Address_
(52)
ROTHWELL, FIGG, ERNST & MANBECK, RC. 1425 K STREET, N.W., SUITE 800
(57)
WASHINGTON DC 20005 ’
A system and method are provided for generating metadata
(73) Assigneez
US. Cl. ............................... .. 707/102; 707/E17.044
ABSTRACT
Pantheon systems, Inc” Potomac’
relating to an enterprise management system including at
MD (Us)
least one data source having one or more of tables and col umns. Constraints existing on at least one of the tables and
(21)
App1_ NO;
11/634,902
columns in the data source are inferred based on data in the tables and columns. Metadata that includes information on
(22)
Filed;
Dec. 7, 2006
the inferred constraints is generated.
Identify the data sources (e.g., data base instances, documents, flat tiles, spread \f“ 310-1
sheets) to be analyzed
i I ObtalnIpermrssron/connectron information for each data source
w 8102
Store identification and permission/ connection information in
metadata repository
Step 10
w 510'?’
Patent Application Publication
Jun. 12, 2008 Sheet 1 0f 25
US 2008/0140696 A1
FIG.1 Step 10
Step 150
Step 160
Obtain data
lnter single
sources
column
p
Identity tables
and columns
_
I
Step 140
step 30 l _
—> column unique
relationshi 5
Step 20 i
Inter multi
keys step 170
l
Inter not null
Inter multi
and unique keys
column
_
relationships
ldentlfy expliclt constraints
Step 130
Step 40 l
Step 180
Execute
Conceptual
Generate user
cardinality
graph ot data
triendly terms or
quer'es
model
pseudo-names Step 50 ¢ _
t Step 120
Step 190 .
Generate
Estlmate
Ident'fY ‘ind.
cardinality
probability for
anaIYZe 'nd'ces
queries
inferences
Step 60 l .
Ste 110
p
T
Ste 200
p
‘
Expand
Accept/discard
'denmy arid
potential keys
inferences
anaWZe Views
based on names
Step 70 l
Step 100T
Obtain and
Inter potential
analyze
keys
procedural code
Step BOl Obtain and analyze client
queries
Step 90 T Propagate ——> names
Patent Application Publication
Jun. 12, 2008 Sheet 2 0f 25
US 2008/0140696 A1
Identify the data sources (e.g., data base instances, documents, ?at tiles, spread w 3104
sheets) to be analyzed
I Obtainlpermrssron/connectron information for each data source
w 8102
Store identification and permission! connection information in
metadata repository
Step 10
FIG.2
w 5193
Patent Application Publication
Jun. 12, 2008 Sheet 3 0f 25
Identity tables of interest for each data source
US 2008/0140696 A1
w 820-1
For each table, identify column information (name and definitions);
w 820-2
Store table and column names and definitions in metadaia repository
\-/\ 820-3
Step 20
FIG.3
Patent Application Publication
Jun. 12, 2008 Sheet 4 0f 25
US 2008/0140696 A1
Identity explicitly defined conetraints for m 3304 each table and column of interest
Store name, type, SQL statement in
__/-__ S302
repository
Normalize SQL
~/" 830-3
Store normalized SQL in metadata repository
‘P 3304
Step 30
FIG.4
Patent Application Publication
840-1
select a column
V
US 2008/0140696 A1
Jun. 12, 2008 Sheet 5 of 25
/\/
340-2 Does col.
No———
name contain
markers
Yes
i
ll
840-3
parse column
840-4
name into a
Break column
plurality of words using markers
syllables
name up by l
V
generate permutations of the column name by applying ontologies to
340-5
transfer individuals words or groups of
words V
store alternative names for each
column in metadata repository
Yes
a
340-7
NO END
Step 40
FIGS
s4o-s
Al
A/
Patent Application Publication
Jun. 12, 2008 Sheet 6 0f 25
obtain all indices for each table of interest
mark each indexed column or groups of columns as
US 2008/0140696 A1
w 850-1
J‘ 850-2
potential keys
Store information in
metadata repository
Step 50
FIG.6
w 550,3
Patent Application Publication
Jun. 12, 2008 Sheet 7 0f 25
Obtain View names and definitions (SQL) from data
US 2008/0140696 A1
m $60-1
base catalog tables
Parse SQL
w 860-2
ll
Extract table/column from the w
it
Store View Name, definition, normalize SQL, and table/
m 3604
column in metadata repository V
Convert to user-friendly names w 860-5
Step 60
FIG.7
Patent Application Publication
Jun. 12, 2008 Sheet 8 0f 25
US 2008/0140696 A1
step 70 obtain from the catalog m procedure code (e.g., triggers,
S704
PL/SQL, etc.) along with SQL definition
8702/»
normalize SOL
$70.3“
Store normalized SQL in metadata
repository
FIG.8
Patent Application Publication
Jun. 12, 2008 Sheet 9 of 25
US 2008/0140696 A1
step 80
identity queries from any source (txt tiles, report definitions, SQL statements,
w 8804
INSERT, etc.)
normalize SQL
Store normalized SQL in metadata
repository
FIG.9
w 880-2
s/‘S80-3
Patent Application Publication
Jun. 12, 2008 Sheet 10 0f 25
US 2008/0140696 A1
step 90 From normalize SQL stored in
repository, identity alternative names used for columns and tables
J‘ 890-1
I Add aliases as alternative names
w 89%
l Store information in
metadata repository
FIG.1O
w 890-3
Patent Application Publication
Jun. 12, 2008 Sheet 11 0f 25
US 2008/0140696 A1
step 100
analyze the SQL statements to
identify potential keys (e.g., group by, join)
w 8100-1
Store identification of potential keys in metadata repository w 8100-2
FIG. 11
Patent Application Publication
Jun. 12, 2008 Sheet 12 of 25
US 2008/0140696 A1
step 110
for each potential key, compare column names to column names and “p 31104
aliases in data repository
t for each match (single- or multi column), identity match as additional \f‘ 8110-2
potential key
Store additional potential keys in
metadata repository
FIG.12
w 3110-3
Patent Application Publication
Jun. 12, 2008 Sheet 13 0f 25
US 2008/0140696 A1
step 120
For each table of interest, determine the cardinality of each column
wS120-1
For each table of interest, determine the range of values $51202 for each column
Store cardinality and data $51204; ranges in metadata repository
FIG.13
Patent Application Publication
Jun. 12, 2008 Sheet 14 of 25
US 2008/0140696 A1
step 140 For each column, select cardinalities $81404 and ranges
Inter NOT NULL,
unique key constraints based on $31402
cardinalities and ranges
Store inferred constraints in
metadata repository
FIG.14
~/"S140-3
Patent Application Publication
Jun. 12, 2008 Sheet 15 0f 25
US 2008/0140696 A1
step 150
r
get single column instance from repository
wStSO-l
S150~2 —NO
Are all rules met? YES execute queries to determine
_
single-column table relationships
” S150 3
Storing relationships in metadata repository
w 3150-4
3150-5 More Cols.?
NO
@ FIG.15
Patent Application Publication
Jun. 12, 2008 Sheet 16 0f 25
US 2008/0140696 A1
step 160
7»
get single column instance from repository
~/-~S160-1
8160-2 —NO
Are all rules met?
YES execute queries to determine multi-column or m 5160-3
composite unique key constraints
YES v
Storing constraints in metadata repository
8160-5 More Cots?
NO
@ FIG.16
w $160.4
Patent Application Publication
Jun. 12, 2008 Sheet 17 of 25
US 2008/0140696 A1
step 170
7
get multi-column unique key constraints
w 8170-1
8170-2 -— No
Are all rules
YES execute queries to determine
f 81703
multi-column table relationships
YES V
Storing relationships in metadata repository
8170-5 More Cols? NQ END
FIG.17
w $170-4
Patent Application Publication
Jun. 12, 2008 Sheet 18 0f 25
US 2008/0140696 A1
step 180
Query table and column definitions, including table relationships, stored in
*j" 3180-1
metadata repository
Identify hierarchical relationships between the tables based on said definitions and
w 8180-2
relationships
generate an entity relationship diagram for each table based on the hierarchical
relationships
FIG.18
m 3180-3
Patent Application Publication
Jun. 12, 2008 Sheet 19 0f 25
US 2008/0140696 A1
step 190
———i For each inference in the
metadata repository, identity
w S1904
the related tables.
t
identify parent tables that do
not have probabilities assigned
y
V
8190-6
8190-7
t
obtain number of expected rows from user for all or a
w 81902
subset of the tables
For each table, if child exists
with a known probability, assign
the maximum probability for child tables to parent
it
Query number of rows in
each table using stored
w S190 - 3
‘V
8190-8
\
cardinalities
For each table having an
interred relationship, assign probability to all inferred constraints for that table
tr
calculate probability by dividing current number of rows by
w 8190-4
expected number
store probability in the
metadata repository
J“ 8190-5
l
FIG.19