metadata repository

Report 6 Downloads 286 Views
US 20080140696A1

(19) United States (12) Patent Application Publication (10) Pub. No.: US 2008/0140696 A1 Mathuria (54)

(43) Pub. Date:

SYSTEM AND METHOD FOR ANALYZING DATA SOURCES TO GENERATE METADATA

(75) Inventor:

Jun. 12, 2008

Publication Classi?cation

(51)

Janak Mathuria, Fairfax, VA (U S)

Int. Cl.

G06F 17/30

(200601)

Correspondence Address_

(52)

ROTHWELL, FIGG, ERNST & MANBECK, RC. 1425 K STREET, N.W., SUITE 800

(57)

WASHINGTON DC 20005 ’

A system and method are provided for generating metadata

(73) Assigneez

US. Cl. ............................... .. 707/102; 707/E17.044

ABSTRACT

Pantheon systems, Inc” Potomac’

relating to an enterprise management system including at

MD (Us)

least one data source having one or more of tables and col umns. Constraints existing on at least one of the tables and

(21)

App1_ NO;

11/634,902

columns in the data source are inferred based on data in the tables and columns. Metadata that includes information on

(22)

Filed;

Dec. 7, 2006

the inferred constraints is generated.

Identify the data sources (e.g., data base instances, documents, flat tiles, spread \f“ 310-1

sheets) to be analyzed

i I ObtalnIpermrssron/connectron information for each data source

w 8102

Store identification and permission/ connection information in

metadata repository

Step 10

w 510'?’

Patent Application Publication

Jun. 12, 2008 Sheet 1 0f 25

US 2008/0140696 A1

FIG.1 Step 10

Step 150

Step 160

Obtain data

lnter single

sources

column

p

Identity tables

and columns

_

I

Step 140

step 30 l _

—> column unique

relationshi 5

Step 20 i

Inter multi

keys step 170

l

Inter not null

Inter multi

and unique keys

column

_

relationships

ldentlfy expliclt constraints

Step 130

Step 40 l

Step 180

Execute

Conceptual

Generate user

cardinality

graph ot data

triendly terms or

quer'es

model

pseudo-names Step 50 ¢ _

t Step 120

Step 190 .

Generate

Estlmate

Ident'fY ‘ind.

cardinality

probability for

anaIYZe 'nd'ces

queries

inferences

Step 60 l .

Ste 110

p

T

Ste 200

p



Expand

Accept/discard

'denmy arid

potential keys

inferences

anaWZe Views

based on names

Step 70 l

Step 100T

Obtain and

Inter potential

analyze

keys

procedural code

Step BOl Obtain and analyze client

queries

Step 90 T Propagate ——> names

Patent Application Publication

Jun. 12, 2008 Sheet 2 0f 25

US 2008/0140696 A1

Identify the data sources (e.g., data base instances, documents, ?at tiles, spread w 3104

sheets) to be analyzed

I Obtainlpermrssron/connectron information for each data source

w 8102

Store identification and permission! connection information in

metadata repository

Step 10

FIG.2

w 5193

Patent Application Publication

Jun. 12, 2008 Sheet 3 0f 25

Identity tables of interest for each data source

US 2008/0140696 A1

w 820-1

For each table, identify column information (name and definitions);

w 820-2

Store table and column names and definitions in metadaia repository

\-/\ 820-3

Step 20

FIG.3

Patent Application Publication

Jun. 12, 2008 Sheet 4 0f 25

US 2008/0140696 A1

Identity explicitly defined conetraints for m 3304 each table and column of interest

Store name, type, SQL statement in

__/-__ S302

repository

Normalize SQL

~/" 830-3

Store normalized SQL in metadata repository

‘P 3304

Step 30

FIG.4

Patent Application Publication

840-1

select a column

V

US 2008/0140696 A1

Jun. 12, 2008 Sheet 5 of 25

/\/

340-2 Does col.

No———

name contain

markers

Yes

i

ll

840-3

parse column

840-4

name into a

Break column

plurality of words using markers

syllables

name up by l

V

generate permutations of the column name by applying ontologies to

340-5

transfer individuals words or groups of

words V

store alternative names for each

column in metadata repository

Yes

a

340-7

NO END

Step 40

FIGS

s4o-s

Al

A/

Patent Application Publication

Jun. 12, 2008 Sheet 6 0f 25

obtain all indices for each table of interest

mark each indexed column or groups of columns as

US 2008/0140696 A1

w 850-1

J‘ 850-2

potential keys

Store information in

metadata repository

Step 50

FIG.6

w 550,3

Patent Application Publication

Jun. 12, 2008 Sheet 7 0f 25

Obtain View names and definitions (SQL) from data

US 2008/0140696 A1

m $60-1

base catalog tables

Parse SQL

w 860-2

ll

Extract table/column from the w

it

Store View Name, definition, normalize SQL, and table/

m 3604

column in metadata repository V

Convert to user-friendly names w 860-5

Step 60

FIG.7

Patent Application Publication

Jun. 12, 2008 Sheet 8 0f 25

US 2008/0140696 A1

step 70 obtain from the catalog m procedure code (e.g., triggers,

S704

PL/SQL, etc.) along with SQL definition

8702/»

normalize SOL

$70.3“

Store normalized SQL in metadata

repository

FIG.8

Patent Application Publication

Jun. 12, 2008 Sheet 9 of 25

US 2008/0140696 A1

step 80

identity queries from any source (txt tiles, report definitions, SQL statements,

w 8804

INSERT, etc.)

normalize SQL

Store normalized SQL in metadata

repository

FIG.9

w 880-2

s/‘S80-3

Patent Application Publication

Jun. 12, 2008 Sheet 10 0f 25

US 2008/0140696 A1

step 90 From normalize SQL stored in

repository, identity alternative names used for columns and tables

J‘ 890-1

I Add aliases as alternative names

w 89%

l Store information in

metadata repository

FIG.1O

w 890-3

Patent Application Publication

Jun. 12, 2008 Sheet 11 0f 25

US 2008/0140696 A1

step 100

analyze the SQL statements to

identify potential keys (e.g., group by, join)

w 8100-1

Store identification of potential keys in metadata repository w 8100-2

FIG. 11

Patent Application Publication

Jun. 12, 2008 Sheet 12 of 25

US 2008/0140696 A1

step 110

for each potential key, compare column names to column names and “p 31104

aliases in data repository

t for each match (single- or multi column), identity match as additional \f‘ 8110-2

potential key

Store additional potential keys in

metadata repository

FIG.12

w 3110-3

Patent Application Publication

Jun. 12, 2008 Sheet 13 0f 25

US 2008/0140696 A1

step 120

For each table of interest, determine the cardinality of each column

wS120-1

For each table of interest, determine the range of values $51202 for each column

Store cardinality and data $51204; ranges in metadata repository

FIG.13

Patent Application Publication

Jun. 12, 2008 Sheet 14 of 25

US 2008/0140696 A1

step 140 For each column, select cardinalities $81404 and ranges

Inter NOT NULL,

unique key constraints based on $31402

cardinalities and ranges

Store inferred constraints in

metadata repository

FIG.14

~/"S140-3

Patent Application Publication

Jun. 12, 2008 Sheet 15 0f 25

US 2008/0140696 A1

step 150

r

get single column instance from repository

wStSO-l

S150~2 —NO

Are all rules met? YES execute queries to determine

_

single-column table relationships

” S150 3

Storing relationships in metadata repository

w 3150-4

3150-5 More Cols.?

NO

@ FIG.15

Patent Application Publication

Jun. 12, 2008 Sheet 16 0f 25

US 2008/0140696 A1

step 160



get single column instance from repository

~/-~S160-1

8160-2 —NO

Are all rules met?

YES execute queries to determine multi-column or m 5160-3

composite unique key constraints

YES v

Storing constraints in metadata repository

8160-5 More Cots?

NO

@ FIG.16

w $160.4

Patent Application Publication

Jun. 12, 2008 Sheet 17 of 25

US 2008/0140696 A1

step 170

7

get multi-column unique key constraints

w 8170-1

8170-2 -— No

Are all rules

YES execute queries to determine

f 81703

multi-column table relationships

YES V

Storing relationships in metadata repository

8170-5 More Cols? NQ END

FIG.17

w $170-4

Patent Application Publication

Jun. 12, 2008 Sheet 18 0f 25

US 2008/0140696 A1

step 180

Query table and column definitions, including table relationships, stored in

*j" 3180-1

metadata repository

Identify hierarchical relationships between the tables based on said definitions and

w 8180-2

relationships

generate an entity relationship diagram for each table based on the hierarchical

relationships

FIG.18

m 3180-3

Patent Application Publication

Jun. 12, 2008 Sheet 19 0f 25

US 2008/0140696 A1

step 190

———i For each inference in the

metadata repository, identity

w S1904

the related tables.

t

identify parent tables that do

not have probabilities assigned

y

V

8190-6

8190-7

t

obtain number of expected rows from user for all or a

w 81902

subset of the tables

For each table, if child exists

with a known probability, assign

the maximum probability for child tables to parent

it

Query number of rows in

each table using stored

w S190 - 3

‘V

8190-8

\

cardinalities

For each table having an

interred relationship, assign probability to all inferred constraints for that table

tr

calculate probability by dividing current number of rows by

w 8190-4

expected number

store probability in the

metadata repository

J“ 8190-5

l

FIG.19