Introduction to Data Science, Week 2 Notes

Report 6 Downloads 51 Views
Introduction to Data Science, Week 2 Notes

0

Page 1 of 7

Danny King, 2013

Introduction to Data Science, Week 2 Notes Different types of database structure/organisation have benefits or weaknesses for different types of activities (e.g. tradeoffs between read and write speeds). When choosing the data storage mechanism you should keep in mind: how is the data organised on disk? What kinds of queries are best suited to the organisation (and what aren’t)? How hard is it to update the data/add new data? What happens when you encounter unanticipated queries – do you need to reorganise the data and how hard is that? The key part of relational databases is that the data is independent of the application code – activities of users and applications should remain unaffected when the internal representation of data is changed. Tabular data has an algebraic structure which allows reasoning and manipulation independently of physical data representation. Programs you write to manipulate data are more robust than they would be without the relational model. SQL is converted into relational algebra by relational DBMSs so that they can use algebraic simplification to optimise the equation and eliminate unnecessary tasks, making the computation more efficient, which is especially important over large data sets. Algebraic closure in this context means that any operation (on a table) will return a table so you can chain operations. Sets can’t have duplicate elements and bags can.

Union bags

Danny King [email protected] www.dannyking.eu

Union sets

0

Introduction to Data Science, Week 2 Notes

Page 2 of 7

Danny King, 2013

Intersection

e.g.:

Projection: remove all columns that aren’t explicitly listed and remove all duplicates that remain (for sets) Bag semantics

Bag semantics

Danny King [email protected] www.dannyking.eu

Introduction to Data Science, Week 2 Notes

0

Page 3 of 7

Danny King, 2013

Cross product: all possible combinations of R1 with R2. Size of a cross product is |R1| x |R2|

Equijoin is a join with an equality condition. Join is: for every record in R1 find a corresponding record in R2 that satisfies some condition. Theta join is simply a join where the condition can be anything (so equijoin is a special case of theta join where the condition is equality).

Danny King [email protected] www.dannyking.eu

Introduction to Data Science, Week 2 Notes

0

Page 4 of 7

Danny King, 2013

When dealing with understanding a complicated query, look for the FROM clause. In this case, there are 3 nested FROM clauses (closure allows this – operating on new tables returned by each query). You can save any of these statements as a view (i.e. give it a name), then you can refer to it as if it were a table.

Sometimes it helps to draw out complex conditions graphically to work out what is happening.

There are three types of user-defined functions (UDFs). Aggregate functions only appear in the select clause and is always associated with a GROUP BY – a common example is a concatenate function. Table functions appear in the from clause – usually the most complicated – commonly they are used to generate sequence of integers (1, 2, ..., 4, 5) in a temporary table. Scalar functions can appear pretty much anywhere an attribute can – e.g. if in a SELECT clause, a WHERE clause, in a JOIN condition, etc.

Danny King [email protected] www.dannyking.eu

Introduction to Data Science, Week 2 Notes

0

Page 5 of 7

Danny King, 2013

Declarative languages: you specify the answer you want but don’t specify anything about how to get it. Relational Algebra expressions are not declarative (you do specify an order) but SQL is. With SQL you just specify the properties that must be true of the result and you let the database figure out the right way to compute it. The SQL is the ‘what’ not the ‘how.’

Logical Independence Physical independence is the ability to isolate/protect the application from physical changes in the organisation of the data. So if things were rearranged on disk, we don’t have to rewrite the application. This is what databases provide. Logical data independence is the ability to isolate/protect the application form changes to the logical structure of the database – e.g. adding a new column. If your application doesn’t care about that column it shouldn’t have to be re-written. This logical data independence is provided by views (all relational databases have this concept). This allows us to separate the people who are administering the database from the ones who are actually accessing it. A view is just a query with a name (saved in the database) which you can access as if it were a real table. No matter how you want to organise your tables views allows you to provide a different perspective on the data (e.g. different structure, different column names) for users or other queries to use. Access views exactly as if they were a table. The database pulls all these together and works out the best way to deal with them all in one go, so you don’t need to worry about having a stack of queries or views – the database will translate the whole thing into one big query.

Danny King [email protected] www.dannyking.eu

Introduction to Data Science, Week 2 Notes

0

Page 6 of 7

Danny King, 2013

Indexes allow you to find needles in the haystack of database data. Queries will always finish regardless of database size as long as it fits on disk: you don’t need to worry about amount of memory(RAM).

Written Notes

Danny King [email protected] www.dannyking.eu

Introduction to Data Science, Week 2 Notes Danny King, 2013

Danny King [email protected] www.dannyking.eu

0

Page 7 of 7