BT5110
Mini-project
INSTRUCTIONS 1. Some of the reference code is available in the file mini-project.sql (it can be found in IVLE Workbin (Files)/Project/Mini Project). 2. Submit your group’s report (one report per group; any group member submits the report) to the folder ”/Project/Mini Project/Submissions” in IVLE Workbin (Files) as one PDF file by Friday 20 September 2017 at 17:00. 3. The PDF file should be named groupXX.pdf, where XX is your group number as in IVLE Project. 4. Make sure to write the student number of each member of your group on the front page of the report. 5. After the deadline and until Friday 6 October 2017 at 17:00, you can submit late reports to the folder ”/Project/Mini Project/Late Submissions” in IVLE Workbin (Files) (penalties apply). 6. Submit only one PDF file per team (only one document will be considered additional files will be ignored). 7. Present your answer in two tables (one table for the query results, one table for the eight queries and the eight performance results). add additional comments if any separately. 8. Make sure to clearly highlight the answer from other discussions and comments. 9. Keep your code as you may be asked to demonstrate that you can reproduce the results that you are reporting.
BT5110 In 1988, Fabian Pascal, a database designer and programmer (and prolific blogger on database issues, see http://www.dbdebunk.com, published the article ”SQL Redundancy and DBMS Performance” in the journal Database Programming & Design (V1, N12). He compared and discussed the plan and performance of seven equivalent SQL queries with different database management systems. For the experiment he proposed a schema and a synthetic instance on which the seven queries are executed. At the time, the different systems could or could not execute all the queries and the performances significantly differed among and within individual systems while one would expect the DBMS optimizer to choose the same optimal execution plan for these queries. In this assignment, we propose to replay Fabian Pascal’s experiment with PostgreSQL. The schema consists of a table employee and a table payroll. The table employee records information about employees of a fictitious company. Employees have an employee identifier, a first name and a last name, an address recorded as a street address, a city, a state and a zip code. The table payroll records, for each employee, her bonus and salary. The following SQL creates the tables employee and payroll with the domains in Fabian Pascals original article. CREATE TABLE employee ( empid CHAR( 9 ) , lname CHAR( 1 5 ) , fname CHAR( 1 2 ) , a d d r e s s CHAR( 2 0 ) , c i t y CHAR( 2 0 ) , s t a t e CHAR( 2 ) , ZIP CHAR( 5 ) );
CREATE TABLE p a y r o l l ( empid CHAR( 9 ) , bonus INTEGER, s a l a r y INTEGER );
The query that we are studying finds the identifier and the last name of the employees earning a salary of $199170. This query in English can be expressed in SQL in (infinitely) many different ways.
Page 2
BT5110 Question 1 [20 marks] (a) Create and populate the employee and payroll tables. PLPGSQL is procedural code that can be executed by the PostgreSQL server directly. You may use or modify the following PLPGSQL function to generate random string of upper case alphabetical characters of a fixed length. CREATE o r REPLACE FUNCTION r a n d o m s t r i n g ( l e n g t h INTEGER) RETURNS TEXT AS $$ DECLARE c h a r s TEXT [ ] := ’ {A, B, C, D, E , F , G, H, I , J , K, L ,M, N, O, P , Q, R, S , T, U, V,W, X, Y, Z} ’ ; r e s u l t TEXT := ’ ’ ; i INTEGER := 0 ; BEGIN IF l e n g t h < 0 then RAISE EXCEPTION ’ Given l e n g t h c ann ot be l e s s than 0 ’ ; END IF ; FOR i IN 1 . . l e n g t h LOOP r e s u l t := r e s u l t | | c h a r s [1+ random ( ) ∗ ( a r r a y l e n g t h ( c h a r s , 1 ) − 1 ) ] ; END LOOP; RETURN r e s u l t ; END; $$ LANGUAGE p l p g s q l ;
You may use or modify the following SQL DML code to insert data into the two tables. INSERT INTO employee SELECT g : : t e x t AS empid , r a n d o m s t r i n g ( 1 5 ) AS lname , r a n d o m s t r i n g ( 1 2 ) AS fname , ’ 500 ORACLE PARKWAY’ AS a d d r e s s , ’REDWOOD SHORES ’ AS c i t y , ’CA ’ AS s t a t e , ’ 94065 ’ AS z i p FROM g e n e r a t e s e r i e s (0 , 9999) g ; INSERT INTO p a y r o l l ( empid , bonus , s a l a r y ) SELECT p e r . empid , 0 a s bonus , 99170 + ROUND( random ( ) ∗ 1 0 0 0 ) ∗ 1 0 0 AS s a l a r y FROM employee p e r ;
Print the result of the first query below. For example:
empid 327 2532 4313 5144
lname BLDBMBDKYHJXSWU UTZKTYHLCLFRPZB BEMENOIQDNTVH LKZDENXQFEAMLVH
Page 3
(1)
BT5110 (b) Consider the following query: “Finds the identifier and the last name of the employees earning a salary of $199170” . Propose a query in SQL for each of the following requirements. 1. 2. 3. 4. 5. 6.
Write the query as a simple query. Write the query as a simple query with DISTINCT. Write the query as a query with INNER . Write the query as an uncorrelated nested query with IN. Write the query as a correlated nested query with EXISTS. Write the query as a correlated nested query with a scalar subquery comparing 199170 with = ALL. 7. Write the query as a double negative nested query.
Page 4
(7)
BT5110 (c) For each of your query, find the execution time minus the planning time as reported with EXPLAIN ANALYZE. Calculate the average of the execution time minus the planning time over 1000 executions of the query. Tabulate the results for each of the seven queries above. You can use or modify the following stored function to calculate the average of the execution time minus the planning time . CREATE OR REPLACE FUNCTION t e s t (TEXT) RETURNS TEXT AS $$ DECLARE r RECORD; s TEXT; t TEXT; p TEXT; a NUMERIC := 0 ; MAX INTEGER : = 1 0 0 0 ; BEGIN FOR i IN 1 . .MAX LOOP FOR r i n EXECUTE ’EXPLAIN ANALYZE ’ | | $1 LOOP t := c o n c a t ( t , r : : TEXT) ; END LOOP; s := r e g e x p r e p l a c e ( t , ’ . ∗ E x e c u t i o n time : ( . ∗ ? ) ms . ∗ ’ , ’ \1 ’ ) ; p:= r e g e x p r e p l a c e ( t , ’ . ∗ P l a n n i n g time : ( . ∗ ? ) ms . ∗ ’ , ’ \1 ’ ) ; a := a + ( s : : NUMERIC − p : : NUMERIC − a ) / i ; END LOOP; RETURN ROUND( a , 2 ) ; END; $$ LANGUAGE p l p g s q l ;
The procedure above is used as follows. SELECT ∗ FROM t e s t ( ’SELECT ∗ FROM employee ; ’ ) ;
Page 5
(7)
BT5110 (d) Propose and evaluate the most non-trivially inefficient version of the query “Finds the identifier and the last name of the employees earning a salary of $199170” that you can write. The marking for this question is competitive. The queries will be ranked based on their interest, their planning time and the average of the execution time minus the planning time. Present the main findings by combining the queries and performance results int one table. For example:
Query Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8
SQL SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT
... ... ... ... ... ... ... ...
Page 6
Average Time 2.24ms ... ... ... ... ... ... ...
(5)