Teach Yourself SQL in 21 Days, Second Edition

- Day 1 -
Introduction to SQL

A Brief History of SQL

The history of SQL begins in an IBM laboratory in San Jose, California, where SQL was developed in the late 1970s. The initials stand for Structured Query Language, and the language itself is often referred to as "sequel." It was originally developed for IBM's DB2 product (a relational database management system, or RDBMS, that can still be bought today for various platforms and environments). In fact, SQL makes an RDBMS possible. SQL is a nonprocedural language, in contrast to the procedural or third-generation languages (3GLs) such as COBOL and C that had been created up to that time.

NOTE: Nonprocedural means what rather than how. For example, SQL describes what data to retrieve, delete, or insert, rather than how to perform the operation.

The characteristic that differentiates a DBMS from an RDBMS is that the RDBMS provides a set-oriented database language. For most RDBMSs, this set-oriented database language is SQL. Set oriented means that SQL processes sets of data in groups.

Two standards organizations, the American National Standards Institute (ANSI) and the International Standards Organization (ISO), currently promote SQL standards to industry. The ANSI-92 standard is the standard for the SQL used throughout this book. Although these standard-making bodies prepare standards for database system designers to follow, all database products differ from the ANSI standard to some degree. In addition, most systems provide some proprietary extensions to SQL that extend the language into a true procedural language. We have used various RDBMSs to prepare the examples in this book to give you an idea of what to expect from the common database systems. (We discuss procedural SQL--known as PL/SQL--on Day 18, "PL/SQL: An Introduction," and Transact-SQL on Day 19, "Transact-SQL: An Introduction.")

A Brief History of Databases

A little background on the evolution of databases and database theory will help you understand the workings of SQL. Database systems store information in every conceivable business environment. From large tracking databases such as airline reservation systems to a child's baseball card collection, database systems store and distribute the data that we depend on. Until the last few years, large database systems could be run only on large mainframe computers. These machines have traditionally been expensive to design, purchase, and maintain. However, today's generation of powerful, inexpensive workstation computers enables programmers to design software that maintains and distributes data quickly and inexpensively.

Dr. Codd's 12 Rules for a Relational Database Model

The most popular data storage model is the relational database, which grew from the seminal paper "A Relational Model of Data for Large Shared Data Banks," written by Dr. E. F. Codd in 1970. SQL evolved to service the concepts of the relational database model. Dr. Codd defined 13 rules, oddly enough referred to as Codd's 12 Rules, for the relational model:

0. A relational DBMS must be able to manage databases entirely through its relational capabilities.

1. Information rule-- All information in a relational database (including table and column names) is represented explicitly as values in tables.

2. Guaranteed access--Every value in a relational database is guaranteed to be accessible by using a combination of the table name, primary key value, and column name.

3. Systematic null value support--The DBMS provides systematic support for the treatment of null values (unknown or inapplicable data), distinct from default values, and independent of any domain.

4. Active, online relational catalog--The description of the database and its contents is represented at the logical level as tables and can therefore be queried using the database language.

5. Comprehensive data sublanguage--At least one supported language must have a well-defined syntax and be comprehensive. It must support data definition, manipulation, integrity rules, authorization, and transactions.

6. View updating rule--All views that are theoretically updatable can be updated through the system.

7. Set-level insertion, update, and deletion--The DBMS supports not only set-level retrievals but also set-level inserts, updates, and deletes.

8. Physical data independence--Application programs and ad hoc programs are logically unaffected when physical access methods or storage structures are altered.

9. Logical data independence--Application programs and ad hoc programs are logically unaffected, to the extent possible, when changes are made to the table structures.

10. Integrity independence--The database language must be capable of defining integrity rules. They must be stored in the online catalog, and they cannot be bypassed.

11. Distribution independence--Application programs and ad hoc requests are logically unaffected when data is first distributed or when it is redistributed.

12. Nonsubversion--It must not be possible to bypass the integrity rules defined through the database language by using lower-level languages.

Most databases have had a "parent/child" relationship; that is, a parent node would contain file pointers to its children. (See Figure 1.1.)

Figure 1.1.
Codd's relational database management system.

This method has several advantages and many disadvantages. In its favor is the fact that the physical structure of data on a disk becomes unimportant. The programmer simply stores pointers to the next location, so data can be accessed in this manner. Also, data can be added and deleted easily. However, different groups of information could not be easily joined to form new information. The format of the data on the disk could not be arbitrarily changed after the database was created. Doing so would require the creation of a new database structure.

Codd's idea for an RDBMS uses the mathematical concepts of relational algebra to break down data into sets and related common subsets.

Because information can naturally be grouped into distinct sets, Dr. Codd organized his database system around this concept. Under the relational model, data is separated into sets that resemble a table structure. This table structure consists of individual data elements called columns or fields. A single set of a group of fields is known as a record or row. For instance, to create a relational database consisting of employee data, you might start with a table called EMPLOYEE that contains the following pieces of information: Name, Age, and Occupation. These three pieces of data make up the fields in the EMPLOYEE table, shown in Table 1.1.

Table 1.1. The EMPLOYEE table.

Name Age Occupation

Will Williams 25 Electrical engineer

Dave Davidson 34 Museum curator

Jan Janis 42 Chef

Bill Jackson 19 Student

Don DeMarco 32 Game programmer

Becky Boudreaux 25 Model

The six rows are the records in the EMPLOYEE table. To retrieve a specific record from this table, for example, Dave Davidson, a user would instruct the database management system to retrieve the records where the NAME field was equal to Dave Davidson. If the DBMS had been instructed to retrieve all the fields in the record, the employee's name, age, and occupation would be returned to the user. SQL is the language that tells the database to retrieve this data. A sample SQL statement that makes this query is

SELECT *
FROM EMPLOYEE

Remember that the exact syntax is not important at this point. We cover this topic in much greater detail beginning tomorrow.

Because the various data items can be grouped according to obvious relationships (such as the relationship of Employee Name to Employee Age), the relational database model gives the database designer a great deal of flexibility to describe the relationships between the data elements. Through the mathematical concepts of join and union, relational databases can quickly retrieve pieces of data from different sets (tables) and return them to the user or program as one "joined" collection of data. (See Figure 1.2.) The join feature enables the designer to store sets of information in separate tables to reduce repetition.

Figure 1.2.
The join feature.

Figure 1.3 shows a union. The union would return only data common to both sources.

Figure 1.3.
The union feature.

Here's a simple example that shows how data can be logically divided between two tables. Table 1.2 is called RESPONSIBILITIES and contains two fields: NAME and DUTIES.

Table 1.2. The RESPONSIBILITIES table.

Name Duties

Becky Boudreaux Smile

Becky Boudreaux Walk

Bill Jackson Study

Bill Jackson Interview for jobs

It would be improper to duplicate the employee's AGE and OCCUPATION fields for each record. Over time, unnecessary duplication of data would waste a great deal of hard disk space and increase access time for the RDBMS. However, if NAME and DUTIES were stored in a separate table named RESPONSIBILITIES, the user could join the RESPONSIBILITIES and EMPLOYEE tables on the NAME field. Instructing the RDBMS to retrieve all fields from the RESPONSIBILITIES and EMPLOYEE tables where the NAME field equals Becky Boudreaux would return Table 1.3.

Table 1.3. Return values from retrieval where NAME equals Becky Boudreaux.

Name Age Occupation Duties

Becky Boudreaux 25 Model Smile

Becky Boudreaux 25 Model Walk

More detailed examples of joins begin on Day 6, "Joining Tables."

Designing the Database Structure

The most important decision for a database designer, after the hardware platform and the RDBMS have been chosen, is the structure of the tables. Decisions made at this stage of the design can affect performance and programming later during the development process. The process of separating data into distinct, unique sets is called normalization.

Today's Database Landscape

Computing technology has made a permanent change in the ways businesses work around the world. Information that was at one time stored in warehouses full of filing cabinets can now be accessed instantaneously at the click of a mouse button. Orders placed by customers in foreign countries can now be instantly processed on the floor of a manufacturing facility. Although 20 years ago much of this information had been transported onto corporate mainframe databases, offices still operated in a batch-processing environment. If a query needed to be performed, someone notified the management information systems (MIS) department; the requested data was delivered as soon as possible (though often not soon enough).

In addition to the development of the relational database model, two technologies led to the rapid growth of what are now called client/server database systems. The first important technology was the personal computer. Inexpensive, easy-to-use applications such as Lotus 1-2-3 and Word Perfect enabled employees (and home computer users) to create documents and manage data quickly and accurately. Users became accustomed to continually upgrading systems because the rate of change was so rapid, even as the price of the more advanced systems continued to fall.

The second important technology was the local area network (LAN) and its integration into offices across the world. Although users were accustomed to terminal connections to a corporate mainframe, now word processing files could be stored locally within an office and accessed from any computer attached to the network. After the Apple Macintosh introduced a friendly graphical user interface, computers were not only inexpensive and powerful but also easy to use. In addition, they could be accessed from remote sites, and large amounts of data could be off-loaded to departmental data servers.

During this time of rapid change and advancement, a new type of system appeared. Called client/server development because processing is split between client computers and a database server, this new breed of application was a radical change from mainframe-based application programming. Among the many advantages of this type of architecture are

Reduced maintenance costs
Reduced network load (processing occurs on database server or client computer)
Multiple operating systems that can interoperate as long as they share a common network protocol
Improved data integrity owing to centralized data location

In Implementing Client/Server Computing, Bernard H. Boar defines client/server computing as follows:

Client/server computing is a processing model in which a single application is partitioned between multiple processors (front-end and back-end) and the processors cooperate (transparent to the end user) to complete the processing as a single unified task. Implementing Client/Server Computing A client/server bond product ties the processors together to provide a single system image (illusion). Shareable resources are positioned as requestor clients that access authorized services. The architecture is endlessly recursive; in turn, servers can become clients and request services of other servers on the network, and so on and so on.

This type of application development requires an entirely new set of programming skills. User interface programming is now written for graphical user interfaces, whether it be MS Windows, IBM OS/2, Apple Macintosh, or the UNIX X-Window system. Using SQL and a network connection, the application can interface to a database residing on a remote server. The increased power of personal computer hardware enables critical database information to be stored on a relatively inexpensive standalone server. In addition, this server can be replaced later with little or no change to the client applications.

A Cross-Product Language

You can apply the basic concepts introduced in this book in many environments--for example, Microsoft Access running on a single-user Windows application or SQL Server running with 100 user connections. One of SQL's greatest benefits is that it is truly a cross-platform language and a cross-product language. Because it is also what programmers refer to as a high-level or fourth-generation language (4GL), a large amount of work can be donehigher-level language 4GL (fourth-generation) language fourth-generation (4GL) language in fewer lines of code.

Early Implementations

Oracle Corporation released the first commercial RDBMS that used SQL. Although the original versions were developed for VAX/VMS systems, Oracle was one of the first vendors to release a DOS version of its RDBMS. (Oracle is now available on more than 70 platforms.) In the mid-1980s Sybase released its RDBMS, SQL Server. With client libraries for database access, support for stored procedures (discussed on Day 14, "Dynamic Uses of SQL"), and interoperability with various networks, SQL Server became a successful product, particularly in client/server environments. One of the strongest points for both of theseSQL Server powerful database systems is their scalability across platforms. C language code (combined with SQL) written for Oracle on a PC is virtually identical to its counterpart written for an Oracle database running on a VAX system.

SQL and Client/Server Application Development

The common thread that runs throughout client/server application development is the use client/server computing of SQL and relational databases. Also, using this database technology in a single-user business application positions the application for future growth.

An Overview of SQL

SQL is the de facto standard language used to manipulate and retrieve data from these relational databases. SQL enables a programmer or database administrator to do the following:

Modify a database's structure
Change system security settings
Add user permissions on databases or tables
Query a database for information
Update the contents of a database

NOTE: The term SQL can be confusing. The S, for Structured, and the L, for Language, are straightforward enough, but the Q is a little misleading. Q, of course, stands for "Query," which--if taken literally--would restrict you to asking the database questions. But SQL does much more than ask questions. With SQL you can also create tables, add data, delete data, splice data together, trigger actions based on changes to the database, and store your queries within your program or database.

Unfortunately, there is no good substitute for Query. Obviously, Structured Add Modify Delete Join Store Trigger and Query Language (SAMDJSTQL) is a bit cumbersome. In the interest of harmony, we will stay with SQL. However, you now know that its function is bigger than its name.

The most commonly used statement in SQL is the SELECT statement (see Day 2, "Introduction to the Query: The SELECT Statement"), which retrieves data from the database and returns the data to the user. The EMPLOYEE table example illustrates a typical example of a SELECT statement situation. In addition to the SELECT statement, SQL provides statements for creating new databases, tables, fields, and indexes, as well as statements for inserting and deleting records. ANSI SQL also recommends a core group of data manipulation functions. As you will find out, many database systems also have tools for ensuring data integrity and enforcing security (see Day 11, "Controlling Transactions") that enable programmers to stop the execution of a group of commands if a certain condition occurs.

Popular SQL Implementations

This section introduces some of the more popular implementations of SQL, each of which has its own strengths and weaknesses. Where some implementations of SQL have been developed for PC use and easy user interactivity, others have been developed to accommodate very large databases (VLDB). This sections introduces selected key features of some implementations.

NOTE: In addition to serving as an SQL reference, this book also contains many practical software development examples. SQL is useful only when it solves your real-world problems, which occur inside your code.

Microsoft Access

We use Microsoft Access, a PC-based DBMS, to illustrate some of the examples in this text. Access is very easy to use. You can use GUI tools or manually enter your SQL statements.

Personal Oracle7

We use Personal Oracle7, which represents the larger corporate database world, to demonstrate command-line SQL and database management techniques. (These techniques are important because the days of the standalone machine are drawing to an end, as are the days when knowing one database or one operating system was enough.) In command-line RÊl, simple stand+[cedilla]one SQL statements are entered into Oracle's SQL*Plus tool. This tool then returns data to the screen for the user to see, or it performs the appropriate action on the database.

Most examples are directed toward the beginning programmer or first-time user of SQL. We begin with the simplest of SQL statements and advance to the topics of transaction management and stored procedure programming. The Oracle RDBMS is distributed with a full complement of development tools. It includes a C++ and Visual Basic language library (Oracle Objects for OLE) that can link an application to a Personal Oracle database. It also comes with graphical tools for database, user, and object administration, as well as the SQL*Loader utility, which is used to import and export data to and from Oracle.

NOTE: Personal Oracle7 is a scaled-down version of the full-blown Oracle7 server product. Personal Oracle7 allows only single-user connections (as the name implies). However, the SQL syntax used on this product is identical to that used on the larger, more expensive versions of Oracle. In addition, the tools used in Personal Oracle7 have much in common with the Oracle7 product.

We chose the Personal Oracle7 RDBMS for several reasons:

It includes nearly all the tools needed to demonstrate the topics discussed in this book.
It is available on virtually every platform in use today and is one of the most popular RDBMS products worldwide.
A 90-day trial copy can be downloaded from Oracle Corporation's World Wide Web server (http://www.oracle.com).

Figure 1.4 shows SQL*Plus from this suite of tools.

Figure 1.4.
Oracle's SQL*Plus.

TIP: Keep in mind that nearly all the SQL code given in this book is portable to other database management systems. In cases where syntax differs greatly among different vendors' products, examples are given to illustrate these differences.

Microsoft Query

Microsoft Query (see Figure 1.5) is a useful query tool that comes packaged with Microsoft's Windows development tools, Visual C++, and Visual Basic. It uses the ODBC standard to communicate with underlying databases. Microsoft Query passes SQL statements to a driver, which processes the statements before passing them to a database system.

Figure 1.5.
Microsoft Query.

Open Database Connectivity (ODBC)

ODBC is a functional library designed to provide a common Application Programming Interface (API) to underlying database systems. It communicates with the database through a library driver, just as Windows communicates with a printer via a printer driver. Depending on the database being used, a networking driver may be required to connect to a remote database. The architecture of ODBC is illustrated in Figure 1.6.

Figure 1.6.
ODBC structure.

The unique feature of ODBC (as compared to the Oracle or Sybase libraries) is that none of its functions are database-vendor specific. For instance, you can use the same code to perform queries against a Microsoft Access table or an Informix database with little or no modification. Once again, it should be noted that most vendors add some proprietary extensions to the SQL standard, such as Microsoft's and Sybase's Transact-SQL and Oracle's PL/SQL.

You should always consult the documentation before beginning to work with a new data source. ODBC has developed into a standard adopted into many products, including Visual Basic, Visual C++, FoxPro, Borland Delphi, and PowerBuilder. As always, application developers need to weigh the benefit of using the emerging ODBC standard, which enables you to design code without regard for a specific database, versus the speed gained by using a database specific function library. In other words, using ODBC will be more portable but slower than using the Oracle7 or Sybase libraries.

SQL in Application Programming

SQL was originally made an ANSI standard in 1986. The ANSI 1989 standard (often called SQL-89) defines three types of interfacing to SQL within an application program:

Module Language-- Uses procedures within programs. These procedures can be called by the application program and can return values to the program via parameter passing.
Embedded SQL--Uses SQL statements embedded with actual program code. This method often requires the use of a precompiler to process the SQL statements. The standard defines statements for Pascal, FORTRAN, COBOL, and PL/1.
Direct Invocation--Left up to the implementor.

Before the concept of dynamic SQL evolved, embedded SQL was the most popular way to use SQL within a program. Embedded SQL, which is still used, uses static SQL--meaning that the SQL statement is compiled into the application and cannot be changed at runtime. The principle is much the same as a compiler versus an interpreter. The performance for this type of SQL is good; however, it is not flexible--and cannot always meet the needs of today's changing business environments. Dynamic SQL is discussed shortly.

The ANSI 1992 standard (SQL-92) extended the language and became an international standard. It defines three levels of SQL compliance: entry, intermediate, and full. The new features introduced include the following:

Connections to databases
Scrollable cursors
Dynamic SQL
Outer joins

This book covers not only all these extensions but also some proprietary extensions used by RDBMS vendors. Dynamic SQL allows you to prepare the SQL statement at runtime. Although the performance for this type of SQL is not as good as that of embedded SQL, it provides the application developer (and user) with a great degree of flexibility. A call-level interface, such as ODBC or Sybase's DB-Library, is an example of dynamic SQL.

Call-level interfaces should not be a new concept to application programmers. When using ODBC, for instance, you simply fill a variable with your SQL statement and call the function to send the SQL statement to the database. Errors or results can be returned to the program through the use of other function calls designed for those purposes. Results are returned through a process known as the binding of variables.

Summary

Day 1 covers some of the history and structure behind SQL. Because SQL and relational databases are so closely linked, Day 1 also covers (albeit briefly) the history and function of relational databases. Tomorrow is devoted to the most important component of SQL: the query.

Q&A

Q Why should I be concerned about SQL?

A Until recently, if you weren't working on a large database system, you probably had only a passing knowledge of SQL. With the advent of client/server development tools (such as Visual Basic, Visual C++, ODBC, Borland's Delphi, and Powersoft's PowerBuilder) and the movement of several large databases (Oracle and Sybase) to the PC platform, most business applications being developed today require a working knowledge of SQL.

Q Why do I need to know anything about relational database theory to use SQL?

A SQL was developed to service relational databases. Without a minimal understanding of relational database theory, you will not be able to use SQL effectively except in the most trivial cases.

Q All the new GUI tools enable me to click a button to write SQL. Why should I spend time learning to write SQL manually?

A GUI tools have their place, and manually writing SQL has its place. Manually written SQL is generally more efficient than GUI-written SQL. Also, a GUI SQL statement is not as easy to read as a manually written SQL statement. Finally, knowing what is going on behind the scenes when you use GUI tools will help you get the most out of them.

Q So, if SQL is standardized, should I be able to program with SQL on any databases?

A No, you will be able to program with SQL only on RDBMS databases that support SQL, such as MS-Access, Oracle, Sybase, and Informix. Although each vendor's implementation will differ slightly from the others, you should be able to use SQL with very few adjustments.

Workshop

The Workshop provides quiz questions to help solidify your understanding of the material covered, as well as exercises to provide you with experience in using what you have learned. Try to answer the quiz and exercise questions before checking the answers in Appendix F, "Answers to Quizzes and Exercises."

Quiz

1. What makes SQL a nonprocedural language?

2. How can you tell whether a database is truly relational?

3. What can you do with SQL?

4. Name the process that separates data into distinct, unique sets.

Exercise

Determine whether the database you use at work or at home is truly relational.

Casa de Bender

Name	Age	Occupation
Will Williams	25	Electrical engineer
Dave Davidson	34	Museum curator
Jan Janis	42	Chef
Bill Jackson	19	Student
Don DeMarco	32	Game programmer
Becky Boudreaux	25	Model

Name	Duties
Becky Boudreaux	Smile
Becky Boudreaux	Walk
Bill Jackson	Study
Bill Jackson	Interview for jobs