Ultra-fast Searching with Indexes

FILEFLEX PROGRAMMER MANUAL

CHAPTER 13
Ultra-fast Searching with Indexes

FileFlex provides a number of very powerful searching mechanisms. Each is used to accomplish a certain kind of searching task. Briefly, the FileFlex search mechanisms are:

Indexed search: An index is an ordered list of information. Searching an index involves moving instantly to a specific location in the ordered list of information. This type of search is blazingly fast.
Relational query: FileFlex 3 permits databases to be related to each other through relational joins. A relational query searches for data across databases according to the defined structure of the relation. These searches may take longer than a straight indexed search, although a special query acceleration technology (similar to FoxPro's Rushmore) can often make for exceptionally fast relational queries.
Iterative query: FileFlex implements a DBQuery function that scans one record at a time from the current record until it finds a match. Used on records in physical order, or in context with an indexed search, DBQuery allows a second level of evaluation and data search.
Full-text retrieval: FileFlex implements an interative full-text searching mechanism called DBFindMemo. This function scans through records in physical order until a memo field is found that contains the search string.

In this chapter, we discuss indexed searching.

Index Basics

FileFlex excels at index files. Letting FileFlex use indexes is like giving a prized racehorse the chance to just let it all out, throttling up a perfectly tuned race car, or punching a fighter jet into afterburner. When we say that FileFlex can locate any record in a sea of billions of records faster than the blink of an eye, we're talking FileFlex compound, compact indexes.

Whenever possible, we recommend you use indexed-based searching. Indexes work by algorithm, rather than brute force. FileFlex looks at the string you're search for, does a mathematical calculation that basically tells it how far into the file to move, and boom, it's on the record you need.

By contrast, both DBQuery and the full-text search DBFindMemo scan on a record by record basis. This means that if the data you're looking for is at the end of the file, DBQuery and DBFindMemo must individually check all the preceding records prior to finding the match. DBSeek (the interface to indexes) just does a calculation and whammo! You're on the record.

A relational query can work in either way. If the proper acceleration conditions have been set up, a relational query will use existing indexes and find data as quickly as in a straight indexed search. Otherwise, searches will iterate over the virtual, combined data file represented by the relation.

Indexes are not without their price, however. Since indexes can't check every record and rely instead on complex offset calculations, they don't support on-the-fly queries. This means that the user at search time can't enter a search expression. But if you want to pre-create a search where you find all companies beginning with "S", or all sales people in Texas earning commissions over 20%, you certainly can do it.

If you've used FileFlex 2 or earlier, you'll find that FileFlex 3 indexes are vastly more capable. Indexes are ideal for most queries you'll need. Need to find someone's address? Construct an index combining last name and first name, do a DBSeek, and--poof!--you're on the record. Need to find everyone in the Southeast Region who's booked over $1.25 million and who hasn't gotten a recent raise? Use an index (or construct a relational query).

Sort Orders and Tags

Indexes also take disk space, often quite a lot (although FileFlex 3's .CDX format takes a heck of a lot less space than FileFlex 2's .NDX format). Each index is it's own file and that file contains the complete data of the field it's indexing, as well as some overhead space used internally by FileFlex. Each index file can contain a number of sort orders, called tags (a FoxPro term).

Some history: In FileFlex 2, each sort order (field "LASTNAME", ascending order) was also it's own file, and thereby called an index. In FileFlex 3, each sort order is called a tag. Many tags can exist inside a single index file. If you're moving from FileFlex 2, what you called an index is now called a tag. And the .CDX index file is really a container holding a whole bunch of tags.

This means that instead of having a whole bunch of .NDX index files, each containing their own sort order, you can now have one .CDX file, containing a whole bunch of sort order tags. Makes for a lot less clutter.

Tags are named. If you have a sort order in an index file that is based on zip code, you can name the tag "BYZIP". You could also name it "BYGEORGE", but that would be confusing. If you have a sort order that indexes based on last name and first name, you can name the tag "LASTFIRST". This makes it easy to keep track of tags in the index file.

Automatic "Structural" Indexes

It gets even cooler. FileFlex 3 introduces the concept of automatic databases (in FoxPro, these would be called "structural databases"). An automatic database is a database (a .DBF file), that, when opened, also automatically opens its associated index file.

Let's think this through for a second. Let's assume you have a typical mailing list. You've got fields called first, last, company, street, city, state, and zip. You might want to have a variety of sort orders: BYZIP, BYLAST, BYLASTFIRST, and so forth. You'd create the database, called MAILLIST.DBF. If you use DBCreateAuto to create the mailing list database, you'd also create the index file, called MAILLIST.CDX. This would contain the sort orders we've listed.

Now, when you open the database using DBUse, you'd automatically open the index .CDX file at the same time. Accessing the database in zip code order would be a matter of just two commands (of course, normally you'd check result codes):

put DBUse("MAILLIST") into id
put DBSelectIndex("BYZIP") into tagid

It's way easier to create automatic databases and indexes using FoxPro than it is writing the DBCreateAuto function. Since FileFlex uses native FoxPro format, you can use Visual FoxPro's very adequate GUI interface to define your fields, indexes, sort orders, and the like.

Note: It is possible to use DBSeek, DBQuery, and the relational functions incorrectly and get dog-poor performance. These tools provide you with the capabilities. But it's up to you to design something that works efficiently. Just because you've got a hot database engine is no excuse for poor application design.

Index Files Supported

FileFlex supports the use and updating of high-performance FoxPro-native .CDX index files only. Other index file architectures such as dBASE-III (e.g., FileFlex 2) indexes cannot be used. However, FileFlex can reindex a file using the FoxPro index file structures. This makes it easy to use files which have been indexed using other methods.

When Indexes are Updated Automatically

Any time you have opened one or more index files related to an open database file and you make changes to that database file, FileFlex automatically updates those indexes to reflect the new file contents.

Opening and Using Index Files

Among the functions you may wish to perform on an index file are the following:

open a specific index file
check an open index file to be sure it matches the file's contents
select from among two or more open index files to make one current
close an open index file

Opening an Index File

Use the FileFlex DBUseIndex function to open an index file for use. Supply the index file's name as an argument. Assign the result of this function to a variable (usually global) because you'll need to refer to the index file's ID in other scripts and handlers. Here's an example of the use of this function:

put DBUseIndex("STARS") into StarIndex

Note that the name of the index file should include any extension the file might have. In the Macintosh environment, extensions are normally omitted, but in DOS environments, the file extension ".CDX" will be used for FoxPro-native index files.

We regularly get tech support calls from customers who get index file errors. The most common reason is that the database the index file indexes must be open and selected (DBUse and DBSelect) before executing a DBUseIndex. One good practice is to do all your DBUse and DBUseIndex calls at the start of your application's life (for instance, in a startMovie handler). Use DBUse to open a data file and assign an ID to a global variable. Then call DBUseIndex repeatedly, opening all the index files for that data file, again assigning the IDs to globals. Repeat this until all your data and index files are open.

Selecting a Sort Order Tag

You must inform FileFlex which sort order it should use at any given time. You can specify the current sort order tag with the DBSelectIndex function. Supply the name of the sort order tag as an argument:

put DBSelectIndex("BYZIP") into dbResult

Only one sort order can be active at any time. It dictates the order in which record retrieval takes place. However, all open index files are updated each time a record is added or permanently deleted. Also, as you may have guessed, you need to be sure all your tags have unique names, even across index files. There's no way to tell FileFlex to select this-and-such sort order in that-and-such index file. If STARS.CDX and PEOPLE.CDX both have a "BYZIP" tag, FileFlex will select the first one it finds.

If you're jumping between databases as well as indexes, don't forget to execute a DBSelect for the appropriate database file prior to calling DBSelectIndex on it's associated indexes.

Finding a Record by Tag

One of the main reasons for using indexes is to enable the database to find a record by the specific content of a specific field, namely, the one on which the tag is based. The other reason is to keep the file in a specific order, or at least to give the appearance of retrieving information in that specific order. Remember that FileFlex never physically reorders your data. Tags are the FileFlex equivalent of sort...but an equivalent that lets you switch between an unlimited number of sort orders instantly.

To locate a record which matches a specific expression in the indexed field, using the currently active sort order, use the FileFlex DBSeek function, supplying a parameter containing the expression to be matched in the indexed field:

put DBSeek("Fred Jones") into foundRec

The expression passed to DBSeek should contain the value for which you wish to search the indexed field. If this command fails to find the record (i.e., the search moves beyond the end of the file), the function returns a failure code. You can then retrieve the contents of the located record.

Once you have found a record that matches the index key given as an argument, subsequent uses of the function will move the current record pointer to succeeding records in the file until a failure code is returned, indicating no more records match the criterion.

Building a Seek Expression

Seek expressions can be complex. One important thing to know is that the seek expression must match the exact number of characters as the data stored. So, in the example above with Fred Jones, DBSeek would have indicated an exact match only if the field being searched was defined to contain exactly ten characters and the contents was "Fred Jones".

But what if you're not looking for a name that exactly fits the field width? The short answer is you need to pad out the string to the right number of characters. Let's assume we're searching a name field defined as 15 characters wide. We'd need the following search expression to get an exact match ('#' indicates a space character for readability purposes):

put DBSeek("Fred Jones#####") into foundRec

You can use DBSuperPadString to pad out the string to the requisite number of spaces.

One other thing to be aware of: the values in the index are case-specific. If you want to find FRED JONES, you'll need to pass FRED JONES to DBSeek. There are a whole bunch of intrinsic functions available for creating indexes, so you can create an upper-case only index.

Creating a New Index

As we've said before, the very best way to create an index and it's associated tags is to use FoxPro. But there are two other supplied functions, DBCreateIndex and DBCreateAuto that will also create index files for you.

If you have a database file open and you wish to use programming to index it on a field for which no index is presently available, you can create a new index file with the FileFlex DBCreateIndex function. This function requires four arguments, as explained below, and follows this format:

put DBCreateIndex(index-file, num-tags, tags, overwrite-flag) into dbResult

The four arguments required are:

index-file: the name of the index file to be created (including a path name if you wish to place it somewhere specific)
num-tags: the number of tags (sort orders) you want to create in the index file
tags: a container describing each tag (more on that below)
overwrite-flag: a logical value that is set to "false" if you want to overwrite an existing file of the same name if one exists, "true" if you wish to be prohibited from doing so

The tags container contains a series of "tag records". Each record is a single line in the container, in the following format (each argument separated by a comma):

tagname,index-expr,filter-expr,order,uniq-flag

The fields of a tag record are constructed as follows:

tagname: the name of the sort order (i.e., "BYLAST"). Keep it to ten alphanumeric characters or less.
index-expr: the definition of the sort order (more on expressions below)
filter-expr: the definition of how you want to select or choose data in the index; use "0" to ignore filtering
order: determines if the list is sorted in ascending or descending order. For ascending order, this parameter must be "A"; for descending, it must be "D"
uniq-flag: if "0", duplicate fields are allowed, if non-"0", duplicates are not allowed

Creating an Index Expression

The index expression is used to create the sort order key for indexes in FileFlex. The simplest index expression is simply the name of a field to sort on. For example, if you wanted to create an index in zip code order, you could use an index expression consisting of only "ZIPCODE" (the name of a field) as the index expression.

FileFlex will permit you to index on multiple fields. To create such an index, simply list the names of all the fields you wish to use in the index field group as a single string, with a plus sign concatenating them, as in "TERRITORY+SALES".

Fields should be listed from primary key to secondary key. Indexing will take place in major-minor order from left to right. In the above example, the file will end up being sorted by the field called TERRITORY and, within TERRITORY, by SALES.

Expressions can be much more powerful by adding the use of intrinsic functions. To create an index using an intrinsic function, simply include it inside the string defining the index expression, as in "UPPER(SALES)".

The above would index the field SALES, but would sort as though all the letters in the SALES field were converted to upper case.

And yes, you can actually do bizarre index expressions like:

"UPPER(LAST)+UPPER(FIRST)+STR(AGE)"

Intrinsic functions are defined in more detail in Understanding Expressions.

Creating a Filter Expression

Filter expressions are used to narrow down the data that's represented in a search. The idea is that a given piece of data must pass through the filter represented by an expression in order to be returned in a search.

How would this be used? Let's assume you want to find everyone located in New Jersey. You could create a filter expression like "STATE = 'NJ'" and then only records containing NJ would be returned by the search. Filter expressions are very much like query expressions used in DBQuery or in relational queries. The key is the expression must be able to evaluate to either true or false.

Checking an Index File for Currency

If you suspect that an index file may not be "in sync" with its database file, you can confirm its currency with the DBCheckIndex function. This function returns an error code if its contents do not match those of the database file. Otherwise, it returns the normal result code of 0.

DBCheckIndex works differently in FileFlex 3 than in previous releases. In FileFlex 3, DBCheckIndex checks ALL open index files and reports an error if any one of them isn't up to date. There isn't a way to to determine which file isn't up to date, other than closing all index files and opening one at a time to check currency.

Reindexing a File

There are times that an index file and its associated data file can become unsynchronized. You can determine if a particular index file is synchronized with the database in use by using the FileFlex DBCheckIndex function described earlier. If you find that it is not synchronized, you can use the FileFlex function DBReindex with the following syntax:

put DBReindex(indexID) into dbResult

Note that this operation uses an existing and previously opened index file and replaces it with the newly generated index.

DBReindex will not reindex an index opened as an automatic database.

From a performance perspective, it's often not wise to do a DBCheckIndex and then a DBReindex if the index is out of sync. DBCheckIndex will scan each record until it finds a mismatch, and this takes time. It's often easier just to run a DBReindex and go for a cup of coffee, go out to lunch, or go home for the night (depending on the size of the database).

Closing an Index File

Once you have opened an index file for use, you must also close it as part of cleaning up after your application. The FileFlex DBCloseIndex function closes a designated index file:

put DBCloseIndex(StarIndex) into dbResult

To close an index file use the DBCloseIndex function along with its associated ID (returned by the DBUseIndex function) to close it.

Discuss this chapter on the FileFlex Boards.

Casa de Bender