Dear All (this is a letter I wrote to a client close to
going live with a new data warehouse),
Some general comments first, and then some specifics with
regard to our project.
Here are a few bits and pieces that I've picked up over
the years, that may be useful. The following represents my opinions
only - other people may (will!) have other notions. I would say the
following two rules apply:
1. Only featurize the product
when it is necessary i.e. too slow.
2. Otherwise don't.
How that applies to us is simple:
our priority is to get the product out of the door, and therefore anything
we can/should do at this point will necessarily involve quick-fixes and
bolt-ons. There will certainly be no design or other fundamental
changes at this point. Anyway:
- Analyze full-volume test runs
for data loads. One nifty way is to put the mapping dependencies into
Microsoft Project as dependent tasks, and then use the numbers from Run
Time Audit Browser for entry into the task time fields. Let Project do
the computations, and you get an instant critical path dependency chart
with all of the hotspots identified for you.
- Mapping default is to
run Set-based, with failover to Row-based. If you know that a given
mapping should run Set-based, set that to be the option with no
failover. The code size and auditing requirements will be cut down
considerably. If you need to log errors, see if you can use the
Set-based failover to Row-based (Target Only). This is a compromise
where a cursor will be used to rapidly extract all the data from the
source, but will then use a row-based mode to insert data into the
target, where errors are more likely to occur. This is used if you want
to run as fast as you can, but must extensively monitor the results.
- Always set Parallel Row Code
to True where possible, as this will cause a row-based mapping to be
generated as a table function i.e. a pipelined and parallelized process
that can significantly (sometimes dramatically) improve performance.
However, note the limitation on table functions below.
- In general, switch off all
auditing and logging completely. That is, Default Audit Level set to
None. No REDO logs. If it bombs, it bombs. Use Audit Levels for
debugging purposes only.
- An option for clean data with
rollback is to set Bulk Size = Commit Size = a big value (perhaps
10,000). However, if row-based is being used, Oracle's recommendation
is to leave bulk size at 50 and set commit frequency to the same.
Obviously, if rollback is disabled, there is no value to a small (or
indeed any) commit size, so if you have REDO set to NOLOGGING, don't
bother changing the defaults..
- Update-only mappings will run
row-based and are very expensive. However, you can force them set-based
by abusing the MERGE statement. Force parallelism and pipelining
wherever possible. Don't do a table scan and an subset insert, followed
by the same kind of thing for a different subset. Use multi-table
inserts instead. Try never to scan a table more than once. Be aware
of how the generated SQL affects performance (e.g. a splitter can result
in multiple SELECT statements, i.e. multiple passes of the same source
table - so use a table function instead, which is much, much faster).
Be aware that table functions only work within a single schema, not
across multiple schemas.
- Always use big complicated SQL
over lots of itty-bitty mappings from a simple physical table to a
simple physical table. There is a difference of opinion in he OWB
community over this. Nonetheless, I would contend that disk I/O costs a
hundred times more than pipelined processes in memory.
- Ensure that cost-based
optimisation is used with fully ANALYZEd tables. But never ANALYZE the
repository.
- Switch off indexes on final load targets, and
re-enable them afterwards (do not want to lots of itty-bitty
processing). If necessary and warranted by volume/load times, do a
'strip & pin' on the dimensions prior to fact load. (Strip key fields
from dimensions, strip out only current records, flip them upside down,
index them, pin them in memory, and use these for ultra-fast FK lookup
during fact table load, then re-enable the indexes.). Note that the FK
lookup must
take place, as the constraints are not enabled, and thus this is the
only way to ensure referential integrity. The degree of data
cleanliness will dictate what is practicable in terms of data load
strategy.
- Generally speaking, if
dimension tables (or vertical strips of dimension tables) are small
enough, pin them in memory all the time; it will really speed up OLAP
processing.
- For bottleneck
processes, look at the query plan/explain plan. You can run the OWB
code in SQLplus or TOAD or whatever. Add hints if appropriate (n.b. OWB
will not validate the syntax of hints). Be aware of how OWB sequences
its generated code. This is important for data loads, but perhaps even
more important when trying to resolve slow response times for OLAP
processing and reporting.
- Be willing to vertically split
tables, particularly dimension tables. Sometimes, a dimension is used
online only for selection on a few fields, key lookup, and use of its
hierarchy in OLAP work. In that case, consider splitting the dimension
into a keys-for-online table and a text-for-reporting table. You can
always look at the dimension as a whole via a View.
- Be willing to horizontally
split tables, particularly fact tables. Sometimes, a fact table
actually contains several distinct facts, which are not retrieved or
used together. There is then an overhead in skipping the non-relevant
records. These should really be in separate fact tables. If they are
required to be reported on together for summaries or aggregated reports
only, then join them in the creation of the snapshot tables, not at the
base transaction level.
- Infrastructure is vital
but often neglected. However, it's not my intention to go into any
depths on hardware or networking; a few word on database organisation
are appropriate, though. While I realise that the NT project does use
multiple schemas, it may be worth mentioning for future reference that
I've not personally needed that level of control, and have found a
single schema and five tablespaces (Temp, Dimensions, Facts, Staging,
and Summaries) to be all that is required. In direct opposition to
'received wisdom', I never split indexes and data into separate
tablespaces . . . I combine them deliberately, to avoid I/O
contention. (The split is indeed valid for OLTP, but not for
DW/BI). It is critical to note that
all of this is predicated on a massively parallel and striped disk
subsystem (e.g. a 64K stripe across 10
or 12 disks, mirrored in a RAID 1+0 configuration). Otherwise, things
will rapidly get more complex for the DBAs and their physical allocation
of tablespaces. On the other hand, with that disk configuration as a
given, then I would say that the simpler the database organisation, the
easier it is to manage. But this general comment does not apply to
partitioning in particular.
- Partitioning may indeed be complex, but it
offers one of the keys to data performance.
Broadly, all fact tables should be partitioned. 'Monster' dimensions
should probably also be partitioned. The usual way is by date (often a
week or a month), but that does not apply in this case. The other main
way is by hash, where an internal computation splits the data into
chunks, based on some field within the data. This allows for parallel
processing, and is what we did for the Australian Stock Exchange with
200 partitions. The important thing here is to choose a random a key as
possible. There are more complicated combinations available (List,
Composite), but I've never found a need for them - this is just as well,
because OWB only supports range and hash. As far as this project is
concerned, I would say that the facts have dates, therefore look good
candidates for range partitioning. The Supporters and Organisation
dimensions could be candidates for hash partitions.
- Note that Partition Exchange Loading is easy
with range partitions, but impossible with hash. (IF PEL is required,
then this will not be a problem if the split of types is as I have
suggested). I would suggest that PEL be considered. Its main advantage
is that using it avoids the need to rebuild the indexes and constraints
in the big partitioned table - data is loaded, indexes created, and
constraints are maintained in a physical staging table that is much
smaller. Once complete, the staging table and the relevant partition
are 'rotated' in one go. Each fact table load becomes independent of
all others, no matter the size of the overall fact tables. This is
much faster than working with a single big
table. Note that there are constraints: single tablespace, local
indexes, range partitioning, naming conventions, and so on. The
following are comments from the OWB user manual: "The target table must
be partitioned by one DATE column. All partitions must be created in the
same tablespace. All tables are created in the same tablespace." and
"All
new data must to be loaded into the same partition in a target table.
For example, if the target table is partitioned by day, then the daily
data should be loaded into one partition.".
An additional point is that OWB does not automatically create partitions
at runtime. All partitions used in PEL must be created outside of the
OWB environment. PEL is a trade-off. Speed versus complexity.
- Another possibility for
parallelism is the load from computer to computer across a DBLink. It's
entirely feasible to create a partitioned staging table to receive the
source data, and run multiple processes in parallel. This is one way
out of a narrow load window.
- Materialized Views and Query
Rewrite. A whole topic of its own, and one that is not relevant to the
NT project, as far as I am aware. One word of warning, however. I've
spent months researching this one aspect of Oracle data warehousing, and
would comment that while it can be extraordinary valuable, it can also
be a black hole with no easy way out.
- Disabling and
re-enabling indices has been mentioned. It can also be appropriate to
drop and re-create indexes (usually done in pre- and post-mapping
processes). Typically, this is done for smaller non-partitioned
tables. Large partitioned tables usually have local indexes, which are
constructed so that they reflect the structure of the underlying table.
They are equipartitioned with the underlying table, meaning that they
are partitioned on the same columns as the underlying table, create the
same number of partitions or subpartitions, and have the same partition
bounds as corresponding partitions of the underlying table. When the
partition strategy is date based on range, typically older partitions
(and of course their indices) get 'frozen' when they are no longer
current. This is where PEL comes into its own.
- Indexes. These are generally not required on
tables during data load, except for PK/FK lookups. Hence they would not
normally be seen very much in the staging area. They are used
extensively for slicing and dicing the data during retrieval. Hence we
see lots of them in the presentation layer. Bitmaps versus b-trees? I
like bitmaps even for data of fairly high cardinality (say a few
thousand values, maybe). Which to use when, is a judgement call.
There is a need to always examine the logs and plans. This is one of
the items that should periodically be re-examined. Indexes can be
global or local. As far as partitioned fact tables go, I only use local
indexes. I cannot think of a situation where a global index had to be
used. OWB does not currently support Index Organized Tables (IOTs).
Mostly, there is no call for them. But they are extremely useful in the
stripped dimension role, so I will build them by hand in a post-mapping
- they are
that useful in fact table loads, and in facilitating fast online access
using Discoverer or whatever.
- Some tables can be simply ignored as far as
any efficiency requirements are concerned. They should simply be pinned
in RAM and forgotten about. They are:
- TRANSACTIONS_DIM, CHANNELS_DIM, REGIONS_DIM,
FEEDBACK_REASONS_DIM, and DATE_DIM. It's entirely possible that
CAMPAIGNS_DIM should also be treated in the same way. That table
(and possibly REGIONS) should be checked when we have a 100% clean
data load, to determine how best to configure it.
- At this time, I see the two fact tables,
FEEDBACK_FACT and LOYALTY_FACT as prime candidates for Range
partitioning on Date - specifically DATE_FK.
- Additionally, I see
SUPPORTERS_DIM and CLUB_MEMBS_DIM as candidates for Hash paritioning.
Whilst SUPPORTER_ID and CLUB_PARTY_ID (the natural keys) look like
strong possibilities, I would like to know more about the
random/non-random nature of their generation.
I would suggest that given our timeframes, some or all of
the following steps may now be appropriate.
- Put the mappings into Project, while the data
migration load is being performed.
- When it is done, put the numbers into Project, and
get out the PERT chart that shows the hot spots.
- Pick 'n' mix from the various available techniques
ones most likely to deliver quick fixes.
- Test, and plug the new numbers
back in. This may well generate a very different picture of what to do
next.
- In the first instance, partition the above tables
with an initial cut. For this we shall use 10 hash partitions for the
dimension tables (giving about 2 million rows per partition - a number
that has worked well in previous experience), and monthly range
partitions for the facts. We will use the naming conventions as
described for Partition Exchange Loading. That way, it will be simple
for us to bolt-on PEL if its use is indicated as performance testing
continues to progress.
- Pin the other tables and check effect on data
load/retrieval times.
- Again, re-test and adjust
accordingly.
Regards,
Donna