One of the main concerns of Ingres and Vector customers whose systems need to undergo a significant change, such as major version upgrades, OS migration, major software releases, increased volumes of usage, etc, is how to properly test the database under realistic conditions in a test environment.
Some customers have heard of the Query, Recording and Playback (QRP) tool, developed by Actian Professional Services, at user meetings. The aim of the tool is to test the Ingres or Vector database in near-to production conditions by recording and then playing back a production workload in a test environment. The rest of this post describes the tool and its usage in more detail.
The Query Recording piece of QRP uses the “SC930” trace point, which produces a recording of all queries run against the database during the time the trace point was turned on. The output from SC930 is essentially a series of text files, containing the queries, parameters, datatypes, data, etc. – i.e. the load we are aiming to test against. Usage of the trace point, and details of the output format that it produces, are further described here:
This trace point – which are usually undocumented ways of amending the behaviour of Ingres and Vector – is now officially supported and is documented in the product manuals. In fact, extra syntax has been added in the latest releases to make it even easier to work with (look for “SET SERVER_TRACE” in the manuals), so you can use it yourselves for ad-hoc query analysis. For example, you can now enable query tracing only for specific users running queries against a specific database that take longer than a certain time period to run, making its use for diagnostics much easier – see here for more details for Actian X and here for Vector v5.
Using the SC930 trace point means that the query recording part of the operation can work on older versions of Ingres and Vector that don’t yet support server_trace.
Playback is a separate set of tools, written initially by the Professional Services team at Actian, described in the Actian Github repository here (with binary releases also published to Github but the leader source code maintained in SVN here), which is designed to replay each query against your test system in the same order and relative timing that it was recorded in.
“Query” in this context means any SQL statement that can be run against the Ingres or Vector database. In fact, even if the “Query” is invalid SQL, Playback will try and replay it and if it is invalid, it will fail just like it would have when it was recorded.
Playback is, in the main, made up of 2 core pieces.
- Preprocess reformats the SC930 output files so that they are split into individual sessions and every session, connection and query is ordered by a timestamp. This was done to ensure we could play each session back as a separate thread at the correct time, regardless of what else is happening to other threads.
- Preprocess is written in C# and can run with MONO or .Net environments.
- This only needs to be run once per recording.
- The output files can be run on one server and played back elsewhere.
- A basic playback job opens individual files produced by preprocess and submits a separate thread for each session in the same order and timing. The threads should then run all queries in that file in the same order and timing.
- Playback is written in C calling standard Ingres Embedded SQL library functions, which is arguably the fastest way to submit a query to an Ingres or Vector database either locally or over a network.
- This can be run as many times as you choose.
Playback can help test the consequence of adding load to the system, changes in hardware, such as reducing cores or memory, increasing data volumes or users and also taking performance baselines then testing upgrades or patches. For example, we can run playback at twice the speed to test the impact of greater contention for system resources. Greater load can unearth issues with system, application, database configuration, etc. so that these can be addressed before the cut-over.
Earlier I said the tool could re-create “near-to production conditions”, as this is never going to be perfect. Firstly, hardware inconsistencies even on servers that have apparently identical specs can cause differences. Then the database can be accessed in a variety of ways, and using a variety of languages, each with their own quirks. There are also some interesting aspects with the timing of query submission(s) in a contentious environment. Also, often the application is housed on one or many separate servers or client machines that arrive over the network in a host of ways.
Because of this, formal benchmarking that delivers consistently repeatable results is not straightforward. We have found that when running multiple playbacks of the same data files you can get different runtime results. Each session should start in the same order as the original session, at the same interval from the overall start time and each query should be in the same order as the original query and all things being equal, should start at the same interval (to the start time), however, this is rarely the case on big playback jobs.
After many hours working with the tool we believe that even very small changes, which are often out of our control, can have a significant influence on the overall runtime. Areas like a “slow” or faulty disk, shared resources in a VM environment, minor differences in memory/core speeds, etc. are all factors. Remember that the Playback queries already run in 1/10 and 1/100 of a second on powerful boxes and at times 100+ threads can be running concurrently. If a query ever gets out of sync with other threads, this can lead to unpredictable results, such as queries acting on empty tables (as the query ran before the data was loaded), or long running queries where the query started before the optimizer stats were generated. Actian recommends that 3-4 runs are performed and an average taken, when running performance tests.
Playback also has features to measure individual query performance, row counts, sessions and number of queries. The query performance stats can obviously help identify long running queries, although because of the previous paragraph, you should only focus on areas that re-occur consistently across multiple sessions or replays. A set of CSV files are produced by every playback job, which can be loaded into a spreadsheet to help drill down on this information.
Actian’s Services team has used the tool in quite a few customer engagements, and had very good results with it. One recent example is using it with a customer who was doing a platform migration project to move to a new and larger server, where we were able to use the QRP tool to configure the new server to better support the actual production workload, and hence produced a much smoother go-live experience for the users than would have otherwise been possible, and a better experience that had ever been delivered in previous upgrade projects.
In summary QRP is really useful and quite straightforward to use, though it is a Community-led project, rather than a licensed product, so some hands-on engineering will be needed (e.g. to build it from the latest source code), so you should have someone with development skills on hand during the setup. There is some good info around it on the Actian Community pages, such as pre-requisites and platform availability, which I recommend you read before attempting it.
Detailed information on Playback setup and usage, can be found here:
Of course, new features and bug fix contributions from our Community are very welcome ! Many improvements would be possible, such as building a user interface to monitor progress as it happens for instance – and I’m sure other features will suggest themselves as you use it.
If you need any in depth help then contact me or one of the team at email@example.com. We would be very happy to assist you in getting started with the tool, or in running an upgrade project using it and hence in delivering a smooth and predictable outcome.