Evaluation techniques for interactive systems

Ravindu Senal Fernando
16 min readDec 27, 2020

What is evaluation?

Assessing and testing designs, systems to ensure that they actually behave as we expect and meet user requirements is Evaluation.

Evaluation should not be thought of as a single phase in the design process . Ideally, evaluation should occur throughout the design life cycle, with the results of the evaluation feeding back into modifications to the design.

A broad distinction can be made between evaluation by the designer or a usability expert, without direct involvement by users, and evaluation that studies actual use of the system. The former is particularly useful for assessing early designs and prototypes; the latter normally requires a working prototype or implementation.

Goals of evaluation

Evaluation has three main goals:

  • To assess the extent and accessibility of the system’s functionality
  • To assess users’ experience of the interaction
  • To identify any specific problems with the system.

The system’s functionality is important in that it must accord with the user’s requirements. In other words, the design of the system should enable users to perform their intended tasks more easily. This includes not only making the appropriate functionality available within the system, but making it clearly reachable by the user in terms of the actions that the user needs to take to perform the task. It also involves matching the use of the system to the user’s expectations of the task.

Evaluation through expert analysis

A number of methods have been proposed to evaluate interactive systems through expert analysis. These depend upon the designer, or a human factors expert, taking the design and assessing the impact that it will have upon a typical user. The basic intention is to identify any areas that are likely to cause difficulties because they violate known cognitive principles, or ignore accepted empirical results. These methods can be used at any stage in the development process from a design specification, through storyboards and prototypes, to full implementations, making them flexible evaluation approaches. They are also relatively cheap, since they do not require user involvement. However, they do not assess actual use of the system, only whether or not a system upholds accepted usability principles.

There are main 3 approaches for expert analysis:

  • Cognitive walkthrough
  • Heuristic evaluation
  • Model-based evaluation

Cognitive walkthrough

In cognitive walkthrough one or more evaluators work through a series of tasks and ask a set of questions from the perspective of the user.

The focus of the cognitive walkthrough is on understanding the system’s learnability for new or infrequent users. The cognitive walkthrough was originally designed as a tool to evaluate walk-up-and-use systems like postal kiosks, automated teller machines (ATMs), and interactive exhibits in museums where users would have little or no training. However, the cognitive walkthrough has been employed successfully with more complex systems like CAD software and software development tools to understand the first experience of new users.

To do a walkthrough, you need four things:

  1. A specification or prototype of the system. It doesn’t have to be complete, but it should be fairly detailed. Details such as the location and wording for a menu can make a big difference.
  2. A description of the task the user is to perform on the system. This should be a representative task that most users will want to do.
  3. A complete, written list of the actions needed to complete the task with the pro-posed system.
  4. An indication of who the users are and what kind of experience and knowledge the evaluators can assume about them.

In each step evaluators try to answer the following questions:

  1. Is the effect of the action the same as the user’s goal at that point?
  2. Will users see that the action is available?
  3. Once users have found the correct action, will they know it is the one they need?
  4. After the action is taken, will users understand the feedback they get?

Heuristic evaluation

A heuristic is a guideline or general principle or rule of thumb that can guide a design decision or be used to critique a decision that has already been made.

Heuristic evaluation can be performed on a design specification so it is useful for evaluating early design. But it can also be used on prototypes, storyboards and fully functioning systems. It is therefore a flexible, relatively cheap approach. Hence it is often considered a discount usability technique.

The general idea behind heuristic evaluation is that several evaluators independently critique a system to come up with potential usability problems. It is important that there be several of these evaluators and that the evaluations be done independently.

Model-based evaluation

A third expert-based approach is the use of models. Certain cognitive and design models provide a means of combining design specification and evaluation into the same framework.

The GOMS (goals, operators, methods and selection) model predicts user performance with a particular interface and can be used to filter particular design options.Similarly, lower-level modeling techniques such as the keystroke-level model provide predictions of the time users will take to perform low-level physical tasks.

Evaluation through user participation

User participation in evaluation tends to occur in the later stages of development when there is at least a working prototype of the system in place. This may range from a simulation of the system’s interactive capabilities, without its underlying functionality.

In evaluation through user participation there are two user categories:

  • Laboratory studies
  • Field studies

Laboratory studies

Users are taken out of their normal work environment to take part in controlled tests, often in a specialist usability laboratory.

A well-equipped usability laboratory may contain sophisticated audio/visual recording and analysis facilities, two-way mirrors, instrumented computers and the like, which cannot be replicated in the work environment. In addition, the participant operates in an interruption-free environment. However, the lack of context for example, filing cabinets, wall calendars, books or interruptions and the unnatural situation may mean that one accurately records a situation that never arises in the real world.

There are, however, some situations where laboratory observation is the only option, for example, if the system is to be located in a dangerous or remote location, such as a space station. Also some very constrained single-user tasks may be adequately performed in a laboratory. Finally, and perhaps most commonly, we may deliberately want to manipulate the context in order to uncover problems or observe less used procedures, or we may want to compare alternative designs within a controlled context. For these types of evaluation, laboratory studies are appropriate.

Field studies

In this type of evaluation, takes the designer or evaluator out into the user’s work environment in order to observe the system in action.

High levels of ambient noise, greater levels of movement and constant interruptions, such as phone calls, all make field observation difficult. However, the very ‘open’ nature of the situation means that you will observe interactions between systems and between individuals that would have been missed in a laboratory study.The context is retained and you are seeing the user in his ‘natural environment’. In addition, some activities, such as those taking days or months, are impossible to study in the laboratory.

On balance, field observation is to be preferred to laboratory studies as it allows us to study the interaction as it occurs in actual use. Even interruptions are important as these will expose behaviors such as saving and restoring state during a task.However, we should remember that even in field observations the participants are likely to be influenced by the presence of the analyst and/or recording equipment, so we always operate at a slight remove from the natural situation, a sort of Heisenberg uncertainty principle.

Empirical methods: experimental evaluation

One of the most powerful methods of evaluating a design or an aspect of a design is to use a controlled experiment. This provides empirical evidence to support a particular claim or hypothesis. It can be used to study a wide range of different issues at different levels of detail

Any experiment has the same basic form. The evaluator chooses a hypothesis to test, which can be determined by measuring some attribute of participant behavior.A number of experimental conditions are considered which differ only in the values of certain controlled variables. Any changes in the behavioral measures are attributed to the different conditions. Within this basic form there are a number of factors that are important to the overall reliability of the experiment, which must be considered carefully in experimental design. These include the participants chosen, the variables tested and manipulated, and the hypothesis tested.


The choice of participants is vital to the success of any experiment. In evaluation experiments, participants should be chosen to match the expected user population as closely as possible. Ideally, this will involve experimental testing with the actual users but this is not always possible. If participants are not actual users, they should be chosen to be of a similar age and level of education as the intended user group.Their experience with computers in general, and with systems related to that being tested, should be similar, as should their experience or knowledge of the task domain. It is no good testing an interface designed to be used by the general public on a participant set made up of computer science undergraduates: they are simply not representative of the intended user population.

A second issue relating to the participant set is the sample size chosen. Often this is something that is determined by pragmatic considerations: the availability of participants is limited or resources are scarce. However, the sample size must be large enough to be considered to be representative of the population, taking into account the design of the experiment and the statistical methods chosen.


There are two main types of variables:

  • Those that are ‘manipulated’ or changed ( independent variables )
  • Those that are measured ( dependent variables )

Independent variables are those elements of the experiment that are manipulated to produce different conditions for comparison. Examples of independent variables in evaluation experiments are interface style, level of help, number of menu items and icon design.

Dependent variables, on the other hand, are the variables that can be measured in the experiment, their value is ‘dependent’ on the changes made to the independent variable.The dependent variable must be measurable in some way, it must be affected by the independent variable, and, as far as possible, unaffected by other factors. Common choices of dependent variable in evaluation experiments are the time taken to complete a task, the number of errors made, user preference and the quality of the user’s performance.


A hypothesis is a prediction of the outcome of an experiment. It is framed in terms of the independent and dependent variables, stating that a variation in the independent variable will cause a difference in the dependent variable.

The aim of the experiment is to show that this prediction is correct. This is done by disproving the null hypothesis, which states that there is no difference in the dependent variable between the levels of the independent variable. The statistical measures described below produce values that can be compared with various levels of significance.

Experimental design

In order to produce reliable and generalizable results, an experiment must be care-fully designed.

The first phase in experimental design then is to choose the hypothesis. In doing this you are likely to clarify the independent and dependent variables, in that you will have identified what you are going to manipulate and what change you expect.

he next step is to decide on the experimental method that you will use. There are two main methods:

  • between-subjects
  • within-subjects.

In a between-subjects design, each participant is assigned to a different condition. There are at least two conditions: the experimental condition and the control, which is identical to the experimental condition except for this manipulation. This control serves to ensure that it is the manipulation that is responsible for any differences that are measured. There may, of course, be more than two groups, depending on the number of independent variables and the number of levels that each variable can take.

The advantage of a between-subjects design is that any learning effect resulting from the user performing in one condition and then the other is controlled: each user performs under only one condition. The disadvantages are that a greater number of participants are required, and that significant variation between the groups can negate any results.

The second experimental design is within-subjects (or repeated measures). Here each user performs under each different condition. This design can suffer from transfer of learning effects, but this can be lessened if the order in which the conditions are tackled is varied between users, for example, group A do first condition followed by second and group B do second condition followed by first. Within-subjects is less costly than between-subjects, since fewer users are required, and it can be particularly effective where learning is involved. There is also less chance of effects from variation between participants.

The choice of experimental method will depend on the resources available, how far learning transfer is likely or can be controlled, and how representative the participant group is considered to be. A popular compromise, in cases where there is more than one independent variable, is to devise a mixed design where one variable is placed between-groups and one within-groups.

Statistical measures

The first two rules of statistical analysis are to look at the data and to save the data. It is easy to carry out statistical tests blindly when a glance at a graph, histogram or table of results would be more instructive. In particular, looking at the data can expose outliers, single data items that are very different from the rest. Outliers are often the result of a transcription error or a freak event not connected to the experiment.

Saving the data is important, as we may later want to try a different analysis method. It is all too common for an experimenter to take some averages or other-wise tabulate results, and then throw away the original data. At worst, the remaining statistics can be useless for statistical purposes, and, at best, we have lost the ability to trace back odd results to the original data, as, for example, we want to do for outliers.

Our choice of statistical analysis depends on the type of data and the questions we want to answer. It is worth having important results checked by an experienced statistician, but in many situations standard tests can be used.

Observational techniques

A popular way to gather information about actual use of a system is to observe users interacting with it. Usually they are asked to complete a set of predetermined tasks,although, if observation is being carried out in their place of work, they may be observed going about their normal duties. The evaluator watches and records the users’ actions (using a variety of techniques — see below). Simple observation is seldom sufficient to determine how well the system meets the users’ requirements since it does not always give insight into the their decision processes or attitude.Consequently users are asked to elaborate their actions by ‘thinking aloud’. In this section we consider some of the techniques used to evaluate systems by observing user behavior.

Think aloud and cooperative evaluation

Think aloud is a form of observation where the user is asked to talk through what he is doing as he is being observed; for example, describing what he believes is happening, why he takes an action, what he is trying to do.

A variation on think aloud is known as cooperative evaluation[240] in which the user is encouraged to see himself as a collaborator in the evaluation and not simply as an experimental participant. As well as asking the user to think aloud at the beginning of the session, the evaluator can ask the user questions ,if his behavior is unclear, and the user can ask the evaluator for clarification if a problem arises. This more relaxed view of the think aloud process has a number of advantages:

  • the process is less constrained and therefore easier to learn to use by the evaluator
  • the user is encouraged to criticize the system
  • he evaluator can clarify points of confusion at the time they occur and so maximize the effectiveness of the approach for identifying problem areas

Protocol analysis

Methods for recording user actions include the following:

  • Paper and pencil : This is primitive, but cheap, and allows the analyst to note interpretations and extraneous events as they occur.
  • Audio recording : This is useful if the user is actively ‘thinking aloud’. However, it may be difficult to record sufficient information to identify exact actions in later analysis, and it can be difficult to match an audio recording to some other form of protocol
  • Video recording : This has the advantage that we can see what the participant is doing.
  • Computer logging : It is relatively easy to get a system automatically to record user actions at a keystroke level, particularly if this facility has been considered early in the design.
  • User notebooks : The participants themselves can be asked to keep logs of activity/problems. This will obviously be at a very coarse level — at most, records every few minutes and, more likely, hourly or less. It also gives us ‘interpreted’ records,which have advantages and problems. The technique is especially useful in longitudinal studies, and also where we want a log of unusual or infrequent tasks and problems.

Post-task walkthroughs

Often data obtained via direct observation lack interpretation. We have the basic actions that were performed, but little knowledge as to why. Even where the participant has been encouraged to think aloud through the task, the information may be at the wrong level.

A walkthrough attempts to alleviate these problems, by reflecting the participants’ actions back to them after the event. The transcript, whether written or recorded, is replayed to the participant who is invited to comment, or is directly questioned by the analyst. This may be done straightaway, when the participant may actually remember why certain actions were performed, or after an interval, when the answers are more likely to be the participant’s post hoc interpretation. (In fact, interpretation is likely even in the former case.) The advantage of a delayed walkthrough is that the analyst has had time to frame suitable questions and focus on specific incidents. The disadvantage is a loss of freshness.

Query techniques

Query techniques can be useful in eliciting detail of the user’s view of a system. They embody the philosophy that states that the best way to find out how a system meets user requirements is to ‘ask the user’. They can be used in evaluation and more widely to collect information about user requirements and tasks. The advantage of such methods is that they get the user’s viewpoint directly and may reveal issues that have not been considered by the designer.

There are two main types of query techniques:

  • Interviews
  • Questionnaires


Interviewing users about their experience with an interactive system provides a direct and structured way of gathering information. Interviews have the advantages that the level of questioning can be varied to suit the context and that the evaluator can probe the user more deeply on interesting issues as they arise. An interview will usually follow a top-down approach, starting with a general question about a task and progressing to more leading questions (often of the form ‘why?’ or ‘what if ?’) to elaborate aspects of the user’s response.

Interviews can be effective for high-level evaluation, particularly in eliciting information about user preferences, impressions and attitudes. They may also reveal problems that have not been anticipated by the designer or that have not occurred under observation. When used in conjunction with observation they are a useful means of clarifying an event.

To be more effective the interview should be planned in advance with a set of central questions prepared. Each interview is then structured around these questions. This helps to focus the purpose of the interview, which may, for instance, be to probe a particular aspect of the interaction.


An alternative method of querying the user is to administer a questionnaire. This is clearly less flexible than the interview technique, since questions are fixed in advance and it is likely that the questions will be less probing. However, it can be used to reach a wider participant group, it takes less time to administer, and it can be analyzed more rigorously. It can also be administered at various points in the design process, including during requirements capture, task analysis and evaluation, in order to get information on the user’s needs, preferences and experience.

Evaluation through monitoring physiological responses

One of the problems with most evaluation techniques is that we are reliant on observation and the users telling us what they are doing and how they are feeling. What if we were able to measure these things directly? Interest has grown recently in the use of what is sometimes called objective usability testing, ways of monitoring physiological aspects of computer use. Potentially this will allow us not only to see more clearly exactly what users do when they interact with computers, but also to measure how they feel. The two areas receiving the most attention to date are eye tracking and physiological measurement.

Eye tracking for usability evaluation

Eye tracking has been possible for many years, but recent improvements in hardware and software have made it more viable as an approach to measuring usability.

Eye movements are believed to reflect the amount of cognitive processing a dis-play requires and, therefore, how easy or difficult it is to process . So measuring not only where people look, but also their patterns of eye movement, may tell us which areas of a screen they are finding easy or difficult to understand. Eye movement measurements are based on fixations, where the eye retains a stable position for a period of time, and saccades, where there is rapid ballistic eye movement from one point of interest to another. There are many possible measurements related to usability evaluation including:

  • Number of fixations : The more fixations the less efficient the search strategy.
  • Fixation duration: Longer fixations may indicate difficulty with a display.
  • Scan path: indicating areas of interest, search strategy and cognitive load.

Eye tracking for usability is still very new and equipment is prohibitively expensive for everyday use. However, it is a promising technique for providing insights into what really attracts the eye in website design and where problem areas are in system use. More research is needed to interpret accurately the meaning of the various eye movement measurements, as well as to develop more accessible and robust equipment.

Physiological measurements

Emotional responses is closely tied to physiological changes.These include changes in heart rate, breathing and skin secretions. Measuring these physiological responses may therefore be useful in determining a user’s emotional response to an interface.

Physiological measurement involves attaching various probes and sensors to the user. These measure a number of factors:

  • Heart activity, indicated by blood pressure, volume and pulse. These may respond to stress or anger.
  • Activity of the sweat glands, indicated by skin resistance or galvanic skin response(GSR). These are thought to indicate levels of arousal and mental effort.
  • Electrical activity in muscle, measured by the electromyogram (EMG). These appear to reflect involvement in a task.
  • Electrical activity in the brain, measured by the electroencephalogram (EEG). These are associated with decision making, attention and motivation.

One of the problems with applying these measurements to interaction events is that it is not clear what the relationship between these events and measurements might be.