In Blog Posts
This post was originally published on this site

1. Introductory remarks

One of the recurrent topics in online discussions on sales forecasting and demand planning is the idea of the “one-number forecast”, that is a common view of the future on which multiple plans and decisions can be made, from different functions of an organisation. In principle, this is yet another idea around the notion that we must break information silos within organisations. This topic has been hotly debated in practice, but academia has been somewhat silent. There are good reasons for this. I would argue that a key to this is that many colleagues in forecasting, econometrics and machine learning are predominantly focused on algorithmic and modelling research questions, rather than the organisational context within which forecasts are generated. This naturally results in different research focus. Given my background in strategic management, I always like to ponder on the organisational aspects of forecasting – even though I admit that most of my research revolves around algorithmic and modelling questions!

Over the years I have written a lot about the benefits of multiple temporal aggregation, either in the form of MAPA or Temporal Hierarchies, in terms of achieving both accuracy gains and importantly alignment between short- and long-term forecasts, as well as allowing operational information to pass on seamlessly to strategic decision makers and vice-versa. Yet, this still leaves the cross-sectional aspect of the forecasting problem (for example, different product categories, market segments, etc.) somewhat disconnected. Keeping these two streams of forecasting research disconnected has left the so-called one-number forecast beyond our modelling capabilities and to the sphere of organisational and process design and management (see S&OP, IBP, or other various names of the idea that people should… talk – hint: works in every aspect of life!).

Over the years, with colleagues, I have approached the problem from various aspects, but I think I finally have a practical solution to this that I am happy to recommend!

2. Why bother?

So what is the limitation of traditional forecasting? Why do we need this more integrated thinking?

  • Forecasts build for different functions/decisions are typically based on different information and therefore these are bound to differ and ignore much of the potentially available information.
  • Statistically speaking, given different target forecast horizons or different input information, typically a different model is more appropriate. The resulting forecasts are bound to differ as well.
  • Different functions/decisions need forecasts at different frequencies. We need to account for different decision-making frequency and speed for operational decisions (for example, inventory management) and different for tactical/strategic (for example, location and capacity of a new warehouse).
  • Forecasts that differ will provide misaligned decisions, which will result in organisational friction, additional costs and lost opportunities.
  • Different forecasts give a lot of space for organisational politics: my forecast is better than yours! This is often resolved top-down, which eliminates important information that in principle is available to the organisations. Organisational politics and frictions are a leading reason for silos.
  • A quite simple argument: if you have many different forecasts about the same thing that do not agree, most, if not all, are wrong. (Yes statistically speaking all forecasts are wrong, but practically speaking many are just fine and safe to use!).

What is wrong with using cross-sectional hierarchical forecasting (bottom-up, top-down, etc.) to merge forecasts together?

  • First, none of bottom-up, top-down or middle-out are the way to go. The optimal combination (or MinT) methodology is more meaningful and eliminates the need for a modelling choice that is not grounded on any theoretical understanding of the forecasting problem.
  • Cross-sectional hierarchical forecasting can indeed provide aggregate coherent (that is forecasts on different levels, such as SKUs sales and product category sales, that add up perfectly), but they do so for a single time instance. Let’s make this practical. Having coherent forecasts at the bottom level, where say weekly forecasts of SKUs are available is meaningful. As we go to higher levels of the hierarchy, is there any value on weekly total sales of the company? More importantly, apart from the statistical convenience of such as forecast, is there any meaningful information that senior management can add on a weekly basis (or would they bother?).

What is wrong with using temporal hierarchical forecasting to merge forecasts together?

  • This is the complimentary problem of cross-sectional forecasting. Now we have the temporal dynamics captured. We merge together information that is relevant for short-term forecasting, but also long-term, to gain benefits in both. However, SKU sales on weekly frequency are useful, but SKU annual sales not so. Probably there you need product group of even more aggregate figures at annual buckets of sales.

What is wrong with building super-models that get all the information in one go and produce outputs for everything?

  • That should sound dodgy! If this was a thing, then this blog wouldn’t really exist…
  • On a more serious note, statistically speaking this is a very challenging problem, both in terms of putting down the equations for such a model, but also estimating meaningful parameters. Economics has failed repeatedly in doing this for macro-economy and there are well understood and good reasons why our current statistical and mathematical tools fail at that. I underline current because research is ongoing!
  • From an organisational point of view, that would require a data integration maturity, as well as almost no silos between functions and teams, so as to be able to get all the different sources of data in the model in a continuous and reliable form. Again, this is not theoretically impossible, but my experience in working with some of the leading companies in various sectors is that we are not there yet.

So getting true one-number forecasts is more difficult that one would like. Does it worth the effort?

  • If different functions/decision makers have the same view about the future, they are better informed and naturally will tend to take more aligned decisions, with all the organisational and financial benefits.
  • If such forecasts were possible, it would enable overcoming many of the organisational silos in a data-driven way. Between innovating human organisations and behaviours or statistics, it is somewhat easy to guess which one is easier!

3. How to build one-number forecasts?

Let me say upfront that:

  • It is all about bringing together the cross-sectional and temporal hierarchies. There is a benefit to this: both are mature enough to offer substantial modelling flexibility and therefore they do not restrict our forecasting toolbox, including statistics, machine learning, managerial expertise or “expert” judgement.
  • It can be done in a modular fashion, so forecasts do not need to be fully blended from the onset, but as the organisation gains more maturity, then more functions can contribute to the one-number forecast, so that we move towards the ultimate goal in practical and feasible steps.
  • Therefore, what follows can be implemented within the existing machinery of business forecasting (please don’t ask me how to do this in Excel! It can be done, but why?).
  • For anyone interested, this is the relevant paper (and references within), but quite readily I admit that papers are often not written to be… well, accessible. I hope that readers of my academic work will at least feel that I try to put some effort to make my work accessible to varying degrees of success!

3.1. The hierarchical forecasting machinery

We need to start with the basics of how to blend forecasts for different items together. Suppose we have a hierarchy of products, such as the one in Figure 1. This is a fairly generic hierarchy that we could imagine it describes sales of SKUs XX, XY and YX, YY and YZ, which can be grouped together in product groups X and Y, which in turn can be aggregated to Total sales. This hierarchy also implies that there are some coherence constraints: Total = X + Y, X = XX + XY, Y = YX + YY + YZ. This is surely true for past sales and it should be true for any forecasts.

Figure 1. Total sales can be broken into product groups X and Y, which in turn contain SKUs XX, XY and YX, YY and YX.

This restriction, that the coherence should be true for forecasts, is very helpful in giving us a mechanism to blend forecasts. This is not a top-down or bottom-up problem. The reason is that each level is relevant to different parts of an organisation and they have different information available. SKU level sees the detailed demand and interaction with customers, on high frequency. This is very relevant, for example, for inventory management. Product/brand level sees the aggregate demand patterns, the perception of a brand, etc., which is very relevant, for example, to marketing. Budgeting and financial operations would be very interested in the total level. All these different functions will most probably have different information sets available and should be using different models (or “expert guessing”) to build forecasts. Primarily, as these models are required to give different types of outputs, for different time scales. For example, inventory planners need forecasts and uncertainties to plan safety stocks and orders, typically for short horizons. Marketing needs elasticities of promotions and pricing and potentially longer-term forecasts. Financial operations even longer horizons and forecasts expressed in monetary terms, rather than product units. Therefore, it is not just about making numbers match, but it is about bringing different organisational views together. Top-down and bottom-up fail completely on this aspect.

So how are we to bring different views together? Let us abstract the problem a bit (because using X, XX, and so on, was very specific for my taste!). The hierarchy in Figure 1 can be written as the following matrix S.

The structure of S codifies the hierarchy. Each column corresponds to a bottom level time series and each row to a node/level of the hierarchy. We place 1 when a bottom-level (column) element contributes to that level and 0 otherwise. With this structure, if one would take the SKU sales and pass them through this S (for summing) matrix, then the outcome would be the sales for all levels of the hierarchy (rows).

If instead of sales we had forecasts for the bottom level, using S we can produce bottom-up forecasts for the complete hierarchy. Likewise, if we had forecasts for only the top level (Total) then we could use the mapping in S to come up with a way to disaggregate forecasts to the lower levels. A couple of paragraphs above I argued that we need forecasts at all levels. If we do that, the same S we help us understand how much our forecast disagree: how decoherent they are. Skipping the mathematical derivations, it has been shown that the following equation can take any raw decoherent forecasts and reconcile them, by attempting to minimise the reconciliation errors, that is how much the forecasts disagree.

I am using matrix notation to avoid writing massive messy formulas. What the above says is: give me all your initial forecasts and I will multiply them with a matrix G that contains some combination weights, the S matrix that maps the hierarchy and I will give you back coherent forecasts. This is fairly easy if G is known. Before we go into the estimation of G there are some points useful to stress, which are typically not given enough attention in hierarchical forecasting:

  • Hierarchical forecasting is merely a forecast combination exercise, where we linearly combine (independent) forecasts of different levels.
  • Combinations of forecasts are desirable. Statistically, they typically lead to more accurate forecasts (this is why hierarchical forecasting often relates to accuracy gains), but also substantially mitigates the model selection problem, as it is okay to get some models wrong.
  • That the forecasts can be independent is a tremendous advantage for practice. At each node/level we can produce forecasts separately, based on different information and forecasting techniques, matching the requirements as needed. Statistically speaking, independent forecasts may not be theoretically elegant, but they are certainly much simpler to specify and estimate, so quite useful for practice!
  • There is no need to aggregate/disaggregate. Hierarchical forecasting directly produces forecasts for all levels.

Let us return to the estimation of G. The formula for this is:

That means that G is dependent on the map of the hierarchy in S and the forecast errors. Estimating W is not straightforward (for details and example see this paper, section 2), but suffice to say that it accounts for the forecast errors, or in other words the quality of the forecast at each node. In a nutshell, poor forecasts will be given less weight than better forecasts. Consider the following: if all forecasts were perfect, then they would be coherent and no need to reconcile the forecasts. Of course, in practice they are not, and we prefer to adjust more the inaccurate forecasts, as the chance is that they are probably more responsible for the lack of coherence.

3.2. Cross-sectional hierarchical forecasting

This is the standard form of hierarchical forecasting that most people relate to through the bottom-up and top-down forecasting logic. In this case, G plays the role of aggregation or disaggregation weights. I am hesitant to use the forecast combination logic in this case, as we do not use forecasts from different levels of the hierarchy, but we combine forecasts from a single level only. This increases the forecasting risk, as we rely on fewer models, forecasts and less information.

Using the machinery described above one could forecast each node in the hierarchy independently (the machinery does not preclude using models capable of producing forecasts for multiple nodes simultaneous). For example, at the very disaggregate level, one might use intermittent demand forecasting methods, or other automatic univariate forecasting models, such as exponential smoothing. The reason being is that in practice the bottom levels are very large, typically containing (many) thousands of time series, which need to be forecasted very frequently. Here automation and reliability are essential. At higher levels, explanatory variables may be more relevant. For instance, one could use at the higher levels of the hierarchy leading indicators from the macro-economic environment to augment the forecasts. Also at that level, there are fewer forecasts to be made, so human experts can intervene more easily. Such information may be difficult to connect to the time series at the lower levels of the hierarchy.

Using cross-sectional hierarchical forecasting the different forecasts from different nodes/levels are blended, providing typically more accurate predictions and aligned decisions. In principle, if for every node we would produce the best practically possible forecasts, the blended coherent forecasts would contain all this information, as well as providing a common view of the future. The catch is that the whole hierarchy is tied to the same time bucket. If say the lowest level is at a daily or weekly sampling frequency, so is the top level. At the very aggregate level decision making is typically slower than at the very disaggregate operational level. This mismatch makes cross-sectional hierarchical forecasting useful for some aspects, but at the same time reduces it to a statistical exercise that hopes merely for forecast accuracy improvements.

3.3. Temporal hierarchies

Temporal hierarchies use the same machinery to solve the problem across time. I have covered this topic in more detail in previous posts, so I will be brief here.  Suppose we deal with a quarterly time series. This implies a hierarchical structure as in Figure 2.

Figure 2. A time series sampled in quarterly frequency forms an implicit temporal hierarchy, where an annum is split into two semi-annual periods, which are split into two quarters each.

Of course, we can define that hierarchy for monthly, daily, etc. time series. It should be quite evident in comparing Figures 1 and 2 that we can construct a summing matrix S for Figure 2 and produce coherent forecasts as needed. In this case, we achieve temporal coherency. That is, short-term lower level forecasts are aligned with long-term top-level forecasts. In practice, high-frequency short-term decision making is informed by long-term decision making and vice-versa. Temporal hierarchies offer substantial accuracy gains, due to seeing the time series from various aggregation viewpoints, hence capture both short- and long-term dynamics, but also mitigate the problem of model selection, as they naturally force the modeller to rely on multiple forecasts.

On the downside, although temporal hierarchies are a very handy statistical device for getting better quality forecasts, they do not always translate one-to-one with the relevant organisational decision making. For example, suppose that we model the daily demand of a particular SKU. Temporal hierarchies will be helpful in getting better forecasts. At the daily level and the levels close to it, decisions about stocking at stores, warehouses, etc. will be informed. As we get to the top levels, the forecasts may not relate directly to some decision. Do we need annual forecasts for a single SKU?

3.4. Cross-temporal hierarchies

The natural extension is to construct hierarchies that span both cross-sectional and temporal dimensions. Figure 3 illustrates how one could construct such a hierarchy. Each cross-sectional (blue) node, contains a temporal hierarchy (yellow). Here is where things start to become complicated. Expressing this hierarchy with a summing matrix S is not straightforward!

Figure 3. A cross-temporal hierarchy. Each cross-sectional node (blue), contains a temporal hierarchy (yellow).

With colleagues we have done some work in doing exactly that, only to realise that this needs a lot more thinking than just blindly adding columns and rows to S. For small hierarchies this may be feasible, but for large realistic ones, this becomes unmanageable very fast. This is work in progress, hopefully soon enough I will have something better to say on this!

Another approach is to do this sequentially. One could first do temporal and then cross-sectional, or the other way around. It appears that by first doing temporal, the forecasting exercise becomes simpler. However, the sequential application of the hierarchical machinery does not guarantee that forecasts will remain coherent across both dimensions. In fact, unless you have perfect forecasts it is easy to demonstrate that the second reconciliation will cause decoherence of the first. Figure 4 is helpful to understand this, but also to see the way forward.

Figure 4. Cross-sectional hierarchies on different temporal levels.

The cross-sectional hierarchy, from Figure 1, will remain applicable irrespective of the temporal level it is modelled at. Cross-sectional hierarchical forecasting merely chooses one of these levels and models everything there. Suppose we would model each of these cross-sectional hierarchies. The structure captured by S will stay the same, however G will most probably not, as it depends on the characteristics of the forecasts and the time series that change (i.e., the forecasting models/methods used and the resulting forecast errors), so at each temporal level it is reasonable to expect a different cross-sectional G. This is exactly what causes the problem. Not all G‘s can be true at the same time and ensure cross-temporal coherence.

The practical way forward is very simple, which improves forecast accuracy and imposes cross-temporal coherence:

  1. Produce temporal hierarchies forecasts for each time series of the cross-sectional hierarchy.
  2. Model the cross-sectional hierarchy at all temporal levels (these are reconciled temporally already).
  3. Collect all the different G‘s and calculate their element-wise average G*.
  4. Use the common G* to reconcile cross-sectionally, which by construction respects all temporal reconciliations.

The calculation is fairly trivial, although a large number of forecasts is required to be produced. Nowadays the latter is typically not an issue.

4. Does it work?

A recently published paper demonstrates the accuracy gains. Without going into too much detail, as one can find all that in the linked paper the key takeaways are:

  • The major forecast accuracy benefits come from temporal hierarchies.
  • Cross-sectional hierarchies still improve accuracy, but to a lesser extent.
  • The second hierarchical reconciliation is bound to offer fewer improvements, as the forecasts are already more accurate and therefore closer to the ideal ones that in principle are already coherent.
  • In our experiments, total gains were up to 10% accuracy. Obviously, this is dataset and setup dependent.

One may argue that this may be too much work for 10% accuracy. The strength of this argument depends on the quality of the base initial forecasts and also in the fact that the accuracy gains are a “by the way” benefit. The true purpose is to produce cross-sectionally coherent forecasts. These forecasts provide the same aligned view of the future across all dimensions, so they truly represent the one number forecast!

The forecasts are now aligned across:

  • Planning horizons: short/long, for high and low-frequency data (e.g., from hourly to yearly observations).
  • Planning units: SKUs to the most detailed level used, up to the total of the whole organisation.
  • The cross-temporal hierarchy respects the decision making needs: it provides detailed per SKU short-term high-frequency forecasts and aggregate long-term low-frequency forecasts. And these are coherent. No matter how you split or join forecasts together, they still agree.

The real benefit is that people can supplement different types of forecasts to the cross-temporal machinery. Back to the initial examples, the inventory side, the marketing side and the finance side keep on doing their work and provide their expert, model and information specific, views about the future. Crucially, this can be done in a modular fashion. An organisation does not have to go online with the whole construct simultaneously, but different functions can join step-by-step simply by revising the hierarchy to include that view.

In practice, one would use different type of models and inputs at the different parts of the cross-temporal hierarchy. For higher levels leading indicators, other regressors and expert judgment will be helpful. At lower levels, due to their size, univariate reliable forecasts, for example, based on exponential smoothing, potentially augmented by judgement, would be better suited.

5. The organisational benefits

An aligned view of the future across all levels/functions/horizons of an organisation comes with the apparent benefits for decision making. There are four more benefits that may not be apparent immediately:

  1. Break information silos the analytics way: it is not easy to change corporate structures, culture or human nature to improve communication between teams and functions. It is not easy to have colleagues who do not do forecasting for living to sit into long meetings about improving forecasts. The beauty of cross-temporal hierarchies is that forecasts can be produced independently and are subsequently weighted according to their quality. None of the views is discarded, but all are considered, with their different information base and distilled expert knowledge, to the single coherent forecast. Subsequently, information silos are softened as all functions and teams plan on a common blended view of the future.
  2. From strategising operations to informed strategies: the traditional managerial mantra is about how to operationalise strategies, i.e. how to take top-level decisions and vision about the future to the rest of the organisation. In principle that is all fine, if the top management had transparency of operations. A single page report, a line graph or a pie chart just doesn’t cut it! Cross-temporal hierarchies allow taking into consideration both top-down and bottom-up views, both short-term objectives/needs and long-term strategies/visions. This creates data transparency. Top management can generate a view of the future and then inform it with the rest of the organisational knowledge. Operations are closer to the customer. Marketing is shaping the customer. But neither operations or marketing have the bird-eye view of the board. And these are just examples.
  3. Ultra-fast decision making: welcome to a world where Artificial Intelligence decides for you. AI is not yet able to replace human decision making fully, but it is surely able to take care of many tedious decisions and do these at very large scales and very fast. It is only logical (obviously I had to use this phrase when talking about AI!) that we will see increasing use of AI to interact with customers at an increasingly high frequency. The scale can easily become of a level that it is impossible for human decision makers/planners/operators to supervise effectively. More importantly, if experts disengage from this, there is a good chance that the company will not be able to use all the expertise and experience in the (human) workforce. Cross-temporal hierarchies can help with that. AI will be able to take decisions and use data at ultra-fast frequencies. Humans do not need to follow that, as they can supplement with their views, knowledge and information with lower frequency decision making. Cross-temporal hierarchies will blend the two together, with AI adding additional levels to the hierarchical structure.
  4. Collaboration: thinking out of the box in a literal way. The cross-temporal machinery does not have to be restricted to a single organisation, but can encompass multiple. This way multiple units and stakeholders can share information and have a common view of the future. In the aforementioned paper, in the conclusions, we provide an example about the tourism sector, where hotel units, satellite companies and the state tourism board can all collaborate through a cross-temporal hierarchy.

Admittedly, each of the four points raised requires increasing analytics and corporate maturity. These are my views about how business will change, and my expectation is that this will happen rather quickly. Point 1 is apparent. Point 2 is necessary as employees become better skilled, better informed and better educated. If you want these people to remain part of your organisation, you can surely assume that top-down and traditional strategising operations will not be satisfactory. Point 3 is bound to happen, led by the large companies who already invest heavily in AI. But the interesting thing about AI is that its cost is reducing substantially and very fast, making it accessible to more and more organisations. Point 4 may be somewhat more contentious. What about competition between units and companies? My view is that collaborative existence is the only way forward for many small to medium size organisations, if they are to survive. How this is done, and what would be the involvement of larger players and the public is to be seen and surely a topic for a different discussion! This post is already too long!

Happy forecasting!