Is backtesting a hoax ?. a critical reflection of my journey to… | by Thomas Reinecke | Sep, 2022 – DataDrivenInvestor

a critical reflection of my journey to find substantial profitability in algorithmic trading

What an exciting time we live in: Broker APIs to place orders in microseconds, transaction costs that allow for non-professional day trading, affordable access to years of market data and ridiculously cheap memory+processing power — it’s a great time that allows every ordinary fintech nerd to go crazy on Backtesting… and finally, eventually lose a lot of money when you’ve been tempted enough to place it on live-trading just because you drew the wrong conclusions from it.

I’m one of them and this is the story of my years-long hunt for wisdom in the question : “What does Backtesting actually mean for future trading performance ?”

Photo by Isaac Smith on Unsplash

Let’s start into this with a little experiment: We take tick data from a well known index future, let’s say NQ (E-mini NASDAQ 100) from at least the past 10 years, implement a quick and dirty trading strategy, throw all this into a backtest calculation and pull some nice metrics out of it. I’ll shortcut this a bit for you…

NQ Backtest 2009–09–30 to 2020–09–01, PLD = 87.90 (from a self-made system)

The strategy behind this is conceivably simple: We use a 6-candles fast and a 7-candles slow Simple Moving Average (SMA) indicator on a 5-minutes intraday candle chart of the NQ future. We enter the market LONG when the fast SMA pierces the slow SMA upwards. We use a trailing stop with a static distance of let’s say 250$ on the low of the previous (completed) Candle and exit market when the price falls through the stop. That’s it!

This strategy generated an average profit/loss per day (PLD) of 87.9$ for the past 11 years, 248.571$ in total or 22,597$ annually and it used only one contract of NQ at a time. There has not been a single loss year. Looks incredible, huh?

When you look at the NQ chart (red) line, it becomes obvious how a simple LONG strategy was almost guaranteed to return successful: The market was long most of the time anyways.

When we look into the details, the following questions come up right away:

  1. What did we actually do with this backtest ?
  2. How can we be sure we technically did the right thing ?
  3. What does this past performance possibly tell us about the future ? (spoiler: not much 😉

In a backtest, your algorithm does a little time travel to a specific point in the past of the dataset, and from there it moves forward into the relative future. Assuming you have a proper dataset — check “Data Quality” section — you pick a start- and end-timestamp and now you iterate your dataset (sorted by timestamp ascending obviously), feed one record after the other into your dataset and herewith walk into the relative future. Your algorithm should have logic to consume & process the data, decide on entries/exits and it needs the ability to record executed trades and their metrics — this topic is described in full detail here:

You ideally also have a way to visualize the outcomes, both on a micro- and macro-level, so that you can inspect a single trade including its entry and exit and the cumulative view of all trades to measure overall performance. However there are a number of challenges involved with Backtesting:

Data Quality

Make sure you know exactly what your data is all about. Trust your source or record the data yourself and know about its characteristics. On tick-level, is it really a single record for each price fixing or is it already grouped (e.g. by a second) ? On daily, does historic data come split-adjusted ? Does the data have gaps ? What timezone is referenced on the timestamps ? Are trading hours relevant to your algorithm and if so – do you have the trading hours consistently from the past, including holidays? If trading volume of the instrument is a challenge, do you know what the liquid trading hours are ?

Data Aggregation

If you have data on a lower, more-detailed level (like ticks), the way you aggregate them on a higher level is critical. The best experience I made here with huge volumes of data is to create candles with Postgresql.

Using relative Future-Information unintentionally

Whatever you do there, make sure you really think through every single data point that flows into your algo and whether the information it contains was actually there at the relative moment in time when your algo is using. This is especially a challenge when you use aggregated data like daily candles. Example: you backtest with Daily candles and compare your current Day with the last and distill a relationship (e.g. HIGHER) and you may trigger an entry/exit — challenge is : at the relative timestamp when your algo inspects the 2nd candle, it wasn’t actually complete since the day was not over, it may have just started. If you don’t realize or honour this, your algo will be using information from a relative future and it will obviously outperform on a backtest as opposed to the reality — a pretty bad trap.

Slippage

In theory, you expected to get a specific price during entry or exit but in reality you’ve got something else – the difference between your desire and the actual reality we call Slippage and rarely it’ll be to your benefit. On a future we’re mostly talking about a minimal possible tick or two (on a regular trading day) and if you’re trading more long term it doesn’t really hurt and you can possibly ignore it. However on intraday trading, this is what’s killing most of the ideas that look so great in theory. Be conservative with slippage and factor it into your parameters. To make a concrete example:
I had configured a stop with STP-LMT (so the trade expected to get a limited price on the exit), however the movement was fast and it rushed through so I couldn’t get my stop. An emergency exit was performed 2mins later with a slippage of -565.

Trading details of an automated trade against a paper trading account (from a self-made system)

Unstable conditions

Whatever you use to contribute to your backtest, make sure these conditions are stable and produce exactly the same result if you run your backtest on the same dataset + timeframe multiple times. If there’s any random parameter or a relative timing aspect to your backtest, this should raise a red flag — you won’t get repeatable results. Another example are prediction models that may be used for entry/exit calculation which could learn and evolve over time as they use the data inputs: it just means your backtests from once could become void since your new predictor would see the world differently and thus suggest different entries/exits — again, no repeatable results.

Number of Parameters

When you get deep into backtesting, you may inspect not just a summary of the results, but you actually try to understand what happens and why on a specific day — why did the system behave like this / why did it produce this huge loss? If you do this, you’ll inevitably start to add more parameters to your algo which you hope can solve or improve certain situations. What’s happening though is first — you’ll make your backtests much more expensive as you now have a number of new parameters to test, which drastically increases the total number of possible parameter combinations, second — you set your strategy up for over-fitting. The more parameters, the closer you can tune to the ideal outcome and the more fragile your strategy will become in real-world trading. As always, keep it simple, stupid — the fewer parameters, the better.

Photo by Soroush Karimi on Unsplash

Expectation vs. Risk

The risk-reward ratio measures how much your actual reward is, for every dollar you risk. For example: If you have a risk-reward ratio of 1:3, it means you’re risking $1 to potentially make $3.

It’s been eye-opening when I started to accept this concept the very first time whilst I’ve been reading Van Tharp’s “Trade your way to Financial Freedom”. It’s a super easy concept and very powerful at the same time:

When you know (from your backtest) that the backward-looking average Risk-Reward ratio (R) < 1, no matter how much your concept is claiming to make for you on whatever timescale — its a perfect moment to become maximal skeptical. It means you risk more than you expect to get in average on a single transaction, which still does not necessarily mean your idea will fail. However from my heuristic experience, R < 1 also creates pressure on congruence (see Why your great trading idea is secondary at first).

Relative to Backtesting this topic is huge and I’m planning to provide a dedicated article just for this. Stay tuned. For this article though we consider (R) to be a potentially reasonable indicator to measure how meaningful a backtest might be.

Profit-Loss per day

PLD measures the monetary amount a trading strategy made as result of a backtest on a single day in average. I focussed on mini futures like YM, ES, NQ most of the times and I used only one contract to run the backtest. I guess PLD could be called the “greed-indicator”, however to be fair, the idea was more to keep the investment conditions stable (just one contract) and herewith get a pretty good indication how much different a trading strategy was compared to another. On some of them with low PLD (less than 50$/d) and optimistic slippage assumptions, I knew right away its throw-away, regardless what other indicators said. However it was also observed that PLD often comes with considerable fluctuations, so another indicator was needed to tackle this.

Linearity

To measure how steady the positive growth of the portfolio performance was over time, I had an algorithm calculating the absolute cumulated offset of every day compared to the ideal linear growth — the smaller the outcome, the closer to the ideal. Instead of hunting for the highest PLD, this indicator can provide a nice filtering capability to the smooth, steadily growing strategies which may not be the best performance-wise, however psychologically the most tolerable that provide the smallest possible level of surprise and drawdown. An example of a relatively smooth linearity is shown here:

Backtest based on NQ for a total of 7 years with very little trading activity

So with all this effort developing a trading engine that runs reliably, the investments into backtesting, the time used to define and evaluate KPIs to produce reasonable measurements and finally the energy to run millions of backtesting scenarios to find the strategies that fits most to me, what finally happened after these were put to work ?

The short answer — it was sobering!

The strategies with too many parameters have proven to be hopelessly over-fitted in practical testing, ok that was somewhat expected though. However, even those with a moderate PLD and robust linearity quickly experienced drawdowns that had never occurred even with years of backtesting. The most promising indication I think I found was again related to Risk-Reward-Ratio (R): None of the strategies I was able to develop so far had (R) larger than 2, most of them where (R) < 1 and my assumption that is yet to be proven is that strategies with an (R) > 2 have a much better chance to perform.

The funny part is, I’ve tested all my strategy against a paper trading account, for weeks, sometimes months — it all looked great. When I finally was ready to enable it on my real IB account though, it smashed almost 8K in less than 4 weeks and I had to take an emergency break.

It has been said in many articles I have read about backtesting, and I hereby confirm it — if you think you will find in backtesting the “holy grail” that will tell you what future performance will look like based on the past, you will most likely be disappointed. I personally still would take the efforts and the cost associated to it as I see it as an investment into my skills and in-depth understanding of the topic and I personally conclude: Backtests are the right tool to quickly test if a trading idea ever worked in the past — you don’t need excessive combination tests for that, a few manually selected combinations of your parameters are enough and you know if it is worth pursuing or not. Another insight is the relevance of the risk-reward relationship: whatever you do that is (R)<2 is very likely to fail.

Related Articles:
Why your great trading idea is secondary at first)
I can build my own Trading System!
6 Things you need to know about Future Trading
The 9 most relevant indicators for Value investors

Thank you

If you liked this story and want to read thousands of stories like this on medium, you can become a medium member for as little as 5$/Month. If you’d like to support my writing, use my referral link below, I will get a portion of your membership without any extra cost to you.

https://medium.com/@thomas.reinecke/membership

Leave a Reply

Your email address will not be published. Required fields are marked *