Probability Theory: The Logic of Science by E. T. Jaynes

P(A|B) = [P(A)*P(B|A)]/P(B), all the rest is commentary.

Astral Codex Ten tagline.

If I told you that over the course of last year I have read Hartshorne's graduate math textbook "Algebraic Geometry" and did all the exercises in it, you would probably assume I have learned some mathematics.

If I told you that over the course of last year I have read Thích Nhất Hạnh's "The Miracle of Mindfulness", and practiced all the techniques in it, you could deduce that I have developed some applicable skills stemming from a metaphysical doctrine.

In reality I did neither of these things.

I did, however, read E. T. Jaynes's "Probability Theory: The Logic of Science" (PT:TLoS from here on) and solved (most of) the exercises.

If you have heard anything about this book, you may have expected that I have learned some mathematics, and developed some applicable skills stemming from a metaphysical doctrine. In reality, I have learned that in the 20th century disputes concerning probability and statistics, physicists whose last names start with J were (almost) always right, and everyone else was almost always wrong. Ok, fine, I did learn some math, though mostly in pursuit of understanding of an offhand remark or a solution to an exercise, often by following the crumb trail of hints and looking up references left by Jaynes in the text, and only occasionally from the text itself. As for the metaphysical indoctrination, well, there was a fair bit of that - but one does not simply join the Bayesian Conspiracy by reading a 700+ page book. One must read a 1600+ page book at least!

On the origin of PT:TLoS.

Edwin Thompson (i.e. E.T.) Jaynes was a Ph.D. student of Eugene Wigner, the unreasonably effective Nobel laureate physicist. Wigner is reported to have later characterized Jaynes as "one of the two most under-appreciated people in physics." Jaynes's PhD thesis was on ferroelectricity, and apart from contributions to probability and statistical mechanics, he is perhaps most known for his work in quantum optics.

Jaynes defended his Ph.D. at Princeton in 1950, and then moved to Stanford. He did what one is supposed to do there: invested in a Palo Alto tech startup. Since his first field of research could be called applied classical electrodynamics, he also consulted for the startup, calculating behavior of electrons in cavity resonators and working on magnetic resonance. Apparently this led to him buying a fairly large house -- though this was the 1950s, when normal people could have houses in Palo Alto. He bought an even larger house when he moved to the University of Washington at St. Louis in 1960.

In 1957 Jaynes published two papers on "Information Theory and Statistical Mechanics" concerned with formulating (the Gibbs's picture of) statistical mechanics in terms of information theory, first for classical and second for quantum systems. At about the same time he delivered a series of lectures on "Probability Theory in Science and Engineering" at the Field Research Laboratory of the Mobil oil company. The published version of 5 of these lectures is the first draft of the PT:TLoS. It includes a now-extinct section on the Gibbs model and one titled "why does statistical mechanics work?", as well as (much) briefer versions of chapters 1, 2, 4, 5, 6, 11, and 18 of PT:TLoS, for a total of about 200 typed pages overall. It also contains a "historical introduction" explaining "how it could happen that a person who is a rather strange mixture of two thirds theoretical physicist and one-third electrical engineer could grow up to be a hero and a scholar get really worried about the foundations of probability theory". The answer, of course, is by "trying to understand what statistical mechanics is all about and how it is related to communication theory". I'd say that it's a struggle that still goes on for many of us!

Jaynes says that "in the years 1957–1970 the lectures were repeated, with steadily increasing content, at many other universities and research laboratories." In 1974 some of this steadily increasing content was assembled into a 446 page "fragmentary edition" entitled "Probability Theory With Applications in Science and Engineering" with a stated goal of eventually having "approximately 30 Lectures" in the project. It now also included some of what will become chapters 10, 13, 19, and 22 of PT:TLoS, as well as a chapter on irreversible statistical mechanics.

Jaynes continued working on this material up until his retirement, and even more so after it. The magnum opus was woefully unfinished at the time of Jaynes's death in 1998. By 2004, its manuscript was shaped into a book by Jaynes's former graduate student Larry Bretthorst, resulting in the 727-page commentary on Bayes's theorem that we are now reviewing.

What Jaynes taught.

While no Australian fashion models seem to be available to distill the core idea of PT:TLoS into a single passage, we can get something reasonably close from Jaynes himself. Right from the start, he declares: "Our topic is the optimal processing of incomplete information", and the focus is on producing "quantitative rules for conducting inference". Note that while other frameworks might "process incomplete information" by learning hypothesis consitent with data, Jaynes is after not just good enough processing, but the "optimal" one. Of course, the "quantitative rules" mentioned turn out to be those of "probability theory and all of its conventional mathematics, but now viewed in a wider context than that of the standard textbooks." This is the essential content of theorems of Richard Cox and Jaynes spends the first chapter fleshing out more precisely what the "quantitative rules for conducting inference" are and how they should look like. The second chapter is spent reproving Cox's results (i.e. that only probabilities allow us to do inference the way we would like).

With this first (but by no means last) tussle with foundations out of the way, Jaynes proceeds to develop some of the math needed for basic applications in 'direct' and 'inverse' probability. Here, by 'basic applications' I mean counting balls in urns (lest you find this boring, let me remind you that counting things in urns, is not only a centuries-old pastime of probability theorists, but is essential for the functioning of any democratic society). By 'direct probability' I mean things like: if there are a hundred red and a hundred blue ballots balls in an urn and you draw 10 'at random', what is the probability that they are all red? That 9 of them are red and 1 is blue? Et cetera. This is 'sampling theory' and is covered in chapter 3, with the question of what 'at random' means getting some love in section 3.8.1. 'Inverse probability', on the other hand, is the old-school name for the more interesting kind of question: suppose you draw 10 balls at random from an urn containing 200 balls, and all 10 are red (this is your 'data'). How likely is it that there were 0 red balls in the urn? How about 1 red ball? How about 100? Here of course the answer depends on what we thought about the number of red balls in the urn before doing the drawing -- if we have looked in the urn just before and counted the balls directly, the drawing itself is unlikely to change our opinion about this 'prior' count. This prior is the remaining necessary ingredient. Once we have it, Bayes's theorem - the P(A|B) = [P(A)*P(B|A)]/P(B) from the tagline - finishes the job. This is the famed 'Bayesian update' producing the new, i.e. posterior, beliefs from the prior ones and the probability of data. The innocuous-looking observation that inference requires priors is the fact that launched a thousand ships. Jaynes lists 4 "principles" for obtaining this missing ingredient (you know it's bad when there is more than one "principle", and more than two is real trouble). He postpones further discussion to later chapters and proceeds to develop 'inverse probability' - aka hypothesis testing - assuming the prior is known somehow. Along the way in chapter 4.2 we get introduced to measuring information (or 'evidence') provided by the data in decibels (which I believe Jaynes invented independently of the equivalent "decibans" of Turing and Good), and learn how to do multiple hypothesis testing in chapter 86 4.4.

With all this hard work out of the way, we get to "queer uses of probability theory" also known as the seeds of the CFAR curriculum. While non-technical, this chapter explains how to reason "in a Bayesian way" about telepathy, why the same evidence presented to different people may make their opinions diverge more, how the Bayesian nature of visual perception may explain optical illusions how not to weigh evidence in court, and other useful things like that. "It's the priors, stupid"- for the most part; yet the details are entertaining and sometimes illuminating.

By chapter 6 the break is over, and we return to our urns. Amid some rather mundane calculations, some inspiring things happen. Under the rubric of "effects of qualitative prior information" (here 'qualitative' means something like knowing 'who does what to whom'), Jaynes introduces what we now can recognize as rudimentary probabilistic graphical models. The question of the choice of a prior returns briefly, only to be postponed again. For the most part it is a continuation of what has gone on before.

Chapter 7, dedicated to the Gaussian distribution, is a change of pace. While mathematically interesting, at first blush it may seem purely technical. Yet there is a key question behind it: why is Gaussian distribution so ubiquitous? Of course, mathematical reality being what it is, all good explanations are connected to each other; but the side from which one approaches the network of explanations matters both philosophically, and in terms of what further ideas it generates. Here, as in many other situations, Jaynes has a favorite side.

A standard answer is commonly taught: if a number we are considering is a sum of many (sufficiently) independent random pieces the result will be approximately Gaussian. Since many things have multiple 'small causes', this is a common situation. Mathematically, this is expressed as the Central limit theorem. A mechanism that makes this work also explains why Gaussian distribution is connected to the least squares fitting of linear models, and, more generally, illuminates why mean and variance are the only thing that matter in a Gaussian distribution. Thus Jaynes's favorite explanation is reached: Gaussian distribution is the one we would obtain if we agree that we know some random number's mean and variance, and nothing else. For example, we might be analyzing a noisy current in an electric circuit, and we know that the noise is zero on average, and also know its average power (which gives us its variance), but we don't really know much else about this noise. Then, as Jaynes promises to show later, the Gaussian is the distribution of maximum entropy subject to our knowledge, the one expressing total ignorance beyond those two values. Thus, out of a technical sounding-question in a technical-looking chapter a major theme is born: if you know something, and want to get a prior reflecting that knowledge and nothing else, look for a maximum entropy distribution compatible with this knowledge. This maximum entropy principle is one of the four principles for finding priors that Jaynes mentioned back in chapter 4, and Jaynes is widely known for advocating it. Jaynes was "a perennial participant" of the annual workshop on Maximum Entropy and Bayesian Methods which was run from 1981 and until his retirement. Several volumes of the workshop's proceedings are dedicated to Jaynes. Properties of maximum entropy distributions (at least for 'finite' situations) are explored in chapter 11. This is also where the editing seams start to show: producing Gaussian distribution as a maximum entropy one is easy after the material in chapter 11 is absorbed, but as far as I can tell Jaynes never actually gets back to fulfilling his promise to do so, and does not even assign this as an exercise to the reader. Chapter 11 is in part II, where completeness of the text begins to decline.

Another one of those four principles for finding priors is "group invariance" (more properly "equivariance"). It is explored in Chapter 12 of PT:TLoS. The name hides a simple idea and a surprising complication. Here is the idea: if your setting is unchanged by some modification - and this includes your state of knowledge - then your prior should be unchanged by this modification. For example, if I don't know anything about the length of something then I don't know anything about twice its length. The modification here is stretching things by a factor of 2. Now, if I think my ignorance about these two situations should be expressed the same way mathematically - then my prior should be unchanged by the stretching. It turns out that in many situations this suffices to mostly determine the prior. In the above example of length (technically known as 'a scale parameter') I conclude that my prior needs to be such that, for any L, the probability that the length lies between L and 2L is the same as that it lies between 2L and 4L, is the same as that it lies between 4L and 8L etc. I then must conclude that the resulting prior probability density at length x is proportional to 1/x (indeed, observe: the integral of 1/x between L and 2L is the same as between 2L and 4L and so on).

The surprising complication is that often this is not enough. For the simplest examples - like the 'scale' one above - this complication does not arise, but for the case of determining 'scale' and 'location' simultaneously it already does, and Jaynes gets it wrong. A deeper analysis of the situation hinges on the difference between something called 'right-invariant (Haar) measure' and 'left-invariant (Haar) measure'. A book of James Berger explains that the "correct" one to use is the right one. Jaynes certainly knows about this book, since he refers to it several times elsewhere in PT:TLoS. In his generally very positive and friendly review, Stanford statistician Persi Diaconis mentions that Jaynes has been accused of "not knowing his left from his right Haar measure". In fact, in PT:TLoS Jaynes seems wholly oblivious to the issue in the first place. His language is sufficiently imprecise to be confusing rather than enlightening -- which is doubly strange since the explanations in Berger's book are considerably clearer. I should note that in the 1974 "fragmentary edition" one can find a rather genteel "word of explanation and apology to mathematicians who may happen on this book not written for them", excusing the absence of measure-theoretic notions. Jaynes says: "I am not opposed to these things, and will gladly use and teach them as soon as I find one specific real application where they are needed." In the PT:TLoS the rejection of modern mathematical toolkit continues unabated, but any tone of apology is gone. Perhaps, at least in Chapter 12, some measure theory could have been useful after all.

But enough of that. All of this "inference" business is about what to think, and who cares about that. We want to know what to do! Thus, we need decision theory. The shift in focus from inference to decision gives an occasion for some discoursing on British vs. American priorities in life. This is particularly amusing given that the main credit for decision theory goes to the Hungarian mathematician Abraham Wald, of the "it's the missing bullet hole locations that you need to worry about" fame. (Wald's dramatic life story is second perhaps only to that of Alexander Grothendieck in its Holywood potential.) Wald's decision theory proceeds by assigning to each possible action (say: buy, sell) some utility, dependent on the true state of the world (say, the price tomorrow). The recommended action is then the one that maximizes the expected utility, 'expected' meaning averaged over your beliefs about the true state of the world (i.e. today's beliefs about tomorrow's price). That is, ignoring transaction costs: buy if the expected utility of tomorrow's price is higher than the utility of today's price, and sell otherwise. (Of course the economists being naturally dismal talk about minimizing loss - or cost - rather than maximizing utility.)

This may sound trivial, but that's because, under Jaynes's influence, we are already talking in the language of beliefs about the true state of the world -- what a statistician may call distribution of the model parameter, something which is not really allowed in the 'orthodox' or 'frequentist' approach to statistics. Instead, a frequentist might be concerned with a 'decision procedure' or 'strategy' based on some data, i.e. some process that takes in data and spits out the action to take. This procedure should not be too wild, and what "not too wild" means is formalized by Wald and is given the name "admissible". (Jaynes seems to interpret 'admissible' as 'good' and proceeds to rally against this term, by providing some not-so-good admissible strategies; I think simply interpreting 'admissible' as 'not obviously stupid' would've ameliorated that particular pet peeve.) This sets up a triumph of Bayesianism: many years after starting the study of admissible strategies, Wald proved that they are all basically Bayesian. That is, they are equivalent to starting with some prior 'beliefs about the true state of the world', updating them based on the data - via Bayes's theorem, of course - and then choosing the action that maximizes expected utility. Moreover, in the case where the "decision" is actually estimating a parameter, by varying your utility/loss function and applying the above strategy, you may recover some standard estimators, such as taking the posterior mean, or taking the maximum of the posterior, of which classical maximum likelihood is a special case. Jaynes rightly points out that the shape of the loss function can change the decision quite drastically: in deciding between cutting your hair too short or too long, one type of error is much less costly than another; the cost of various errors in a 'William Tell-type scenario' is even further from the usual models.

With this - essentially final - layer of theory, we are ready for some applications. One application Jayes considers is distinguishing the signal from the noise. And he does mean signal - an electrical one, in volts (it is probably that "one-third engineer" in Jaynes speaking). Another application is deciding what widgets to produce in an imaginary widget factory. While the first, simpler, task is arguably more important, it is the latter that is more revealing of both Jaynes's process and its flaws. The analysis is fine - great even - when taken on its own, but there are no sanity checks and no robustness analysis. If I actually had a widget factory, I would not act based on this whole thing, at least not before hiring someone to vary the model and see how it flexes.

An important issue that remains is how to summarize beliefs about the world. Imagine I have a coin. I may say that the probability that it will land heads on the next toss is half, but this does not capture all my beliefs about the coin. Perhaps I have personally forged the coin to be fair, or perhaps I have never seen it before in my life. Now, imagine I see it be tossed and come up heads 10 times in a row. What would be my prediction of the next toss now? In the first case it is still pretty close to 50-50 ('pretty close' rather than 'exactly' because maybe my manufacturing process was flawed, or maybe the throws were rigged, or maybe I'm insane - all good reasons for me to hedge against being overconfident). In the second case I might start to suspect that the coin is not fair, and adjust my forecast accordingly. The question before us is how to account for this difference. Jaynes takes this up in chapter 18, and essentially invents a two-level hierarchical Bayesian model. It goes roughly like this: First, for each p, I consider the world in which the coin is biased to land heads with probability p, and judge how likely it is that I am in that world. I collect all these judgments into what Jaynes calls "the A_p distribution", A_p being his notation for the statement 'the coin is biased to land heads with probability p'. Then, once the results of the coin flips come in, I update this A_p distribution using Bayes's theorem. The difference between the two scenarios above ('personally forged' vs 'never seen before') is in the initial shape of the distribution for A_p. The 'I forged this coin' initial distribution has a high peak near p=0.5, while the 'this is just some coin' one is more spread out. They both average to 0.5, which is why they both, before any data comes in, lead me to predict 50-50 for the next toss. However, the first one is a more confident prior and is less susceptible to change based on new evidence. (Incidentally, if our initial distribution A_p is in the Beta family, then updating it is particularly easy to do, which is what makes the section 18.5 work out). One thing to note here is that we are now talking about something like 'probabilities of probabilities', and this is not what Jaynes discussed when setting up the whole "extension of logic" business. In fact, I agree with the the contention that 'logic' in 'the logic of science' is to be, at least initially, understood as propositional calculus, since this is actually what Cox's theorems deal with. Finding probabilistic extensions of predicate (and higher-order) logic seems to be the subject of some current research. Whether this has some bearing on Bayesianism as "a complete theory of formal rationality" is a question slightly too philosophical for my usual tastes.

All of this is no doubt very thrilling: I mean, we are "only" solving the question of how one should reason - and act! - in the world. We call it 'inference' just to keep the excitement down and keep philosopher-logicians off our back. But it is not nearly as much fun as the numerous polemical tirades against "the orthodoxy", be it of Fisher, Pearson, or Feller patriarchate.

À la recherche du temps perdu.

Chapters 8, 16, and 17 give some account of - and Jaynesian commentary on - the classical statistics. These did not appear in the earlier drafts, which were more focused on expounding Jaynes's own theories. In PT:TLoS, Jaynes spends quite a bit of time in remembrance of things past, recounting his disagreements with classical statisticians. Their "orthodoxy" is described in terms of its "pathology" and "folly". Jaynes's main charge is that their methods are "ad hoc" - a phrase that appears 47 times in PT:TLoS. For the book whose chief aim is to develop systematic rules of inference, this is probably not surprising.

If one were to pick out a single antagonist in the PT:TLoS it would have to be Sir Ronald Aylmer Fisher. One could say that Fisher was a geneticist and a statistician. Or, one could say that he was "the greatest of Darwin’s successors" and "the single most important figure in 20th century statistics". Bradley Efron (another Stanford statistician) writes that "one difficulty in assessing the importance of Fisherian statistics is that it’s hard to say just what it is. Fisher had an amazing number of important ideas and some of them, like randomization inference and conditionality, are contradictory. It’s a little as if in economics Marx, Adam Smith and Keynes turned out to be the same person."

Among many charges Jaynes lays at Fisher is that of establishing statistics as a collection of (ad hoc!) recipes for analyzing data. In Jaynes's view Fisher's cookbooks (primarily "Statistical Methods for Research Workers ", but also The Design of Experiments ) established the situation in which a scientist was to follow the recipes, but was not to question the reasoning behind these recipes. Then, as per Jaynes:

Whenever a real scientific problem arose that was not covered by the published recipes, the scientist was expected to consult a professional statistician for advice on how to analyze his data, and often on how to gather them as well. There developed a statistician–client relationship rather like the doctor–patient one, and for the same reason. If there are simple unifying principles (as there are today in the theory we are expounding), then it is easy to learn them and apply them to whatever problem one has; each scientist can become his own statistician. But in the absence of unifying principles, the collection of all the empirical, logically unrelated procedures that a data analyst might need, like the collection of all the logically unrelated medicines and treatments that a sick patient might need, was too large for anyone but a dedicated professional to learn.

Jaynes's statements that "deep change in the sociology of science – the relationship between scientist and statistician – is now underway" and that "each scientist involved in data analysis can be his own statistician" seem premature. My impression is that basic courses in applied statistics are routinely taught without even attempting to impart much conceptual understanding, and for many scientists doing your own statistics is still dangerously close to rolling your own crypto. Be that as it may, hardly anyone can be against getting scientists to understand the statistics they are practicing. According to Jaynes, one of the earliest attempts to do this is the 1939 "Theory of Probability" by (future Sir) Harold Jefferys.

This book is perhaps the most direct prior influence on Jaynes and on PT:TLoS - which is, after all, "dedicated to the memory of Sir Harold Jeffreys, who saw the truth and preserved it". In Jaynes's telling, Jeffreys "was buried under an avalanche of criticism which simply ignored his mathematical demonstrations and substantive results and attacked his ideology". Jaynes writes:

We need to recognize that a large part of their differences arose from the fact that Fisher and Jeffreys were occupied with very different problems. Fisher studied biological problems, where one had no prior information and no guiding theory (this was long before the days of the DNA helix), and the data taking was very much like drawing from Bernoulli’s urn. Jeffreys studied problems of geophysics, where one had a great deal of cogent prior information and a highly developed guiding theory (all of Newtonian mechanics giving the theory of elasticity and seismic wave propagation, plus the principles of physical chemistry and thermodynamics), and the data taking procedure had no resemblance to drawing from an urn. Fisher, in his cookbook defines statistics as the study of populations; Jeffreys devotes virtually all of his analysis to problems of inference where there is no population.

But just in case you had any doubt whose side he is on, Jaynes then adds:

What Fisher was never able to see is that, from Jeffreys’ viewpoint, Fisher’s biological problems were trivial, both mathematically and conceptually.

Them's fightin words!

Incidentally, Jaynes does not deny Fisher's mathematical cleverness, and credits him with having a "deep intuitive multidimensional space intuition", which allowed him to calculate many sampling distributions for the first time. But Jaynes points out that "just before starting to produce those results, Fisher spent a year (1912–1913) as assistant to the theoretical physicist Sir James Jeans, who was then preparing the second edition of his book on kinetic theory and worked daily on calculations with high-dimensional multivariate Gaussian distributions". Yes, even these stem from a physicist whose last name starts with J!

A secondary antagonist is William Feller, the author of "the most successful treatise on probability ever written". He is also accused by Jaynes of being too clever - and thus being able to get away with not doing things systematically. According to Jaynes, Feller's readers "get the impression that: (1)probability theory has no systematic methods; it is a collection of isolated, unrelated clever tricks, each of which works on one problem but not on the next one; (2) Feller was possessed of superhuman cleverness; (3) only a person with such cleverness can hope to find new useful results in probability theory". The unstated implication here is that we should doubt all three. As an illustration of "clever tricks" Jaynes chooses the following problem:

Peter and Paul toss a coin alternately starting with Peter, and the one who first tosses 'heads' wins. What are the probabilities p, p' for Peter or Paul to win? The direct, systematic computation would sum (1/2)^n over the odd and even integers:

p = Σ (1/2)^(2n+1) = 2/3 [n=0,..., ∞], p' = Σ (1/2)^(2n) = 1/3 [n=1,..., ∞].

The clever trick notes instead that Paul will find himself in Peter’s shoes if Peter fails to win on the first toss: ergo, p' = p/2, so p = 2/3, p' = 1/3.

The "ergo, p' = p/2" is saying that Paul will win precisely when [Peter does not win immediately] and [Paul wins, given that Peter does not win immediately]. The probability of the first clause is 1/2, and that of the second is p (since, after Peter tosses a tail, Paul's situation is the same as that of Peter at the start of the game); ergo, p' =(1/2)*p= p/2. One can also solve this problem by saying that for Peter to win, he needs to either do so immediately, or later - after the first two tosses come up tails and the game effectively begins anew. In math, this says that p=(1/2)+(1/2*1/2)*p. Here, 1/2 is the probability of Peter's immediate win, (1/2*1/2) is the probability of two first tosses being tails, and p is the probability of Peter winning once the game starts anew, after the first two tails are tossed.

Of course, Jaynes himself can do things that are clever. His dexterity with generating functions, transform methods, and asymptotic expansions, among other things, can appear magical to those not trained as applied mathematicians or physicists. But the irony here is that this "Peter and Paul problem" is exactly the wrong example to use for complaining about “isolated clever tricks and gamesmanship”. In fact, thinking about a game as a system moving between states, and analyzing how likely it is to reach certain "goal states" is basically using Markov chain theory! It is one of the most common methods for solving probability problems, well connected to other key things in probability and statistics.

This "Peter and Pall" mishap serves as an illustration of a deeper point: once well understood, many clever tricks become powerful methods, much more powerful indeed than straightforward but uninspiring computations. I agree with Jaynes in calling for “general mathematical techniques which will work not only on our present problem, but on hundreds of others”. Trouble is, a 'general technique' may solve a given problem, but not explain what is going on (mathematician Paul Zeitz calls this "How vs. Why"). At the same time, a clever trick may lead to a better general theory, closer to answering the 'why' question. I am arguing not for "gamesmanship", but for bringing the game to the next level.

There are many other things Jaynes has to say about "orthodox" statistics and statisticians. One such volley is aimed at Jerzy Neyman, who had an argument with Jeffreys. In this argument, Jaynes says, "Jeffreys is clearly right". This affirmation is the only reason I see for bringing up this episode in the book, since the actual nature of the dispute is not given explicitly. What is my reason for bringing this up in the review? Well, having read the relevant parts of the original sources, I can report that Jeffreys was clearly wrong. I encourage you to discuss whether I am wrong that Jaynes is wrong that Neyman is wrong in the comments.

In Persi Diaconis's review I mentioned earlier, he calls PT:TLoS "wonderfully out of date", saying that "the wonderful part is that Jaynes discusses and points to dozens of papers from the 1950s through the 1980s that have slipped off the map." A noticeable fraction of this pointing is, in fact, pointing fingers at people doing things wrong. The abundance of these sidetracks forces the reader to either mostly ignore them or to follow up on them. Both strategies are admissible, and I have found the second one quite rewarding, but it makes reading PT:TLoS seem like walking through a garden of forking paths.

Paradox lost.

Jaynes's opinions are of course not limited to statistics. He has things to say about set theory, measure theory, the infinite, Kolmogorov's axiomatization of probability, generalized functions, Godel's incompleteness, and so on. Jaynes says that "we shall find ourselves defending Kolmogorov against his critics on many technical points". I was glad to see this, but not because I think Kolmogorov needs defending. Rather, it signaled to me that Jaynes's math will be mostly right. Despite this, after I read Appendix B, it became clear to me that on the subject of modern mathematics Jaynes and I don't really see eye to eye. This appendix contains most of Jaynes's attack against modern mathematical formalism, but remarks of similar nature can be found in multiple places in PT:TLoS. I view finding the right language and level of generality, and using mathematical rigor to remold uninformed intuition as some of the essential goals that underpin modern mathematical developments. Unfortunately, they seem to be either ignored altogether, or viewed as unimportant hindrances by Jaynes. He also insists that using modern techniques can produce nonsense. In fact, one can produce nonsense even without any modern math, simply by not being careful. Perhaps you have seen how one can use the sum S = 1-1+1-1+&nbsp... to show that 0 = 1: namely, S = (1-1) + (1-1) + ... = 0, but S = 1 + (-1+1) + (-1+1) + ... = 1. You should not interpret this as saying that using arithmetic can produce nonsense. Rather, you should learn that unjustified manipulations like that are dangerous - as indeed they are. That's precisely why mathematicians have thought long and hard about how one can work with such infinite series without running into problems. They developed multiple sophisticated and precise theories about this, some of which are now taught in the 'sequences and series' part of courses on rigorous mathematical analysis. Ignoring what these theories say is what leads to apparent paradoxes.

The situation with paradoxes of probability theory is quite similar. This might explain why chapter 15, "Paradoxes of probability theory", was not all I hoped for. A true mathematical paradox would be a pair of contradictory conclusions derived from standard axioms. So far, no such thing has been found. Thus, any presently known "paradox" must be of some other nature. Roughly speaking, there are three common types. There are true statements that subvert naive intuition (a la Banach-Tarski paradox), there are faulty demonstrations (like Achilles and the tortoise), and there are arguments that reveal a deficiency of terminology or definitions (such as Russell's paradox). Alas, many of the "paradoxes" in PT:TLoS are not even paradoxes in this weaker sense. For example, consider the following: we take one of the Kolmogorov's axioms called 'countable additivity' and weaken it to something called 'finite additivity'. We study the resulting alternative "probability theory" and find that it sometimes produces pathological-looking results. Is this a paradox? I'd say hardly so. Yet, this is exactly what the "non-conglomerability paradox" from chapter 15 boils down to after all is said and done.

Another example, the "Borel-Kolmogorov paradox", is mostly of terminological type - it poses the question of how to make sense of conditioning on an event of probability zero. Jaynes considers the case of conditioning a joint density. Then, he shows that a plausible-looking formula for this conditioning is easily obtained "by an intuitive ad hoc device". Jaynes shows how careless use of this formula gives rise to paradoxical results. Surprisingly, Jaynes does not point out that this paradox was resolved by Kolmogorov in the same 1933 book where he axiomatized probability theory. Using some basic measure theory Kolmogorov defines conditioning with respect to a random variable, of which conditioning a joint density is a special case.

Finally, the "marginalization paradox" touches on an important issue of using improper priors in Bayesian inference. Roughly speaking, 'improper' means 'not summing to 1'. It's as if you added up probabilities of every possible outcome, and found that the sum, i.e. the probability that something will happen, is not 1, but instead is infinite. Clearly, this is not supposed to happen. Using such priors voids all the warranties on your calculational procedures, and can lead to contradictory answers. So why not just agree to never use them? Because they are often easier to compute with, and, more importantly, they arise often from both maximum entropy and the group invariance we have talked about. So Bayesians would like to know how to work with them. The marginalization paradox is an example of a problem in which two different approaches to using and improper prior do in fact produce different answers. Jaynes's solution is to approximate improper priors by proper ones, perform the necessary calculations with those, and then take the limit. The trouble is that the calculations are sufficiently involved, and the limits are sufficiently tricky that following everything to a satisfactory conclusion is challenging. Even a very persistent reader may end up doubting Jaynes's conclusion as to which of the two procedures should ultimately be used.

Exegi monumentum.

What are we to make of all this, as the saying goes?

PR:TLoS is, to put it mildly, a very special book. It is neither a textbook, nor a reference test, nor a philosophical treatise, nor a history book - and it is a bit of all of those. It is singularly shaped by the person of E. T. Jaynes: by his "two thirds theoretical physicist and one-third electrical engineer" background, by his interest in radars and statistical mechanics, by his unconventional thinking, by his polemic style in disputes with statisticians of his age, and by his untimely death.

The book's chapters written earlier and polished for longer are some of the strongest, while those added late are often more susceptible to criticism or are incomplete. Yet, despite its flaws, the book had a tremendous influence in reframing statistical practice through the lens of inference and promoting Bayesian methods as a coherent framework for it. Bayesian methods have become ubiquitous in machine learning, and maximum entropy distributions (often under the name of exponential families) play a prominent role. Perhaps because of this influence PT:TLoS is often recommended as a resource on probability and information theory for those with "absolutely no prior experience with these subjects" or even "to the general reader". I find this akin to recommending "Ulysses" as a practice book for beginner English learners. PT:TLoS would fare much better as "A Companion to Probability: A Second First and A First Second Course in Probability", suitable for a dedicated reader possessing a solid grasp of the basics and wishing to gain a deeper conceptual understanding. An advanced student or a practicing professional in any field related to probability, statistic, or machine learning would benefit from reading it. As it stands, however, PT:TLoS is one of those complex classics that many wish to have read, but not many have actually managed to read.

Perhaps this review may at times seem critical and not sufficiently expounding on all of Jaynes's contributions. This may be because Jaynes has, by now, won many of his battles, and his mode of thinking has become part of the intellectual background of our age. It is, after all, difficult to appreciate an insight once it becomes the usual mode of thinking, the proverbial water. It may also be because the book itself is incomplete, and sometimes frustrating. In the very first paragraph of the editor's preface, Larry Butterhurst explains:

I could have written [the] latter chapters and filled in the missing pieces, but if I did so, the work would no longer be Jaynes’; rather, it would be a Jaynes–Bretthorst hybrid with no way to tell which material came from which author. In the end, I decided the missing chapters would have to stay missing – the work would remain Jaynes’.

This is a decision which one Amazon review calls "a bad mistake". This is certainly how I felt when I was reading the book; now I am less sure. Once you have struggled through it, the motivation to make the struggle less onerous diminishes, and you begin to think that "keeping the work Jaynes'" may actually be a valid consideration, and not just a lazy cop-out you thought it to be whilst in the thick of it all.

And yet I, too, find myself mourning for what this book could have been. Sometimes when faced with a choice (ketchup or mayo? vanilla or chocolate?) I simply choose both. We already have the Jaynes's version of PT:TLoS. Can we not get the "completed version" as well? Could we not write the missing chapters, explain the cryptic references, solve the unsolved exercises and release the result to the world? Someone who is better than I am at organizing things, and someone who knows more than I do about copyright and publishing would need to think about it. On one hand, we are in the 21st century, with the power of the internet, crowdsourcing and social campaigns. On the other hand, it is my understanding that it will almost be the 22nd century before the copyright for PT:TLoS expires.

Until then, we read the version we have. The version that embodies Jaynes's message: "progress in science goes forward on the shoulders of doubters, not believers". The version that urges us to think for ourselves rather than to defer to the "orthodoxy" (whatever it may be called in our time), to see the truth and preserve it.