This is really, really messy to calculate. It depends on:
- Corpus size. As in, is it a brand new sentence in relation to which other sentences? Every single sentence uttered? If the later, how do we estimate the number of sentences already uttered?
- Word frequency. This is probably the easiest as it's dictated Zipf's Law, so the Nth most common word appears with a frequency of k/N, where "k" is a language-dependent constant.
- Sentence size. Two sentences are only identical if they have the same length; I think sentence length should vary accordingly to a standard normal distribution, with the highest values being also language-dependent.
- Grammatical restrictions. For example, in English you won't see "the" followed by a verb, but most of the time subject pronouns do it. We could abstract this factor away, though.
With all of that said, I've calculate this for a specific situation: a corpus containing a single one word sentence, you're uttering a new one word sentence, and you want to know if they're identical.
The chance both sentences are identical because they use the same word will be:
- for the most common word, k²
- for the second most common word, (k/2)²
- for the Nth most common word, (k/N)²
The odds both sentences will be identical will be the sum of all odds above, so:
- p = (k/1)² + (k/2)² + (k/3)² ...
- p = k² * (1/1² + 1/2² + 1/3²...)
Technically this is not an infinite series because vocab isn't infinite, but it's easier if we pretend that it is - because then the second factor becomes a convergent series, determined to converge to π²/6 = 1.64. So we can simplify the formula to p = 1.64 k², where "k" is the frequency of the most common word.
For example, the first link contains Zipf's Law data for English, with "the" at 7% and "of" at 3.5%; so for English k=0.07. So the odds both one-word sentences are identical, in English, are (0.07)²*1.64 = 8.0*10⁻³ = 0.8%.
Once you enlarge the corpus from one previous sentence to two or more, or tries to handle different sentence sizes, my brain becomes mush.