If we close the eyes and then pick up four circles, we will guess that it was box2.
We can calculate the probabilities and then compare them. P(4 circles|Box1) vs P(4 circles|Box2)
As the number of picking circles is the same as the number of successes in the total trials, we can use the binomial distribution to calculate probability.
Picking four circles in Box1 follows X~B(n=4,p=0.5).
Picking four circles in Box2 follows X~B(n=4,p=0.9).
Thus, the probability for box2 is higher than that for Box1.
Here, P(data|Box) is called likelihood function.
Then, what is the maximum likelihood estimatior?
It is estimator, which maximize the likelihood function and then the value of the estimator.
In the above example, we can say that box1 and box2 are populations.
The cards(X1,X2...,Xn) from box1 ~ iid f(X|θ) Here, θ is a parameter for Box1.
If we don't know θ but we wanna estimate θ with samples' joint probability,
we can find likelihood function.
This is multiplied marginal probability(joint probability).
To maximize the function, we differentiate it with θ and then put Zero.
As an example,
Let's think about exponential distribution.
X1,X2...Xn ~ iid exp(λ)