# Gaussian mixture model sometimes seems to fit very badly

In the following code, I fit a gaussian mixture model (GMM) to some randomly sampled data. I do this twice. Each time, the data represent two well separated gaussians, the only difference being the seed I use for the random number generator.

`N = 100000;EFFECT_SIZE = 5;seedList = [1 6];for s = seedList      rng(s)      X = [randn(N,1); randn(N,1)+EFFECT_SIZE];      figure      hist(X,101)      GMModel = fitgmdist(X,2)end`

If you run that code — you will need the Statistics Toolbox — you will see that the first distribution is fit very well, and the second one terribly. I am trying to understand why. I would expect such well separated peaks to be fit well essentially every time.

This is not a fluke. I ran 1,000 different seeds, and got the bad fit about 18% of the time. Also, those bad fits tend to cluster relatively close the same parameter values.

Any thoughts? I am a novice at using GMM, so maybe I am just naive about how well this should do.

I am running R2014b on Mac OS X Yosemite.

NOTE:-

Matlabsolutions.com provide latest MatLab Homework Help,MatLab Assignment Help , Finance Assignment Help for students, engineers and researchers in Multiple Branches like ECE, EEE, CSE, Mechanical, Civil with 100% output.Matlab Code for B.E, B.Tech,M.E,M.Tech, Ph.D. Scholars with 100% privacy guaranteed. Get MATLAB projects with source code for your learning and research.

Hi cyclist,

the default starting values are selected from the data at random, and you have discovered that this sometimes does not work well. There’s an argument ‘replicates’ that you can use to have it try this multiple times and deliver the bone fit. However, in more recent releases including R2014b there is a new argument ‘Start’,’plus’ that uses a better starting method, one based on the kmeans++ algorithm for clustering. Here’s a variant of your function that shows the relative performance of the default random start and the ‘plus’ start:

`N = 100000;EFFECT_SIZE = 5;seedList = 1:20;means1 = zeros(length(seedList),2);means2 = zeros(length(seedList),2);for s = seedList    s    rng(s)    X = [randn(N,1); randn(N,1)+EFFECT_SIZE];    rng(s)  % randomness is used in the fit also    GMModel = fitgmdist(X,2);    means1(s,:) = GMModel.mu';    rng(s)    GMModel = fitgmdist(X,2,'start','`

--

--

## More from Technical Source

Simple! That is me, a simple person. I am passionate about knowledge and reading. That’s why I have decided to write and share a bit of my life and thoughts to.

Love podcasts or audiobooks? Learn on the go with our new app.

## Technical Source

Simple! That is me, a simple person. I am passionate about knowledge and reading. That’s why I have decided to write and share a bit of my life and thoughts to.