Stackexchange users be welcome to this page :), if you find the program useful or have suggestions, please feel free to leave a comment.
The poweRlaw R library provides the bootstrap_p function which allows to test the goodness of fit of a power law to the data using bootstrapping. Unfortunately the library does not provides such methods for other distributions. We here propose such functions for log-normal and exponential models.
The code is based on the principle of the paper by Clauset et al. 2009 to test goodness of fit by bootstrapping for the log-normal and exponential distributions and reuse some of the powerLaw functions.
I wrote it after realizing that the code provided on the page http://users.dimi.uniud.it/~massimo.franceschet/R/fit.html meant for the same purpose was incorrect: the way the goodness of fit is computed is not correct as it does not strictly follow the procedure described in the paper, and implemented in the poweRlaw package.
Specifically: the synthetic data generation procedure is half implemented, it does not search for the best xmin as provided by the extimate_xmin function of the poweRlaw package, and finally the ks.test discards all the ties, while powerLaw doesn't with its built-in KS test.
Usage: a poweRlaw discrete distribution object has to be given as the first parameter. Prints the p-value on the output. Note that following the paper, about 2500 simulations should be generated to get a 2 points precision.
Interpretation of the p-value: a p-value < 0.1 rejects the hypothesis that the tested law generated the data, while a value > 0.1 does not reject it, but does not disprove the model either.
Note that model comparison based on the likelihood ratio (like the Vuong test, which is implemented in the compare_distribution function of poweRlaw) is better able to tell which of two models better fits the data and if there are enough data to confidently claim so.
Code for log-normal distributions:
bootstrap_ln = function(m,nbsims=5) { xmin = m$getXmin() ntot = length(m$dat) n1 = sum(m$dat<xmin) n2 = ntot - n1 print(n1) print(n2) print(ntot) start_time = Sys.time() m_est = estimate_xmin(m) end_time = Sys.time() total_time = difftime(end_time, start_time, units="secs") mean = m_est$pars[1] sd = m_est$pars[2] print(m_est$KS) print(unlist(m_est)) print(total_time) count=0 for(i in 1:nbsims) { # generate synthetic data set if(xmin>1) { syn1 = sample(m$dat[m$dat<xmin],n1,replace=TRUE) } else { syn1 = c() } # sample from distribution syn2 = c() while(length(syn2)<n2) { n2bis = n2-length(syn2) syn2bis = round(rlnorm(n2bis,meanlog=mean,sdlog=sd)) syn2bis = syn2bis[syn2bis>0] syn2 = c(syn2,syn2bis) } syn = c(syn1,syn2) print(length(syn)) # estimate gof of syn data msyn = dislnorm$new(syn) syn_est = estimate_xmin(msyn) print(syn_est$KS) print(unlist(syn_est)) if(syn_est$KS >= m_est$KS) { count = count + 1 } } print(count/nbsims) }
example:
dsyn = round(rlnorm(1000,2.4,0.98)) dsyn = dsyn[dsyn>0] length(dsyn) e_lnsyn = dislnorm(dsyn) e_lnsyn$setXmin(estimate_xmin(e_lnsyn)) bootstrap_ln(e_lnsyn)
Code for exponential distributions:
bootstrap_exp = function(m,nbsims=5) { xmin = m$getXmin() ntot = length(m$dat) xmin n1 = sum(m$dat<xmin) n2 = ntot - n1 print(n1) print(n2) print(ntot) start_time = Sys.time() m_est = estimate_xmin(m) end_time = Sys.time() total_time = difftime(end_time, start_time, units="secs") rate = m_est$pars[1] print(m_est$KS) print(unlist(m_est)) print(total_time) count=0 for(i in 1:nbsims) { # generate synthetic data set if(xmin>1) { syn1 = sample(m$dat[m$dat < xmin],n1,replace=TRUE) } else { syn1 = c() } # sample from distribution syn2 = c() while(length(syn2)<n2) { n2bis = n2-length(syn2) syn2bis = round(rexp(n2bis,rate=rate)) syn2bis = syn2bis[syn2bis>0] syn2 = c(syn2,syn2bis) } syn = c(syn1,syn2) # estimate gof of syn data msyn = disexp$new(syn) syn_est = estimate_xmin(msyn) print(syn_est$KS) print(unlist(syn_est)) if(syn_est$KS >= m_est$KS) { count = count + 1 } } print(count/nbsims) }
Hello Nicolas.. How the code from the link is not correct?
ReplyDeleteWhat is wrong?
http://users.dimi.uniud.it/~massimo.franceschet/R/fit.html
I want to implement for Weibull