Advanced search options

You searched for `subject:(Good Turing estimators)`

. One record found.

▼ Search Limiters

University of California – San Diego

1. Suresh, Ananda Theertha. Statistical Inference over Large Domains.

Degree: Electrical Engineering (Communication Theory and Systems), 2016, University of California – San Diego

URL: http://www.escholarship.org/uc/item/6vr8w9bq

Motivated by diverse applications in ecology, genetics, and language modeling, researchers in learning, computer science, and information theory have recently studied several fundamental statistical questions in the large domain regime, where the domain size is large relative to the number of samples. We study three such basic problems with rich history and wide applications. In the course of analyzing these problems, we also provide provable guarantees for several existing practical estimators and propose estimators with better guarantees.Competitive distribution estimation and classification:Existing theory does not explain why absolute-discounting, Good-Turing, and related estimators outperform the asymptotically min-max optimal estimators in practice. We explain their performance by showing that a variant of Good-Turing estimators performs near optimally for all distributions. Specifically, for distributions over k symbols and n samples, we show that a simple variant of Good-Turing estimator is always within KL divergence of (3+o_{n}(1))/n^{1/3} from a genie-aided estimator that knows the underlying distribution up to a permutation, and that a more involved estimator is within ~{𝓞_{n}}(\min(k/n,1/√ n)). We extend these results to classification, where the goal is to classify a test sample based on two training samples of length n each.Estimating the number of unseen species:We study species estimation, where given n independent samples from an unknown species distribution, we would like to estimate the number of new species that will be observed among the next t ∙ n samples. Existing algorithms provide guarantees only for the prediction range t ≤ 1. We significantly extend the range of predictability and prove that a class of estimators including Efron-Thisted accurately predicts the number of unseen species for t ∝ log n. Conversely, we show that no estimator is accurate for t = Ω(log n).Learning Gaussian mixtures:We derive the first sample-efficient \newline polynomial-time estimator for high-dimensional spherical Gaussian mixtures. It estimates mixtures of k spherical Gaussians in d-dimensions to within ℓ_{1} distance ε using 𝓞({dk^{9}(log^{2d})}/{ε^{4}}) samples and 𝓞_{k,ε}(d^{3log}^{5} d) computation time. Conversely, we show that any estimator requires Ω\bigl({dk}/{ε^{2}}\bigr) samples, hence the algorithm's sample complexity is nearly optimal in the number of dimensions. We also construct a simple estimator for one-dimensional Gaussian mixtures that uses ~{𝓞}({k }/{ε^{2}}) samples and ~{𝓞}(({k}/{ε})^{3k+1}) computation time.

Subjects/Keywords: Electrical engineering; Information science; Good-Turing estimators; Large Domain; Statistical Inference

Record Details Similar Records

❌

APA · Chicago · MLA · Vancouver · CSE | Export to Zotero / EndNote / Reference Manager

APA (6^{th} Edition):

Suresh, A. T. (2016). Statistical Inference over Large Domains. (Thesis). University of California – San Diego. Retrieved from http://www.escholarship.org/uc/item/6vr8w9bq

Note: this citation may be lacking information needed for this citation format:

Not specified: Masters Thesis or Doctoral Dissertation

Chicago Manual of Style (16^{th} Edition):

Suresh, Ananda Theertha. “Statistical Inference over Large Domains.” 2016. Thesis, University of California – San Diego. Accessed April 08, 2020. http://www.escholarship.org/uc/item/6vr8w9bq.

Note: this citation may be lacking information needed for this citation format:

Not specified: Masters Thesis or Doctoral Dissertation

MLA Handbook (7^{th} Edition):

Suresh, Ananda Theertha. “Statistical Inference over Large Domains.” 2016. Web. 08 Apr 2020.

Vancouver:

Suresh AT. Statistical Inference over Large Domains. [Internet] [Thesis]. University of California – San Diego; 2016. [cited 2020 Apr 08]. Available from: http://www.escholarship.org/uc/item/6vr8w9bq.

Note: this citation may be lacking information needed for this citation format:

Not specified: Masters Thesis or Doctoral Dissertation

Council of Science Editors:

Suresh AT. Statistical Inference over Large Domains. [Thesis]. University of California – San Diego; 2016. Available from: http://www.escholarship.org/uc/item/6vr8w9bq

Not specified: Masters Thesis or Doctoral Dissertation