Re: minimum sample

From: Daniel Ria–o (danielrr@mad.servicom.es)
Date: Fri Mar 05 1999 - 02:16:25 EST


        Jonathan,

        You gave an example of probabilistic calculus, but I think (and I
think we both agree) that this kind of statistics are of a completely
different nature from statistics based on data coming from syntactically
analysed text corpora (something that the Gramcord texts are not).
        I'll try to make my self clearer (though I am translating the
Spanish terms for chart drawing): we agree (do we?) that there is a set of
rules that govern any natural language's syntax. There is a much larger set
of rules, with fuzzier boundaries, that govern the use of this system by
any given community (it linguistic norm), and a even larger and fuzzier set
of rules that govern the linguistic choices of any individual in different
situations. Suppose you or your team or several scientific project are
involved in a complex linguistic investigation on several levels of
description of ancient Greek. You are limited to the testimony of written
text: you don't have a 2.000 years old Greek to test his linguistic
competence. Anyway, you could draw a graphic, where you represent over the
horizontal axis the volume of your corpus (the number of words), and, with
a higher level of conventionality, you could represent over the vertical
axis the completedness of your description of the system of Greek syntax in
the first century AD, the norm in Palestine at the time, or the use of the
language of a given author. Now, the curve that represent the (foreseeable)
growth of your descriptions will raise as you parse text, but at some
moment, the line that represent your description of the system will start
to rise much slower and at some moment it will be flat (congratulations:
you are now in the books of history), and some time latter the inclination
of the line that represents the accurateness of your description of some
author's linguistic use will tend to zero. And my question is: is there a
statistical method to calculate where is the inflexion point of that line?.
Is there a "minimum sample" for linguistic studies over text corpora?

Jonathan Robie wrote:
>I think it is important to distinguish sampling statistics from descriptive
>statistics. If you are looking at individual observations from a universe
>that is unbounded or too large to study, you use sampling statistics to
>extrapolate from a sample. If you are trying to describe a well-bounded
>universe, it's better to use descriptive statistics.
>
>To use a homely example, suppose you had 10 marbles in a jar, and you
>wanted to know how many of them are black. Since you can just count them,
>you'd be best looking at all 10 marbles. Although it is possible to sample
>5 and extrapolate from there, your estimate will not be particularly accurate.

>In general, I'm suspicious when sampling statistics are used to assert
>probabilities of interpretation for linguistic phenomena, especially when
>the corpus is well defined, reasonably small, and eminently explorable with
>tools like Gramcord. Suppose I were to assert that the English word "run"
>refers to a way of moving 75% of the time. That doesn't mean that there is
>a 75% probability that it has this meaning in a sentence like "she has a
>run in her stocking". There are many factors that affect meaning in
>context, and if you try to make statistical projections about meaning, it's
>pretty likely that there may be other factors that you did not consider
>when drawing your sample.
>

___________________________________________________________
Daniel Ria–o Rufilanchas
c. Santa Engracia 52, 7 dcha.
28010-Madrid, Espa–a
___________________________________________________________

---
B-Greek home page: http://sunsite.unc.edu/bgreek
You are currently subscribed to b-greek as: [cwconrad@artsci.wustl.edu]
To unsubscribe, forward this message to leave-b-greek-329W@franklin.oit.unc.edu
To subscribe, send a message to subscribe-b-greek@franklin.oit.unc.edu


This archive was generated by hypermail 2.1.4 : Sat Apr 20 2002 - 15:40:18 EDT