Frequency analysis as a statistical method of linguistic research

 Frequency analysis as a statistical method of linguistic research
I. Karamysheva, M. Holubinska
Applied Linguistics department, Lviv Polytechnic National University, Ukraine, Lviv, S.Bandery, 12, E-mail: strohushmaryana@rambler.ru, iryn_ka@ukrnet.com  
Abstract – The article illustrates correlation between absolute word frequency and margin error of frequency analysis. It is considered the theoretical background concerning frequency analysis in linguistics; and practical results of applied statistical method are presented. 
Key words – lexical unit, statistical linguistics, frequency analysis, relative frequency, frequency of occurrence, relative error, margin error. 
I. Introduction
An important and promising trend in modern linguistics which has been making progress for the last few decades is the quantitative study of language phenomena and application of statistical methods in linguistic analysis.
Statistics is a branch of mathematics dealing with collection, classification, analysis, and interpretation of numerical facts or data.
Statistical linguistics follows certain rules of mathematical statistics. It is generally recognized as one of the major branches of linguistics. There was a considerable growth of interest and activity in statistical linguistics in the 1960s-80s.
The statistical approach has become very popular because it is the most helpful in dealing with large masses of data to analyze.
Frequency method reveals statistical features of language units. Such a kind of method is called a mathematical method of language research [4]. This method is used in the article in order to define the most frequently used words.
The object of our investigation is a word, under which we understand ‘a sequence of graphemes which can occur between spaces or the representation of such a sequence on morphemic level’ [6]. The material under study is lexical filling of the textbook Gold of the First Certificate level.
II. Initiation of frequency analysis in the linguistic research
One of the first attempts to introduce statistical methods in linguistics was made by the American scientist G.Zipf in his book Human Behaviour and the Principle of Least Effort (1949) [3, 263]. Having discovered that there is a direct relationship between the number of different meanings of a word and its relative frequency of occurrence, G.Zipf put forward a mathematical formula for this correlation claiming that the number of meanings in any polysemantic word will tend to be equal to the square root of its relative frequency (with the possible exception of the few most frequent words of a language), so that
m= (1)
where m stands for the number of meanings, p for relative frequency of occurrence.
 This formula for correlation between polysemy and word frequency (termed the “the principle of diversity of meaning” by G.Zipf) is known as Zipf’s law: the more senses the word has accumulated the more diversed aspects of intellectual and social activity it represents. Probably it is the best known result achieved in the field of statistical linguistics.
 Though numerous corrections to this law have been suggested by subsequent authors, still there is no reason to doubt the principle itself, namely, that the more frequent a word is, the more meanings it is likely to have.
III. Prescribed relative error in frequency analysis
Every linguistic research is based on collecting sample material, in other words, examples. Having determined the object of research on the basis of preliminary study and the sets of units to be described, the linguist proceeds to collect and classify data. At this stage, it is essential to know how many examples are necessary to make the conclusion valid. Mathematical statistics supplies the research with the formula showing the necessary size of sample material depending on the amount of error they are prepared to tolerate.
 Statistical table below is used to determine the necessary number of samples to be investigated depending on the prescribed relative error [2, 265].

TABLE 1 
INTERDEPENDENCE BETWEEN A RELATIVE ERROR (δ)
AND A NUMBER OF SAMPLES
Relative error δ Number of samples
  
  In statistics, relative error is a measure of the difference between the observed or approximately determined value and the value of a quantity, often expressed as a decimal fraction or percentage: an error of 0.05 or of 5%. The error of 30% is considered maximum in linguistics for the obtained results to be proved reliable.
IV. Results of conducted frequency analysis
To conduct frequency analysis we have used TEXT ANALYZER compiled by B.Rudyy. The program counts words (functional and notional parts of speech) separated by either spaces (whitespace) or punctuation. The results of calculating are presented on the sheets of Microsoft Excel (see Fig.1).
 TABLE 2
 THE MOST FREQUENTLY USED WORDS IN THE TEXTBOOK GOLD
V. Margin error in frequency analysis
Following certain rules of mathematical statistics, linguists must be able to state their margin error [4, 27].
  (2)  
where p for the relative frequency of occurrence of the unit under study, N for the length of the text (the actual number of words in the text). The larger value of the margin error is, the less pricise results one would obtain.
 To determine the relative error, one must determine the relative frequency of the unit under study [4, 44].
  (2)  
where p stands for the relative frequency of occurrence, m for the absolute frequency of occurrence (the actual number of occurrences of the unit in texts) [1], N for the length of the text.
The total number of words (N) in the whole textbook GOLD is 75 537. To determine the relative frequency of the unit under study we use the formula (3) [4, 44]:

TABLE 3
 ABSOLUTE AND RELATIVE FREQUENCY OF THE MORE FREQUENTLY USED WORDS
Thus, the most frequently used language units of the English textbook GOLD are functional (preposition, conjunction, particle, article): definite article the, indefinite article a, particle to, prepositions of, in, at, for, conjunctions and, if, as, etc. Personal pronouns I, you, he, they, impersonal pronoun it, the third person singular of the present tense of ‘be’ is, the first and third person singular of the past tense of ‘be’ was, the present tense and plural of ‘be’ are belong to the group of notional parts of speech (noun, adjective, pronoun, numeral, verb, adverb, interjectionare) and are integral part of the textbook as well. To calculate the relative frequency of other notional wotds, one should know their absolute frequency. Let’s consider the following Figure 1.
  Fig. 1 Results of frequency analysis of the 1st Unit, presented in Microsoft Excel
The column D shows the absolute frequency (m) of the words from column C. Knowing the value of N (4319) in the 1st Unit, calculation of relative frequency (p) and margin error ( ) for each word becomes possible.
TABLE 4
  RELATIVE FREQUENCY AND MARGIN ERROR OF THE NOTIONAL WORDS

The larger value of the relative frequency is, the less the margin error appears to be. The correlation between the absolute frequency and the margin error is indirect.
Thus, statistical regularities can be observed only if the phenomena under analysis are sufficiently numerous and their occurrence is very frequent. The main requirement for a successful statistic investigation is the representativeness of the phenomenon in question, its relevance from the linguistic point of view.
In accordance with the statistical table Table 1 “Interdependence between the relative error and a number of samples”, that is used to determine the necessary number of samples to be investigated aiming at the smallest relative error, the relative error in our research is 0.01, because the total number of words under research is above 38 416. 
Conclusions
An important and promising trend in modern linguistics which has been making progress during the last few decades is quantitative study of language phenomena and application of statistical methods in linguistic analysis. The statistical approach has become very popular in linguistics, not only because of the precision and objectivity which it is held to guarantee, but also because language is a mass phenomenon which seems to invite this kind of treatment. 
For us it is not enough to know that it is allowable for a given structure to appear, we are interested in its frequency, in how often it appears. Frequency is the main criterion for classification of facts. It is the criterion for comparison of the investigated units and selecting the basic lexical store for a foreign language learner. The more frequently words you know, the better you are familiar with the senses they denote. And in accordance with the conducted research, knowing of functional word usage is obligatory and preliminary to the further process of foreign language learning.
Mathematical statistics supplied the research with the formulas showing the necessary size of sample material depending on the amount of error they are prepared to tolerate. The presented research aimed at the smallest relative error, expressed as a decimal fraction. The relative error in the research is 0.01, since the total number of words under research is equal to 75 537, and the obtained results are considered to be reliable.
References
[1] Волошин В.Г., Комп’ютерна лінгвістика: Навчальний посібник. Суми: ВТД „Університетська книга”, 382 с., 2004. 
[2] Раєвська Н.М. English lexicology. – Київ: Вища школа, 109 с., 1971.
[3] Сухорольська С., Федоренко О. Методи лінгвістичних досліджень. – Львів: Видавничий центр Львівського НУ ім. І.Франка, 344 с., 2006. 
[4] Перебийніс В.І. Статистичні методи для лінгвістів: Навчальний посібник. – Вінниця: „Нова Книга”, 168c., 2001.
[5] Arnold I.V. The English Word. – Moscow: Vysšaja Škola, 296 p., 1986. 
 [6] Lamb S.M. Segmentation. Proceedings of the National Symposium on Machine Translation. – New York, PP. 35-38, 1961.

 

AttachmentSize
f1.doc15.5 KB
f2.doc15.5 KB
f3.doc16 KB
t1.doc31 KB
t2.doc37.5 KB
t3.doc27.5 KB
t4.doc21.5 KB
tfigure.doc56.5 KB

Get the CSIT'2009 button!
You can help us to promote the conference by adding our button to your website or blog!

CSIT'2009: International Conference on Computer Science and IT

Here is the button code: