Entropy and type-token ratio in gigaword corpora

There are different ways of measuring diversity in complex systems. In particular, in language, lexical diversity is characterized in terms of the type-token ratio and the word entropy. We here investigate both diversity metrics in six massive linguistic data sets in English, Spanish, and Turkish, c...

Full description

Saved in:
Bibliographic Details
Main Authors: Pablo Rosillo-Rodes, Maxi San Miguel, David Sánchez
Format: Article
Language:English
Published: American Physical Society 2025-07-01
Series:Physical Review Research
Online Access:http://doi.org/10.1103/rxxz-lk3n
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1839629888381779968
author Pablo Rosillo-Rodes
Maxi San Miguel
David Sánchez
author_facet Pablo Rosillo-Rodes
Maxi San Miguel
David Sánchez
author_sort Pablo Rosillo-Rodes
collection DOAJ
description There are different ways of measuring diversity in complex systems. In particular, in language, lexical diversity is characterized in terms of the type-token ratio and the word entropy. We here investigate both diversity metrics in six massive linguistic data sets in English, Spanish, and Turkish, consisting of books, news articles, and tweets. These gigaword corpora correspond to languages with distinct morphological features and differ in registers and genres, thus constituting a varied testbed for a quantitative approach to lexical diversity. We unveil an empirical functional relation between entropy and type-token ratio of texts of a given corpus and language, which is a consequence of the statistical laws observed in natural language. Further, in the limit of large text lengths we find an analytical expression for this relation relying on both Zipf and Heaps laws that agrees with our empirical findings.
format Article
id doaj-art-c0c73f18e6694e309d95d2828d7c50f4
institution Matheson Library
issn 2643-1564
language English
publishDate 2025-07-01
publisher American Physical Society
record_format Article
series Physical Review Research
spelling doaj-art-c0c73f18e6694e309d95d2828d7c50f42025-07-14T14:06:25ZengAmerican Physical SocietyPhysical Review Research2643-15642025-07-017303305410.1103/rxxz-lk3nEntropy and type-token ratio in gigaword corporaPablo Rosillo-RodesMaxi San MiguelDavid SánchezThere are different ways of measuring diversity in complex systems. In particular, in language, lexical diversity is characterized in terms of the type-token ratio and the word entropy. We here investigate both diversity metrics in six massive linguistic data sets in English, Spanish, and Turkish, consisting of books, news articles, and tweets. These gigaword corpora correspond to languages with distinct morphological features and differ in registers and genres, thus constituting a varied testbed for a quantitative approach to lexical diversity. We unveil an empirical functional relation between entropy and type-token ratio of texts of a given corpus and language, which is a consequence of the statistical laws observed in natural language. Further, in the limit of large text lengths we find an analytical expression for this relation relying on both Zipf and Heaps laws that agrees with our empirical findings.http://doi.org/10.1103/rxxz-lk3n
spellingShingle Pablo Rosillo-Rodes
Maxi San Miguel
David Sánchez
Entropy and type-token ratio in gigaword corpora
Physical Review Research
title Entropy and type-token ratio in gigaword corpora
title_full Entropy and type-token ratio in gigaword corpora
title_fullStr Entropy and type-token ratio in gigaword corpora
title_full_unstemmed Entropy and type-token ratio in gigaword corpora
title_short Entropy and type-token ratio in gigaword corpora
title_sort entropy and type token ratio in gigaword corpora
url http://doi.org/10.1103/rxxz-lk3n
work_keys_str_mv AT pablorosillorodes entropyandtypetokenratioingigawordcorpora
AT maxisanmiguel entropyandtypetokenratioingigawordcorpora
AT davidsanchez entropyandtypetokenratioingigawordcorpora