Entropy and type-token ratio in gigaword corpora
There are different ways of measuring diversity in complex systems. In particular, in language, lexical diversity is characterized in terms of the type-token ratio and the word entropy. We here investigate both diversity metrics in six massive linguistic data sets in English, Spanish, and Turkish, c...
Saved in:
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
American Physical Society
2025-07-01
|
Series: | Physical Review Research |
Online Access: | http://doi.org/10.1103/rxxz-lk3n |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1839629888381779968 |
---|---|
author | Pablo Rosillo-Rodes Maxi San Miguel David Sánchez |
author_facet | Pablo Rosillo-Rodes Maxi San Miguel David Sánchez |
author_sort | Pablo Rosillo-Rodes |
collection | DOAJ |
description | There are different ways of measuring diversity in complex systems. In particular, in language, lexical diversity is characterized in terms of the type-token ratio and the word entropy. We here investigate both diversity metrics in six massive linguistic data sets in English, Spanish, and Turkish, consisting of books, news articles, and tweets. These gigaword corpora correspond to languages with distinct morphological features and differ in registers and genres, thus constituting a varied testbed for a quantitative approach to lexical diversity. We unveil an empirical functional relation between entropy and type-token ratio of texts of a given corpus and language, which is a consequence of the statistical laws observed in natural language. Further, in the limit of large text lengths we find an analytical expression for this relation relying on both Zipf and Heaps laws that agrees with our empirical findings. |
format | Article |
id | doaj-art-c0c73f18e6694e309d95d2828d7c50f4 |
institution | Matheson Library |
issn | 2643-1564 |
language | English |
publishDate | 2025-07-01 |
publisher | American Physical Society |
record_format | Article |
series | Physical Review Research |
spelling | doaj-art-c0c73f18e6694e309d95d2828d7c50f42025-07-14T14:06:25ZengAmerican Physical SocietyPhysical Review Research2643-15642025-07-017303305410.1103/rxxz-lk3nEntropy and type-token ratio in gigaword corporaPablo Rosillo-RodesMaxi San MiguelDavid SánchezThere are different ways of measuring diversity in complex systems. In particular, in language, lexical diversity is characterized in terms of the type-token ratio and the word entropy. We here investigate both diversity metrics in six massive linguistic data sets in English, Spanish, and Turkish, consisting of books, news articles, and tweets. These gigaword corpora correspond to languages with distinct morphological features and differ in registers and genres, thus constituting a varied testbed for a quantitative approach to lexical diversity. We unveil an empirical functional relation between entropy and type-token ratio of texts of a given corpus and language, which is a consequence of the statistical laws observed in natural language. Further, in the limit of large text lengths we find an analytical expression for this relation relying on both Zipf and Heaps laws that agrees with our empirical findings.http://doi.org/10.1103/rxxz-lk3n |
spellingShingle | Pablo Rosillo-Rodes Maxi San Miguel David Sánchez Entropy and type-token ratio in gigaword corpora Physical Review Research |
title | Entropy and type-token ratio in gigaword corpora |
title_full | Entropy and type-token ratio in gigaword corpora |
title_fullStr | Entropy and type-token ratio in gigaword corpora |
title_full_unstemmed | Entropy and type-token ratio in gigaword corpora |
title_short | Entropy and type-token ratio in gigaword corpora |
title_sort | entropy and type token ratio in gigaword corpora |
url | http://doi.org/10.1103/rxxz-lk3n |
work_keys_str_mv | AT pablorosillorodes entropyandtypetokenratioingigawordcorpora AT maxisanmiguel entropyandtypetokenratioingigawordcorpora AT davidsanchez entropyandtypetokenratioingigawordcorpora |