Automatic language identification in texts:
"This book provides readers with a brief account of the history of Language Identification (LI) research and a survey of the features and methods most used in LI literature. LI is the problem of determining the language in which a document is written and is a crucial part of many text processin...
Gespeichert in:
Beteilige Person: | |
---|---|
Format: | Buch |
Sprache: | Englisch |
Veröffentlicht: |
Cham, Switzerland
Springer Nature Switzerland
[2024]
|
Schriftenreihe: | Synthesis lectures on human language technologies
|
Schlagwörter: | |
Links: | http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=035240833&sequence=000001&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA |
Zusammenfassung: | "This book provides readers with a brief account of the history of Language Identification (LI) research and a survey of the features and methods most used in LI literature. LI is the problem of determining the language in which a document is written and is a crucial part of many text processing pipelines. The authors use a unified notation to clarify the relationships between common LI methods. The book introduces LI performance evaluation methods and takes a detailed look at LI-related shared tasks. The authors identify open issues and discuss the applications of LI and related tasks and proposes future directions for research in LI." -- |
Umfang: | xiv, 148 Seiten Illustrationen, Diagramme 25 cm |
Internformat
MARC
LEADER | 00000nam a2200000 c 4500 | ||
---|---|---|---|
001 | BV049901892 | ||
003 | DE-604 | ||
005 | 20241113 | ||
007 | t| | ||
008 | 241010s2024 xx a||| |||| 00||| eng d | ||
015 | |a GBC464277 |2 dnb | ||
020 | |z 3031458214 |9 3-031-45821-4 | ||
020 | |z 9783031458217 |9 978-3-031-45821-7 | ||
035 | |a (OCoLC)1435426773 | ||
035 | |a (DE-599)BVBBV049901892 | ||
040 | |a DE-604 |b ger |e rda | ||
041 | 0 | |a eng | |
049 | |a DE-739 | ||
084 | |a ST 306 |0 (DE-625)143654: |2 rvk | ||
100 | 1 | |a Jauhiainen, Tommi |d ca. 20./21. Jh. |e Verfasser |0 (DE-588)1348178884 |4 aut | |
245 | 1 | 0 | |a Automatic language identification in texts |c Tommi Jauhiainen, Marcos Zampieri, Timothy Baldwin, Krister Lindén |
264 | 1 | |a Cham, Switzerland |b Springer Nature Switzerland |c [2024] | |
300 | |a xiv, 148 Seiten |b Illustrationen, Diagramme |c 25 cm | ||
336 | |b txt |2 rdacontent | ||
337 | |b n |2 rdamedia | ||
338 | |b nc |2 rdacarrier | ||
490 | 0 | |a Synthesis lectures on human language technologies | |
520 | |a "This book provides readers with a brief account of the history of Language Identification (LI) research and a survey of the features and methods most used in LI literature. LI is the problem of determining the language in which a document is written and is a crucial part of many text processing pipelines. The authors use a unified notation to clarify the relationships between common LI methods. The book introduces LI performance evaluation methods and takes a detailed look at LI-related shared tasks. The authors identify open issues and discuss the applications of LI and related tasks and proposes future directions for research in LI." -- | ||
650 | 4 | |a Computational linguistics | |
650 | 4 | |a Text processing (Computer science) | |
650 | 4 | |a Linguistique informatique | |
650 | 4 | |a Traitement de texte | |
650 | 7 | |a computational linguistics |2 aat | |
650 | 0 | 7 | |a Computerlinguistik |0 (DE-588)4035843-4 |2 gnd |9 rswk-swf |
650 | 0 | 7 | |a Automatische Spracherkennung |0 (DE-588)4003961-4 |2 gnd |9 rswk-swf |
689 | 0 | 0 | |a Automatische Spracherkennung |0 (DE-588)4003961-4 |D s |
689 | 0 | 1 | |a Computerlinguistik |0 (DE-588)4035843-4 |D s |
689 | 0 | |5 DE-604 | |
700 | 1 | |a Zampieri, Marcos |d ca. 20./21. Jh. |e Sonstige |0 (DE-588)111851582X |4 oth | |
700 | 1 | |a Baldwin, Timothy |e Sonstige |0 (DE-588)134817997X |4 oth | |
700 | 1 | |a Lindén, Krister |d ca. 20./21. Jh. |e Sonstige |0 (DE-588)1348180951 |4 oth | |
856 | 4 | 2 | |m Digitalisierung UB Passau - ADAM Catalogue Enrichment |q application/pdf |u http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=035240833&sequence=000001&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA |3 Inhaltsverzeichnis |
943 | 1 | |a oai:aleph.bib-bvb.de:BVB01-035240833 |
Datensatz im Suchindex
_version_ | 1822520859113291776 |
---|---|
adam_text |
Contents 1 Introduction to Language Identification . 1.1 A Brief History of Language Identification (LI) . 1.2 What is LI Used For?. 1.3 What are the Main Challenges that Make LI Difficult? . References . 2 3 Features and Methods . 1 4 7 9 11 19 19 2.1 On Notation . 2.2 What Textual Features Are Used for LI and How Are They Collected and Calculated? . 20 2.2.1 Feature Smoothing . 2.3 What Classification Methods Are Used for LI and How Do They Work? . 30 2.3.1 Decision Rules, Trees and Random Forests. 2.3.2 Simple Scoring . 2.3.3 Sum or Average of Values . 2.3.4 Product of Values
. 2.3.5 Similarity Measures . 2.3.6 Logistic Regression . 2.3.7 Support Vector Machines . 2.3.8 Neural Networks. 2.3.9 Ensemble Methods . 2.4 Machine Learning Toolkits and Libraries . References . 31 32 33 36 39 42 42 43 45 47 49 Evaluation and Measurement . 65 How is LI Performance Evaluated? What Are the Measures and How Are They Calculated? . 3.2 What Material Can Be Used in Training and Evaluating Language Identifiers? . 28 3.1 65 70 xi
xii Contents 3.3 LI Shared Tasks. References . 73 86 4 Specific Challenges of Variation and Text Types . 4.1 Language Similarity . 4.1.1 LI for Similar Languages, Varieties, and Dialects . 4.2 Low-Resource Languages . 4.3 Orthography and Its Variations . 4.4 Short Texts . References . 99 99 100 106 107 108 109 5 Large Scale, Multi-domain Language Identification . 5.1 Number of Languages . 5.2 Unseen Languages . 5.3 Multilingual Texts . 5.4 Domain Compatibility . References
. 117 117 119 121 125 126 6 Applications and Related Tasks. 6.1 Applications . 6.1.1 Monolingual NLP Components . 6.1.2 Machine Translation . 6.1.3 Multilingual Document Storageand Retrieval . 6.2 Related Tasks . 6.2.1 Native Language Identification . 6.2.2 Author Profiling and Identification . References . 137 137 137 138 138 139 139 140 142 7 Conclusion and Future Directions. 147 |
any_adam_object | 1 |
author | Jauhiainen, Tommi ca. 20./21. Jh |
author_GND | (DE-588)1348178884 (DE-588)111851582X (DE-588)134817997X (DE-588)1348180951 |
author_facet | Jauhiainen, Tommi ca. 20./21. Jh |
author_role | aut |
author_sort | Jauhiainen, Tommi ca. 20./21. Jh |
author_variant | t j tj |
building | Verbundindex |
bvnumber | BV049901892 |
classification_rvk | ST 306 |
ctrlnum | (OCoLC)1435426773 (DE-599)BVBBV049901892 |
discipline | Informatik |
format | Book |
fullrecord | <?xml version="1.0" encoding="UTF-8"?><collection xmlns="http://www.loc.gov/MARC21/slim"><record><leader>00000nam a2200000 c 4500</leader><controlfield tag="001">BV049901892</controlfield><controlfield tag="003">DE-604</controlfield><controlfield tag="005">20241113</controlfield><controlfield tag="007">t|</controlfield><controlfield tag="008">241010s2024 xx a||| |||| 00||| eng d</controlfield><datafield tag="015" ind1=" " ind2=" "><subfield code="a">GBC464277</subfield><subfield code="2">dnb</subfield></datafield><datafield tag="020" ind1=" " ind2=" "><subfield code="z">3031458214</subfield><subfield code="9">3-031-45821-4</subfield></datafield><datafield tag="020" ind1=" " ind2=" "><subfield code="z">9783031458217</subfield><subfield code="9">978-3-031-45821-7</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(OCoLC)1435426773</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(DE-599)BVBBV049901892</subfield></datafield><datafield tag="040" ind1=" " ind2=" "><subfield code="a">DE-604</subfield><subfield code="b">ger</subfield><subfield code="e">rda</subfield></datafield><datafield tag="041" ind1="0" ind2=" "><subfield code="a">eng</subfield></datafield><datafield tag="049" ind1=" " ind2=" "><subfield code="a">DE-739</subfield></datafield><datafield tag="084" ind1=" " ind2=" "><subfield code="a">ST 306</subfield><subfield code="0">(DE-625)143654:</subfield><subfield code="2">rvk</subfield></datafield><datafield tag="100" ind1="1" ind2=" "><subfield code="a">Jauhiainen, Tommi</subfield><subfield code="d">ca. 20./21. Jh.</subfield><subfield code="e">Verfasser</subfield><subfield code="0">(DE-588)1348178884</subfield><subfield code="4">aut</subfield></datafield><datafield tag="245" ind1="1" ind2="0"><subfield code="a">Automatic language identification in texts</subfield><subfield code="c">Tommi Jauhiainen, Marcos Zampieri, Timothy Baldwin, Krister Lindén</subfield></datafield><datafield tag="264" ind1=" " ind2="1"><subfield code="a">Cham, Switzerland</subfield><subfield code="b">Springer Nature Switzerland</subfield><subfield code="c">[2024]</subfield></datafield><datafield tag="300" ind1=" " ind2=" "><subfield code="a">xiv, 148 Seiten</subfield><subfield code="b">Illustrationen, Diagramme</subfield><subfield code="c">25 cm</subfield></datafield><datafield tag="336" ind1=" " ind2=" "><subfield code="b">txt</subfield><subfield code="2">rdacontent</subfield></datafield><datafield tag="337" ind1=" " ind2=" "><subfield code="b">n</subfield><subfield code="2">rdamedia</subfield></datafield><datafield tag="338" ind1=" " ind2=" "><subfield code="b">nc</subfield><subfield code="2">rdacarrier</subfield></datafield><datafield tag="490" ind1="0" ind2=" "><subfield code="a">Synthesis lectures on human language technologies</subfield></datafield><datafield tag="520" ind1=" " ind2=" "><subfield code="a">"This book provides readers with a brief account of the history of Language Identification (LI) research and a survey of the features and methods most used in LI literature. LI is the problem of determining the language in which a document is written and is a crucial part of many text processing pipelines. The authors use a unified notation to clarify the relationships between common LI methods. The book introduces LI performance evaluation methods and takes a detailed look at LI-related shared tasks. The authors identify open issues and discuss the applications of LI and related tasks and proposes future directions for research in LI." --</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Computational linguistics</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Text processing (Computer science)</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Linguistique informatique</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Traitement de texte</subfield></datafield><datafield tag="650" ind1=" " ind2="7"><subfield code="a">computational linguistics</subfield><subfield code="2">aat</subfield></datafield><datafield tag="650" ind1="0" ind2="7"><subfield code="a">Computerlinguistik</subfield><subfield code="0">(DE-588)4035843-4</subfield><subfield code="2">gnd</subfield><subfield code="9">rswk-swf</subfield></datafield><datafield tag="650" ind1="0" ind2="7"><subfield code="a">Automatische Spracherkennung</subfield><subfield code="0">(DE-588)4003961-4</subfield><subfield code="2">gnd</subfield><subfield code="9">rswk-swf</subfield></datafield><datafield tag="689" ind1="0" ind2="0"><subfield code="a">Automatische Spracherkennung</subfield><subfield code="0">(DE-588)4003961-4</subfield><subfield code="D">s</subfield></datafield><datafield tag="689" ind1="0" ind2="1"><subfield code="a">Computerlinguistik</subfield><subfield code="0">(DE-588)4035843-4</subfield><subfield code="D">s</subfield></datafield><datafield tag="689" ind1="0" ind2=" "><subfield code="5">DE-604</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Zampieri, Marcos</subfield><subfield code="d">ca. 20./21. Jh.</subfield><subfield code="e">Sonstige</subfield><subfield code="0">(DE-588)111851582X</subfield><subfield code="4">oth</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Baldwin, Timothy</subfield><subfield code="e">Sonstige</subfield><subfield code="0">(DE-588)134817997X</subfield><subfield code="4">oth</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Lindén, Krister</subfield><subfield code="d">ca. 20./21. Jh.</subfield><subfield code="e">Sonstige</subfield><subfield code="0">(DE-588)1348180951</subfield><subfield code="4">oth</subfield></datafield><datafield tag="856" ind1="4" ind2="2"><subfield code="m">Digitalisierung UB Passau - ADAM Catalogue Enrichment</subfield><subfield code="q">application/pdf</subfield><subfield code="u">http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=035240833&sequence=000001&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA</subfield><subfield code="3">Inhaltsverzeichnis</subfield></datafield><datafield tag="943" ind1="1" ind2=" "><subfield code="a">oai:aleph.bib-bvb.de:BVB01-035240833</subfield></datafield></record></collection> |
id | DE-604.BV049901892 |
illustrated | Illustrated |
indexdate | 2025-01-28T19:08:54Z |
institution | BVB |
language | English |
oai_aleph_id | oai:aleph.bib-bvb.de:BVB01-035240833 |
oclc_num | 1435426773 |
open_access_boolean | |
owner | DE-739 |
owner_facet | DE-739 |
physical | xiv, 148 Seiten Illustrationen, Diagramme 25 cm |
publishDate | 2024 |
publishDateSearch | 2024 |
publishDateSort | 2024 |
publisher | Springer Nature Switzerland |
record_format | marc |
series2 | Synthesis lectures on human language technologies |
spelling | Jauhiainen, Tommi ca. 20./21. Jh. Verfasser (DE-588)1348178884 aut Automatic language identification in texts Tommi Jauhiainen, Marcos Zampieri, Timothy Baldwin, Krister Lindén Cham, Switzerland Springer Nature Switzerland [2024] xiv, 148 Seiten Illustrationen, Diagramme 25 cm txt rdacontent n rdamedia nc rdacarrier Synthesis lectures on human language technologies "This book provides readers with a brief account of the history of Language Identification (LI) research and a survey of the features and methods most used in LI literature. LI is the problem of determining the language in which a document is written and is a crucial part of many text processing pipelines. The authors use a unified notation to clarify the relationships between common LI methods. The book introduces LI performance evaluation methods and takes a detailed look at LI-related shared tasks. The authors identify open issues and discuss the applications of LI and related tasks and proposes future directions for research in LI." -- Computational linguistics Text processing (Computer science) Linguistique informatique Traitement de texte computational linguistics aat Computerlinguistik (DE-588)4035843-4 gnd rswk-swf Automatische Spracherkennung (DE-588)4003961-4 gnd rswk-swf Automatische Spracherkennung (DE-588)4003961-4 s Computerlinguistik (DE-588)4035843-4 s DE-604 Zampieri, Marcos ca. 20./21. Jh. Sonstige (DE-588)111851582X oth Baldwin, Timothy Sonstige (DE-588)134817997X oth Lindén, Krister ca. 20./21. Jh. Sonstige (DE-588)1348180951 oth Digitalisierung UB Passau - ADAM Catalogue Enrichment application/pdf http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=035240833&sequence=000001&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA Inhaltsverzeichnis |
spellingShingle | Jauhiainen, Tommi ca. 20./21. Jh Automatic language identification in texts Computational linguistics Text processing (Computer science) Linguistique informatique Traitement de texte computational linguistics aat Computerlinguistik (DE-588)4035843-4 gnd Automatische Spracherkennung (DE-588)4003961-4 gnd |
subject_GND | (DE-588)4035843-4 (DE-588)4003961-4 |
title | Automatic language identification in texts |
title_auth | Automatic language identification in texts |
title_exact_search | Automatic language identification in texts |
title_full | Automatic language identification in texts Tommi Jauhiainen, Marcos Zampieri, Timothy Baldwin, Krister Lindén |
title_fullStr | Automatic language identification in texts Tommi Jauhiainen, Marcos Zampieri, Timothy Baldwin, Krister Lindén |
title_full_unstemmed | Automatic language identification in texts Tommi Jauhiainen, Marcos Zampieri, Timothy Baldwin, Krister Lindén |
title_short | Automatic language identification in texts |
title_sort | automatic language identification in texts |
topic | Computational linguistics Text processing (Computer science) Linguistique informatique Traitement de texte computational linguistics aat Computerlinguistik (DE-588)4035843-4 gnd Automatische Spracherkennung (DE-588)4003961-4 gnd |
topic_facet | Computational linguistics Text processing (Computer science) Linguistique informatique Traitement de texte computational linguistics Computerlinguistik Automatische Spracherkennung |
url | http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=035240833&sequence=000001&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA |
work_keys_str_mv | AT jauhiainentommi automaticlanguageidentificationintexts AT zampierimarcos automaticlanguageidentificationintexts AT baldwintimothy automaticlanguageidentificationintexts AT lindenkrister automaticlanguageidentificationintexts |