Package 'franc'

Title: Detect the Language of Text
Description: With no external dependencies and support for 335 languages; all languages spoken by more than one million speakers. 'Franc' is a port of the 'JavaScript' project of the same name, see <https://github.com/wooorm/franc>.
Authors: Gabor Csardi, Titus Wormer, Maciej Ceglowski, Jacob R. Rideout, and Kent S. Johnson
Maintainer: Gábor Csárdi <[email protected]>
License: MIT + file LICENSE
Version: 1.1.4.9000
Built: 2024-06-18 02:14:59 UTC
Source: https://github.com/gaborcsardi/franc

Help Index


Detect the language of a string

Description

Detect the language of a string

Usage

franc(text, min_speakers = 1e+06, whitelist = NULL, blacklist = NULL,
  min_length = 10, max_length = 2048)

Arguments

text

A string constant. Should be at least min_length characters long, this is 10 characters by default. Only the first max_length characters are used (2048 by default), to make the detection reasonably fast.

min_speakers

Languages with at least this many speakers are checked. By default this is one million. Set it to zero to include all languages known by franc. See also speakers.

whitelist

List of three letter language codes to check against.

blacklist

List of three letter language codes not to check againts.

min_length

Minimum number of characters required in the text.

max_length

Maximum number of characters used from the text. By default only the first 2048 characters are used.

Value

A three letter ISO-639-3 language code, the detected language of the text. "und" is returned for too short input.

See Also

franc_all for scores against many languages, speakers.

Examples

## afr
franc("Alle menslike wesens word vry")

## nno
franc("Alle mennesker er født frie og")

## Too short, und
franc("the")

## You can change what’s too short (default: 10), sco
franc("the", min_length = 3)

List of probably languages for a text

Description

Returns the scores for all languages that use the same script as the input text, in decreasing order of probability. The score is calculated from the distances of the trigram distributions in the input text and in the language model. The closer the languages, the higher the score. Scores are scaled, so that the closest language will have a score of 1.

Usage

franc_all(text, min_speakers = 1e+06, whitelist = NULL,
  blacklist = NULL, min_length = 10, max_length = 2048)

Arguments

text

A string constant. Should be at least min_length characters long, this is 10 chracters by default. Only the first max_length characters are used (2048 by default), to make the detection reasonably fast.

min_speakers

Languages with at least this many speakers are checked. By default this is one million. Set it to zero to include all languages known by franc. See also speakers.

whitelist

List of three letter language codes to check against.

blacklist

List of three letter language codes not to check againts.

min_length

Minimum number of characters required in the text.

max_length

Maximum number of characters used from the text. By default only the first 2048 characters are used.

Value

A data frame with columns language and score. The language column contains the three letter ISO-639-3 language codes. The score column contains the scores.

See Also

franc if you only want the top result, speakers.

Examples

head(franc_all("O Brasil caiu 26 posições"))

## Provide a whitelist:
franc_all("O Brasil caiu 26 posições",
  whitelist = c("por", "src", "glg", "spa"))

## Provide a blacklist:
head(franc_all("O Brasil caiu 26 posições",
  blacklist = c("src", "glg", "lav")))

Number of speakers for 370 languages

Description

This is a superset of all languages detected by franc. Numbers were collected by Titus Wormer. To quote him: Painstakingly crawled by hand from OHCHR, the numbers are (in some cases, very) rough estimates or out-of-date..

Usage

speakers

Format

A data frame with columns:

language

Three letter language code.

speakers

Number of speakers.

name

Full name of language.

iso6391

ISO 639-1 codes. See more at https://en.wikipedia.org/wiki/ISO_639.

iso6392

ISO 639-2T codes. See more at https://en.wikipedia.org/wiki/ISO_639.