Title: | A Tidy Data Model for Natural Language Processing |
---|---|
Description: | Provides a set of fast tools for converting a textual corpus into a set of normalized tables. Users may make use of the 'udpipe' back end with no external dependencies, or a Python back ends with 'spaCy' <https://spacy.io>. Exposed annotation tasks include tokenization, part of speech tagging, named entity recognition, and dependency parsing. |
Authors: | Taylor B. Arnold [aut, cre] |
Maintainer: | Taylor B. Arnold <[email protected]> |
License: | LGPL-2 |
Version: | 3.1.0 |
Built: | 2024-12-23 05:21:37 UTC |
Source: | https://github.com/taylor-arnold/cleannlp |
Provides a set of fast tools for converting a textual corpus into a set of normalized tables. Multiple NLP backends can be used, with the output standardized into a normalized format. Options include stringi (very fast, but only provides tokenization), udpipe (fast, many languages, includes part of speech tags and dependencies), and spacy (python backend; includes named entity recognition).
Once the package is set up, run one of cnlp_init_stringi
,
cnlp_init_spacy
, or cnlp_init_udpipe
to load
the desired NLP backend. After this function is done running, use
cnlp_annotate
to run the annotation engine over a corpus of
text. The package vignettes provide more detailed set-up information.
Useful links:
## Not run: library(cleanNLP) # load the annotation engine cnlp_init_stringi() # annotate your text input <- data.frame( text=c( "This is a sentence.", "Here is something else to parse!" ), stringsAsFactors=FALSE ) ## End(Not run)
## Not run: library(cleanNLP) # load the annotation engine cnlp_init_stringi() # annotate your text input <- data.frame( text=c( "This is a sentence.", "Here is something else to parse!" ), stringsAsFactors=FALSE ) ## End(Not run)
Runs the clean_nlp annotators over a given corpus of text
using the desired backend. The details for which annotators to run and
how to run them are specified by using one of:
cnlp_init_stringi
, cnlp_init_spacy
, or
cnlp_init_udpipe
.
cnlp_annotate( input, backend = NULL, verbose = 10, text_name = "text", doc_name = "doc_id" )
cnlp_annotate( input, backend = NULL, verbose = 10, text_name = "text", doc_name = "doc_id" )
input |
an object containing the data to parse. Either a
character vector with the texts (optional names can
be given to provide document ids) or a data frame. The
data frame should have a column named 'text' containing
the raw text to parse; if there is a column named
'doc_id', it is treated as a a document identifier.
The name of the text and document id columns can be
changed by setting |
backend |
name of the backend to use. Will default to the last model to be initalized. |
verbose |
set to a positive integer n to display a progress message to display every n'th record. The default is 10. Set to a non-positive integer to turn off messages. Logical input is converted to an integer, so it also possible to set to TRUE (1) to display a message for every document and FALSE (0) to turn off messages. |
text_name |
column name containing the text input. The default
is 'text'. This parameter is ignored when |
doc_name |
column name containing the document ids. The default
is 'doc_id'. This parameter is ignored when
|
The returned object is a named list where each element containing a data frame. The document table contains one row for each document, along with with all of the metadata that was passed as an input. The tokens table has one row for each token detected in the input. The first three columns are always "doc_id" (to index the input document), "sid" (an integer index for the sentence number), and "tid" (an integer index to the specific token). Together, these are a primary key for each row.
Other columns provide extracted data about each token, which differ slightly based on which backend, language, and options are supplied.
token: detected token, as given in the original input
token_with_ws: detected token along with white space; in, theory, collapsing this field through an entire document will yield the original text
lemma: lemmatised version of the token; the exact form depends on the choosen language and backend
upos: the universal part of speech code; see https://universaldependencies.org/u/pos/all.html for more information
xpos: language dependent part of speech code; the specific categories and their meaning depend on the choosen backend, model and language
feats: other extracted linguistic features, typically given as Universal Dependencies (https://universaldependencies.org/u/feat/index.html), but can be model dependent; currently only provided by the udpipe backend
tid_source: the token id (tid) of the head word for the dependency relationship starting from this token; for the token attached to the root, this will be given as zero
relation: the dependency relation, usually provided using Universal Dependencies (more information available here https://universaldependencies.org/ ), but could be different for a specific model
a named list with components "token", "document" and (when running spacy with NER) "entity".
Taylor B. Arnold, [email protected]
cnlp_init_stringi() cnlp_annotate(un)
cnlp_init_stringi() cnlp_annotate(un)
The cleanNLP package does not supply the model files required for using the spacy backend. These files can be downloaded with this function. If you need more control, download directly from Python.
cnlp_download_spacy(model_name = "en_core_web_sm")
cnlp_download_spacy(model_name = "en_core_web_sm")
model_name |
string giving the model namel. Defaults to "en_core_web_sm". |
## Not run: cnlp_download_spacy(model_name="en_core_web_sm") ## End(Not run)
## Not run: cnlp_download_spacy(model_name="en_core_web_sm") ## End(Not run)
This function must be run before annotating text with the spacy backend. It sets the properties for the spacy engine and loads the file using the R to Python interface provided by reticulate.
cnlp_init_spacy(model_name = NULL, disable = NULL, max_length = NULL)
cnlp_init_spacy(model_name = NULL, disable = NULL, max_length = NULL)
model_name |
string giving the model name for the spacy backend. Defaults to "en_core_web_sm" (English) if set to NULL. |
disable |
an optional vector of pipes to disable. |
max_length |
amount of temporary memory provided to Spacy, in characters. The default of 1000000 should work for most applications, but can be increased when working with long documents. |
Taylor B. Arnold, [email protected]
## Not run: cnlp_init_spacy(model_name = "en_core_web_sm") ## End(Not run)
## Not run: cnlp_init_spacy(model_name = "en_core_web_sm") ## End(Not run)
This function must be run before annotating text with the tokenizers backend.
cnlp_init_stringi(locale = NULL, include_spaces = FALSE)
cnlp_init_stringi(locale = NULL, include_spaces = FALSE)
locale |
string giving the locale name to
pass to the stringi functions. If
|
include_spaces |
logical. Should spaces be included as tokens in the output. Defaults to FALSE |
Taylor B. Arnold, [email protected]
## Not run: cnlp_init_stringi() ## End(Not run)
## Not run: cnlp_init_stringi() ## End(Not run)
This function must be run before annotating text with the udpipe backend. It will parse in English by default, but you can load other models as well.
cnlp_init_udpipe( model_name = NULL, model_path = NULL, tokenizer = "tokenizer", tagger = "default", parser = "default" )
cnlp_init_udpipe( model_name = NULL, model_path = NULL, tokenizer = "tokenizer", tagger = "default", parser = "default" )
model_name |
string giving the model namel.
Defaults to "english" if NULL.
Ignored if |
model_path |
provide a full path to a model file. |
tokenizer |
a character string of length 1, which is either 'tokenizer' (default udpipe tokenisation) or a character string with more complex tokenisation options as specified in <URL: http://ufal.mff.cuni.cz/udpipe/users-manual> in which case tokenizer should be a character string where the options are put after each other using the semicolon as separation. |
tagger |
a character string of length 1, which is either 'default' (default udpipe POS tagging and lemmatisation) or 'none' (no POS tagging and lemmatisation needed) or a character string with more complex tagging options as specified in <URL: http://ufal.mff.cuni.cz/udpipe/users-manual> in which case tagger should be a character string where the options are put after each other using the semicolon as separation. |
parser |
a character string of length 1, which is either 'default' (default udpipe dependency parsing) or 'none' (no dependency parsing needed) or a character string with more complex parsing options as specified in <URL: http://ufal.mff.cuni.cz/udpipe/users-manual> in which case parser should be a character string where the options are put after each other using the semicolon as separation. |
Taylor B. Arnold, [email protected]
## Not run: cnlp_init_udpipe(model_name = "english") ## End(Not run)
## Not run: cnlp_init_udpipe(model_name = "english") ## End(Not run)
Takes a matrix and returns a data frame with the top principal components extracted. This is a simple but powerful technique for visualizing a corpus of documents.
cnlp_utils_pca(x, k = 2, center = TRUE, scale = TRUE)
cnlp_utils_pca(x, k = 2, center = TRUE, scale = TRUE)
x |
a matrix object to pass to |
k |
integer. The number of components to include in the output. |
center |
logical. Should the data be centered? |
scale |
logical. Should the data be scaled? Note that this will
need to be set to false if any columns in |
a data frame object containing the top k
principal
components of the data in x.
Given annotations, this function returns the term-frequency inverse document frequency (tf-idf) matrix from the extracted lemmas.
cnlp_utils_tfidf( object, tf_weight = c("lognorm", "binary", "raw", "dnorm"), idf_weight = c("idf", "smooth", "prob", "uniform"), min_df = 0.1, max_df = 0.9, max_features = 10000, doc_var = "doc_id", token_var = "lemma", vocabulary = NULL, doc_set = NULL ) cnlp_utils_tf( object, tf_weight = "raw", idf_weight = "uniform", min_df = 0, max_df = 1, max_features = 10000, doc_var = "doc_id", token_var = "lemma", vocabulary = NULL, doc_set = NULL )
cnlp_utils_tfidf( object, tf_weight = c("lognorm", "binary", "raw", "dnorm"), idf_weight = c("idf", "smooth", "prob", "uniform"), min_df = 0.1, max_df = 0.9, max_features = 10000, doc_var = "doc_id", token_var = "lemma", vocabulary = NULL, doc_set = NULL ) cnlp_utils_tf( object, tf_weight = "raw", idf_weight = "uniform", min_df = 0, max_df = 1, max_features = 10000, doc_var = "doc_id", token_var = "lemma", vocabulary = NULL, doc_set = NULL )
object |
a data frame containing an identifier for the document
(set with |
tf_weight |
the weighting scheme for the term frequency matrix.
The selection |
idf_weight |
the weighting scheme for the inverse document
matrix. The selection |
min_df |
the minimum proportion of documents a token should be in to be included in the vocabulary |
max_df |
the maximum proportion of documents a token should be in to be included in the vocabulary |
max_features |
the maximum number of tokens in the vocabulary |
doc_var |
character vector. The name of the column in
|
token_var |
character vector. The name of the column in
|
vocabulary |
character vector. The vocabulary set to use in
constructing the matrices. Will be computed
within the function if set to |
doc_set |
optional character vector of document ids. Useful to
create empty rows in the output matrix for documents
without data in the input. Most users will want to keep
this equal to |
a sparse matrix with dimnames giving the documents and vocabular.
Data frame containing the 30 Articles in the United Nations' Universal Declaration of Human Rights, ratified on 10 December 1948 in Paris, France.
https://www.un.org/en/universal-declaration-human-rights/
A dataset of the 150k most frequently used English words, extracted by Peter Norvig from the Google Web Trillion Word Corpus. Frequencies are multiplied by 100.