Coptic NLP Service

Home
Tools
Corpora
Documentation
News
Projects
DONATE
About

Coptic NLP Service

Enter Coptic text in UTF-8 (XML markup is also allowed).
Bound groups should be separated by spaces or underscores.

Dialect:

Auto-detect
Sahidic
Bohairic

Input:

My data contains meaningful linebreaks

This inserts <line>..</line> tags around each line of text.
If you already have <lb/> tags or your data is already tokenized, you probably want to ignore line breaks.

Ignore linebreaks in my data

Output:

Use old finite state tokenizer Less accurate, provided for reproducing older results. Not compatible with detokenization.
Re-merge bound groups Regularizes bound group spaces if input does not follow Layton's guidelines (a.k.a. 'Laytonization'; increases accuracy on Till-segmented text and OCR) Conservative merging Only re-bind items known to appear unbound in other segmentations (e.g. well edited text following Till) ϩⲙ ⲡⲏⲓ --> ϩⲙ\|ⲡ\|ⲏⲓ Aggressive merging Re-bind all items that are unlikely to appear unbound (better for messy data/OCR) ⲁ ϥⲥⲱⲧⲙ --> ⲁ\|ϥ\|ⲥⲱⲧⲙ Smart merging Re-bind items using a context sensitive machine learning binder (trained on editions by E.A.W. Budge) ⲉ ⲃⲟⲗ ⲙ ⲡⲏⲓ --> ⲉⲃⲟⲗ ⲙ\|ⲡ\|ⲏⲓ Segment at merge point If bound groups are merged, assume a morpheme boundary (recommended if base segmentation is reliable)
SGML pipeline Stretch milestones This setting replaces unary XML elements with binary ones. For example for milestone page break elements: (<pb/> → <pb> ... </pb>) Tokenize `[stk-6.0.0]` Automatic From pipes in input Normalize Disable to remove norm_group attribute from output. Diacritic stripping will still be done for processing norm units. Tag`[flairbert-6.0.0]` Lemmatize Language of origin MWE recognition Enable to automatically recognize multiword expressions (MWEs), e.g. ϭⲱⲗⲡ ⲉⲃⲟⲗ. Known MWEs are retrieved from the Coptic Dictionary Online. Parse`[diabert-UD2.15]` Entity recognition Identify sequences of words referring to people, places and more.	Just piped and dashed morphemes

Result:

Coptic SCRIPTORIUM is supported by the National Endowment for the Humanities Office of Digital Humanities and Division of Preservation and Access, Georgetown University, The University of Oklahoma, the University of the Pacific,and Canisius College.

Fork us on GitHub

Follow Coptic SCRIPTORIUM on Twitter

Unless otherwise indicated, Coptic SCRIPTORIUM website and content
is licensed under a Creative Commons Attribution 4.0 International License.