Coptic NLP Service

Enter Coptic text in UTF-8 (XML markup is also allowed, 10,000 characters max).
Bound groups should be separated by spaces or underscores.

If you need to analyze longer texts or multiple texts automatically, you can log in to the secure area or use the API. For a login please contact Amir Zeldes.

Use old finite state tokenizer Less accurate, provided for reproducing older results. Not compatible with detokenization.
Re-merge bound groups Regularizes bound group spaces if input does not follow Layton's guidelines (a.k.a. 'Laytonization'; increases accuracy on Till-segmented text and OCR) Conservative merging Only re-bind items known to appear unbound in other segmentations (e.g. well edited text following Till) ϩⲙ ⲡⲏⲓ --> ϩⲙ\|ⲡ\|ⲏⲓ Aggressive merging Re-bind all items that are unlikely to appear unbound (better for messy data/OCR) ⲁ ϥⲥⲱⲧⲙ --> ⲁ\|ϥ\|ⲥⲱⲧⲙ Smart merging Re-bind items using a context sensitive machine learning binder (trained on editions by E.A.W. Budge) ⲉ ⲃⲟⲗ ⲙ ⲡⲏⲓ --> ⲉⲃⲟⲗ ⲙ\|ⲡ\|ⲏⲓ Segment at merge point If bound groups are merged, assume a morpheme boundary (recommended if base segmentation is reliable)
SGML pipeline Stretch milestones This setting replaces unary XML elements with binary ones. For example for milestone page break elements: (<pb/> → <pb> ... </pb>) Tokenize `[stk-6.0.0]` Automatic From pipes in input Normalize Disable to remove norm_group attribute from output. Diacritic stripping will still be done for processing norm units. Tag`[flairbert-6.0.0]` Lemmatize Language of origin MWE recognition Enable to automatically recognize multiword expressions (MWEs), e.g. ϭⲱⲗⲡ ⲉⲃⲟⲗ. Known MWEs are retrieved from the Coptic Dictionary Online. Parse`[diabert-UD2.15]` Entity recognition Identify sequences of words referring to people, places and more.	Just piped and dashed morphemes

Coptic NLP Service

Dialect:

Input:

Output: