-
Notifications
You must be signed in to change notification settings - Fork 9
Expand file tree
/
Copy pathencoding_proposal.xml
More file actions
732 lines (715 loc) · 49.2 KB
/
encoding_proposal.xml
File metadata and controls
732 lines (715 loc) · 49.2 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_odds.rng" type="application/xml" schematypens="http://relaxng.org/ns/structure/1.0"?>
<?xml-model href="http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_odds.rng" type="application/xml"
schematypens="http://purl.oclc.org/dsdl/schematron"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0"
xmlns:xi="http://www.w3.org/2001/XInclude">
<teiHeader>
<fileDesc>
<titleStmt>
<title>Encoding Guidelines for the ELTeC</title>
<author>Cost Action CA16204 – WG1</author>
</titleStmt>
<publicationStmt>
<p>Unpublished draft for discussion</p>
</publicationStmt>
<sourceDesc>
<p>A born digital document drafted in TEI format by LB</p>
</sourceDesc>
</fileDesc>
<revisionDesc>
<change when="2018-03-16">Revised post-Prague</change>
<change when="2018-01-17">Expanded metadata section a bit; added comments from CO and
BN</change>
<change when="2017-12-17">First (partial) discussion draft</change>
</revisionDesc>
</teiHeader>
<text>
<body>
<p rend="red">THIS DOCUMENT HAS BEEN SUPERCEDED AND IS NOW OF HISTORICAL INTEREST ONLY!</p><p rend="red">For current Encoding Guidelines, please
refer to <ptr target="https://distantreading.github.io/eltec-0.html"/> (level zero)
or <ptr target="https://distantreading.github.io/eltec-1.html"/> (level one)
</p><p>This reference document defines the encoding scheme to be used
for the European Literary Text Collection (ELTeC) which will be a major deliverable of
COST Action 16204, <title>Distant Reading</title>. This draft reflects decisions taken
at the WG1 meeting in Prague in February 2018, but has not yet been formally reviewed by
the Work Group. Particular topics on which policy remains to be
defined are signalled below with the label <q>Open Question</q>. </p>
<div>
<head>Principles</head>
<p>The MoU for the project points out that <q>Distant Reading methods cover a wide range
of computational methods for literary text analysis, such as authorship
attribution, topic modelling, character network analysis, or stylistic
analysis.</q> The focus of the ELTeC encoding scheme is thus not to represent
texts in all their original complexity of structure or appearance, but rather to
facilitate a richer and better-informed distant reading than a transcription of its
lexical content alone would permit. For example, it seems useful to distinguish
headings and annotations from the rest of the text, and to be able to locate
stretches of text within gross structural features such as pages, chapters, or paragraphs.
Although it may be useful to distinguish passages belonging to different narrative
levels (for example, direct speech versus narrative or quotation versus narrative),
it is difficult to do so automatically with any degree of consistency.
It is certainly less useful to record
exact nuances of rendition or spelling in a particular version of a text. Our goal is
thus not to duplicate the work of scholarly editors or to produce (yet another) digital
edition of a specific source document. Rather it is to ensure that the ELTeC texts can be
processed by very simple minded (but XML-aware) systems primarily concerned with
lexis and to make life easier for the developers of such systems. </p>
<p>In selecting features for inclusion in the markup scheme, we have been guided, but
not limited, by existing practice as far as possible. Our main goal has been to
identify a small core set of textual features which can be readily
(preferably automatically) identified in existing digital transcriptions, or easily
and consistently provided by new transcriptions. </p>
<p>We distinguish three <soCalled>levels</soCalled> of encoding, referred to below as
level zero, level one and level two. All ELTeC texts are made available at level zero,
the basic encoding format. Some texts may additionally be made available at levels
one or two, which provide a richer set of encoded features. For example: a level one
text will include information about rendition features missing from a level zero
text; a level two text will include tokenization information missing from a level one
text. As far as possible conversion between levels will be automatically scripted,
but this is not possible in the general case.
<!-- from level one to level zero is a simple lossy transformation;
conversion from level one or level zero to level two loses some information but adds
Conversion scripts from where may be prepared in either level zero or level
one-->
<!--0. The eltec schema has a minimal level (level zero) in which (more or
less) only the features enumerated in Prague are permitted/guaranteed.
Other features (e.g. linebreaks, pagebreaks, highlighting) are not
annotated/encoded/represented, even if they were available in a
pre-existing digital version from which the eltec version derives. The
details are to be determined.
1. At level 1, a slightly richer set of features -\- basically the
intersection of what is available and useful in most pre-existing
digital versions -\- are provided and required. This feature set is
also to be defined: I think it should include e.g. pagebreaks and
highlighting, but not linebreaks or editorial interventions such as
correction or normalization.
2. At level 2, a different set of largely linguistically-motivated
annotations are provided, typically involving tokenisation and
morpho-syntax. Note that level 2 annotation can be added to a level 0
text, so the term "level" is perhaps a misnomer. This set is still to
be defined; the current encoding document explicitly defers it to
another workgroup cycle.-->
</p>
<p>This document lists all the textual features which are to be distinguished in an
ELTeC conformant transcription at one of these three levels.
Whenever a given feature exists in a text, it will be
marked up as indicated here. No other features will be captured by the markup: if
some textual feature not provided for here is identified by a marked up source text,
that markup will be removed (though it may be retained in a version of the text
encoded at a different level). </p>
<p>All ELTeC documents are TEI conformant, and therefore include a TEI Header, as discussed
in section <ptr target="#hdr"/> below.</p>
<!--
<p>We make no attempt to propose markup for linguistic annotations here. The assumption
is that this will be produced by different annotation systems in different ways,
though with an association between such annotations and the basic lexical structures
represented by the core ELTeC markup. </p>-->
</div>
<div>
<head>Basic Transcription Guidelines (all levels) </head>
<p>The basic unit of the ELTeC corpus is the text of a single novel, represented by a
TEI <gi>text</gi> element. We propose no mechanism (other than metadata) to encode
units larger than a single novel, such as multipart novel series like Proust's
<title>A la recherche du temps perdu</title> or Balzac's <title>Les
Rougon-Macquart</title>. </p>
<!-- <p><label>Open Question</label> Should we include liminal matter (titlepages, prefaces,
appendixes...) in our transcriptions? The following policies seem possible: <list>
<item>No : these typically belong to a particular edition or version of the text,
and should therefore systematically be excluded</item>
<item>Yes : these often form a significant part of the reader's experience (cf.
the foreword to most editions of <title>David Copperfield</title>). Mark them
up using <gi>front</gi> and <gi>back</gi> as appropriate.</item>
<item>Sort of : do not transcribe them, but indicate that they have been
suppressed by using the <gi>gap</gi> element. </item>
</list></p>-->
<p>To facilitate checking of a transcription against its source during
production, the <gi>pb</gi> element must be provided to mark the point in a
transcript where a new page begins. If a page begins with the second part of a
hyphenated word, the <gi>pb</gi> tag may appear after that, but otherwise its
position should be the same in transcription and source. The <gi>pb</gi> element has
an attribute <att>n</att> which should be used to number the pages. A level 1 text may
also provide a <att>facs</att> attribute to point to a page image of the corresponding
source page. </p>
<p>As well as a titlepage or a table of contents, a published novel often includes material such as forewords or appendixes in addition
to the text of the novel itself. This <term>liminal</term> matter is included in an
ELTeC text only if it is believed to be authorial. Material before the body of the
text begins is collected within a <gi>front</gi> element, and material following the
body in a
<gi>back</gi> element. In either case, distinct sections of the material, if
encoded, are represented by a
<gi>div</gi> with its <att>type</att> attribute set to <code>liminal</code>.</p><p>At
level zero, titlepages and tables of contents are omitted. At level one, they are
replaced by a <gi>gap</gi> element. Non-authorial liminal material is silently
omitted at all levels. </p>
<p rend="red">
The Prague decision list says that we decided to exclude titlepages, tables of contents, errata list etc, but
to include prefaces, introductions, afterwords, and appendixes, provided these are
contemporary with the text. It also says to include footnotes and commentary, but does
not specify whether these should also be contemporary, nor how the encoder can easily
determine whether or not something is "contemporary".</p>
<p>Within the body of a text, major structural divisions (parts, sections, chapters
etc.) will be captured using the generic <gi>div</gi> element, with attributes
<att>type</att>, <att>xml:lang</att>, <att>xml:id</att> and <att>n</att> used as
further detailed below.</p>
<p>The names used for hierarchic structural divisions of a novel above the chapter are
arbitrary, culture-specific, and often inconsistent : in some novels things called
<q>part</q> contain things called <q>book</q> and in others the reverse. We
propose to follow TEI in using a single element (<gi>div</gi>) for every hierarchical
structural division, down to the level of <q>chapter</q>.</p>
<p><label>Open Question</label> Is it useful to retain the name used for each level in
the original source (the type of div) ? <list>
<!-- CO: Maybe I just missed this information. We can keep the structuring texts/words of the text as headings.
We can mark them with the appropriate element. In this way, we can keep the information without having to define
a fixed list of subdivisions in a novel (you alreay pointed out that this is not possible). Having marked the existing
headings (and subheadings) they can be included or ignored automatically when processing the document. -->
<!-- LB : I think it would be problematic to just use <head> without using <div>
if that's what you are suggesting. I am only asking whether or not we
include e.g. @type="chapter" on the top level div -->
<!-- BN: I'm not sure. Maybe it could be a good solution the third one:-->
<item> Yes: it is easy to keep and may help referencing : use the <att>type</att>
attribute to hold the name used for each level of div in the work in
question</item>
<item>No : this name adds no useful information beyond the level indicated by the
XML structure </item>
<item>No : it would be more useful to provide an explicit and normalised
indication of the hierarchic level for the benefit of non-XML-aware processors
(e.g. <code>level1</code>, <code>level2</code> etc.)</item>
</list></p>
<p rend="red">This issue was not discussed in Prague. Proposal is to use (and enforce) a predefined list of
specific values. </p>
<p>The (human) language in which a text is expressed is indicated explicitly by the
<att>xml:lang</att> attribute which supplies the ISO 2 letter code for the
language concerned. This attribute will always be supplied on the <gi>text</gi>
element to specify a default, and may also appear on other elements to indicate passages where the language changes.
The various
different languages used in a given text will be itemized in its metadata (see
<gi>langUsage</gi> element in the header). </p>
<p><label>Open question</label> Should passages exhibiting regional or dialectal
variation be specially signalled? <list>
<item>No : this is too fine grained and controversial a distinction to be made
with reliable consistency </item>
<item>Yes : treat this in the same way as any other kind of code switching and
define a set of appropriate language codes for the project</item>
<!-- CO: From a linguistic point of view, I would like to say yes but detecting dialectal variation
is something which cannot be done automatically. I think, without a more explicit guidelines concerning
the detection of foreign material we might end up with confusing analysis. The
definition of the element foreign is broad. For example,
neoclassic words may or may not count as foreign. Especially in a corpus containing roman and germanic languages. -->
<!-- BN: This annotation could be complex. I think it will be better to annotate
dialectal variations during linguistic annotation.-->
<!-- LB : maybe we should just use foreign for passages which are signalled
specially in some way in the text. e.g. in italics -->
<item>Maybe : just use the <gi>distinct</gi> element to indicate the kind of
variation concerned</item>
<!-- CO: This element may then be applied to other distinct phenomena as well. I
think this is not the best way. -->
<!-- LB: Many TEI elements have multiple uses, so this is not an argument against using
<distinct> in my view-->
</list></p>
<p rend="red">In Prague there was some support for using either <gi>foreign</gi> or <gi>distinct</gi>, but no decision to do
so. Proposal is not to do so.</p>
<p>A single reference scheme will be defined for the whole corpus, with the following
components: <list>
<item>text identifier : every text will have an identifier consisting of its two
letter language code and a three digit serial number, for example
<code>FR042</code></item>
<item>chapter identifier: each chapter or equivalent will have an identifier
concatenating the text identifier and a three digit serial number, for example
<code>FR042012</code> is the twelfth chapter of the 42nd French novel. </item>
<item>If sub-chapter segmentation (see below) is implemented, then the segments
will append a further four digit serial number.</item>
</list>The identifier will be supplied as the value of an <att>xml:id</att> attribute
on each <gi>text</gi>, <gi>div</gi> or <gi>s</gi> element as appropriate. Adding this
identifier is an easily automated task which can be built into the workflow for
accession to the ELTeC.</p>
<p>Note that these identifiers will not necessarily correspond with the numbering used
in a particular source text. In a work where the first twelve chapters are considered
to form part one, and the next twelve constitute part two, the first chapter of the
second part will have an identifier ending <code>013</code>, even though it may be
numbered <code>1</code> in a source text.</p><p rend="red">No dissent from this proposal in
Prague</p>
<p><label>Open question</label> is it important to preserve the original numbering,
particularly for deeply structured texts? <list>
<item>Yes : the original numbering is widely used to reference the text: it should
be supplied as using the <att>n</att> attribute on the <gi>div</gi>.</item>
<!-- CO: we may use here the "head" element as the numbering of a chapter may be analysed as a head, maybe next to other heads. See comment above-->
<!-- BN: I'm not sure. Sometimes scholars refer to specific passages by original numbering ("The chapter two of Don Quixote bla bla bla"). In this case
this information is necessary. -->
<!-- LB: head is not the same as @n : I agree with Borja -->
<item>No : the original numbering and referencing scheme are of no use in our
intended applications, introduce unnecessary complexity, and may be a source of
confusion. </item>
</list></p>
<p rend="red">Not explicitly addressed in Prague. Proposal is not to retain original
numbering.</p>
<p>The chapters of a novel mostly consist of prose, arranged in paragraphs, for which we
will use the TEI <gi>p</gi> element. It is not unusual to find other structures
however, specifically verse, or passages of dialogue presented as if in a play, with
speaker labels and even stage directions. Less frequently, novels may contain
material presented in list or tabular formats. Graphics with their own associated
heading or other text are also frequent. </p>
<p><label>Open Question</label> how should material other than running prose and
dialogue be encoded? <list type="ordered">
<item>Use the appropriate TEI elements for verse or drama (<gi>lg</gi>,
<gi>l</gi>, <gi>sp</gi>, <gi>stage</gi>)</item>
<item>Use the appropriate TEI elements for lists and tables (<gi>list</gi>,
<gi>label</gi>, <gi>item</gi>, <gi>table</gi>, <gi>cell</gi>,
<gi>row</gi>)</item>
<item>Use the appropriate TEI elements for embedded graphics (<gi>figure</gi>,
<gi>graphic</gi>, <gi>head</gi>)</item>
<item>Suppress all non-prose material, replacing it by <gi>gap</gi></item>
</list></p>
<!-- Prague decisions
* annotation of footnotes (we will test whether finding footnotes will be a problem; if so they go to 2nd level ),
* afterwords, appendix, preface, introduction
* include <p>
* no <lb>
* no annotation of a lists, the textual material will be in the corpus with the <p> annotation
* suppress the tables, annotate with gap
* suppress figures/pictures with a gap
* suppress the heading of a picture / figure
* typographic information is ignored
* no <pb>
* hyphenation is merged
* include <head> (for chapters etc.)
* include <div>
* no annotation of quotes (cf. mottos), instead using <p>
* retain information from level0 if possible; but mark with <gap> and put
into comments-->
<p rend="red">In Prague, we decided to suppress annotation of linebreaks, lists,
tables, figures, captions of figures, typographic information, pagebreaks,
and quotation (i.e. direct/indirect speech). We explicitly agreed to annotate
only paragraphs, divisions, and headings. Other features would be represented either
by a <gi>gap</gi>, if they have been entirely suppressed, or by a <gi>p</gi> if they
have textual content.We also agreed that hyphenation, like other typographic
features, would not be preserved. Verse and drama were not explicitly addressed. </p>
<p>Novels are also full of direct speech, represented using various different
conventions, but almost always distinguished from the narrative voice. The first
person narrative is also common, but may be regarded as a special case.
<!-- CO: Narrative voice might be good to have for the metadata! -->How exactly
different narrative strands are articulated in a novel, and the extent to which they
may be characterised by their lexis has been a preoccupation of many <q>distant
reading</q> style analyses. It might therefore be helpful to distinguish material
purporting to be direct speech from material purporting to be narrative in our basic
encoding, though to do so consistently and accurately may occasionally be
problematic.</p>
<p><label>Open Question</label> Should passages presented as direct speech in a novel be
distinguished from passages presented as narrative? <list>
<item>Yes : use <gi>q</gi> and avoid nesting problems by always nesting it within
<gi>p</gi></item>
<item>Yes : use a <gi>milestone</gi> to mark the beginning and end of each passage
of direct speech</item>
<item>Sort of : provide an attribute on <gi>p</gi> to indicate whether or not the
paragraph contains direct speech</item>
<!-- CO: I don't know whether the kind of suggestion annotation help to process the corpus.
Either the whole paragraph needs to be excluded from analysis or you include the paragraph
and you know that the paragraph contain some speech text. I think, this is not a big benefit. -->
<item>No : rely on (or normalise) typographic conventions such as quote marks or
dashes to distinguish direct speech only. </item>
<!-- CO: At the moment, I would prefer this solution. -->
<!-- LB: it's certainly the easiest! tho normalising punctuation marks may be
problematic-->
</list>
</p>
<p rend="red">In Prague the majority view was not to attempt to do more than preserve
existing punctuation. </p>
<p>Printed texts typically deploy a number of conventions which can cause problems for
linguistic analyses of even the most basic kind. Changes of font or style
(italicization or use of superscript, for example) can have particular lexical
significance which should be taken into account. End-of-line hyphenation can make it
harder to identify the exact form of a token. Non-standard (i.e. non-modern)
spellings can mislead parsers. Our proposed encoding aims above all for consistency
and transparency in what is reliably achievable, leaving more difficult and
problematic issues to be addressed by linguistic annotations. </p>
<p>We do not preserve the lineation of running prose in our source texts, since this is
always purely an artefact of the source edition. For the same reason we will
reassemble words broken across a line break, silently removing any hyphen present.
(This will make it impossible to use our texts for hyphenation studies. So be it.) </p>
<!-- CO: This contradicts the idea of encoding the first edition in a philological way. -->
<p><label>Open Question</label> : Should page breaks in the source text be preserved ? <list>
<item>Yes : this is useful information (e.g. to determine words-per-page, or to
anchor links to an image of the source text) which is usually available at
no-cost in existing digital texts</item>
<!-- CO: This might help. -->
<!-- BN: From my point of view this information is not necessary. As I said
before, it is related to the book as physical object, not to the novel itself.-->
<item>No : the proposed uses don't justify the cost of providing the information
if it is missing. And pagination is inherently copy-specific.</item>
</list></p>
<p rend="red">Prague decision (as noted above) was to suppress page numbering; however
the proposal is to retain it, since it will always be available for OCR texts, where
it is essential information during text validation, and is usually available in other
digital versions. The discussion in Prague concerned only its lack of utility during the
analysis stage, but it is very useful during the transcription and validation stage. </p>
<p>Font and style variations in the source text usually signal something. Italics may
signal emphasis, quotation, foreign language terms etc. Superscripts almost always
signal abbreviation. The visual salience of these variations is of considerably less
interest to distant readers than the intended function they signal. However, it is
not always easy to determine that function reliably and consistently by algorithm.
Some simple cases could however be addressed. A possibly strategy is outlined below.
It assumes the existence of a digital version of the text in which visual features
are explicit, whether by means of TEI-style markup or styling information such as
that provided by Word. <list>
<item>if possible, replace indications of highlighting by an appropriate TEI
element, chosen from the following list : <gi>foreign</gi>, <gi>title</gi>,
<gi>emph</gi></item>
<!-- CO: Why would title and foreign be good elements for the task? Emph refers to linguistic or rhetorical effect -->
<!-- LB: because titles and foreign passages are often represented by
highlighting -->
<item>otherwise, replace all indications of highlighting by the TEI <gi>hi</gi>
element</item>
<!-- CO: this might be a good way of encoding. hi can get a @rend for determing bold, underlined etc.?! -->
<!-- LB: indeed it can, but why would we want to use that? -->
<item>indications of superscript characters (such as French
<soCalled>14ᵉ</soCalled>) should be removed. Instead, the TEI element
<gi>abbr</gi> should be used to indicate the presence of an abbreviated
word: <code><abbr>14e</abbr></code></item>
</list></p>
<p rend="red">Prague decision (as noted above) : was to suppress all encoding of renditional features. </p>
<p><label>Open Question</label>: Is it feasible or useful to recode highlighted spans of
text in this way?
<list>
<item>Yes : in many cases this can be an automatic process and the results justify
investing the effort </item>
<item>No : there are likely to be too many borderline or debatable cases to do
this automatically so this would have to be done as part of a major proof
reading exercise</item>
</list></p>
<p>Whichever solution is adopted, it should be applied uniformly across the ELTeC. A
collection in which some texts make distinctions ignored by others is
unsatisfactory.</p>
</div>
<div>
<head>TEI Elements used</head>
<p>This section will provide a checklist of TEI elements used in the body of each ELTeC
text, with descriptions and examples of their intended applications. </p>
</div>
<div xml:id="hdr">
<head>Metadata in the TEI Header</head>
<!-- BN: About the metadata:
Licences (creative common) will be included in metadata, isn't it?
I think it could be useful to link author names with WikiData, VIAF, ISNI
or similar linked open data resources. The ID of each author in these
resources could be included in the metadata.
WikiData:
https://www.wikidata.org/wiki/Wikidata:Main_Page
VIAF (Virtual International Authority File):
https://www.oclc.org/en/viaf.html
http://viaf.org/
ISNI (International Standard Name Identifier (ISO 27729))
http://www.isni.org/
http://isni.oclc.org/-->
<!-- LB : agreed. Since wikidata includes the others, should we use that for
preference? -->
<p>This section describes the metadata associated with each text (title, authorship,
date etc.) and with the collection as a whole. The intention is to provide this in a
standardised way to facilitate subsetting of the collection, using (for example)
coded values for the descriptive selection criteria associated with the text. As far
as possible, our text should represent the first complete printed edition of each
novel selected. </p>
<p>The TEI Header provides a very large number of possibilities for encoding such
metadata. We will provide a checklist of the TEI Header elements which are always to
be provided for each text, possibly in the form of a template. As in the body of the
text, the intention is to provide a guaranteed minimal level of information,
consistent across all parts of the ELTeC. </p>
<p>Note that metadata may be supplied at (at least) two levels: the level of the ELTeC
as a whole, and that of individual texts within it. Information which applies
uniformly to all parts of the collection should be supplied in the ELTeC header;
information specific to a particular document in the text header. </p>
</div>
<div>
<head>Text-level metadata</head>
<p>Here is an example template for an individual text header
<egXML xmlns="http://www.tei-c.org/ns/Examples">
<teiHeader type="novelHeader">
<fileDesc>
<titleStmt>
<title><!-- standard title of work -->
</title>
<author>
<!-- information about the author -->
</author>
</titleStmt>
<extent>
<!-- size of the text, in pages and words -->
</extent>
<publicationStmt>
<!-- boilerplate statement about status as part of ELTeC -->
</publicationStmt>
<sourceDesc>
<bibl>
<!-- bibliographic description of the printed source -->
</bibl>
</sourceDesc>
</fileDesc>
<profileDesc>
<!-- additional descriptive information -->
</profileDesc>
<revisionDesc>
<!-- revision information -->
</revisionDesc>
</teiHeader>
</egXML>
</p>
<p>Within the <gi>teiHeader</gi>, a <gi>fileDesc</gi>, a <gi>profileDesc</gi>, and a
<gi>revisionDesc</gi> are all required. The <gi>encodingDesc</gi> may be supplied
in (hopefully unlikely) event that some aspect of this document's encoding is
anomalous. </p>
<div>
<head>Components of the file description</head>
<p>The <gi>fileDesc</gi> contains the following mandatory elements: <specList>
<specDesc key="titleStmt"/>
<specDesc key="extent"/>
<specDesc key="publicationStmt"/>
<specDesc key="sourceDesc"/>
</specList>
</p>
<p> Taking these in turn, the <gi>titleStmt</gi> contains the title, author, and
encoder of the document. For novels with multiple authors, titles, or encoders the
element concerned is simply repeated. The <gi>title </gi>should be taken from an
authoritative bibliographic source, and should include a phrase such as
<soCalled>ELTeC edition</soCalled>. The <gi>author</gi> may contain one or more
of the following descriptive elements: <specList>
<specDesc key="persName"/>
<specDesc key="forename"/>
<specDesc key="surname"/>
<specDesc key="birth"/>
<specDesc key="death"/>
<specDesc key="affiliation" atts="type"/>
<specDesc key="sex" atts="value"/>
<specDesc key="idno" atts="type"/>
</specList>
</p>
<p>In addition to one or more <gi>author</gi> elements, a <gi>titleStmt</gi> should
contain at least one <gi>respStmt</gi> element indicating the person responsible
for the ELTeC encoded version, using the following elements <specList>
<specDesc key="resp"/>
<specDesc key="respStmt"/>
<specDesc key="name"/>
</specList></p>
<p>Here is an example :
<egXML xmlns="http://www.tei-c.org/ns/Examples">
<titleStmt>
<title>Howards End : ELTeC edition</title>
<author>
<persName>
<forename>Edward</forename>
<forename>Morgan</forename>
<surname>Forster</surname>
</persName>
<persName>E.M. Forster</persName>
<birth when="1879"/>
<death when="1970"/>
<sex value="M"/>
<idno type="viaf">https://viaf.org/viaf/31996364</idno>
<idno type="wiki">https://www.wikidata.org/wiki/Q189119</idno>
</author>
<respStmt>
<resp>ELTeC encoding</resp>
<name>Lou Burnard</name>
</respStmt>
</titleStmt>
</egXML>
</p>
<p> The <gi>extent</gi> provides information about the size of the document, given by
means of the following elements<specList>
<specDesc key="extent"/>
<specDesc key="measure" atts="unit quantity"/>
</specList> Exactly which measurements will be most useful and easily incorporated
is yet to be determined: probably a count of words and pages will suffice. </p>
<egXML xmlns="http://www.tei-c.org/ns/Examples"><extent>
<measure unit="words" quantity="20010"/>
<measure unit="pages" quantity="245"/>
</extent>
</egXML>
<p>The <gi>publicationStmt</gi> is required for TEI conformance: in individual text
headers it will contain some standard boiler plate text referring to the fuller
statement which will be furnished by the collection-level header. <!--<specList>
<specDesc key="idno"/>
<specDesc key="pubPlace"/>
<specDesc key="publisher"/>
<specDesc key="date" atts="when"/>
<specDesc key="biblScope"/>
</specList>-->
<egXML xmlns="http://www.tei-c.org/ns/Examples">
<publicationStmt>
<p>Incorporated into the ELTeC <date>2018-02-12</date></p>
</publicationStmt>
</egXML>
</p>
<p>The <gi>sourceDesc</gi> element is also required for TEI conformance. It will
contain a bibliographic description of the source text against which the digital
text has been validated, typically the first published edition of the work
concerned. Where the ELTeC version derives from a pre-existing digital version of
this work, a reference to that source will also be provided. The following
elements are used to record this information: <specList>
<specDesc key="bibl"/>
<specDesc key="title"/>
<specDesc key="author"/>
<specDesc key="publisher"/>
<specDesc key="pubPlace"/>
<specDesc key="ref"/>
</specList>
</p>
<egXML xmlns="http://www.tei-c.org/ns/Examples">
<sourceDesc>
<bibl>
<author>E.M. Forster</author>
<title>Howards End</title>
<pubPlace>London</pubPlace>
<publisher>Edward Arnold</publisher>
<date>1910</date>
<idno type="wiki">https://www.wikidata.org/wiki/Q1146642</idno>
</bibl>
<bibl>
<title>The Project Gutenberg Etext of Howards End, by E. M. Forster</title>
<ref target="http://www.gutenberg.org/files/2891/2891-h/2891-h.htm">HTML
version downloaded on <date>2017-12-26</date></ref>
</bibl>
<note type="editions" source="worldcat"> Worldcat lists 484 print editions in
English</note>
</sourceDesc>
</egXML>
</div>
<div>
<head>Components of the profile description</head>
<p>The <gi>profileDesc</gi> of an ELTeC text has the following mandatory components: <specList>
<specDesc key="langUsage"/>
<specDesc key="textClass"/>
</specList></p>
<p>The <gi>langUsage</gi> element contains one or more <gi>language</gi> elements,
one for each language, dialect, sublanguage etc. explicitly identified in the body
of the text, indicating roughly how much of the text uses this language. For
example, a text which is almost entirely in British English, but also contains
some parts in US English would have an entry like this: </p>
<egXML xmlns="http://www.tei-c.org/ns/Examples">
<langUsage>
<language ident="en-GB" usage="90">British English</language>
<language ident="en-US" usage="10">North American English</language>
</langUsage></egXML>
<p>The TEI <gi>textClass</gi> element can contain one or more of the following
elements: <specList>
<specDesc key="catRef"/>
<specDesc key="classCode"/>
<specDesc key="keywords" atts="source"/>
<specDesc key="term"/>
</specList> These three methods for classifying texts can be used in parallel. It
is an <label>open question</label> which we should use for the ELTeC collection:
the schema proposed here permits any combination. </p>
<p>The <gi>keywords</gi> option allows us to supply one or more <gi>term</gi>
elements to categorise a text in some way. If the values are taken from a known
closed list or authority file, that file should be specified using the
<att>source</att> attribute. </p>
<egXML xmlns="http://www.tei-c.org/ns/Examples">
<textClass>
<keywords source="http://wikidata.org">
<term>social class</term>
<term>social convention</term>
<term>modernity</term>
<term>family drama</term>
</keywords>
</textClass>
</egXML>
<p><label>Open Question</label> : should we invent our own taxonomy, use a
pre-existing one, make no attempt to constrain or predefine terms used here?</p>
<p>The <gi>classCode</gi> option allows us to use classification codes used or
defined by existing authorities, such as library catalogue schemes, while the
<gi>catRef</gi> option allows us to specify such codes using our own
classification scheme. </p>
<egXML xmlns="http://www.tei-c.org/ns/Examples">
<catRef target="#author_m #reprint_3"/>
<classCode source="UDC">8231.111</classCode>
</egXML>
<p>Since our selection and descriptive criteria are likely to be specific to the
project, we will probably have to define them in the corpus header using the
following elements: <specList>
<specDesc key="taxonomy"/>
<specDesc key="category"/>
<specDesc key="catDesc"/>
</specList></p>
<egXML xmlns="http://www.tei-c.org/ns/Examples">
<taxonomy>
<category xml:id="author_m"><catDesc>male authorship</catDesc></category>
<category xml:id="author_f"><catDesc>female authorship</catDesc></category>
<category xml:id="author_u"><catDesc>author gender unknown</catDesc></category>
<category xml:id="reprint_0"><catDesc>no reprints found</catDesc></category>
<category xml:id="reprint_1"><catDesc>1 to 50 editions</catDesc></category>
<category xml:id="reprint_2"><catDesc>50 to 100 editions</catDesc></category>
<category xml:id="reprint_3"><catDesc>Over 100 reprints</catDesc></category>
</taxonomy>
</egXML>
<!-- LB: some examples are needed here. Check recommendations of TEI in Libraries -->
</div>
<div>
<head>Components of the Revision Description</head>
<p>The <gi>revisionDesc</gi> element is used to document significant points in the
version history of the document. At least one entry should be provided for an
ELTeC document, specifying when it was first added to the collection. The
following elements can be used: <specList>
<specDesc key="revisionDesc"/>
<specDesc key="change" atts="when who"/>
</specList>
<egXML xmlns="http://www.tei-c.org/ns/Examples">
<revisionDesc>
<change when="2018-02-21" who="ELTeC:LB">Added new linguistic
classifications</change>
<change when="2018-01-29" who="ELTeC:LB">Added to the ELTeC</change>
</revisionDesc></egXML>
</p>
</div>
<div>
<head>Encoding description</head>
<p>The TEI allows for the specification of encoding practice, by which is meant
documentation of the specific editorial policies followed during transcription
(treatment of printed hyphens, lexical normalisation, sampling procedures,
features included, ignored, or normalised, etc.). Such specification may be
supplied at the individual document level, or once for all across the whole of a
corpus. It is even possible to specify that different parts of a document follow
different policies, provided that all the available policies are defined
somewhere. </p>
<p><label>Open Question</label> : We propose as far as possible not to allow for any
variation in encoding policies applied within the ELTeC. We will still need to
determine our encoding policies, of course, and to document them appropriately in
the ELTeC corpus header, but there should be no need for separate specifications
at the document level. </p>
</div>
</div>
<div>
<head>Linguistic and semantic annotation (level 2)</head>
<p>Additional markup facilities will be needed to
represent more sophisticated annotations, which may be motivated linguistically (for
example, to provide a normalised form, part of speech, etc.) or semantically (for
example to distinguish proper names, names of people, places, events, etc.). </p>
</div>
<div xml:id="sources">
<p>Sources consulted</p>
<listBibl>
<bibl>An introduction to TEI Simple Print <idno type="URI"
>http://www.tei-c.org/release/doc/tei-p5-exemplars/html/tei_simplePrint.doc.html</idno></bibl>
<bibl>Burnard, Lou <date>2005</date>
<title level="a">Metadata for corpus work</title> in <title>Developing Linguistic
Corpora: A guide to good practice</title> ed. Martin Wynne. Oxford: Oxbow Books,
pp 30-46. <!--<ref target="2005-metadata.xml">XML source</ref>--></bibl>
<bibl> Odebrecht, Carolin. (2017). Metadata for Historical Corpora. Realization of the
Metamodel for Corpus Metadata with the help of TEI Customization [Data set]. Zenodo.
http://doi.org/10.5281/zenodo.267999</bibl>
<bibl>
<idno type="URI">github.com/cligs/textbox</idno>
</bibl>
</listBibl>
</div>
</body>
<back>
<head>Formal specifications</head>
<p>The ELTeC encoding scheme defined by this document is a TEI-conformant customization,
from which user documentation, and formal RELAXNG or DTD specifications can be
generated automatically.
</p>
<!-- only one of the following includes can be active at any one time -->
<xi:include href="eltec.xml"/>
<xi:include href="eltec-0.xml"/>
<xi:include href="eltec-1.xml"/>
</back>
</text>
</TEI>