WG1/encoding_proposal.xml at master · distantreading/WG1 · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_odds.rng" type="application/xml" schematypens="http://relaxng.org/ns/structure/1.0"?>
<?xml-model href="http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_odds.rng" type="application/xml"
	schematypens="http://purl.oclc.org/dsdl/schematron"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0"
   xmlns:xi="http://www.w3.org/2001/XInclude">
   <teiHeader>
      <fileDesc>
         <titleStmt>
            <title>Encoding Guidelines for the ELTeC</title>
            <author>Cost Action CA16204 – WG1</author>
         </titleStmt>
         <publicationStmt>
            <p>Unpublished draft for discussion</p>
         </publicationStmt>
         <sourceDesc>
            <p>A born digital document drafted in TEI format by LB</p>
         </sourceDesc>
      </fileDesc>
      <revisionDesc>
         <change when="2018-03-16">Revised post-Prague</change>
         <change when="2018-01-17">Expanded metadata section a bit; added comments from CO and
            BN</change>
         <change when="2017-12-17">First (partial) discussion draft</change>
      </revisionDesc>
   </teiHeader>
   <text>
      <body>
         <p rend="red">THIS DOCUMENT HAS BEEN SUPERCEDED AND IS NOW OF HISTORICAL INTEREST ONLY!</p><p rend="red">For current Encoding Guidelines, please
            refer to <ptr target="https://distantreading.github.io/eltec-0.html"/> (level zero)
            or <ptr target="https://distantreading.github.io/eltec-1.html"/> (level one)
         </p><p>This reference document defines the encoding scheme to be used
            for the European Literary Text Collection (ELTeC) which will be a major deliverable of
            COST Action 16204, <title>Distant Reading</title>. This draft reflects decisions taken
            at the WG1 meeting in Prague in February 2018, but has not yet been formally reviewed by
            the Work Group. Particular topics on which policy remains to be
            defined are signalled below with the label <q>Open Question</q>. </p>
         <div>
            <head>Principles</head>
            <p>The MoU for the project points out that <q>Distant Reading methods cover a wide range
                  of computational methods for literary text analysis, such as authorship
                  attribution, topic modelling, character network analysis, or stylistic
                  analysis.</q> The focus of the ELTeC encoding scheme is thus not to represent
               texts in all their original complexity of structure or appearance, but rather to
               facilitate a richer and better-informed distant reading than a transcription of its
               lexical content alone would permit. For example, it seems useful to distinguish
               headings and annotations from the rest of the text, and to be able to locate
               stretches of text within gross structural features such as pages, chapters, or paragraphs.
              Although it may be useful to distinguish passages belonging to different narrative
               levels (for example, direct speech versus narrative or quotation versus narrative),
               it is difficult to do so automatically with any degree of consistency.
               It is certainly less useful to record
               exact nuances of rendition or spelling in a particular version of a text. Our goal is
               thus not to duplicate the work of scholarly editors or to produce (yet another) digital
               edition of a specific source document. Rather it is to ensure that the ELTeC texts can be
               processed by very simple minded (but XML-aware) systems primarily concerned with
               lexis and to make life easier for the developers of such systems. </p>

            <p>In selecting features for inclusion in the markup scheme, we have been guided, but
               not limited, by existing practice as far as possible. Our main goal has been to
               identify a small core set of textual features which can be readily
               (preferably automatically) identified in existing digital transcriptions, or easily
               and consistently provided by new transcriptions. </p>

            <p>We distinguish three <soCalled>levels</soCalled> of encoding, referred to below as
               level zero, level one and level two. All ELTeC  texts are made available at level zero,
               the basic encoding format. Some texts may additionally be made available at levels
               one or two, which provide a richer set of encoded features. For example: a level one
               text will include information about rendition features missing from a level zero
               text; a level two text will include tokenization information missing from a level one
               text. As far as possible conversion between levels will be automatically scripted,
               but this is not possible in the general case.
              <!-- from level one to level zero is a simple lossy transformation;
               conversion from level one or level zero to level two loses some information but adds
               Conversion scripts from  where  may be prepared in either level zero or level
               one-->
         <!--0. The eltec schema has a minimal level (level zero) in which (more or
         less) only the features enumerated in Prague are permitted/guaranteed.
         Other features (e.g. linebreaks, pagebreaks, highlighting)  are not
         annotated/encoded/represented, even if they were available in a
         pre-existing digital version from which the eltec version derives. The
         details are to be determined.

         1. At level 1, a slightly richer set of features -\- basically the
         intersection of what is available and useful in most pre-existing
         digital versions -\- are provided and required.  This feature set is
         also to be defined: I think it should include e.g. pagebreaks and
         highlighting, but not linebreaks or editorial interventions such as
         correction or normalization.

         2. At level 2,  a different set of largely linguistically-motivated
         annotations are provided, typically involving tokenisation and
         morpho-syntax. Note that level 2 annotation can be added to a level 0
         text, so the term  "level" is perhaps a misnomer. This set is still to
         be defined; the current encoding document explicitly defers it to
         another workgroup cycle.-->
         </p>
            <p>This document lists all the textual features which are to be distinguished in an
               ELTeC conformant transcription at one of these three levels.
               Whenever a given feature exists in a text, it will be
               marked up as indicated here. No other features will be captured by the markup: if
               some textual feature not provided for here is identified by a marked up source text,
               that markup will be removed (though it may be retained in a version of the text
               encoded at a different level). </p>

         <p>All ELTeC documents are TEI conformant, and therefore include a TEI Header, as discussed
         in section <ptr target="#hdr"/> below.</p>
          <!--
            <p>We make no attempt to propose markup for linguistic annotations here. The assumption
               is that this will be produced by different annotation systems in different ways,
               though with an association between such annotations and the basic lexical structures
               represented by the core ELTeC markup. </p>-->

         </div>
         <div>
            <head>Basic Transcription Guidelines (all levels) </head>

            <p>The basic unit of the ELTeC corpus is the text of a single novel, represented by a
               TEI <gi>text</gi> element. We propose no mechanism (other than metadata) to encode
               units larger than a single novel, such as multipart novel series like Proust's
                  <title>A la recherche du temps perdu</title> or Balzac's <title>Les
                  Rougon-Macquart</title>. </p>
           <!-- <p><label>Open Question</label> Should we include liminal matter (titlepages, prefaces,
               appendixes...) in our transcriptions? The following policies seem possible: <list>
                  <item>No : these typically belong to a particular edition or version of the text,
                     and should therefore systematically be excluded</item>
                  <item>Yes : these often form a significant part of the reader's experience (cf.
                     the foreword to most editions of <title>David Copperfield</title>). Mark them
                     up using <gi>front</gi> and <gi>back</gi> as appropriate.</item>

                  <item>Sort of : do not transcribe them, but indicate that they have been
                     suppressed by using the <gi>gap</gi> element. </item>
               </list></p>-->
            <p>To facilitate checking of a transcription against its  source during
               production, the <gi>pb</gi> element must be provided to mark the point in a
               transcript where a new page begins. If a page begins with the second part of a
               hyphenated word, the <gi>pb</gi> tag may appear after that, but otherwise its
               position should be the same in transcription and source. The <gi>pb</gi> element has
              an attribute <att>n</att> which should be used to number the pages. A level 1 text may
            also provide a <att>facs</att> attribute to point to a page image of the corresponding
               source page. </p>
            <p>As well as a titlepage or a table of contents, a published novel often includes material such as forewords or appendixes in addition
            to the text of the novel itself. This <term>liminal</term> matter is included in an
               ELTeC text only if it is believed to be authorial. Material before the body of the
               text begins is collected within a <gi>front</gi> element, and material following the
               body in a
               <gi>back</gi> element. In either case, distinct sections of the material, if
               encoded,  are represented by a
               <gi>div</gi> with its <att>type</att> attribute set to <code>liminal</code>.</p><p>At
               level zero, titlepages and tables of contents are omitted. At level one, they are
               replaced by a <gi>gap</gi> element. Non-authorial liminal material is silently
                  omitted at all levels. </p>
            <p rend="red">
            The Prague decision list says that we decided to exclude titlepages, tables of contents, errata list etc, but
            to include prefaces, introductions, afterwords, and appendixes, provided these are
            contemporary with the text. It also says to include footnotes and commentary, but does
            not specify whether these should also be contemporary, nor how the encoder can easily
            determine whether or not something is "contemporary".</p>
            <p>Within the body of a text, major structural divisions (parts, sections, chapters
               etc.) will be captured using the generic <gi>div</gi> element, with attributes
                  <att>type</att>, <att>xml:lang</att>, <att>xml:id</att> and <att>n</att> used as
               further detailed below.</p>
            <p>The names used for hierarchic structural divisions of a novel above the chapter are
               arbitrary, culture-specific, and often inconsistent : in some novels things called
                  <q>part</q> contain things called <q>book</q> and in others the reverse. We
               propose to follow TEI in using a single element (<gi>div</gi>) for every hierarchical
               structural division, down to the level of <q>chapter</q>.</p>
            <p><label>Open Question</label> Is it useful to retain the name used for each level in
               the original source (the type of div) ? <list>
                  <!-- CO: Maybe I just missed this information. We can keep the structuring texts/words of the text as headings.
                     We can mark them with the appropriate element. In this way, we can keep the information without having to define
                     a fixed list of subdivisions in a novel (you alreay pointed out that this is not possible). Having marked the existing
                     headings (and subheadings) they can be included or ignored automatically when processing the document.  -->
                  <!-- LB : I think it would be problematic to just use <head> without using <div>
                     if that's what you are suggesting. I am only asking whether or not we
                     include e.g. @type="chapter" on the top level div -->
                  <!-- BN: I'm not sure. Maybe it could be a good solution the third one:-->
                  <item> Yes: it is easy to keep and may help referencing : use the <att>type</att>
                     attribute to hold the name used for each level of div in the work in
                     question</item>
                  <item>No : this name adds no useful information beyond the level indicated by the
                     XML structure </item>
                  <item>No : it would be more useful to provide an explicit and normalised
                     indication of the hierarchic level for the benefit of non-XML-aware processors
                     (e.g. <code>level1</code>, <code>level2</code> etc.)</item>
               </list></p>
            <p rend="red">This issue was not discussed in Prague. Proposal is to use (and enforce) a predefined list of
               specific values. </p>
            <p>The (human) language in which a text is expressed is indicated explicitly by the
                  <att>xml:lang</att> attribute which supplies the ISO 2 letter code for the
               language concerned. This attribute will always be supplied on the <gi>text</gi>
               element to specify a default, and may also appear on other elements to indicate passages where the language changes.
               The various
               different languages used in a given text will be itemized in its metadata (see
                  <gi>langUsage</gi> element in the header). </p>
            <p><label>Open question</label> Should passages exhibiting regional or dialectal
               variation be specially signalled? <list>
                  <item>No : this is too fine grained and controversial a distinction to be made
                     with reliable consistency </item>
                  <item>Yes : treat this in the same way as any other kind of code switching and
                     define a set of appropriate language codes for the project</item>
                  <!-- CO: From a linguistic point of view, I would like to say yes but detecting dialectal variation
                     is something which cannot be done automatically. I think, without a more explicit guidelines concerning
                     the detection of foreign material we might end up with confusing analysis. The
                     definition of the element foreign is broad. For example,
                     neoclassic words may or may not count as foreign. Especially in a corpus containing roman and germanic languages.  -->
                  <!-- BN: This annotation could be complex. I think it will be better to annotate
dialectal variations during linguistic annotation.-->
                  <!-- LB : maybe we should just use foreign for passages which are signalled
                     specially in some way in the text. e.g. in italics -->
                  <item>Maybe : just use the <gi>distinct</gi> element to indicate the kind of
                     variation concerned</item>
                  <!-- CO: This element may then be applied to other distinct phenomena as well. I
               think this is not the best way. -->
                  <!-- LB: Many TEI elements have multiple uses, so this is not an argument against using
                  <distinct> in my view-->
               </list></p>
            <p rend="red">In Prague there was some support for using either <gi>foreign</gi> or <gi>distinct</gi>, but no decision to do
               so. Proposal is not to do so.</p>
            <p>A single reference scheme will be defined for the whole corpus, with the following
               components: <list>
                  <item>text identifier : every text will have an identifier consisting of its two
                     letter language code and a three digit serial number, for example
                        <code>FR042</code></item>
                  <item>chapter identifier: each chapter or equivalent will have an identifier
                     concatenating the text identifier and a three digit serial number, for example
                        <code>FR042012</code> is the twelfth chapter of the 42nd French novel. </item>
                  <item>If sub-chapter segmentation (see below) is implemented, then the segments
                     will append a further four digit serial number.</item>

               </list>The identifier will be supplied as the value of an <att>xml:id</att> attribute
               on each <gi>text</gi>, <gi>div</gi> or <gi>s</gi> element as appropriate. Adding this
               identifier is an easily automated task which can be built into the workflow for
               accession to the ELTeC.</p>
            <p>Note that these identifiers will not necessarily correspond with the numbering used
               in a particular source text. In a work where the first twelve chapters are considered
               to form part one, and the next twelve constitute part two, the first chapter of the
               second part will have an identifier ending <code>013</code>, even though it may be
               numbered <code>1</code> in a source text.</p><p rend="red">No dissent from this proposal in
                  Prague</p>
             <p><label>Open question</label> is it important to preserve the original numbering,
               particularly for deeply structured texts? <list>
                  <item>Yes : the original numbering is widely used to reference the text: it should
                     be supplied as using the <att>n</att> attribute on the <gi>div</gi>.</item>
                  <!-- CO: we may use here the "head" element as the numbering of a chapter may be analysed as a head, maybe next to other heads. See comment above-->
                  <!-- BN: I'm not sure. Sometimes scholars refer to specific passages by original numbering ("The chapter two of Don Quixote bla bla bla"). In this case
this information is necessary. -->
                  <!-- LB: head is not the same as @n : I agree with Borja -->
                  <item>No : the original numbering and referencing scheme are of no use in our
                     intended applications, introduce unnecessary complexity, and may be a source of
                     confusion. </item>
               </list></p>
            <p rend="red">Not explicitly addressed in Prague. Proposal is not to retain original
               numbering.</p>
            <p>The chapters of a novel mostly consist of prose, arranged in paragraphs, for which we
               will use the TEI <gi>p</gi> element. It is not unusual to find other structures
               however, specifically verse, or passages of dialogue presented as if in a play, with
               speaker labels and even stage directions. Less frequently, novels may contain
               material presented in list or tabular formats. Graphics with their own associated
               heading or other text are also frequent. </p>
            <p><label>Open Question</label> how should material other than running prose and
               dialogue be encoded? <list type="ordered">
                  <item>Use the appropriate TEI elements for verse or drama (<gi>lg</gi>,
                     <gi>l</gi>, <gi>sp</gi>, <gi>stage</gi>)</item>
                         <item>Use the appropriate TEI elements for lists and tables (<gi>list</gi>,
                        <gi>label</gi>, <gi>item</gi>, <gi>table</gi>, <gi>cell</gi>,
                     <gi>row</gi>)</item>
                    <item>Use the appropriate TEI elements for embedded graphics (<gi>figure</gi>,
                        <gi>graphic</gi>, <gi>head</gi>)</item>
                     <item>Suppress all non-prose material, replacing it by <gi>gap</gi></item>
               </list></p>

            <!-- Prague decisions
                       * annotation of footnotes (we will test whether finding footnotes will be a problem; if so they go to 2nd level ),
                       * afterwords, appendix, preface, introduction
                       * include <p>
                       * no <lb>
                       * no annotation of a lists, the textual material will be in the corpus with the <p> annotation
                       * suppress the tables, annotate with gap
                       * suppress figures/pictures with a gap
                       * suppress the heading of a picture / figure
                       * typographic information is ignored
                       * no <pb>
                       * hyphenation is merged
                       * include <head> (for chapters etc.)
                       * include <div>
                       * no annotation of quotes (cf. mottos), instead using <p>
                       * retain information from level0 if possible; but mark with <gap> and put
                       into comments-->
            <p rend="red">In Prague, we decided to suppress annotation of linebreaks, lists,
               tables, figures, captions of figures,  typographic information, pagebreaks,
               and quotation (i.e. direct/indirect speech). We explicitly agreed to annotate
               only paragraphs, divisions, and headings. Other features would be represented either
               by a <gi>gap</gi>, if they have been entirely suppressed, or by a <gi>p</gi> if they
               have textual content.We also agreed that hyphenation, like other typographic
               features, would not be preserved.  Verse and drama were not explicitly addressed. </p>
            <p>Novels are also full of direct speech, represented using various different
               conventions, but almost always distinguished from the narrative voice. The first
               person narrative is also common, but may be regarded as a special case.
               <!-- CO: Narrative voice might be good to have for the metadata! -->How exactly
               different narrative strands are articulated in a novel, and the extent to which they
               may be characterised by their lexis has been a preoccupation of many <q>distant
                  reading</q> style analyses. It might therefore be helpful to distinguish material
               purporting to be direct speech from material purporting to be narrative in our basic
               encoding, though to do so consistently and accurately may occasionally be
               problematic.</p>
                 <p><label>Open Question</label> Should passages presented as direct speech in a novel be
               distinguished from passages presented as narrative? <list>
                  <item>Yes : use <gi>q</gi> and avoid nesting problems by always nesting it within
                        <gi>p</gi></item>
                  <item>Yes : use a <gi>milestone</gi> to mark the beginning and end of each passage
                     of direct speech</item>
                  <item>Sort of : provide an attribute on <gi>p</gi> to indicate whether or not the
                     paragraph contains direct speech</item>
                  <!-- CO:  I don't know whether the kind of suggestion annotation help to process the corpus.
                     Either the whole paragraph needs to be excluded from analysis or you include the paragraph
                     and you know that the paragraph contain  some speech text. I think, this is not a big benefit.    -->
                  <item>No : rely on (or normalise) typographic conventions such as quote marks or
                     dashes to distinguish direct speech only. </item>
                  <!-- CO: At the moment, I would prefer this solution. -->
                  <!-- LB: it's certainly the easiest! tho normalising punctuation marks may be
                     problematic-->
               </list>
            </p>
            <p rend="red">In Prague the majority view was not to attempt to do more than preserve
               existing punctuation. </p>
            <p>Printed texts typically deploy a number of conventions which can cause problems for
               linguistic analyses of even the most basic kind. Changes of font or style
               (italicization or use of superscript, for example) can have particular lexical
               significance which should be taken into account. End-of-line hyphenation can make it
               harder to identify the exact form of a token. Non-standard (i.e. non-modern)
               spellings can mislead parsers. Our proposed encoding aims above all for consistency
               and transparency in what is reliably achievable, leaving more difficult and
               problematic issues to be addressed by linguistic annotations. </p>
            <p>We do not preserve the lineation of running prose in our source texts, since this is
               always purely an artefact of the source edition. For the same reason we will
               reassemble words broken across a line break, silently removing any hyphen present.
               (This will make it impossible to use our texts for hyphenation studies. So be it.) </p>
            <!-- CO: This contradicts the idea of encoding the first edition in a philological way. -->
            <p><label>Open Question</label> : Should page breaks in the source text be preserved ? <list>
                  <item>Yes : this is useful information (e.g. to determine words-per-page, or to
                     anchor links to an image of the source text) which is usually available at
                     no-cost in existing digital texts</item>
                  <!-- CO: This might help. -->
                  <!-- BN: From my point of view this information is not necessary. As I said
before, it is related to the book as physical object, not to the novel itself.-->
                  <item>No : the proposed uses don't justify the cost of providing the information
                     if it is missing. And pagination is inherently copy-specific.</item>
               </list></p>
            <p rend="red">Prague decision (as noted above) was to suppress page numbering; however
               the proposal is to retain it, since it will always be available for OCR texts, where
            it is essential information during text validation, and is usually available in  other
            digital versions. The discussion in Prague concerned only its lack of utility during the
            analysis stage, but it is very useful during the transcription and validation stage. </p>
            <p>Font and style variations in the source text usually signal something. Italics may
               signal emphasis, quotation, foreign language terms etc. Superscripts almost always
               signal abbreviation. The visual salience of these variations is of considerably less
               interest to distant readers than the intended function they signal. However, it is
               not always easy to determine that function reliably and consistently by algorithm.
               Some simple cases could however be addressed. A possibly strategy is outlined below.
               It assumes the existence of a digital version of the text in which visual features
               are explicit, whether by means of TEI-style markup or styling information such as
               that provided by Word. <list>
                  <item>if possible, replace indications of highlighting by an appropriate TEI
                     element, chosen from the following list : <gi>foreign</gi>, <gi>title</gi>,
                        <gi>emph</gi></item>
                  <!-- CO: Why would title and foreign be good elements for the task?  Emph refers to linguistic or rhetorical effect -->
                  <!-- LB: because titles and foreign passages are often represented by
                     highlighting -->
                  <item>otherwise, replace all indications of highlighting by the TEI <gi>hi</gi>
                     element</item>
                  <!-- CO: this might be a good way of encoding. hi can get a @rend for determing bold, underlined etc.?! -->
                  <!-- LB: indeed it can, but why would we want to use that? -->
                  <item>indications of superscript characters (such as French
                        <soCalled>14&#x1d49;</soCalled>) should be removed. Instead, the TEI element
                        <gi>abbr</gi> should be used to indicate the presence of an abbreviated
                     word: <code>&lt;abbr>14e&lt;/abbr></code></item>
               </list></p>
            <p rend="red">Prague decision (as noted above) : was to suppress all encoding of renditional features. </p>

            <p><label>Open Question</label>: Is it feasible or useful to recode highlighted spans of
               text in this way?
               <list>
                  <item>Yes : in many cases this can be an automatic process and the results justify
                     investing the effort </item>
                  <item>No : there are likely to be too many borderline or debatable cases to do
                     this automatically so this would have to be done as part of a major proof
                     reading exercise</item>
               </list></p>
            <p>Whichever solution is adopted, it should be applied uniformly across the ELTeC. A
               collection in which some texts make distinctions ignored by others is
               unsatisfactory.</p>

         </div>
         <div>
            <head>TEI Elements used</head>
            <p>This section will provide a checklist of TEI elements used in the body of each ELTeC
               text, with descriptions and examples of their intended applications. </p>
         </div>
         <div xml:id="hdr">
            <head>Metadata in the TEI Header</head>
            <!-- BN: About the metadata:
Licences (creative common) will be included in metadata, isn't it?
I think it could be useful to link author names with WikiData, VIAF, ISNI
or similar linked open data resources. The ID of each author in these
resources could be included in the metadata.
WikiData:
https://www.wikidata.org/wiki/Wikidata:Main_Page
VIAF (Virtual International Authority File):
https://www.oclc.org/en/viaf.html
http://viaf.org/
ISNI (International Standard Name Identifier (ISO 27729))
http://www.isni.org/
http://isni.oclc.org/-->
            <!-- LB : agreed. Since wikidata includes the others, should we use that for
               preference? -->
            <p>This section describes the metadata associated with each text (title, authorship,
               date etc.) and with the collection as a whole. The intention is to provide this in a
               standardised way to facilitate subsetting of the collection, using (for example)
               coded values for the descriptive selection criteria associated with the text. As far
               as possible, our text should represent the first complete printed edition of each
               novel selected. </p>
            <p>The TEI Header provides a very large number of possibilities for encoding such
               metadata. We will provide a checklist of the TEI Header elements which are always to
               be provided for each text, possibly in the form of a template. As in the body of the
               text, the intention is to provide a guaranteed minimal level of information,
               consistent across all parts of the ELTeC. </p>
              <p>Note that metadata may be supplied at (at least) two levels: the level of the ELTeC
               as a whole, and that of individual texts within it. Information which applies
               uniformly to all parts of the collection should be supplied in the ELTeC header;
               information specific to a particular document in the text header. </p>
         </div>
         <div>
            <head>Text-level metadata</head>
            <p>Here is an example template for an individual text header
               <egXML xmlns="http://www.tei-c.org/ns/Examples">
                  <teiHeader type="novelHeader">
                     <fileDesc>
                        <titleStmt>
                           <title><!-- standard title of work -->
                           </title>
                           <author>
                              <!-- information about the author -->
                           </author>
                        </titleStmt>
                        <extent>
                           <!-- size of the text, in pages and words -->
                        </extent>
                        <publicationStmt>
                           <!-- boilerplate statement about status as part of ELTeC -->
                        </publicationStmt>
                        <sourceDesc>
                           <bibl>
                              <!-- bibliographic description of the printed source -->
                           </bibl>
                        </sourceDesc>
                     </fileDesc>
                     <profileDesc>
                        <!-- additional descriptive information -->
                     </profileDesc>
                     <revisionDesc>
                        <!-- revision information -->
                     </revisionDesc>
                  </teiHeader>
               </egXML>
            </p>
            <p>Within the <gi>teiHeader</gi>, a <gi>fileDesc</gi>, a <gi>profileDesc</gi>, and a
                  <gi>revisionDesc</gi> are all required. The <gi>encodingDesc</gi> may be supplied
               in (hopefully unlikely) event that some aspect of this document's encoding is
               anomalous. </p>
            <div>
               <head>Components of the file description</head>
               <p>The <gi>fileDesc</gi> contains the following mandatory elements: <specList>
                     <specDesc key="titleStmt"/>
                     <specDesc key="extent"/>
                     <specDesc key="publicationStmt"/>
                     <specDesc key="sourceDesc"/>
                  </specList>
               </p>
               <p> Taking these in turn, the <gi>titleStmt</gi> contains the title, author, and
                  encoder of the document. For novels with multiple authors, titles, or encoders the
                  element concerned is simply repeated. The <gi>title </gi>should be taken from an
                  authoritative bibliographic source, and should include a phrase such as
                     <soCalled>ELTeC edition</soCalled>. The <gi>author</gi> may contain one or more
                  of the following descriptive elements: <specList>
                     <specDesc key="persName"/>
                     <specDesc key="forename"/>
                     <specDesc key="surname"/>
                     <specDesc key="birth"/>
                     <specDesc key="death"/>
                     <specDesc key="affiliation" atts="type"/>
                     <specDesc key="sex" atts="value"/>
                     <specDesc key="idno" atts="type"/>
                  </specList>
               </p>
               <p>In addition to one or more <gi>author</gi> elements, a <gi>titleStmt</gi> should
                  contain at least one <gi>respStmt</gi> element indicating the person responsible
                  for the ELTeC encoded version, using the following elements <specList>
                     <specDesc key="resp"/>
                     <specDesc key="respStmt"/>
                     <specDesc key="name"/>
                  </specList></p>
               <p>Here is an example :
                  <egXML xmlns="http://www.tei-c.org/ns/Examples">
                     <titleStmt>
                        <title>Howards End : ELTeC edition</title>
                        <author>
                           <persName>
                              <forename>Edward</forename>
                              <forename>Morgan</forename>
                              <surname>Forster</surname>
                           </persName>
                           <persName>E.M. Forster</persName>
                           <birth when="1879"/>
                           <death when="1970"/>
                           <sex value="M"/>
                           <idno type="viaf">https://viaf.org/viaf/31996364</idno>
                           <idno type="wiki">https://www.wikidata.org/wiki/Q189119</idno>
                        </author>
                        <respStmt>
                           <resp>ELTeC encoding</resp>
                           <name>Lou Burnard</name>
                        </respStmt>
                     </titleStmt>
                  </egXML>
               </p>
               <p> The <gi>extent</gi> provides information about the size of the document, given by
                  means of the following elements<specList>
                     <specDesc key="extent"/>
                     <specDesc key="measure" atts="unit quantity"/>
                  </specList> Exactly which measurements will be most useful and easily incorporated
                  is yet to be determined: probably a count of words and pages will suffice. </p>
               <egXML xmlns="http://www.tei-c.org/ns/Examples"><extent>
                     <measure unit="words" quantity="20010"/>
                     <measure unit="pages" quantity="245"/>
                  </extent>
               </egXML>
               <p>The <gi>publicationStmt</gi> is required for TEI conformance: in individual text
                  headers it will contain some standard boiler plate text referring to the fuller
                  statement which will be furnished by the collection-level header. <!--<specList>
                  <specDesc key="idno"/>
                  <specDesc key="pubPlace"/>
                  <specDesc key="publisher"/>
                  <specDesc key="date" atts="when"/>
                  <specDesc key="biblScope"/>
               </specList>-->
                  <egXML xmlns="http://www.tei-c.org/ns/Examples">
                     <publicationStmt>
                        <p>Incorporated into the ELTeC <date>2018-02-12</date></p>
                     </publicationStmt>
                  </egXML>
               </p>
               <p>The <gi>sourceDesc</gi> element is also required for TEI conformance. It will
                  contain a bibliographic description of the source text against which the digital
                  text has been validated, typically the first published edition of the work
                  concerned. Where the ELTeC version derives from a pre-existing digital version of
                  this work, a reference to that source will also be provided. The following
                  elements are used to record this information: <specList>
                     <specDesc key="bibl"/>
                     <specDesc key="title"/>
                     <specDesc key="author"/>
                     <specDesc key="publisher"/>
                     <specDesc key="pubPlace"/>
                     <specDesc key="ref"/>
                  </specList>
               </p>
               <egXML xmlns="http://www.tei-c.org/ns/Examples">
                  <sourceDesc>
                     <bibl>
                        <author>E.M. Forster</author>
                        <title>Howards End</title>
                        <pubPlace>London</pubPlace>
                        <publisher>Edward Arnold</publisher>
                        <date>1910</date>
                        <idno type="wiki">https://www.wikidata.org/wiki/Q1146642</idno>
                     </bibl>
                     <bibl>
                        <title>The Project Gutenberg Etext of Howards End, by E. M. Forster</title>
                        <ref target="http://www.gutenberg.org/files/2891/2891-h/2891-h.htm">HTML
                           version downloaded on <date>2017-12-26</date></ref>
                     </bibl>
                     <note type="editions" source="worldcat"> Worldcat lists 484 print editions in
                        English</note>
                  </sourceDesc>
               </egXML>
            </div>
            <div>
               <head>Components of the profile description</head>
               <p>The <gi>profileDesc</gi> of an ELTeC text has the following mandatory components: <specList>
                     <specDesc key="langUsage"/>
                     <specDesc key="textClass"/>
                  </specList></p>
               <p>The <gi>langUsage</gi> element contains one or more <gi>language</gi> elements,
                  one for each language, dialect, sublanguage etc. explicitly identified in the body
                  of the text, indicating roughly how much of the text uses this language. For
                  example, a text which is almost entirely in British English, but also contains
                  some parts in US English would have an entry like this: </p>
               <egXML xmlns="http://www.tei-c.org/ns/Examples">
                  <langUsage>
                     <language ident="en-GB" usage="90">British English</language>
                     <language ident="en-US" usage="10">North American English</language>
                  </langUsage></egXML>
               <p>The TEI <gi>textClass</gi> element can contain one or more of the following
                  elements: <specList>
                     <specDesc key="catRef"/>
                     <specDesc key="classCode"/>
                     <specDesc key="keywords" atts="source"/>
                     <specDesc key="term"/>
                  </specList> These three methods for classifying texts can be used in parallel. It
                  is an <label>open question</label> which we should use for the ELTeC collection:
                  the schema proposed here permits any combination. </p>
               <p>The <gi>keywords</gi> option allows us to supply one or more <gi>term</gi>
                  elements to categorise a text in some way. If the values are taken from a known
                  closed list or authority file, that file should be specified using the
                     <att>source</att> attribute. </p>
               <egXML xmlns="http://www.tei-c.org/ns/Examples">
                  <textClass>
                     <keywords source="http://wikidata.org">
                        <term>social class</term>
                        <term>social convention</term>
                        <term>modernity</term>
                        <term>family drama</term>
                     </keywords>
                  </textClass>
               </egXML>
               <p><label>Open Question</label> : should we invent our own taxonomy, use a
                  pre-existing one, make no attempt to constrain or predefine terms used here?</p>
               <p>The <gi>classCode</gi> option allows us to use classification codes used or
                  defined by existing authorities, such as library catalogue schemes, while the
                     <gi>catRef</gi> option allows us to specify such codes using our own
                  classification scheme. </p>
               <egXML xmlns="http://www.tei-c.org/ns/Examples">
                  <catRef target="#author_m #reprint_3"/>
                  <classCode source="UDC">8231.111</classCode>
               </egXML>
               <p>Since our selection and descriptive criteria are likely to be specific to the
                  project, we will probably have to define them in the corpus header using the
                  following elements: <specList>
                     <specDesc key="taxonomy"/>
                     <specDesc key="category"/>
                     <specDesc key="catDesc"/>
                  </specList></p>
               <egXML xmlns="http://www.tei-c.org/ns/Examples">
                  <taxonomy>
                     <category xml:id="author_m"><catDesc>male authorship</catDesc></category>
                     <category xml:id="author_f"><catDesc>female authorship</catDesc></category>
                     <category xml:id="author_u"><catDesc>author gender unknown</catDesc></category>
                     <category xml:id="reprint_0"><catDesc>no reprints found</catDesc></category>
                     <category xml:id="reprint_1"><catDesc>1 to 50 editions</catDesc></category>
                     <category xml:id="reprint_2"><catDesc>50 to 100 editions</catDesc></category>
                     <category xml:id="reprint_3"><catDesc>Over 100 reprints</catDesc></category>
                  </taxonomy>
               </egXML>
               <!-- LB: some examples are needed here. Check recommendations of TEI in Libraries -->
            </div>
            <div>
               <head>Components of the Revision Description</head>
               <p>The <gi>revisionDesc</gi> element is used to document significant points in the
                  version history of the document. At least one entry should be provided for an
                  ELTeC document, specifying when it was first added to the collection. The
                  following elements can be used: <specList>
                     <specDesc key="revisionDesc"/>
                     <specDesc key="change" atts="when who"/>
                  </specList>
                  <egXML xmlns="http://www.tei-c.org/ns/Examples">
                     <revisionDesc>
                        <change when="2018-02-21" who="ELTeC:LB">Added new linguistic
                           classifications</change>
                        <change when="2018-01-29" who="ELTeC:LB">Added to the ELTeC</change>
                     </revisionDesc></egXML>
               </p>
            </div>
            <div>
               <head>Encoding description</head>
               <p>The TEI allows for the specification of encoding practice, by which is meant
                  documentation of the specific editorial policies followed during transcription
                  (treatment of printed hyphens, lexical normalisation, sampling procedures,
                  features included, ignored, or normalised, etc.). Such specification may be
                  supplied at the individual document level, or once for all across the whole of a
                  corpus. It is even possible to specify that different parts of a document follow
                  different policies, provided that all the available policies are defined
                  somewhere. </p>
               <p><label>Open Question</label> : We propose as far as possible not to allow for any
                  variation in encoding policies applied within the ELTeC. We will still need to
                  determine our encoding policies, of course, and to document them appropriately in
                  the ELTeC corpus header, but there should be no need for separate specifications
                  at the document level. </p>
            </div>
         </div>
         <div>
            <head>Linguistic and semantic annotation (level 2)</head>
            <p>Additional markup facilities will be needed to
               represent more sophisticated annotations, which may be motivated linguistically (for
               example, to provide a normalised form, part of speech, etc.) or semantically (for
               example to distinguish proper names, names of people, places, events, etc.). </p>
         </div>

         <div xml:id="sources">
            <p>Sources consulted</p>
            <listBibl>
               <bibl>An introduction to TEI Simple Print <idno type="URI"
                  >http://www.tei-c.org/release/doc/tei-p5-exemplars/html/tei_simplePrint.doc.html</idno></bibl>
               <bibl>Burnard, Lou <date>2005</date>
                  <title level="a">Metadata for corpus work</title> in <title>Developing Linguistic
                     Corpora: A guide to good practice</title> ed. Martin Wynne. Oxford: Oxbow Books,
                  pp 30-46. <!--<ref target="2005-metadata.xml">XML source</ref>--></bibl>
               <bibl> Odebrecht, Carolin. (2017). Metadata for Historical Corpora. Realization of the
                  Metamodel for Corpus Metadata with the help of TEI Customization [Data set]. Zenodo.
                  http://doi.org/10.5281/zenodo.267999</bibl>
               <bibl>
                  <idno type="URI">github.com/cligs/textbox</idno>
               </bibl>
            </listBibl>
         </div>

      </body>
      <back>

            <head>Formal specifications</head>
            <p>The ELTeC encoding scheme defined by this document is a TEI-conformant customization,
               from which user documentation, and formal RELAXNG or DTD specifications can be
               generated automatically.
            </p>
            <!-- only one of the following includes can be active at any one time -->
         <xi:include href="eltec.xml"/>
         <xi:include href="eltec-0.xml"/>
         <xi:include href="eltec-1.xml"/>

      </back>

   </text>
</TEI>