Smart
Doctype "basic" HTML conversion:
The basic HTML conversion:
- Contains code and algorithms © Copyright 1992,1993,1994,1995,1996
Basis Systeme netzwerk, Munich. All rights reserved.
- Produces validated SGML, using the
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
DTD.
- Has been designed to produce markup with acceptable presentation results on
a wide range of browsers, including those that do not support HTML 3.0—
is not "browsercentric".
- Automatically detects and marks-up URLs,
email addresses and Knoxville framework
URNs1) (Uniform Resource Names).
- Converts VT100 overstrike sequences to HTML descriptive markup, eg.
- B^HBS^HSn^Hn to BSn
- _^HB_^HS_^Hn is converted to
BSn2)
- -^HM-^Hi-^Hc-^Hr-^Ho-^Hs-^Ho-^Hf-^Ht to
Microsoft3)
There is also some (minimal) support for VT100 Esc-sequences.
ib100|BSn Smart Doctypes Experimental VT100 conversion:\
:am:bs:km:mi:ms:pt:\
:ce=\E[K:cd=\E[J:\
:so=\E[7m:se=\E[m:us=\E[4m:ue=\E[m:rs=\E[s:\
:md=\E[1m:mr=\E[7m:me=\E[m:\
:al=\E[L:dl=\E[M:im=:ei=:ic=\E[@:dc=\E[P:\
:AL=\E[%dL:DL=\E[%dM:IC=\E[%d@:DC=\E[%dP:\
:up=\E[A:nd=\E[C:ku=\E[A:kd=\E[B:kr=\E[C:kl=\E[D:
Table 1: Terminal Capabilities/Markup
| Capname
|
Termcap
|
Description
|
Value
|
HTML Element
|
| bold
|
md
|
Turn on bold mode
|
\E[1m
|
<B>
|
| smul
|
us
|
Start underscore mode
|
\E[4m
|
<U>
|
| N/A
|
N/A
|
N/A
|
N/A
|
<STRIKE>3
|
| smso
|
so
|
Begin standoutmode
|
\E[7m
|
<STRONG>
|
| blink
|
mb
|
Turn on
|
\E[5m
|
<BLINK>4)
|
| sgr0
|
me
|
Turn off all attributes
|
\E[m
|
</
|
Table 2: Color Capabilities/Markup (HTML 3.2+)
| Color | Foreground (Fg) | Background (Bg) | Color #RRGGBB
|
| black | \E[30m | \E[40m | #000000 |
| blue | \E[34m | \E[44m | #0000FF |
| green | \E[32m | \E[42m | #00FF00 |
| cyan | \E[36m | \E[46m | #00FFFF |
| red | \E[31m | \E[41m | #FF0000 |
| magenta | \E[35m | \E[45m | #FF00FF |
| brown | \E[33m | \E[43m | #A52A2A |
| white | \E[37m | \E[47m | #FFFFFF |
| dark-grey | \E[30;1m | \E[40;1m | #696969 |
| light-blue | \E[34;1m | \E[44;1m | #ADD8E6 |
| light-green | \E[32;1m | \E[42;1m | #98FB98 |
| light-cyan | \E[36;1m | \E[46;1m | #E0FFFF |
| light-red | \E[31;1m | \E[41;1m | #8B0000 |
| light-magenta | \E[35;1m | \E[45;1m | #8B008B |
| light-yellow | \E[33;1m | \E[43;1m | #FFFFE0 |
| light-white | \E[37;1m | \E[47;1m | #F8F8FF |
Support for Color Esc-sequences requires an extended
DTD. The current official W3C HTML
<!DOCTYPE HTML PUBLIC "-//W3C//DTD
HTML 3.2//EN">
offers support for foreground, <FONT COLOR="#RRGGBB">...</FONT>,
but not background color (except globally or in tables). To support background color as a
backwards compatable attribute:
<-- BGCOLOR Added to support background Colour -->
<!ATTLIST FONT
size CDATA #IMPLIED -- [+]nn e.g. size="+1", size=4 --
color CDATA #IMPLIED -- #RRGGBB in hex, e.g. red: color="#FF0000" --
face CDATA #IMPLIED -- comma separated list of font names --
bgcolor CDATA #IMPLIED -- #RRGGBB in hex, e.g. white: color="#FFFFFF" -- >
- Includes several LINKs for browser/agents (when run via a HTTP daemon),
among them a link to the Raw Record (source). This allows one,
with a suitable WWW browser, to go directly from
the HTML presentation to the native presentation of the record.
- Includes META ("HTTP-EQUIV") data for server/agent:
- <META HTTP-EQUIV="Content-Type" CONTENT='text/html; CHARSET="<locale>'>
where <locale> defines the basic locale (LC_CTYPE) of the
server O/S, eg. iso_8859_2, when locale is neither 'C' or 'iso_8859_1'.
- <META HTTP-EQUIV="Document-type" CONTENT="<doctype>">
where <doctype> is the DOCTYPE of
the record.
- <META HTTP-EQUIV="Database-Name" CONTENT="<db_name>">
where <db_name> is the (base)name of the database.
- <META HTTP-EQUIV="Record-Key" CONTENT="<key>">
where <key> is the key for the record.
- <META HTTP-EQUIV="Element-set" CONTENT="<eset>">
where <eset> is the name of the presented element set.
- <META HTTP-EQUIV="Date-last-modified" CONTENT="<date>">
where <date> is the ANSI date (YYYYMMDD) for the
record.
- <META HTTP-EQUIV="Time-Last-Modified" CONTENT="<time>">
where <time> is the time the record was last modified as
Hour:Min:Sec time_zone
- Support for Dublin Core meta-data marked up using the HTMLMETA framework with those doctypes with appropriate data.
- Supports 16-bit UCS encoded HTML entities to convert "native documents"(LC_CTYPE)
from
ISO-8859-x locales.
Table 3: ISO 8859-x Character Sets
| Locale
|
ISO
|
Character Set (Languages)
|
| iso_8859_1
|
8859-1
|
Latin-1 (Western and Northern Europe including Italian)
|
| iso_8859_2
|
8859-2
|
Latin-2 (Eastern Europe except Turkey and the Baltic countries)
|
| iso_8859_3
|
8859-3
|
Latin-3 (Mediterranean, South Africa, Esperanto)
|
| iso_8859_4
|
8859-4
|
Latin-4 (Scandinavia and the Baltic countries; obsolete)
|
| iso_8859_5
|
8859-5
|
Part 5 (Cyrillic)
|
| iso_8859_6
|
8859-6
|
Part 6 (Arabic)
|
| iso_8859_7
|
8859-7
|
Part 7 (Greek)
|
| iso_8859_8
|
8859-8
|
Part 8 (Hebrew)
|
| iso_8859_9
|
8859-9
|
Latin-5 (Turkey, Western Europe except Icelandic and Faroese)
|
| iso_8859_10
|
8859-10
|
Latin-6 (Northern Europe)
|
| As well as a number of other
character sets specified during indexing or implied
(eg. ANSEL for USMARC) by the document type. |
Support for other locales is also possible— table driven. See:
Although also possible we don't handle
man(1) type man page
references since this requires public use BSn's htmlman5
server.
The logic for "recognizing" E-mail
addresses follows the conventions for Internet Mail Addresses:
- Email Logic:
-
- EMAIL : = PATH "@" [HOST |ADDR]
- HOST := ALPHA *( ALPHA | DIGIT | "." | "-" )
- ADDR := DIGITS.DIGITS.DIGITS.DIGITS
- DIGITS := DIGIT[*(DIGIT)]
- PATH := (ALPHA | DIGIT) * (ALPHA | DIGIT | '-' | '.' | '!' | '%' |
'_')
- Support for RFC822-MTSs: SMTP, Smart UUCP, Bitnet, Decnet etc. See:
- Crocker, D., "Standard for the Format of ARPA Internet Text
Messages", STD 11, RFC 822, University of Delaware, August 1982.
Note that the domains '.uucp', '.decnet' and '.bitnet'
have no registered Internet routing. Such addresses must always be routed to
a gateway The routing of Bitnet or Decnet addresses is the responsibility of
the client's mailhost (eg. on Unix platforms the program
sendmail and the correct configuration of gateway rules in
sendmail.cf).
- Versions > 1.11 support X.400 to Internet (MHS) Gateways
per RFC 1327:
The algorithms have been designed to identify lexicaly correct
email addresses with a high penalty for incorrect identification.
Context checks are used to reduce error. The marked-up email address is,
however, not verified.
- Valid address examples:
- user@host.dom
- foo.bar@host.dom
- Z3950IW%NERVM.BITNET@vm.gmd.de
- joe!sam@foo.bar
- /G=Edward/I=C/S=Zimmermann/O=BSn@mhs.bsn.com
- /C=zz/ADMD=ade/PRMD=fhbo/O=a_bank/S=plork/@gw.switch.ch
- Note:
URLs (Uniform Resource Locators) are
defined in RFC1808 (June 95) and RFC1738 (Dec 94). They are recognized according
to standard IETF conventions.
- URL logic:
- <scheme>://<net_loc>/[<path>;<params>?<>query>#<fragment>]
<scheme>:://<net_loc>/
is mandatory.
These components are defined as follows:
- <scheme>: ::=
- scheme name, as per Section 2.1 of RFC 1738.
- //<net_loc> ::=
- network location and login information, as per
Section 3.1 of RFC 1738.
- /<path> ::=
- URL path, as per Section 3.1 of RFC 1738.
- ;<params> ::=
- object parameters (e.g., ";type=a" as in
Section 3.2.2 of RFC 1738.
- ?<query> ::=
- query information, as per Section 3.3 of RFC 1738.
- #<fragment> ::=
- fragment identifier.
- Schemes:= ALPHA *( ALPHA | DIGIT | "." | "-" )
- Examples of "acceptable" URLs:
- http://www.bsn.com/Z39.50
- http://www.bsn.com/Z39.50/Doctype.html#URL
- http://www.bsn.com/cgi-bin/iform?FILMS
- ftp://foo.bar.edu/test/me/now/Readme;type=a
- The following would, for example, NOT get marked-up:
- file:/tmp/readme
- http:///tmp/readme
- http:/../
Since they do not map to a Internet resource.
- Note: No check is made to see if the URL has an IANA registered
scheme. Do to the current proliferation of methods "in-use"
and proposed this has been a pragmatic, not a technical, decision.
- Limitations: Does not markup HREF="/FOO"—
relative URLs using HTML conventions. This is a feature of URLs since to
resolve relative URLs in a documents the browser must have a base and it is
often not clear within
closed content what the base of the URL is.
The algorithms have been designed to identify lexicaly correct
URLs with a high penalty for incorrect identification. Context
checks are used to reduce error. The marked-up URL is, however,
not verified.
1) The URNs are not resolved to URLs. Although the
URNs are marked up they will only work via suitable HTTP proxy servers
or with clients modified or extended to support URNs.
2) Underline <U> has been
depreciated from HTML. Although no longer in the HTML 3.2 DTD we still continue
to markup underlined text using <U>
3) Overstrike <STRIKE> was <S> in earlier HTML
drafts (eg. "draft-ietf-html-spec-04.txt"). We adopt the tags from the
latest DTDs...
4) <BLINK> is a Netscape extension and
NOT part of the HTML standard.
5) htmlman, developed by BSn in '92 along with the "original"
mail2html and other WWW-tools, is a high-speed autotagger to produce validated
(hypertext) HTML "on-the-fly" from formated man pages.