Tibetan Unicode: Difference between revisions

From Rangjung Yeshe Wiki - Dharma Dictionary
Jump to navigation Jump to search
mNo edit summary
 
(31 intermediate revisions by 3 users not shown)
Line 1: Line 1:
<span class=TibUni16>[[ཡུ་ནི་ཀོཌ྄]]</span>
=='''Introduction'''==
=='''Introduction'''==


Line 5: Line 7:
The situation for Tibetan was particularly anarchical. There was no recognized standard for encoding Tibetan script characters. Word-processing applications and add-ins for Tibetan used non-standardized, proprietary font-based encodings - mapping the Tibetan glyphs in the fonts they used to character sets originally designed for encoding Roman or Chinese characters. Since each Tibetan system used its own encoding, one of the greatest obstacles to using electronic Tibetan data resulted from the fact that files could not be easily shared by different Tibetan word-processing programs or with other applications without converting files from one encoding scheme to another.  
The situation for Tibetan was particularly anarchical. There was no recognized standard for encoding Tibetan script characters. Word-processing applications and add-ins for Tibetan used non-standardized, proprietary font-based encodings - mapping the Tibetan glyphs in the fonts they used to character sets originally designed for encoding Roman or Chinese characters. Since each Tibetan system used its own encoding, one of the greatest obstacles to using electronic Tibetan data resulted from the fact that files could not be easily shared by different Tibetan word-processing programs or with other applications without converting files from one encoding scheme to another.  


== Tibetan in the Unicode Standard ==
{|border="1" cellspacing="0" cellpadding="5" class="wikitable" style="border-collapse:collapse;background:#FFFFFF;font-size:x-large; font-family: Kailash, Jomolhari, 'Tibetan Machine Uni'; text-align:center"
|-
|colspan="18" style="background:#F8F8F8;font-size:small"| '''Chart of Tibetan script characters in the Unicode Standard'''<br />[http://www.unicode.org/charts/PDF/U0F00.pdf Unicode.org chart] (PDF)
|-style="background:#F8F8F8;font-size:small"
| style="width:10%" | &nbsp; || style="width:5%"  | 0 || style="width:5%"  | 1 || style="width:5%"  | 2 || style="width:5%"  | 3 || style="width:5%"  | 4 || style="width:5%"  | 5 || style="width:5%"  | 6 || style="width:5%"  | 7 || style="width:5%"  | 8 || style="width:5%"  | 9 || style="width:5%"  | A || style="width:5%"  | B || style="width:5%"  | C || style="width:5%"  | D || style="width:5%"  | E || style="width:5%"  | F ||style="width=*; background:#F8F8F8;" rowspan="17"| &nbsp; 
|-
| style="background:#F8F8F8;font-size:small; font-family: Verdana, sans; height: 36pt;"| U+0F0x
| &#xf00; || &#xf01; || &#xf02; || &#xf03; || &#xf04; || &#xf05; || &#xf06; || &#xf07;
| &#xf08; || &#xf09; || &#xf0a; || &#xf0b; || &#xf0c; || &#xf0d; || &#xf0e; || &#xf0f;
|-
| style="background:#F8F8F8;font-size:small; font-family: Verdana, sans; height: 36pt;"| U+0F1x
| &#xf10; || &#xf11; || &#xf12; || &#xf13; || &#xf14; || &#xf15; || &#xf16; || &#xf17;
| &#xf18; || &#xf19; || &#xf1a; || &#xf1b; || &#xf1c; || &#xf1d; || &#xf1e; || &#xf1f;
|-
| style="background:#F8F8F8;font-size:small; font-family: Verdana, sans; height: 36pt;"| U+0F2x
| &#xf20; || &#xf21; || &#xf22; || &#xf23; || &#xf24; || &#xf25; || &#xf26; || &#xf27;
| &#xf28; || &#xf29; || &#xf2a; || &#xf2b; || &#xf2c; || &#xf2d; || &#xf2e; || &#xf2f;
|-
| style="background:#F8F8F8;font-size:small; font-family: Verdana, sans; height: 36pt"| U+0F3x
| &#xf30; || &#xf31; || &#xf32; || &#xf33; || &#xf34; || &#xf35; || &#xf36; || &#xf37;
| &#xf38; || &#xf39; || &#xf3a; || &#xf3b; || &#xf3c; || &#xf3d; || &#xf3e; || &#xf3f;
|-
| style="background:#F8F8F8;font-size:small; font-family: Verdana, sans; height: 36pt;"| U+0F4x
| &#xf40; || &#xf41; || &#xf42; || &#xf43; || &#xf44; || &#xf45; || &#xf46; || &#xf47;
| bgcolor="#CCCCCC" | &nbsp; || &#xf49; || &#xf4a; || &#xf4b; || &#xf4c; || &#xf4d; || &#xf4e; || &#xf4f;
|-
| style="background:#F8F8F8;font-size:small; font-family: Verdana, sans; height: 36pt;"| U+0F5x
| &#xf50; || &#xf51; || &#xf52; || &#xf53; || &#xf54; || &#xf55; || &#xf56; || &#xf57;
| &#xf58; || &#xf59; || &#xf5a; || &#xf5b; || &#xf5c; || &#xf5d; || &#xf5e; || &#xf5f;
|-
| style="background:#F8F8F8;font-size:small; font-family: Verdana, sans; height: 36pt;"| U+0F6x
| &#xf60; || &#xf61; || &#xf62; || &#xf63; || &#xf64; || &#xf65; || &#xf66; || &#xf67;
| &#xf68; || &#xf69; || &#xf6a; || &#xf6b; || &#xf6c; || bgcolor="#CCCCCC" | &nbsp; || bgcolor="#CCCCCC" | &nbsp; || bgcolor="#CCCCCC" | &nbsp;
|-
| style="background:#F8F8F8;font-size:small; font-family: Verdana, sans; height: 36pt;"| U+0F7x
| bgcolor="#CCCCCC" | &nbsp; || &#xf71; || &#xf72; || &#xf73; || &#xf74; || &#xf75; || &#xf76; || &#xf77;
| &#xf78; || &#xf79; || &#xf7a; || &#xf7b; || &#xf7c; || &#xf7d; || &#xf7e; || &#xf7f;
|-
| style="background:#F8F8F8;font-size:small; font-family: Verdana, sans; height: 36pt;"| U+0F8x
| &#xf80; || &#xf81; || &#xf82; || &#xf83; || &#xf84; || &#xf85; || &#xf86; || &#xf87;
| &#xf88; || &#xf89; || &#xf8a; || &#xf8b; || bgcolor="#CCCCCC" | &nbsp; || bgcolor="#CCCCCC" | &nbsp; || bgcolor="#CCCCCC" | &nbsp; || bgcolor="#CCCCCC" | &nbsp;
|-
| style="background:#F8F8F8;font-size:small; font-family: Verdana, sans; height: 36pt;"| U+0F9x
| &#xf90; || &#xf91; || &#xf92; || &#xf93; || &#xf94; || &#xf95; ||  &#xf96;  || &#xf97;
| bgcolor="#CCCCCC" | &nbsp; || &#xf99; || &#xf9a; || &#xf9b; || &#xf9c; || &#xf9d; || &#xf9e; || &#xf9f;
|-
| style="background:#F8F8F8;font-size:small; font-family: Verdana, sans; height: 36pt;"| U+0FAx
| &#xfa0; || &#xfa1; || &#xfa2; || &#xfa3; || &#xfa4; || &#xfa5; || &#xfa6; || &#xfa7;
| &#xfa8; || &#xfa9; || &#xfaa; || &#xfab; || &#xfac; || &#xfad; || &#xfae; || &#xfaf;
|-
| style="background:#F8F8F8;font-size:small; font-family: Verdana, sans; height: 36pt;"| U+0FBx
| &#xfb0; || &#xfb1; || &#xfb2; || &#xfb3; || &#xfb4; || &#xfb5; || &#xfb6; || &#xfb7;
| &#xfb8; || &#xfb9; || &#xfba; || &#xfbb; || &#xfbc; || bgcolor="#CCCCCC" | &nbsp; || &#xfbe; || &#xfbf;
|-
| style="background:#F8F8F8;font-size:small; font-family: Verdana, sans; height: 36pt;"| U+0FCx
| &#xfc0; || &#xfc1; || &#xfc2; || &#xfc3; || &#xfc4; || &#xfc5; || &#xfc6; || &#xfc7;
| &#xfc8; || &#xfc9; || &#xfca; || &#xfcb; || &#xfcc; || bgcolor="#CCCCCC" | &nbsp; || &#xfce; || &#xfcf;
|-
| style="background:#F8F8F8;font-size:small; font-family: Verdana, sans; height: 36pt;"| U+0FDx
| &#xfd0; || &#xfd1; || &#xfd2; || &#xfd3; || &#xfd4; || bgcolor="#CCCCCC" | &nbsp; || bgcolor="#CCCCCC" | &nbsp; || bgcolor="#CCCCCC" | &nbsp;
| bgcolor="#CCCCCC" | &nbsp; || bgcolor="#CCCCCC" | &nbsp; || bgcolor="#CCCCCC" | &nbsp; || bgcolor="#CCCCCC" | &nbsp; || bgcolor="#CCCCCC" | &nbsp; || bgcolor="#CCCCCC" | &nbsp; || bgcolor="#CCCCCC" | &nbsp; || bgcolor="#CCCCCC" | &nbsp;
|-
| style="background:#F8F8F8;font-size:small; font-family: Verdana, sans; height: 36pt;"| U+0FEx
| bgcolor="#CCCCCC" | &nbsp; || bgcolor="#CCCCCC" | &nbsp; || bgcolor="#CCCCCC" | &nbsp; || bgcolor="#CCCCCC" | &nbsp; || bgcolor="#CCCCCC" | &nbsp; || bgcolor="#CCCCCC" | &nbsp; || bgcolor="#CCCCCC" | &nbsp; || bgcolor="#CCCCCC" | &nbsp;
| bgcolor="#CCCCCC" | &nbsp; || bgcolor="#CCCCCC" | &nbsp; || bgcolor="#CCCCCC" | &nbsp; || bgcolor="#CCCCCC" | &nbsp; || bgcolor="#CCCCCC" | &nbsp; || bgcolor="#CCCCCC" | &nbsp; || bgcolor="#CCCCCC" | &nbsp; || bgcolor="#CCCCCC" | &nbsp;
|-
| style="background:#F8F8F8;font-size:small; font-family: Verdana, sans; height: 36pt;"| U+0FFx
| bgcolor="#CCCCCC" | &nbsp; || bgcolor="#CCCCCC" | &nbsp; || bgcolor="#CCCCCC" | &nbsp; || bgcolor="#CCCCCC" | &nbsp; || bgcolor="#CCCCCC" | &nbsp; || bgcolor="#CCCCCC" | &nbsp; || bgcolor="#CCCCCC" | &nbsp; || bgcolor="#CCCCCC" | &nbsp;
| bgcolor="#CCCCCC" | &nbsp; || bgcolor="#CCCCCC" | &nbsp; || bgcolor="#CCCCCC" | &nbsp; || bgcolor="#CCCCCC" | &nbsp; || bgcolor="#CCCCCC" | &nbsp; || bgcolor="#CCCCCC" | &nbsp; || bgcolor="#CCCCCC" | &nbsp; || bgcolor="#CCCCCC" | &nbsp;
|}


=='''Background:''' Some 'Legacy' Tibetan Systems & Encodings ==


===8-bit Tibetan word-processing systems===


Tibetan texts may contain thousands of different character combinations (stacks displayed as a ligature or combination of glyphs) - but many of these systems, programmed to work with the 8-bit operating systems and applications in use in western countries, India, Bhutan and Nepal at the time they were designed, mapped the glyphs in their fonts to restricted 8-bit character sets supporting a maximum of 256 characters or less. This meant that these applications had to spread the glyph set required for a comprehensive coverage of Tibetan across a whole set or series of separate font files. The different glyphs in each font were necessarily mapped to the same underlying character set. A system using a set of five fonts (e.g. TCC system) to cover the Tibetan combinations it supports is in fact storing five different glyphs, each representing a different Tibetan character or different combination of characters, using a one common underlying character. A system using a set of thirty fonts (e.g. Nitartha’s Sambhota system with Dedris fonts) is using one and the same character for thirty different combinations of Tibetan letters.  
==Characters and Glyphs==
The Unicode Standard encodes characters not glyphs<ref>[http://en.wikipedia.org/wiki/Glyphs glyphs]</ref> or letter-forms. Complex Tibetan combinations or ligatures are encoded in text as the individual characters representing their parts.


With all of these systems, in order to specify which particular combination of Tibetan letters the underlying character represents, it is necessary to store information on which specific font within the set this underlying character should be displayed with. In other words, font formatting information has to be stored along with the text. Effectively these systems are switching between five or thirty different code pages for Tibetan - the code page being used specified by the font being used. This means that Tibetan text entered using one of these systems is tied to the system which produced it – or at least to the particular font set(s) it supported. These systems require Tibetan to be stored as rich-text (text plus formatting information) rather than plain-text and if the formatting information is lost or corrupted the Tibetan text data is corrupted – or becomes garbage.
==How to install Tibetan Unicode software support==
See [[Tibetan Unicode Installation]] for information on how to install the required software support for Tibetan Unicode on Windows, Linux and Mac OS-X.


Another consequence of representing and storing Tibetan text in this way is that it becomes practically impossible to reliably search, sort, index or spell-check Tibetan data. Since most searching, sorting, indexing and spell checking utilities are designed to work with plain text rather than rich text they either ignore - or choke on - any formatting information applied to the underlying characters. Transmission of text across the internet to a large number of users is also a problem as there is no guarantee that a system receiving the data has access to or even can support the fonts and application used to generate the original data. Long term archival storage of Tibetan data in these formats also relies on the highly risky assumption that the non-standardized applications and fonts used to generate them will be supported in future versions of operating systems for many years and decades to come. 
==Current Limitations of Unicode Tibetan==


Of course the systems referred to above were first developed at a time when most users were using them on stand alone computers not connected to the internet. They necessarily had to be designed to work within and leverage the technology and applications available at the time. Their primary purpose was to produce Tibetan documents which would be printed out rather than displayed on screen.  Some of these Tibetan word-processing systems were developed and enhanced to effectively become sophisticated desktop publication systems for Tibetan pecha - and their fonts were evolved to a high standard of design. These systems indeed became very good for the purposes for which they were originally intended – but we should recognize their limitations.
* The main limitation is lack of support in older operatiing systems and applications.
 
===Systems developed in India===
 
 
===Multi-byte Tibetan systems developed in China===
The situation for Tibetan computing in China was very different. Computer systems in China were from the start designed to support the large character sets needed for the Chinese language – so didn’t suffer the limitations same limitations of a charcter set restricted to 256 charcters as in the west. Although in the early stages few Tibetan individuals could afford to purchase a computer, state owned publishing houses were set up to produce new editions of Tibetan texts and to publish newspapers, textbooks and magazines in the Tibetan language. Consequently two Chinese publishing systems ___ and ___ were adapted to work with Tibetan 
       
 
==Characters and Glyphs==
 
==Current Limitations of Unicode Tibetan==


*  Support for complex Indic scripts including Tibetan is currently lacking in pre-press / DTP software such as Adobe InDesign and Scribus used by printers and publishers.


==See also==
==See also==
[[Tibetan Fonts]]
*[[Tibetan Fonts]]
 
*[[Legacy Tibetan Software & Character Encoding]]
===External links===
==External links==
* [http://www.unicode.org/standard/WhatIsUnicode.html What is Unicode?]
* [http://www.unicode.org/standard/WhatIsUnicode.html What is Unicode?]
* [http://www.thdl.org/xml/showEssay.php?xml=/tools/encodingTib.xml&m=all Encoding model of the Tibetan script in the UCS] - Explains how Tibetan characters are encoded in the ISO 10646 / Unicode Standard. by [[Christopher Fynn]]  
* [http://www.thlib.org/tools/#wiki=/access/wiki/site/26a34146-33a6-48ce-001e-f16ce7908a6a/encoding%20model%20of%20the%20tibetan%20script%20in%20the%20ucs.html Encoding model of the Tibetan script in the UCS] - Explains how Tibetan characters are encoded in the ISO 10646 / Unicode Standard. by [[Christopher Fynn]]  
* [http://www.unicode.org/charts/PDF/U0F00.pdf Tibetan Block of The Unicode Standard] (code chart)
* [http://www.unicode.org/charts/PDF/U0F00.pdf Tibetan Block of The Unicode Standard] (code chart)
* [http://en.wikipedia.org/wiki/Wikipedia:Enabling_complex_text_support_for_Indic_scripts Enabling Complex Script Text Support for Indic Scripts] - on Wikipedia (includes Tibetan)

Latest revision as of 10:23, 10 June 2009

ཡུ་ནི་ཀོཌ྄

Introduction

Before the Unicode Standard came along, there were hundreds of different standardized and non-standardized encoding systems for encoding the characters of different writing systems. No single character encoding had enough characters to encode all the characters used in all the different writing systems of the world. Even for Western European languages, which use an uncomplicated writing system, the 7-bit and 8-bit computer character sets, such as ASCII and ISO 8859-1, used for encoding the Roman script were inadequate for all the letters, punctuation, and technical symbols in common use.

The situation for Tibetan was particularly anarchical. There was no recognized standard for encoding Tibetan script characters. Word-processing applications and add-ins for Tibetan used non-standardized, proprietary font-based encodings - mapping the Tibetan glyphs in the fonts they used to character sets originally designed for encoding Roman or Chinese characters. Since each Tibetan system used its own encoding, one of the greatest obstacles to using electronic Tibetan data resulted from the fact that files could not be easily shared by different Tibetan word-processing programs or with other applications without converting files from one encoding scheme to another.

Tibetan in the Unicode Standard

Chart of Tibetan script characters in the Unicode Standard
Unicode.org chart (PDF)
  0 1 2 3 4 5 6 7 8 9 A B C D E F  
U+0F0x
U+0F1x
U+0F2x
U+0F3x ༿
U+0F4x  
U+0F5x
U+0F6x      
U+0F7x   ཿ
U+0F8x        
U+0F9x  
U+0FAx
U+0FBx   ྿
U+0FCx  
U+0FDx                      
U+0FEx                                
U+0FFx                                


Characters and Glyphs

The Unicode Standard encodes characters not glyphs[1] or letter-forms. Complex Tibetan combinations or ligatures are encoded in text as the individual characters representing their parts.


How to install Tibetan Unicode software support

See Tibetan Unicode Installation for information on how to install the required software support for Tibetan Unicode on Windows, Linux and Mac OS-X.

Current Limitations of Unicode Tibetan

  • The main limitation is lack of support in older operatiing systems and applications.
  • Support for complex Indic scripts including Tibetan is currently lacking in pre-press / DTP software such as Adobe InDesign and Scribus used by printers and publishers.

See also

External links