Computer representation of Sanskrit

August 7, 2008

Dear list members,

Now that we're all using Kyoto-Harvard transliteration perhaps its

time to look at another aspect of Sanskrit encoding: Not so much on

how best to communicate Sanskrit between human-beings but how best

to encode Sanskrit for computer processing, analysis and study.

An encoding of a language to work ideally with computers needs to

have two characteristics.

1) One letter of the encoding needs to correspond to one letter of

the language.

2) The encoding needs to sort naturally on the languages alphabetic

order. I.e. in an English font a is encoded as 97, b as 98, c as 99

etc. so it naturally sorts as a,b,c etc.

None of the encodings in popular use today, not even Unicode, have

these characteristics and this complicates computer processing of

Sanskrit. Though I do see that an encoding SLP1 does have a 1 to 1

correspondence between letters of the encoding and letters of

Sanskrit.

To give a very simple example, suppose you had a database of catalog

records of sanskrit texts and their authors and you wanted to be

able to print out the texts sorted by either author or by title in

correct Sanskrit letter order then you would need to have the author

and the title in the database twice, once in an encoding

corresponding to a font to display it and once in an encoding

ordered in the correct Sanskrit alphabetic sequence to sort on. Or

else you'd have program some special sorting routines in your

program.

Or another simple example: suppose you have a word and you want to

find all words in a text that differ from it by one letter. You use

what are called " regular expressions " . For example in English to

find all cases of g-ve (i.e. give, gave etc.) you would input to the

computer program the simple phrase |g.ve| and that would return all

cases of give, gave etc. but to do the same thing in Sanskrit is

more complicated because ai and au represent only one letter in

Sanskrit. For example |v.rya| would return virya but not vairya

because to your computer program ai is two letters. But |v..rya|

would return vairya but not virya. Similarly v(i|ai)rya would return

virya and vairya but miss all misspellings of the word, so even in

unicode you need to put in the more complicated v(.|ai|au|)rya to

get all correct or misspelled cases.

These are of course very simple examples that can easily be gotten

round but they do illustrate in a simple way the complications that

the present encodings of Sanskrit present.

At present we have probably only a few thousand digitized Sanskrit

e-texts but the way things are going its quite possible in a few

years time that there may be a huge corpus of digitized Sanskrit

texts available for computer analysis of all kinds, perhaps analysis

of subtle differences in word counts and usage to help identify

whether an attributed author is in fact the author, analysis to see

if a text was written by one author or many etc. etc., computerized

finding of parallel passages, computerized help in creating

concordances, etc. All of this is much easier if you have an

encoding with those two characteristics mentioned earlier, i.e.:

1) one to one correspondence between a letter of the encoding and a

letter of the language.

2) An encoding that collates (i.e. sorts naturally) in the sort

order of the languages letters.

Regards,

Harry Spier

August 20, 2008

Till the ideal encoding is invented and becomes prevalent,

it would be nice if any Sanskrit text would have a standard marker

that identifies encoding used. This way computer programs would have

at least a chance to adapt to various encodings that are less than

perfect.

There are two steps to this goal.

1.Create a list of abbreviations for all electronically used encodings

and

2. update existing files with a marker, using corresponding abbreviation.

For example:

1. HK for Harvard-Kyoto

CSX for CSX

UTF8 for Unicode-8-bit

REE for REE

...

etc.

2. let the marker be something standard and simple like

%%##skt-encoding=HK##%%

somewhere at the beginning of a document.

I think that creating abbreviations for existing encodings

and the choice of a marker might be accomplished on this forum.

Regards,

Dmitri.

August 31, 2008

Re: Computer representation of Sanskrit

I find that the best computer representation of Sanskrit, Hindi, and a few other

Indian languages is " Sanskrit 2003 " at least for IBM machines. I am teaching

myself basic Hindi. I already know the devanagari script for reading Hindi and

Sanskrit. I have a basic vocabulary of 25 basic expressions already in Hindi.

I intend to one day know fluent Hindi and Sanskrit. I plan to travel to India

and study in an ashram for a couple of months studying the vedic scriptures and

meditating.

Ashok Aklujkar and his wife and daughter (Rasika) I believe were the ones to

encourage me to learn Hindi. I really enjoyed the Butterfly and Bee dance story

very much in Surrey during the BCACL 2008 conference. It really opened my eyes

to Indian and Hindu culture.

All the best,

Lyle Lexier

604-408-9469

lord_moa

________________

Canada Toolbar: Search from anywhere on the web, and bookmark your

favourite sites. Download it now at

http://ca.toolbar..

Sign In

Computer representation of Sanskrit

Recommended Posts

Guest guest

Link to comment

Share on other sites

Guest guest

Link to comment

Share on other sites

Guest guest

Link to comment

Share on other sites

Join the conversation

Support the Ashram

Join Groups

Top Downloads