Question about how to specify coding system chains: encoding or decoding persepctive?

Friday, 7 September 2001

        Now that I've created chains of coding systems and gotten them to work, the
issue comes up: when specifying the chain, does it make more sense to take a
decoding or encoding perspective?

e.g. imagine i have data in base64 wrapping of gzip of euc-jp of CRLF text
[coding system convert-eol-crlf, which assumes the EOL processing has been
extracted into a separate coding system, which i'm in the process of doing]. 
once you specify a coding system chain to do this transformation, you can use it
both ways: decoding to get the actual text, encoding to get some base64 stuff
[or vice versa if you reversed the order of the chain].  as you can see,
everything is symmetrical and reversible, and it's easy to get mixed up unless
there's a clear standard, which hopefully is the most intuitive.

so, what do people think: given that we for the moment are talking about
processes where the result of *decoding* is just the text, should we consider as
basic the steps required to decode:

base64 | gunzip | euc-jp | convert-eol-crlf,

or the steps required to encode:

convert-eol-crlf | euc-jp | gzip | base64

Note that I changed the name of the gzip process between gzip and gunzip so as
not to bias things.  In some ways decode makes sense because that's what the
end-user is more likely to be doing, and they may get confused having to specify
encode-centric steps. [encoding will typically just used what's been decoded.]
In some ways encode makes sense because it makes more sense to attach data to a
file describing how it was created, rather than how to decode it. [and indeed
this is how the world works -- utilities are generically called "tar" and
"gzip", implying encoding.]

this also affects the naming.  e.g. i have a coding system that converts between
unicode and multibyte; when decoding multibyte, the steps are [multibyte-data] |
multibyte-to-unicode | unicode | convert-eol-crlf, and you get raw text.  when
encoding, the steps are reversed.  do i call this coding system
`multibyte-to-unicode' [emphasizing what it does when decoding] or
`unicode-to-multibyte' [emphasizing what it does when encoding]?  as usual,
names tend to emphasize their encoding process, so i called it
unicode-to-multibyte; but that gets very confusing if you're specifying a
decoding chain, since the operation will be multibyte-to-unicode but you'll
still have to give the opposite name.

sorry if this seems very confusing.  the basic point is:

for naming, we need to settle on one of {encoding, decoding} as the "basic"
operation, and the other as the "undoing" operation.  all naming, all chains,
etc. have to take the point of view of the "basic" operation, and when doing
the
other operation, you need to consciously say to yourself, "i'm doing a reverse
operation" and choose or specify the "forward" operation, knowing it will
go
backward.

which of {encoding,decoding} is the "basic", "forward" operation?

ben

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

Question about how to specify coding system chains: encoding or decoding persepctive?