greg wm wrote:
> i used wget to copy the entire http://nonviolentpeaceforce.org site to
> http://nvpf.org/np.  the former is in m$ asp, the latter captured as html.
> 
> for example, http://nonviolentpeaceforce.org/spanish/welcome.asp was
> captured to http://nvpf.org/np/spanish/welcome.asp.html
> 
> as you can see, the capture is mostly fine, including spanish characters
> in the text (eg año), however the spanish characters in the menus didn't
> do quite so well (eg Misi?n)

fixed!  see below.

> in the file año appears as año which is apparently "good", but
> Misi?n appears as Misión, which is apparently "bad".
> 
> first question:  why is that bad?
> 
> if i tell galeon, instead of automatic encoding, use western iso-8859-1,
> then, presto, the page appears nicely.  but i don't have to do that to 
> see the original, nor do i have to do that for anybody else's pages, and 
> of course i can't expect our audience to go and fiddle with that in 
> their browsers.
> 
> second question:  why doesn't the meta http-equiv header do anything?
> 
> right after the title the file says <meta http-equiv="Content-Type" 
> content="text/html; charset=iso-8859-1">.  why isn't that good enough? 
> why does it make no difference at all what i change it to?  i tried 
> utf-8, Utf-8, UTF-8, Windows-1252, none have any effect tho i can see 
> them if i tell my browser to view source.

overridden by apache's http headers, apparently.  see below.

> fourth question:  can wget be tweaked to do better?
>
> i think those menus were rendered out of some .asp database or
> whatever, differently than the rest of the text of the page.  but so 
> what?  why didn't wget capture something identical to what my
> browser shows?
> 
> the command i ran was
> wget -ENKkrl19 -nH -w2 -owget.log http://nonviolentpeaceforce.org

my locale is en_IE.UTF-8, so why did wget save in latin-1 format?
the wget manual page mentions nothing at all about character sets.

> well whatever, thunk i, no problem, i'll just find and replace.  well 
> ha.  i haven't yet managed to craft sed to capture the buggers!  it's 
> all making me feel dang defeated..

Brian Foster wrote:
>  there are other alternatives.  e.g., convert the page
>  (file) to UTF-8 (e.g., using iconv(1)), being sure to
>  change the meta charset setting to utf-8.
>
>  finally, vim(1):  vim confuses things here (I am _not_
>  trying to start an editor war!).  vim guesses what the
>  file's charset is, and adjusts accordingly so that you
>  can view/edit it in a locale using a different charset.
>  hence, a lot of things that cat displays as rubbish
>  display Ok in vim.  if, in vim, you do a “:set” command,
>  you'll probably see an entry like “fileencoding=utf-8”.
>  that means vim thinks the file is UTF-8.  (probably,
>  for the ISO-8859-1 files, it says “fileencoding=latin1”;
>  Latin1 is a common informal(?) name for ISO-8859-1.)
>  “:help fileencoding” for more information.

thank you brian!  perhaps iconv might have done the trick, anyway i used
vim.  vim :se fileencoding revealed that wget saved the files in
latin-1, and :se fileencoding=utf-8 for each file cleaned up the mess.
wasn't even a big job after using :map such that each file was fixed
with a single keystroke.

William A. Rowe, Jr. wrote:
> What happens if you remove the defaultcharset entirely; have Apache
> provide no hinting at the encoding; does the browser respect the meta
> tag?
>
> The http headers are authoritative, and override any metadata.  If you
> rather control your encoding with meta tags, turn off charsets entirely.

that is probably the winning answer.  i already applied the above
solution so i dunno for sure, but look..

wget --save-headers from the original m$ .asp server:
   HTTP/1.1 200 OK
   Server: Microsoft-IIS/5.0
   Date: Sat, 20 Aug 2005 21:18:55 GMT
   Connection: keep-alive
   Connection: Keep-Alive
   Content-Length: 11003
   Content-Type: text/html
   Set-Cookie: ASPSESSIONIDAQQBRDRA=KNIPOCMDJPKMMANLNHFMKMGH; path=/
   Cache-control: private

wget --save-headers from my apache server:
   HTTP/1.1 200 OK
   Date: Sun, 21 Aug 2005 04:10:34 GMT
   Server: Apache/2.0.52 (CentOS)
   Last-Modified: Sun, 21 Aug 2005 01:34:43 GMT
   ETag: "260261-2b33-9134b2c0"
   Accept-Ranges: bytes
   Content-Length: 11059
   Connection: close
   Content-Type: text/html; charset=UTF-8

now i wouldn't have thought that the following httpd.conf directive
would result in overriding the meta http-equiv headers, but, there does
seem to be a strong odor..

# Specify a default charset for all pages sent out. This is
# always a good idea and opens the door for future internationalisation
# of your web site, should you ever want it. Specifying it as
# a default does little harm; as the standard dictates that a page
# is in iso-8859-1 (latin1) unless specified otherwise i.e. you
# are merely stating the obvious. There are also some security
# reasons in browsers, related to javascript and URL parsing
# which encourage you to always set a default char set.
#
AddDefaultCharset UTF-8

greg

> Greg Whitley Mott
> IT Coordinator
> NonviolentPeaceforce.org