greg wm wrote: > i used wget to copy the entire http://nonviolentpeaceforce.org site to > http://nvpf.org/np. the former is in m$ asp, the latter captured as html. > > for example, http://nonviolentpeaceforce.org/spanish/welcome.asp was > captured to http://nvpf.org/np/spanish/welcome.asp.html > > as you can see, the capture is mostly fine, including spanish characters > in the text (eg año), however the spanish characters in the menus didn't > do quite so well (eg Misi?n) fixed! see below. > in the file año appears as año which is apparently "good", but > Misi?n appears as Misión, which is apparently "bad". > > first question: why is that bad? > > if i tell galeon, instead of automatic encoding, use western iso-8859-1, > then, presto, the page appears nicely. but i don't have to do that to > see the original, nor do i have to do that for anybody else's pages, and > of course i can't expect our audience to go and fiddle with that in > their browsers. > > second question: why doesn't the meta http-equiv header do anything? > > right after the title the file says <meta http-equiv="Content-Type" > content="text/html; charset=iso-8859-1">. why isn't that good enough? > why does it make no difference at all what i change it to? i tried > utf-8, Utf-8, UTF-8, Windows-1252, none have any effect tho i can see > them if i tell my browser to view source. overridden by apache's http headers, apparently. see below. > fourth question: can wget be tweaked to do better? > > i think those menus were rendered out of some .asp database or > whatever, differently than the rest of the text of the page. but so > what? why didn't wget capture something identical to what my > browser shows? > > the command i ran was > wget -ENKkrl19 -nH -w2 -owget.log http://nonviolentpeaceforce.org my locale is en_IE.UTF-8, so why did wget save in latin-1 format? the wget manual page mentions nothing at all about character sets. > well whatever, thunk i, no problem, i'll just find and replace. well > ha. i haven't yet managed to craft sed to capture the buggers! it's > all making me feel dang defeated.. Brian Foster wrote: > there are other alternatives. e.g., convert the page > (file) to UTF-8 (e.g., using iconv(1)), being sure to > change the meta charset setting to utf-8. > > finally, vim(1): vim confuses things here (I am _not_ > trying to start an editor war!). vim guesses what the > file's charset is, and adjusts accordingly so that you > can view/edit it in a locale using a different charset. > hence, a lot of things that cat displays as rubbish > display Ok in vim. if, in vim, you do a “:set” command, > you'll probably see an entry like “fileencoding=utf-8”. > that means vim thinks the file is UTF-8. (probably, > for the ISO-8859-1 files, it says “fileencoding=latin1”; > Latin1 is a common informal(?) name for ISO-8859-1.) > “:help fileencoding” for more information. thank you brian! perhaps iconv might have done the trick, anyway i used vim. vim :se fileencoding revealed that wget saved the files in latin-1, and :se fileencoding=utf-8 for each file cleaned up the mess. wasn't even a big job after using :map such that each file was fixed with a single keystroke. William A. Rowe, Jr. wrote: > What happens if you remove the defaultcharset entirely; have Apache > provide no hinting at the encoding; does the browser respect the meta > tag? > > The http headers are authoritative, and override any metadata. If you > rather control your encoding with meta tags, turn off charsets entirely. that is probably the winning answer. i already applied the above solution so i dunno for sure, but look.. wget --save-headers from the original m$ .asp server: HTTP/1.1 200 OK Server: Microsoft-IIS/5.0 Date: Sat, 20 Aug 2005 21:18:55 GMT Connection: keep-alive Connection: Keep-Alive Content-Length: 11003 Content-Type: text/html Set-Cookie: ASPSESSIONIDAQQBRDRA=KNIPOCMDJPKMMANLNHFMKMGH; path=/ Cache-control: private wget --save-headers from my apache server: HTTP/1.1 200 OK Date: Sun, 21 Aug 2005 04:10:34 GMT Server: Apache/2.0.52 (CentOS) Last-Modified: Sun, 21 Aug 2005 01:34:43 GMT ETag: "260261-2b33-9134b2c0" Accept-Ranges: bytes Content-Length: 11059 Connection: close Content-Type: text/html; charset=UTF-8 now i wouldn't have thought that the following httpd.conf directive would result in overriding the meta http-equiv headers, but, there does seem to be a strong odor.. # Specify a default charset for all pages sent out. This is # always a good idea and opens the door for future internationalisation # of your web site, should you ever want it. Specifying it as # a default does little harm; as the standard dictates that a page # is in iso-8859-1 (latin1) unless specified otherwise i.e. you # are merely stating the obvious. There are also some security # reasons in browsers, related to javascript and URL parsing # which encourage you to always set a default char set. # AddDefaultCharset UTF-8 greg > Greg Whitley Mott > IT Coordinator > NonviolentPeaceforce.org