Convert a PostgreSQL Database from LATIN1 to UTF8

I had a problem with my family django-powered website. I have an aggregator for all our friends blogs, very similar to the Django aggregator, except that mine hasn’t been aggregating. After some serious investigation, I found that psycopg was barfing error messages because the feeds that were being stuffed into my postgresql database contained utf-8 characters that it couldn’t figure out how to convert into latin1 characters. I hadn’t paid any attention to this, but apparently my blog database was using latin1. Lame. I decided it was time to learn how to convert it to utf-8.

First, I did a pg_dumpall in order to get a look at things (this happens automatically every night, actually):

/usr/bin/pg_dumpall -U <special user> > /home/bryan/backups/all.dbs.out

Sure enough, my database was created like so:

CREATE DATABASE bryan WITH TEMPLATE = template0 OWNER = bryan ENCODING = 'LATIN1';

So, I dumped just that database:

pg_dump -U <special user> bryan > bryan.dbs.out

Then I converted it to utf-8 with iconv:

iconv --from-code latin1 --to-code utf-8 bryan.dbs.out > bryan.dbs.out.utf8

Then I dropped the lame old latin1 database, after shutting down apache2:

sudo /etc/init.d/apache2 stop
dropdb bryan

Then I created the shiny new utf-8 database, using the command from the pg_dumpall output, with a slight change:

psql -U <special user> bryan < "CREATE DATABASE bryan WITH TEMPLATE = template0 OWNER = bryan ENCODING = 'UTF8';"

Then I restored the converted backup:

psql -U <special user> bryan < bryan.dbs.out.utf8

After starting up apache2 again:

sudo /etc/init.d/apache2 start

The website looked just like it did before. Phew!

Next I ran my update_feeds.py script and my aggregator aggregated, which was really awesome to see. I have some serious reading to catch up on now.

References:

  1. http://archives.postgresql.org/pgsql-general/2006-03/msg01259.php
  2. http://www.postgresql.org/docs/8.2/interactive/backup-dump.html
  3. http://www.postgresql.org/docs/8.2/interactive/manage-ag-dropdb.html
  4. http://www.postgresql.org/docs/8.2/interactive/manage-ag-createdb.html

Comments

Anonymous said…

psql -U ... bryan < "CREATE DATABASE bryan WITH TEMPLATE = template0 OWNER = bryan ENCODING = 'LATIN1';"


should be


psql -U ... bryan < "CREATE DATABASE bryan WITH TEMPLATE = template0 OWNER = bryan ENCODING = 'LATIN1';"


Also, you don't need to convert the sql dump to UTF-8. Postgresql knows what encoding it currently working on, so if you restore a latin1 encoded database into a utf-8 database it will automatically do the right thing.

In other words, with all clients disconnected:

1. pg_dump yourdb > yourdb.sql
2. dropdb yourdb
3. createdb -E utf-8 yourdb
4. psql yourdb < yourdb.sql
Anonymous said…

psql -U ... bryan < "CREATE DATABASE bryan WITH TEMPLATE = template0 OWNER = bryan ENCODING = 'UTF-8';"


forgot to fix it.. :/
Bryan said…
Good catch! Fortunately I only made that mistake in the blog, not on the command line.
Unknown said…
I had the same issue with my DB. Following your instructions I managed to convert the encoding quickly.

In addition I executed the following to convert my htmls to UTF-8:

ksh# for i in *
> do iconv -f CP1251 -t UTF-8 $i | tee $i
> done

and

ksh# for i in *
>do cat $i | sed s/windows-1251/utf-8/ | tee $i
>done


10x a lot
Anonymous said…
Thank you for the fix
Anonymous said…
Hi guys, I am having a similar problem with a postgresql database which was created with latin1 encoding.
I am about to convert the database to utf-8 but I don't want to loose the data already stored in the database.
Can I try the following and still have my website intact? Please advice
1. pg_dump yourdb > yourdb.psql
2. dropdb yourdb
3. createdb -E utf-8 yourdb
4. psql yourdb < yourdb.psql
Bryan said…
You won't lose any data in the database. pg_dump dumps *everything*. You can examine the file produced by the pg_dump command with a text editor and see what is in there if you aren't sure you trust it. You can (and probably should) also do all this on a test database first, just to make sure.

Popular posts from this blog

SystemVerilog Fork Disable "Gotchas"

'git revert' Is Not Equivalent To 'svn revert'

Put /etc Under Revision Control (with git)