Convert a PostgreSQL Database from LATIN1 to UTF8
I had a problem with my family django-powered website. I have an aggregator for all our friends blogs, very similar to the Django aggregator, except that mine hasn’t been aggregating. After some serious investigation, I found that psycopg was barfing error messages because the feeds that were being stuffed into my postgresql database contained utf-8 characters that it couldn’t figure out how to convert into latin1 characters. I hadn’t paid any attention to this, but apparently my blog database was using latin1. Lame. I decided it was time to learn how to convert it to utf-8.
First, I did a pg_dumpall in order to get a look at things (this happens automatically every night, actually):
/usr/bin/pg_dumpall -U <special user> > /home/bryan/backups/all.dbs.out
Sure enough, my database was created like so:
CREATE DATABASE bryan WITH TEMPLATE = template0 OWNER = bryan ENCODING = 'LATIN1';
So, I dumped just that database:
pg_dump -U <special user> bryan > bryan.dbs.out
Then I converted it to utf-8 with iconv:
iconv --from-code latin1 --to-code utf-8 bryan.dbs.out > bryan.dbs.out.utf8
Then I dropped the lame old latin1 database, after shutting down apache2:
sudo /etc/init.d/apache2 stop
dropdb bryan
Then I created the shiny new utf-8 database, using the command from the pg_dumpall output, with a slight change:
psql -U <special user> bryan < "CREATE DATABASE bryan WITH TEMPLATE = template0 OWNER = bryan ENCODING = 'UTF8';"
Then I restored the converted backup:
psql -U <special user> bryan < bryan.dbs.out.utf8
After starting up apache2 again:
sudo /etc/init.d/apache2 start
The website looked just like it did before. Phew!
Next I ran my update_feeds.py script and my aggregator aggregated, which was really awesome to see. I have some serious reading to catch up on now.
References:
Comments
psql -U ... bryan < "CREATE DATABASE bryan WITH TEMPLATE = template0 OWNER = bryan ENCODING = 'LATIN1';"
should be
psql -U ... bryan < "CREATE DATABASE bryan WITH TEMPLATE = template0 OWNER = bryan ENCODING = 'LATIN1';"
Also, you don't need to convert the sql dump to UTF-8. Postgresql knows what encoding it currently working on, so if you restore a latin1 encoded database into a utf-8 database it will automatically do the right thing.
In other words, with all clients disconnected:
1. pg_dump yourdb > yourdb.sql
2. dropdb yourdb
3. createdb -E utf-8 yourdb
4. psql yourdb < yourdb.sql
psql -U ... bryan < "CREATE DATABASE bryan WITH TEMPLATE = template0 OWNER = bryan ENCODING = 'UTF-8';"
forgot to fix it.. :/
In addition I executed the following to convert my htmls to UTF-8:
ksh# for i in *
> do iconv -f CP1251 -t UTF-8 $i | tee $i
> done
and
ksh# for i in *
>do cat $i | sed s/windows-1251/utf-8/ | tee $i
>done
10x a lot
I am about to convert the database to utf-8 but I don't want to loose the data already stored in the database.
Can I try the following and still have my website intact? Please advice
1. pg_dump yourdb > yourdb.psql
2. dropdb yourdb
3. createdb -E utf-8 yourdb
4. psql yourdb < yourdb.psql