Saturday, October 19, 2013

Encoding and Python: The UnicodeDecodeError exception

UnicodeDecodeError: 'ascii' codec can't decode something in position somewhere: ordinal not in range(128)

It all started with "ASCII" (it's a encoding, things will get more clear later) which was proposed in 1962. The idea was to represent english text by relating them to "decimal numbers" (read bytes and ultimately bits).

So, "1000001" (binary number, or 65 in decimal) in ASCII encoding corresponds to "A". This "A" is just a "glyph" (A mark that corresponds to A).

Sadly, this way of representing was not sufficient to represent all characters/symbols in the world. In the good old world, when people couldn't find the characters they wanted, they started creating their own encodings. Hence, encodings like latin, utf-8, utf-32 came in. This was good until a chinese guy just wanted to just write chinese (read any chinese dialect) and not combine both chinese and latin. Hence, there was a problem to represent all possible characters in one string (as not all characters might lie in one encoding).

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 13144: ordinal not in range(128)
Now, lets understand what this error actual means.
  1. It's a exception UnicodeDecodeError that is not caught.
  2. It says that while using "ascii" codec (read encoding), it couldn't decode the byte "0xe2" which is present at 13144.
Lets start with understand what unicode is. Unicode is a way to represent different glyphs using strings. It tries to include all characters possible. For ex, a "halfwidth katakana middledot" which has gylph ( can be represented by a string like \uff65. This way, Unicode tries to represent all the characters and symbols possible in all languages.

So, 'ascii' is a encoding. Old-style str instances use a single 8-bit byte to represent each character of the string using its ASCII code. Python tried to represent a character with 'ascii' encoding but it failed as it didn't exist. But, when the hell ascii? Isn't it old? That's because, python 2's default encoding is "ascii".
➜ 0 /home/shadyabhi [ 8:19PM] % locale
➜ 0 /home/shadyabhi [ 8:19PM] % python2 -c 'exec("import sys; print sys.getdefaultencoding()")'
➜ 0 /home/shadyabhi [ 8:19PM] %
If you want to change default encoding to utf-8 in python, you can do a hack:
import sys
# Set default encoding to 'UTF-8' instead of 'ascii'
# Bad things might happen though

This part is fixed in python3 by making "str" as a Unicode object where "str" object actually manages the sequence of Unicode code-points.

Now that we understand the exception, to fix it, you need to "decode" the string in the proper encoding that actually understands it. The "decoding" will make sure that the particular character which caused the exception earlier is actually a known character now. To encode/decode strings, python has two functions:
  1.  s.decode("ascii"): converts str object to unicode object
  2.  u.encode("ascii"): converts unicode object to str object
>>> u'・'
>>> u'・'.encode('utf-8')
>>> '\xef\xbd\xa5'.decode('utf-8')
>>> print '\xef\xbd\xa5'.decode('utf-8')
>>> '\xef\xbd\xa5'.decode('ascii')
Traceback (most recent call last):
  File "<input>", line 1, in 
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)
 I faced above error when I was trying to parse webpages and get text out of it using "html2text" module. As python 2's default encoding is "ascii", it's stupid to assume that all the websites can be represented in "ascii" encoding.

How do we guess the encoding of text then? We can't. Some encodings have BOM and they can be used to detect text encoding while for others, there is simply no way. Well, there is a module named chardet that you can use to guess the encoding though. I repeat, there is no reliable way to guess the encoding. While parsing web-pages, there is mostly a header like: 
Content-Type: text/html; charset=utf-8
or the webpage may start with:
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

which can be used to get the encoding related information. Wait, but how do we read the encoded text without knowing the encoding? Luckily, the name of all encodings can be represented in that basic "ascii" encoding so that's not a problem. 

This information can then be used to convert it to Unicode in python
Once that is done, you can do whatever you want with it. If you need to save to disk, you need encode it back though.

If the webpage is bitchy and gives a false header, you get a exception UnicodeDecodeError if you're using "strict" option, which is default. If you still want to decode anyway, use "ignore" or "replace".
page_content.decode(encoding_in_header, 'ignore')
It's a good practice to decode string to Unicode as soon as we receive it from external source and operate on it. When we're done with it and want to give back or store somewhere, encode it again. Then why is it not done in Python 2? Because, not all core parts of python operate on Unicode. This is fixed in Python 3.

I hope this gives a little idea of what's this Unicode and how to handle different encodings in your code.

Further Reading:

Google Bookmarks shortcut in pentadactyl

Till now I used Shareholic extension to add bookmarks in Google for the websites.

This worked great but I had to install a addon just for this functionality. I like minimalistic design so I don't have a menuar, bookmarks bar, location bar etc, it's just the webpage. That's the very reason I use pentadactyl on my Firefox. So, the very thought of adding a addon bugs me.

Just today, I figured that I can map javascript functions with shortcuts. So, here is a little thing that you can add in your `.pentadactylrc` and add bookmarks just by pressing a shortcut.
map -modes=n z -javascript (function(){var a=window,b=content.document,c=encodeURIComponent,""+c(b.location)+"&title="+c(b.title),"bkmk_popup","left="+((a.screenX||a.screenLeft)+10)+",top="+((a.screenY||a.screenTop)+10)+",height=510px,width=550px,resizable=1,alwaysRaised=1");a.setTimeout(function(){d.focus()},300)})();
Notice the "content" in variable "b", that's because if I just use "document.location", I'll get the value as  "chrome://browser/content/browser.xul".
Now, you can press "z" and add current location to Google Bookmarks.

Wednesday, July 31, 2013

Error while starting nfsiostat: ValueError: invalid literal for long() with base 10: 'device'

Experienced this on CentOS6 with version `nfs-utils-1.2.3-36.el6.x86_64`.
abhijeet.ras@box ~  [ 6:03:57] 
$ sudo nfsiostat 
Traceback (most recent call last):
  File " in="" line="" module="" nfsiostat="" sbin="" usr="">    iostat_command(prog)
  File "/usr/sbin/nfsiostat", line 587, in iostat_command
    devices = list_nfs_mounts(origdevices, mountstats)
  File "/usr/sbin/nfsiostat", line 490, in list_nfs_mounts
  File "/usr/sbin/nfsiostat", line 179, in parse_stats
  File "/usr/sbin/nfsiostat", line 163, in __parse_rpc_line
    self.__rpc_data[op] = [long(word) for word in words[1:]]
ValueError: invalid literal for long() with base 10: 'device'
abhijeet.ras@box ~ 1 [ 6:04:21] 
Going through the code, I noticed that it wasn't able to parse file /proc/self/mountstats correctly. For now, I've added
if line.startswith("no device mounted"):
in the code. Have sent a mail to ML for further discussion.

Sunday, June 23, 2013

"Error: Time mode requires the flot.time plugin." in Kibana3

This error is very cryptic and hides what the real error is.

In my case, it was issue of Cross Domain Resource Sharing. In simple english, AJAX requests can only happen on the same domain (for security reasons) unless your server is configured to support it.

You can confirm this thing by running chromium as:
chromium --disable-web-security
Solution is to just use same domain in both config.js and the palce where Kibana3 is hosted. Easy ehh, not the error though.

Friday, May 17, 2013

IRSSI: nick already in use error after a disconnect or network issue

This error pretty much annoyed the shit out of me where a temporary disconnect resulted in this error for a few minutes on freenode and irssi used to try using a different nick.

This can be solved by not authenticating using
/msg NickServ password mypass
 If SASL is used for authentication, Freenode servers knows beforehand what my nick is and doesn't complain about this error. The how-to guide for it can be seen here:

Another thing you might want to configure is:

This is one of those issues for which I was lazy enough to not search about it.