Wednesday, February 22, 2012

Unicode and encoding: Python vs Java shootout, part 1

Before going on with this post, be sure you've read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) - otherwise, I won't be able to solve your issues.

Sometimes Java fans just say that Python Unicode support sucks. While there's a bit of truth in that assertion - for Python 2.x at least, since 3.x solved the problem at its root - the real problem lies in the fact that many programmers don't know what's going on "under the hood", and that python default behaviour is a bit unforgiving.

First things first: Python 2.x has got two distinct string types: the so-called "byte strings" (str type) are one, and unicode objects (unicode type) are another one. Unicode objects are much like Java strings; they're an internal abstraction of Python, and need to be interpreted to/from byte strings whenever printing, reading/writing from/to a file, etc.

Sometimes such interpreting "just works" (most probably if you're dealing with ascii-only text), but if such conversion fails, errors occur: Python infamous and dreaded UnicodeDecodeError can happen in many places and might sometime just puzzle the programmer:

# -*- coding: utf-8 -*-
import sys

print "current default encoding: " + sys.getdefaultencoding()

"àèìòù" + u"asd"

javapythonunicode$ python unicode_concat.py
current default encoding: ascii
Traceback (most recent call last):
  File "unicode_concat.py", line 6, in <module>
  "àèìòù" + u"asd"
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

 This happens when trying to concatenate an unicode string to string object. What really happens here, behind the scenes, is explained in unicode_concat_explained.py:

# -*- coding: utf-8 -*-
import sys

print "current default encoding: " + sys.getdefaultencoding()

"àèìòù".decode(sys.getdefaultencoding()) + u"asd"

When using a byte string (not a unicode object) python tries to convert that string to unicode before concatenating to a unicode object. Since an encoding wasn't specified, python just uses the default.

So, if you're planning to mix unicode and bytestring objects (which is not a good idea), always remember to explicitly convert string objects to unicode by their decode() method. Or, if you're sure the encoding is always the same, you could add a sitecustomize.py to your interpreter or your project PYTHONPATH, and set such encoding. Beware that setting it system wide may lead to unexpected results when running your project on another machine.

Also, just remember that the coding directive at the begininning of the file does not change unicode encoding/decoding behaviour: that's a directive to the parser, which is used to instruct it about the current encoding of the file. That information is not retained nor used at runtime.


Let's go for another much dreaded error now: UnicodeEncodeError!

# -*- coding: utf-8 -*-
import sys
print "current stdout encoding: " + str(sys.stdout.encoding)
print "current default encoding: " + sys.getdefaultencoding()

print u"àèìòù"

results in this output:

javapythonunicode$ python unicode_print.py
current stdout encoding: UTF-8
àèìòù

Everything seems to work (as long as your terminal supports a charset, like UTF-8, or iso-8859-1, which can display accented chars) ! But now let's go for some "black magic":


That may be puzzling, but the sys.stdout printout should give you an hint.

Whenever output is to a terminal, Python does perform autodetection of the encoding of your terminal, and subsequently sets sys.stdout accordingly. When sys.stdout encoding is set, any unicode object that gets printed is encoded accordingly. When output is NOT to a terminal, python does not autodetect sys.stdout encoding and, when sys.stdout has no encoding set any unicode object that is printed is converted according to sys.getdefaultencoding()

So, what really happens here is:

# -*- coding: utf-8 -*-
import sys
print "current stdout encoding: " + str(sys.stdout.encoding)
print "current default encoding: " + sys.getdefaultencoding()

sys.stdout.write(
    u"àèìòù".encode(sys.stdout.encoding or sys.getdefaultencoding())
                )

Since accented characters can be encoded to UTF-8 but not to ASCII, such UnicodeEncodeErrors arise.

Also beware of bug 4947 - it may hit you if you're using Python 2.6 or or older.

In part 2, which shall be published within a few days, we'll explore the Java approach to unicode and encodings, why you won't probably hit such hard errors, how the Java approach has its own deal of issues and what can you do to make Python behave more like Java, if you really want to.


Wednesday, December 7, 2011

Jetbrains IDEs: dock-aware launcher for Linux

UPDATE:
Most of this is pointless on IDEA >= 11 and PyCharm >= 2, since a Tools -> Create desktop entry menu now exists and will let you do everything my launcher did. You'll need to reinstall it at every update - or just use the symlink trick I reccomended with the standalone launcher and edit the IDEA desktop entry.

I had already elaborated on this, and I still thank Dutchworks for their work, but I was yet to be satisfied with the status of dock awareness for Jetbrains IDEs; I tend to use multiple of those and I was highly dissatisfied with the way the dock grouped all my Java apps - that's caused by the fact that many Java apps based on Swing/AWT share the same WM_CLASS attribute, which is used bye the dock to tell one window from another - but you can read more on that from the Dutchworks blog page above.

The basic problem is:


Here I have launched IDEA and PyCharm at the same time, and they're grouped with the very same icon by docky - you need to right-click the icon and pick the proper subwindow, which is quite an hassle.

My previously created solution required to update the launcher and copy agent.jar at every update.

So what?

I just created a standalone launcher which creates a proper .desktop file in your .local/share/applications directory, does not modify original files and still allows IDEs to have their own distinct icon in the dock:



There I have both IDEA and PyCharm active at the same time, and they both work properly, and, whenever updating, you can just update the symlink (as suggested in the README from the GitHub repo) and you're done!

Multiple projects in the same DIE still share the same icon, but that doesn't seem to be changeable from the outside, it's something that Jetbrains should do; I'll open a ticket for that.

Enjoy!


Thursday, October 20, 2011

Deploy any Java webapp via war to Heroku

Heroku Java support is something I longed for, but they apparently don't support deploying a plain, old war file to Heroku.

I don't like that; I've got my own continuous integration setup, and I want to compile my files on my own box - this allows me to keep my own Maven proxy with my own dependencies - which sometimes are patched versions of public ones -  so that I don't need to push everything on a public repo. I don't want to be forced to build everything on Heroku, and I don't want to compile anything I push - I'd usually push to Heroku stable versions only, or maybe integration versions that pass unit, integration and acceptance tests.

So what? Don't worry, deployment is possible.



Setup an heroku-aware git repository and a Cedar stack app, and add the files you see above; your dir structure should look like:

pom.xml
Procfile
src/
     main/
             java/
                    root/
                           StartWebApp.java
             resources/
                             mywebapp.war


Now a simple

git push heroku master


Should do the trick, and let you see your fully deployed war app.


There're still things you need to be aware of; file persistence may not work as expected, since filesystem is ephemeral, and I still need to figure out things related to database configuration, but those should not come at a great expense.

I'll post updates in the future as soon as I get them right.



Tuesday, August 30, 2011

Mock Javamail primer

Small primer for mock javamail, a pretty useful project that will be handy whenever you need to test code that uses POP3/IMAP/SMTP and you wouldn't like to build wrappers for anything you need to test - javamail does not really offer interfaces that can be implemented by mock objects, but different implementations can be used at runtime depending on which jars are on the classpath.

Take a look at the project homepage:

http://java.net/projects/mock-javamail

Be sure you've enabled java.net Maven repo, and add this dependency to your pom.xml:

1         <dependency>
2             <groupId>org.jvnet.mock-javamail</groupId>
3             <artifactId>mock-javamail</artifactId>
4             <version>1.9</version>
5             <scope>test</scope>
6         </dependency>

You're almost ready; now setup your test code this way:

 1 public class IMAPMailRepositoryTest {
 2     @Before
 3     public void setUp() throws Exception {
 4         final Session session = Session.getInstance(
 5                 System.getProperties());
 6 
 7         MimeMessage msg = new MimeMessage(
 8                 session);
 9         msg.setRecipients(Message.RecipientType.TO,
10                 "testuser@mockserver.com");
11         msg.setSubject("Some Subject");
12         msg.setText("sometext");
13         Transport.send(msg);
14     }
15 
16     @After
17     public void tearDown() throws Exception {
18         Mailbox.clearAll();
19     }



Now you're ready to use your mock javamail:

 1 public class SomeImapClient {
 2 
 3     private static Logger log = Logger.getLogger(SomeImapClient.class);
 4 
 5     public void processMail() {
 6         try {
 7             Session session = getMailSession();
 8             Store store = connect(session);
 9             Folder folder = openMailFolder(store);
10             findContent(folder);
11         } catch (MessagingException e) {
12             throw new RuntimeException(e);
13         } catch (IOException e) {
14             throw new RuntimeException(e);
15         }
16 
17     }
18 
19     public Session getMailSession() {
20         Properties props = System.getProperties();
21         props.setProperty("mail.store.protocol", "imaps");
22         props.setProperty("mail.imap.partialfetch", "0");
23 
24         log.debug("Getting session");
25         return Session.getDefaultInstance(props, null);
26 
27     }
28 
29     public Store connect(Session session) throws MessagingException {
30         log.debug("getting the session for accessing email.");
31         Store store = session.getStore("imap");
32 
33         store.connect("mockserver.com", "testuser", "somepassword");
34         log.debug("Connection established with IMAP server.");
35         return store;
36     }
37 
38     public Folder openMailFolder(Store store) throws MessagingException {
39         Folder folder = store.getDefaultFolder();
40         folder = folder.getFolder("inbox");
41         folder.open(Folder.READ_ONLY);
42         return folder;
43     }
44 
45     public void findContent(Folder folder) throws MessagingException, IOException {
46         for (Message m : folder.getMessages()) {
47             log.debug(m.getSubject());
48 
49         }
50 
51 
52     }
53 
54 
55 }


Output is:

DEBUG 30 Aug 2011 19:25:25 SomeImapClient:32 - Getting session
DEBUG 30 Aug 2011 19:25:25 SomeImapClient:38 - getting the session for accessing email.
DEBUG 30 Aug 2011 19:25:25 SomeImapClient:42 - Connection established with IMAP server.
DEBUG 30 Aug 2011 19:25:25 SomeImapClient:55 - Some Subject


Some small gotchas:

  • your store.connect("example.com", "username", "anything") must match your "username@example.com" addressee in msg.setRecipients; different addressees will yield multiple mailboxes.
  • Only works for plain imap & pop3 mailboxes. Even though it doesn't make a lot of sense to use SSL on a mock mailbox, if you try getting the "imaps" store you'll get an error as a real implementation of javamail will be employed.
  • I'd always advise to use such tearDown() when doing unit tests, since the Session seems global and different tests could interfere with different contents - unless you use a different fake user/test server url in each of your tests.



Tuesday, April 26, 2011

Unit testing with Twisted: testing protocols

I had some hard time when testing Twisted protocols. Although they should be decoupled from factories, most examples I could find, including those from the official doc, were just too noisy and suggested using a factory.

I don't like that approach - I should not be forced to use a factory - that creates an unnecessary coupling - let's just pretend a factory existed and creates a connections, that's all our protocol should need.

Of course sometimes a protocol is bound to a very specific factory since it needs to invoke methods on it for non protocol-specific data (remember, there's just one factory, while there's one protocol per connection). I don't like that behaviour, since it creates an unnecessary coupling and makes factories huge and not cohesive, and I think protocols should just be injected whatever peers they need during the buildProtocol() invocation - but I'll talk about that in another post. In the meantime, I reccommend just creating a mock factory and setting it manually on the protocol.

So, here it is the solution I baked:



What do you think about this solution?

Friday, April 22, 2011

Building a successful community-driven software project / part 4

This is a multipart post. See part 3

Just a brief conclusion to my post sequence; the three principles for succeeding at community-driven projects are:


  • Make it easy for the people to give you their feedback - it's highly valuable.
  • Listen and reply to people's feedback. Don't do every-possible-requested-thing, but reply on time.
  • Don't take any policy as a dogma. Everything can be changed.

Monday, March 14, 2011

Using trial as test runner with zc.buildout

I've found a good way to launch tests using Twisted Trial in a zc.buildout enabled project. I was in fear of needing a full-fledged recipe - which I didn't have the time nor the will to code.

Luckily, zc.recipe.egg seems good enough to satisfy my needs; I wanted a recipe that could launch all unit tests if no parameter was passed, while retaining the ability to choose which specific unit test was launched on command line.


Here it is:



That's filled in for pydenji