Tuesday, July 22, 2008

Converting HTML E-mail To Plain Text

Posted: Thursday, July 10, 2008 10:34 PM by Simon Hutson

OK, I admit it. I've caught the CRM development bug. What started as a harmless bit of fun working on document library integration between CRM & SharePoint has now developed into an obsession. In this post I will describe how to build a plug-in that examines the body of any e-mail promoted promoted from Outlook or the e-mail router and converts the HTML into plain text.


After a bit of searching, I found a good article which showed how you could use regular expressions to remove unwanted HTML tags leaving just the plain text - Convert HTML to Plain Text. Converting this from C# to VB (my preferred choice of language) and stripping out some of the bits I didn't need, I came up with the following code which forms the basis of this plug-in.



Private Function ConvertHTMLToText(ByVal Source As String) As String
 
    Dim result As String = Source
 
    ' Remove formatting that will prevent regex from running reliably
    ' \r - Matches a carriage return \u000D.
    ' \n - Matches a line feed \u000A.
    ' \f - Matches a form feed \u000C.
    ' For more details see http://msdn.microsoft.com/en-us/library/4edbef7e.aspx
    result = Replace(result, "[\r\n\f]", String.Empty, Text.RegularExpressions.RegexOptions.IgnoreCase)
 
    ' replace the most commonly used special characters:
    result = Replace(result, "&lt;", "<", RegexOptions.IgnoreCase)
    result = Replace(result, "&gt;", ">", RegexOptions.IgnoreCase)
    result = Replace(result, "&nbsp;", " ", RegexOptions.IgnoreCase)
    result = Replace(result, "&quot;", """", RegexOptions.IgnoreCase)
    result = Replace(result, "&amp;", "&", RegexOptions.IgnoreCase)
 
    ' Remove ASCII character code sequences such as &#nn; and &#nnn;
    result = Replace(result, "&#[0-9]{2,3};", String.Empty, RegexOptions.IgnoreCase)
 
    ' Remove all other special characters. More can be added - see the following for more details:
    ' http://www.degraeve.com/reference/specialcharacters.php
    ' http://www.web-source.net/symbols.htm
    result = Replace(result, "&.{2,6};", String.Empty, RegexOptions.IgnoreCase)
 
    ' Remove all attributes and whitespace from the <head> tag
    result = Replace(result, "< *head[^>]*>", "<head>", RegexOptions.IgnoreCase)
    ' Remove all whitespace from the </head> tag
    result = Replace(result, "< */ *head *>", "</head>", RegexOptions.IgnoreCase)
    ' Delete everything between the <head> and </head> tags
    result = Replace(result, "<head>.*</head>", String.Empty, RegexOptions.IgnoreCase)
 
    ' Remove all attributes and whitespace from all <script> tags
    result = Replace(result, "< *script[^>]*>", "<script>", RegexOptions.IgnoreCase)
    ' Remove all whitespace from all </script> tags
    result = Replace(result, "< */ *script *>", "</script>", RegexOptions.IgnoreCase)
    ' Delete everything between all <script> and </script> tags
    result = Replace(result, "<script>.*</script>", String.Empty, RegexOptions.IgnoreCase)
 
    ' Remove all attributes and whitespace from all <style> tags
    result = Replace(result, "< *style[^>]*>", "<style>", RegexOptions.IgnoreCase)
    ' Remove all whitespace from all </style> tags
    result = Replace(result, "< */ *style *>", "</style>", RegexOptions.IgnoreCase)
    ' Delete everything between all <style> and </style> tags
    result = Replace(result, "<style>.*</style>", String.Empty, RegexOptions.IgnoreCase)
 
    ' Insert tabs in place of <td> tags
    result = Replace(result, "< *td[^>]*>", vbTab, RegexOptions.IgnoreCase)
 
    ' Insert single line breaks in place of <br> and <li> tags
    result = Replace(result, "< *br[^>]*>", vbCrLf, RegexOptions.IgnoreCase)
    result = Replace(result, "< *li[^>]*>", vbCrLf, RegexOptions.IgnoreCase)
 
    ' Insert double line breaks in place of <p>, <div> and <tr> tags
    result = Replace(result, "< *div[^>]*>", vbCrLf + vbCrLf, RegexOptions.IgnoreCase)
    result = Replace(result, "< *tr[^>]*>", vbCrLf + vbCrLf, RegexOptions.IgnoreCase)
    result = Replace(result, "< *p[^>]*>", vbCrLf + vbCrLf, RegexOptions.IgnoreCase)
 
    ' Remove all reminaing html tags
    result = Replace(result, "<[^>]*>", String.Empty, RegexOptions.IgnoreCase)
 
    ' Replace repeating spaces with a single space
    result = Replace(result, " +", " ")
 
    ' Remove any trailing spaces and tabs from the end of each line
    result = Replace(result, "[ \t]+\r\n", vbCrLf)
 
    ' Remove any leading whitespace characters
    result = Replace(result, "^[\s]+", String.Empty)
 
    ' Remove any trailing whitespace characters
    result = Replace(result, "[\s]+$", String.Empty)
 
    ' Remove extra line breaks if there are more than two in a row
    result = Replace(result, "\r\n\r\n(\r\n)+", vbCrLf + vbCrLf)
 
    ' Thats it.
    Return result
 
End Function

All that remains is to implement the IPlugin.Execute method. In order to be able to modify the e-mail message before the e-mail activity gets created in the database, I had to figure out which event(s) to intercept. Through a bit of trial and error, I observed that any e-mail promoted from Outlook triggers the "DeliverPromote" event, whereas any incoming e-mail handled by the e-mail router triggers the "DeliverIncoming" event. Interestingly enough, the "Create" event was also called as a child pipeline for these events, but modifying the message here didn't have any effect, even in the pre-processing stage.


Because plug-ins have the potential to introduce significant performance and scalability issues into your environment, it is important to ensure that the code is as efficient as possible. To that end I added additional checks to ensure that the even if registered on multiple events, the main code will only run if the plug-in:



  1. is running on the 'DeliverPromote' or 'DeliverIncoming' messages

  2. is running synchronously

  3. is running against the 'Email' entity

  4. is running in the 'pre-processing' stage of the pipeline

  5. is running in a 'Parent' pipeline


Public Class ConvertHtmlToText
    Implements IPlugin
 
    Public Sub Execute(ByVal context As IPluginExecutionContext) Implements IPlugin.Execute
 
        ' Exit if any of the following conditions are true:
        '  1. plug-in is not running synchronously
        '  2. plug-in is not running against the 'Email' entity
        '  3. plug-in is not running in the 'pre-processing' stage of the pipeline
        '  4. plug-in is not running in a 'Parent' pipeline
        If Not (context.Mode = 0) Or Not (context.PrimaryEntityName = "email") Or Not (context.Stage = 10) Or Not (context.InvocationSource = 0) Then
            Exit Sub
        End If
 
        If (context.MessageName = "DeliverPromote") Or (context.MessageName = "DeliverIncoming") Then
 
            For Each item In context.InputParameters.Properties
 
                If (item.Name = "Body") Then
                    context.InputParameters.Properties.Item("Body") = ConvertHTMLToText(CStr(item.Value))
                End If
 
            Next
 
        End If
 
    End Sub
 
End Class

As always, I have include the source code to my project here. Please do bear in mind that I haven't included any error handling or logging, so it's not production-ready. However, it should provide you with a good head-start.


This posting is provided "AS IS" with no warranties, and confers no rights.

No comments: