Strip HTML from cell

doeleman · September 12, 2018, 2:44pm

Is there a function to strip all HTML from a cell?
I can write the function as a regexp, but it would be easier if there was a native function.

diskborste · September 12, 2018, 5:15pm

That is a good suggestion. Instead of something like =regexpreplace(str,"<.*?>","") the formula would be =striphtml(str)?

WolfeDen · September 13, 2018, 7:48pm

Hi, @doeleman. Here's a function I wrote a while back that should do what you're asking.

There are three optional parameters:

decodeHtml : [Boolean, default=True] optional flag to decode HTML entities such as & to &
removeInlineCode : [Boolean, default=True] optional flag to discard inline scripts code
delim : [String, default=" "] optional delimiter to replace HTML tags (to avoid run-on text)

Here's the UDF (sorry about the weird formatting & color-coding - it's automatic):

Public Function StripHTML(ByRef source As Variant, Optional ByVal decodeHtml As Boolean = True, _
    Optional ByVal removeInlineCode As Boolean = True, Optional ByVal delim As String = " ") As Variant
'
'   source           : [Required:Variant] Text String, Range or Array containing html tags
'   decodeHtml       : [Optional:Boolean] optional flag to Decode HTML Entities [default=True]
'   removeInlineCode : [Optional:Boolean] optional flag to discard inline scripts code [default=True]
'   delim            : [Optional:String]  optional Delimiter to replace HTML tags [default=" "]
'
' StripHTML() removes all HTML tags from source string as scraped from a URL.
'   This is useful for SEO analysis such as spamminess, sentiment, readability, etc.
'   The optional parameter will also remove inline code & text within <scrip>, <style> and comment tags.
'
' Author: Tim Wolfe, SEO By The Book

On Error Resume Next

  ' Setup variables / environment:
    Dim objRegex As VBScript_RegExp_55.RegExp
    Dim newString As Variant
    Dim i As Integer
    Dim tmpStr As String

    Set objRegex = New VBScript_RegExp_55.RegExp
    With objRegex
        .Global = True
        .IgnoreCase = True
        .MultiLine = True
    End With

    Select Case TypeName(source)
        Case "String"
            ReDim newString(0)
            newString = Application.Transpose(newString)
            newString(1) = source
        Case "Range"
            ReDim newString(0)
            newString = Application.Transpose(newString)
            newString(1) = source.Value
        Case "Variant()"
            ReDim newString(UBound(source))
            newString = Application.Transpose(source)
    End Select

    For i = 1 To UBound(newString)
      ' Remove code & text from <script>, <style> & <!--comment--> tags:
        If removeInlineCode Then
            objRegex.Pattern = "(?:<(?:script[\S\t\n\v\f\r ]*?</script)>|<(?:style[\S\t\n\v\f\r ]*?</style)>|<(?:!--[\S\t\n\v\f\r ]*?--)>)+[\t\n\v\f\r ]*"
            newString(i) = Trim(objRegex.Replace(newString(i), ""))
        End If

      ' Replace <br>'s & <p>'s with a LF:
        objRegex.Pattern = "(<(?:br|p)[^>]*>\s*)+"
        newString(i) = objRegex.Replace(newString(i), Chr(10))

      ' Replace consecutive HTML tags with var delim:
        objRegex.Pattern = "(?:(?:<[^>]+>)+[\t\n\v\f\r ]*)+"
        newString(i) = objRegex.Replace(newString(i), delim)

      ' Remove all leading & trailing whitespace & delimiters:
        objRegex.Pattern = "^[\t" & delim & "]+|[\t" & delim & "]+$"
        newString(i) = objRegex.Replace(newString(i), "")
        tmpStr = newString(i)
'        If decodeHtml Then newString(i) = HTML_Decode(newString(i))
        If decodeHtml Then newString(i) = HTML_Decode(tmpStr)
    Next

    If TypeName(source) = "Variant()" Then
        StripHTML = newString
    Else
        StripHTML = newString(1)
    End If

    Set objRegex = Nothing
End Function

Hope this helps!

-Tim W.