Is there a function to strip all HTML from a cell?
I can write the function as a regexp, but it would be easier if there was a native function.
That is a good suggestion. Instead of something like =regexpreplace(str,"<.*?>","") the formula would be =striphtml(str)?
Hi, @doeleman. Here's a function I wrote a while back that should do what you're asking.
There are three optional parameters:
decodeHtml
: [Boolean, default=True] optional flag to decode HTML entities such as&
to&
removeInlineCode
: [Boolean, default=True] optional flag to discard inline scripts code
delim
: [String, default=" "] optional delimiter to replace HTML tags (to avoid run-on text)
Here's the UDF (sorry about the weird formatting & color-coding - it's automatic):
Public Function StripHTML(ByRef source As Variant, Optional ByVal decodeHtml As Boolean = True, _
Optional ByVal removeInlineCode As Boolean = True, Optional ByVal delim As String = " ") As Variant
'
' source : [Required:Variant] Text String, Range or Array containing html tags
' decodeHtml : [Optional:Boolean] optional flag to Decode HTML Entities [default=True]
' removeInlineCode : [Optional:Boolean] optional flag to discard inline scripts code [default=True]
' delim : [Optional:String] optional Delimiter to replace HTML tags [default=" "]
'
' StripHTML() removes all HTML tags from source string as scraped from a URL.
' This is useful for SEO analysis such as spamminess, sentiment, readability, etc.
' The optional parameter will also remove inline code & text within <scrip>, <style> and comment tags.
'
' Author: Tim Wolfe, SEO By The Book
On Error Resume Next
' Setup variables / environment:
Dim objRegex As VBScript_RegExp_55.RegExp
Dim newString As Variant
Dim i As Integer
Dim tmpStr As String
Set objRegex = New VBScript_RegExp_55.RegExp
With objRegex
.Global = True
.IgnoreCase = True
.MultiLine = True
End With
Select Case TypeName(source)
Case "String"
ReDim newString(0)
newString = Application.Transpose(newString)
newString(1) = source
Case "Range"
ReDim newString(0)
newString = Application.Transpose(newString)
newString(1) = source.Value
Case "Variant()"
ReDim newString(UBound(source))
newString = Application.Transpose(source)
End Select
For i = 1 To UBound(newString)
' Remove code & text from <script>, <style> & <!--comment--> tags:
If removeInlineCode Then
objRegex.Pattern = "(?:<(?:script[\S\t\n\v\f\r ]*?</script)>|<(?:style[\S\t\n\v\f\r ]*?</style)>|<(?:!--[\S\t\n\v\f\r ]*?--)>)+[\t\n\v\f\r ]*"
newString(i) = Trim(objRegex.Replace(newString(i), ""))
End If
' Replace <br>'s & <p>'s with a LF:
objRegex.Pattern = "(<(?:br|p)[^>]*>\s*)+"
newString(i) = objRegex.Replace(newString(i), Chr(10))
' Replace consecutive HTML tags with var delim:
objRegex.Pattern = "(?:(?:<[^>]+>)+[\t\n\v\f\r ]*)+"
newString(i) = objRegex.Replace(newString(i), delim)
' Remove all leading & trailing whitespace & delimiters:
objRegex.Pattern = "^[\t" & delim & "]+|[\t" & delim & "]+$"
newString(i) = objRegex.Replace(newString(i), "")
tmpStr = newString(i)
' If decodeHtml Then newString(i) = HTML_Decode(newString(i))
If decodeHtml Then newString(i) = HTML_Decode(tmpStr)
Next
If TypeName(source) = "Variant()" Then
StripHTML = newString
Else
StripHTML = newString(1)
End If
Set objRegex = Nothing
End Function
Hope this helps!
-Tim W.