Regex to match open HTML tags except for self-contained tags

Matthew C.

The Problem

You want a regular expression (regex) to match opening HTML tags such as <div>, <form id="myForm">, and <h1>. The regex should not match self-contained (self-closing) tags such as <img />, <br />, and <input />.

The Solution

Self-closing tags do not exist in HTML. HTML elements that can’t have any child nodes are void elements. These elements don’t have a closing tag. Self-closing tags, which contain a trailing slash character (”/”) before the closing angle bracket, are required for XML, XHTML, and SVG void elements. Some code formatters add a trailing slash to the start tag of an HTML void element to make them XHTML compatible and to improve readability. Self-closing tags can be used when writing HTML code since the trailing slash character is ignored by HTML parsers. These days HTML is used far more than XHTML: it’s the most used markup language for websites.

Various regexes can be used to match open HTML tags and not self-contained tags. For example:

<([a-z]+)(?![^>]*\/>)[^>]*>

This regex does the following:

  • <: Match the opening angle bracket of an HTML tag.
  • ([a-z]+): Match one or more lowercase alphabetical characters.
  • (?![^>]*\/>): Negative lookahead that prevents matching closing tags. If there are zero or more characters other than ”>” followed by a ”/>” then the regex won’t match.
  • [^>]*>: The regex will match if the string ends in zero or more characters other than ”>” followed by a ”>” character.

Using a regex to find HTML tags is not ideal as it can lead to incorrect matches. For example, if you use the above regex for the following HTML string:

<script> const myString = "<script></script>"; </script> <div class="container"> <!-- <img src="cat.jpg" alt="big cat" > --> </div>

The regex will match the <script> and <div> HTML opening tags. However, it will also match two opening tags that are not actual DOM tags: the <script> tag string in the myString variable and the <img> tag in the HTML comment.

A better approach is to use an HTML parser library such as Cheerio.

Loved by over 4 million developers and more than 90,000 organizations worldwide, Sentry provides code-level observability to many of the world’s best-known companies like Disney, Peloton, Cloudflare, Eventbrite, Slack, Supercell, and Rockstar Games. Each month we process billions of exceptions from the most popular products on the internet.

Share on Twitter
Bookmark this page
Ask a questionJoin the discussion

Related Answers

A better experience for your users. An easier life for your developers.

    TwitterGitHubDribbbleLinkedinDiscord
© 2024 • Sentry is a registered Trademark
of Functional Software, Inc.