Regular Expressions

Lesson Details:
July 10, 2020


I: Introduction

A programming language is a human-readable instruction set. According to the MIT Computer Science and Artificial Intelligence Laboratory, a programming language is a formal language. Every programming language has a specific grammar that allows programmers to communicate instructions to a computer. Programming languages are usually classified as low-level languages and high-level languages. Low-level languages are also known as machine languages because a computer can directly understand these languages. High-level languages are understood by humans, not computers. Some high-level languages are easier to use than low-level languages, but they also have more features.

II: Body

Regular expressions are strings of characters used in programming to describe patterns instead of exact matches. The term "regular expression" was coined by Stephen Kleene in the 1940's, who used the term to refer to one of the most prominent mathematical theories of the 20th century. He had no idea that this theory would become so important in the field of programming later on.

There are several ways to explain regular expressions. Many people try to explain them with the help of algebra or graphs. These explanations are not simple at all. Another, often used explanation is based on the idea that regular expressions are sets of possible strings. This explanation is far simpler, but it has some problems. For example, the set of all strings over an alphabet ∑ is ∑∪∑∪∑∪∑.... Essentially, this is an infinite number of strings. All these strings are certainly not possible in practice.

However, another way to explain regular expressions is by using finite automatons. Finite automata are machines that only accept finite strings in their input. They are used in many areas in computer science. They are used for data compression in cryptography, for example. Finite automata can be easily explained in combination with regular expressions. The simplest example of a finite automaton is one that accepts only even numbers written in binary format (numbers made up of 0's and 1's). A finite automaton for this task consists of two states: its initial state and its final state. It starts in its initial state and moves along the string until it reaches either an even number or an odd number. If it reaches an even number, it will stay in its current state; if it reaches an odd number, it will move into its final state after it has read the number. This machine accepts only even numbers in binary, but it is easy to see how this machine can be extended to accept all numbers in binary format (0, 1, 2, 3,...).

III: Conclusion

The main advantage of regular expressions is that they allow for pattern matching in text instead of exact matches. However, there are several disadvantages as well. First of all, they can be very complicated when you want to match certain patterns only at the beginning or end of a string (the RE engine has to find a match within a string).

Another disadvantage is that they contain a lot of metacharacters that have a special meaning in a regular expression. If you confuse these metacharacters with other characters in your pattern, you might lose a lot of time just trying to figure out what went wrong with your program because it does not give any errors or warnings when it runs your code. Regular expressions should therefore be treated with great care and should only be used when there is no simpler solution for your problem.

loader
Course content