How to Compare Strings in Python: Equality and Identity
Once again, we’re back with another Python topic. Today, we’ll talk about how to compare strings in Python. Typically, I try to stay away from strings because they have a lot of complexity (e.g. different languages, implementations, etc.). That said, I decided to take a risk with this one. Hope you like it!
As a bit of a teaser, here’s what you can expect in this article. We’ll be looking at a few different comparison operators in Python including ==
, <
, <=
, >=
, and >
as well as is
. In addition, we’ll talk about how these operators can be used to compare strings and when to use them. If you want to know more, you’ll have to keep reading.
Problem Description
Let’s imagine we’re building up a simple search engine. For example, we have a bunch of files with text in them, and we want to be able search through those documents for certain keywords. How would we do that?
At the core of this search engine, we’ll have to compare strings. For instance, if we search our system for something about the Pittsburgh Penguins (say, Sidney Crosby), we’ll have to look for documents that contain our keyword. Of course, how do we know whether or not we have a match?
Specifically, we want to know how we can compare two strings for equality. For example, is “Sidney Crosby” the same as “Sidney Crosby”? How about “sidney crosby”? Or even “SiDnEy CrOsBy”? In other words, what constitutes equality in Python?
Of course, equality isn’t the only way to compare strings. For example, how can we compare strings alphabetically/lexicographically? Does “Malkin” come before or after “Letang” in a list?
If any of these topics sound interesting, you’re in luck. We’ll cover all them and more in this article.
Solutions
In this section, we’ll take a look at a few different ways to compare strings. First, we’ll look at a brute force solution which involves looping over each character to check for matches. Then, we’ll introduce the comparison operators which abstract away the brute force solution. Finally, we’ll talk about identity.
Compare Strings by Brute Force
Since strings are iterables, there’s nothing really stopping us from writing a loop to compare each character:
1 2 3 4 5 6 7 | penguins_87 = "Crosby" penguins_71 = "Malkin" is_same_player = True for a, b in zip(penguins_87, penguins_71): if a != b: is_same_player = False break |
In this example, we zip both strings and loop over each pair of characters until we don’t find a match. If we break before we’re finished, we know we don’t have a match. Otherwise, our strings are “identical.”
While this gets the job done for some strings, it might fail in certain scenarios. For example, what happens if one of the strings is longer than the other?
1 2 3 | penguins_87 = "Crosby" penguins_71 = "Malkin" penguins_59 = "Guentzel" |
As it turns out, zip()
will actually truncate the longer string. To deal with that, we might consider doing a length check first:
1 2 3 4 5 6 7 8 9 | penguins_87 = "Crosby" penguins_71 = "Malkin" penguins_59 = "Guentzel" is_same_player = len(penguins_87) == len(penguins_59) if is_same_player: for a, b in zip(penguins_87, penguins_59): if a != b: is_same_player = False break |
Of course, even with the extra check, this solution is a bit overkill and likely error prone. In addition, this solution only works for equality. How do we check if a string is “less” than another lexicographically? Luckily, there are other solutions below.
Compare Strings by Comparison Operators
Fun fact: we don’t have to write our own string equality code to compare strings. As it turns out, there are several core operators that work with strings right out of the box: ==
, <
, <=
, >=
, >
.
Using our Penguins players from above, we can try comparing them directly:
1 2 3 4 5 6 7 8 | penguins_87 = "Crosby" penguins_71 = "Malkin" penguins_59 = "Guentzel" penguins_87 == penguins_87 # True penguins_87 == penguins_71 # False penguins_87 >= penguins_71 # False penguins_59 <= penguins_71 # True |
Now, it’s important to note that these comparison operators work with the underlying ASCII representation of each character. As a result, seemingly equivalent strings might not appear to be the same:
1 2 3 4 | penguins_87 = "Crosby" penguins_87_small = "crosby" penguins_87 == penguins_87_small # False |
When we compare “Crosby” and “crosby”, we get False
because “c” and “C” aren’t equivalent:
1 2 | ord( 'c' ) # 99 ord( 'C' ) # 67 |
Naturally, this can lead to some strange behavior. For example, we might say “crosby” is less than “Malkin” because “crosby” comes before “Malkin” alphabetically. Unfortunately, that’s not how Python interprets that expression:
1 2 3 4 | penguins_87_small = "crosby" penguins_71 = "Malkin" penguins_87_small < penguins_71 # False |
In other words, while these comparison operators are convenient, they don’t actually perform a case-insensitive comparison. Luckily, there are all sorts of tricks we can employ like converting both strings to uppercase or lowercase:
1 2 3 4 5 | penguins_87_small = "crosby" penguins_71 = "Malkin" penguins_87_small.lower() < penguins_71.lower() penguins_87_small.upper() < penguins_71.upper() |
Since strings in Python are immutable like most languages, these methods don’t actually manipulate the underlying strings. Instead, the return new ones.
All that said, strings are inherently complex. I say that has a bit of a warning because there are bound to be edge cases where the solutions in this article don’t work as expected. After all, we’ve only scratched the surface with ASCII characters. Try playing around with some strings that don’t include English characters (e.g. 🤐, 汉, etc.). You may be surprised by the results.
Compare Strings by Identity
Before we move on, I felt like it was important to mention another way of comparing strings: identity. In Python, ==
isn’t the only way to compare things; we can also use is
. Take a look:
1 2 3 4 5 6 | penguins_87 = "Crosby" penguins_71 = "Malkin" penguins_59 = "Guentzel" penguins_87 is penguins_87 # True penguins_87 is penguins_71 # False |
Here, it’s tough to see any sort of difference between this solution and the previous one. After all, the output is the same. That said, there is a fundamental difference here. With equality (==
), we compare the strings by their contents (i.e. letter by letter). With identity (is
), we compare the strings by their location in memory (i.e address/reference).
To see this in action, let’s create a few equivalent strings:
01 02 03 04 05 06 07 08 09 10 11 | penguins_87 = "Crosby" penguins_87_copy = "Crosby" penguins_87_clone = "Cros" + "by" penguins_8 = "Cros" penguins_7 = "by" penguins_87_dupe = penguins_8 + penguins_7 id(penguins_87) # 65564544 id(penguins_87_copy) # 65564544 id(penguins_87_clone) # 65564544 id(penguins_87_dupe) # 65639392 Uh Oh! |
In the first three examples, the Python interpreter was able to tell that the constructed strings were the same, so the interpreter didn’t bother making space for the two clones. Instead, it gave the latter two, penguins_87_copy
and penguins_87_clone
, the same ID. As a result, if we compare any of the first three strings with either ==
or is
, we’ll get the same result:
1 2 | penguins_87 == penguins_87_copy == penguins_87_clone # True penguins_87 is penguins_87_copy is penguins_87_clone # True |
When we get to the last string, penguins_87_dupe
, we run into a bit of an issue. As far as I can tell, the interpreter isn’t able to know what the value of the expression is until runtime. As a result, it creates a new location for the resulting string—despite the fact that “Crosby” already exists. If we modify our comparison chains from above, we’ll see a different result:
1 2 | penguins_87 == penguins_87_copy == penguins_87_clone == penguins_87_dupe # True penguins_87 is penguins_87_copy is penguins_87_clone is penguins_87_dupe # False |
The main takeaway here is to only use ==
when comparing strings for equality (an any object for that matter). After all, there’s no guarantee that the Python interpreter is going to properly identify equivalent strings and give them the same ID. That said, if you need to compare two strings for identity, this is the way to go.
Challenge
Normally, I would check each solution for performance, but they’re not all that similar. Instead, I figured we could jump right to the challenge.
Now that we know how to compare strings in Python, I figured we could try using that knowledge to write a simple string sorting algorithm. For this challenge, you can assume ASCII strings and case sensitivity. However, you’re free to optimize your solutions as needed. All I care about is the use of the operators discussed in this article.
If you need a sample list to get started, here’s the current forward roster for the Pittsburgh Penguins (reverse sorted alphabetically):
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 | penguins_2019_2020 = [ 'Tanev' , 'Simon' , 'Rust' , 'McCann' , 'Malkin' , 'Lafferty' , 'Kahun' , 'Hornqvist' , 'Guentzel' , 'Galchenyuk' , 'Di Pauli' , 'Crosby' , 'Blueger' , 'Blandisi' , 'Bjugstad' , 'Aston-Reese' ] |
When you’re finished, drop your solution in the comments below. Then, head on over to my article titled How to Sort a List of Strings in Python to see a few clever solutions.
A Little Recap
And with that, we’re all done. Check out all the solutions here:
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 | penguins_87 = "Crosby" penguins_71 = "Malkin" penguins_59 = "Guentzel" # Brute force comparison (equality only) is_same_player = len(penguins_87) == len(penguins_59) if is_same_player: for a, b in zip(penguins_87, penguins_59): if a != b: is_same_player = False break # Direct comparison penguins_87 == penguins_59 # False penguins_87 > penguins_59 # False penguins_71 <= penguins_71 # True # Identity checking penguins_87 is penguins_87 # True penguins_71 is penguins_87 # False |
Published on Web Code Geeks with permission by Jeremy Grifski, partner at our WCG program. See the original article here: How to Compare Strings in Python: Equality and Identity Opinions expressed by Web Code Geeks contributors are their own. |