Unicode
It may be understandable for really old languages having the problem of no or bad unicode support, and I think most modern ones did solve it. But it seems that in many cases, there is still lots trouble with handling text that isn't encoded as ASCII (or some other older standard) all over the world. Of course it is a solved problem in Rust, so I wanted to include it. So, here's how Rust solves that problem and why it has an additional primitive type for strings, even though they technically are collections.
If you skipped trough the book, you might have missed the general properties of the char
and str
types, so have some links to the char
explanation and the str
explanation.
There's a lot of functionality to correctly interface with the provided types. You can get a single character when indexing into a str
, but you have to be careful: Indexing into the middle of an UTF-8 character will result in a panic. The type provides a few safer alternative ways, for example using an iterator over every UTF-8 character, yielding char
s.
If you still want to directly index it, you can check the safety of doing so by using the is_char_boundary
method for an index.
You can convert both the char
and str
primitives to a few different formats. If you don't want to print a string, but rather just transport it, you can just use byte arrays. Since char
isn't used for this purpose as in C(++), those use the u8
type.
You can also encode from and decode to UTF-16. In this case, a "wide char" buffer is used, which translates to u16
in Rust.
When encoding from raw data, you can choose to either use a conversion method that returns a Result
depending on if the data has a valid format (see the error handling chapter), or a lossy function which replaces all invalid characters with a special one.
You still might occasionally see it online when conversion between encoding formats failed somewhere: It's the "�" unicode character.
While we're pretty much completely in unicode land in Rust, there might still be some things you want to do the old ASCII way. For example, converting to upper or lower case only reliably works with ASCII characters, else you would need tons and tons of conversion rules, which is out of scope for the core and standard libraries. Also, if you know that your input is ASCII-formatted, there's a few more things you can do. The Rust types provide a wealth of methods for those purposes:
- Check if a string is ASCII-formatted with
is_ascii
. - Convert a string to upper and lower case with
make_ascii_lowercase
and its opposite. - Case-insensitive comparison between strings using
eq_ignore_ascii_case
- Special ASCII-variants for various checks, for example if a character is alphanumeric.