New C++ Unicode library
By Ross Smith
#include "rs-unicode.hpp"
namespace RS::Unicode;
This is my new C++ Unicode library.
The library is designed on the assumption that text processing will normally be done entirely with known-valid UTF-8 text, with unvalidated text only being encountered during input sanitization.
My original library (Unicorn) hasn’t been updated for many years; there have been enough changes to the Unicode standard since then that it will not work with a significant fraction of existing Unicode text (in particular, anything that uses emoji). Along with the library’s size (containing many features I no longer consider worthwhile), and some other design decisions I wanted to revisit, I decided that writing a new library (with some code imported form the old one) was the best approach at this point.
rs-unicode/character.hpp
– Character propertiesrs-unicode/encoding.hpp
– Character encodingsrs-unicode/legacy.hpp
– Legacy encodingsrs-unicode/regex.hpp
– Regular expressionsrs-unicode/string.hpp
– String manipulationrs-unicode/version.hpp
– Version informationYou will need my header-only core utility library.
There is a CMakeLists.txt
file that can build and install the Unicode
library using the usual CMake conventions. Command line
usage will typically look like this:
cd wherever/you/installed/rs-unicode
mkdir build
cd build
cmake -G "Unix Makefiles" ../src
# or cmake -G "Visual Studio 17 2022" ../src on Windows
cmake --build . --config Release -- -j<N>
# where <N> is your CPU core count
cmake --build . --config Release --target install'
The library’s public headers are listed above (other headers are for internal
use only and should not be included by your code). To use the library,
#include
either the individual headers you want, or rs-unicode.hpp
to
include all of them.
Link your build with -lrs-unicode.
You will also need -lpcre2-8.
On some
systems, you may also need -liconv.
In a library that does string manipulation, any function that constructs or
modifies a string runs the risk of a memory allocation error. This
possibility is not usually explicitly documented because it is so ubiquitous.
Unless the documentation explicitly says otherwise, any function that is not
marked noexcept
should be assumed to be capable of throwing
std::bad_alloc,
in addition to any exceptions explicitly documented for
it.