Skip to content

gh-108590: Improve sqlite3 docs on encoding issues and how to handle those #108699

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Oct 25, 2023
Merged
Changes from 14 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
85 changes: 52 additions & 33 deletions Doc/library/sqlite3.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1157,6 +1157,10 @@ Connection objects
f.write('%s\n' % line)
con.close()

.. seealso::

:ref:`sqlite3-howto-encoding`


.. method:: backup(target, *, pages=-1, progress=None, name="main", sleep=0.250)

Expand Down Expand Up @@ -1223,6 +1227,10 @@ Connection objects

.. versionadded:: 3.7

.. seealso:

:ref:`sqlite3-howto-encoding`

.. method:: getlimit(category, /)

Get a connection runtime limit.
Expand Down Expand Up @@ -1444,39 +1452,8 @@ Connection objects
and returns a text representation of it.
The callable is invoked for SQLite values with the ``TEXT`` data type.
By default, this attribute is set to :class:`str`.
If you want to return ``bytes`` instead, set *text_factory* to ``bytes``.

Example:

.. testcode::

con = sqlite3.connect(":memory:")
cur = con.cursor()

AUSTRIA = "Österreich"

# by default, rows are returned as str
cur.execute("SELECT ?", (AUSTRIA,))
row = cur.fetchone()
assert row[0] == AUSTRIA

# but we can make sqlite3 always return bytestrings ...
con.text_factory = bytes
cur.execute("SELECT ?", (AUSTRIA,))
row = cur.fetchone()
assert type(row[0]) is bytes
# the bytestrings will be encoded in UTF-8, unless you stored garbage in the
# database ...
assert row[0] == AUSTRIA.encode("utf-8")

# we can also implement a custom text_factory ...
# here we implement one that appends "foo" to all strings
con.text_factory = lambda x: x.decode("utf-8") + "foo"
cur.execute("SELECT ?", ("bar",))
row = cur.fetchone()
assert row[0] == "barfoo"

con.close()
See :ref:`sqlite3-howto-encoding` for more details.

.. attribute:: total_changes

Expand Down Expand Up @@ -1635,7 +1612,6 @@ Cursor objects
COMMIT;
""")


.. method:: fetchone()

If :attr:`~Cursor.row_factory` is ``None``,
Expand Down Expand Up @@ -2614,6 +2590,49 @@ With some adjustments, the above recipe can be adapted to use a
instead of a :class:`~collections.namedtuple`.


.. _sqlite3-howto-encoding:

How to handle non-UTF-8 text encodings
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

By default, :mod:`!sqlite3` adapts SQLite values with the ``TEXT`` data type
using :class:`str`.
This works well for UTF-8 encoded text, but it will fail for other encodings
and invalid UTF-8.
You can use a custom :attr:`~Connection.text_factory` to handle such cases.

Because of SQLite's `flexible typing`_, it is not uncommon to encounter table
columns with the ``TEXT`` data type, containing non-UTF-8 encodings,
or even arbitrary data.
To demonstrate, let's assume we've got a database with ISO-8859-2 (Latin 2)
encoded text, for example a table of Czech-English dictionary entries.
Assuming we now have a :class:`Connection` instance :py:data:`!con`
connected to this database,
we can decode the Latin 2 encoded text using this :attr:`~Connection.text_factory`:

.. testcode::

con.text_factory = lambda data: str(data, encoding="latin2")

For invalid UTF-8 or arbitrary data in stored in ``TEXT`` table columns,
you can use the following technique, borrowed from the :ref:`unicode-howto`:

.. testcode::

con.text_factory = lambda data: str(data, errors="surrogateescape")

.. note::

Strings containing surrogate escapes and must be treated with care.
You cannot simply pass them back to SQLite,
for example using :meth:`~Cursor.execute`,
since the :mod:`!sqlite3` module APIs only accept UTF-8 encoded strings.

.. seealso::

:ref:`unicode-howto`


.. _sqlite3-explanation:

Explanation
Expand Down