Skip to content

gh-67041: Allow to distinguish between empty and not defined URI components #123305

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
213 changes: 135 additions & 78 deletions Doc/library/urllib.parse.rst
Original file line number Diff line number Diff line change
Expand Up @@ -50,12 +50,15 @@
The URL parsing functions focus on splitting a URL string into its components,
or on combining URL components into a URL string.

.. function:: urlparse(urlstring, scheme='', allow_fragments=True)
.. function:: urlparse(urlstring, scheme=None, allow_fragments=True, *, allow_none=False)

Parse a URL into six components, returning a 6-item :term:`named tuple`. This
corresponds to the general structure of a URL:
``scheme://netloc/path;parameters?query#fragment``.
Each tuple item is a string, possibly empty. The components are not broken up
Each tuple item is a string, possibly empty, or ``None`` if *allow_none* is true.
Not defined component are represented an empty string (by default) or
``None`` if *allow_none* is true.
The components are not broken up
into smaller parts (for example, the network location is a single string), and %
escapes are not expanded. The delimiters as shown above are not part of the
result, except for a leading slash in the *path* component, which is retained if
Expand Down Expand Up @@ -84,6 +87,12 @@
80
>>> o._replace(fragment="").geturl()
'http://docs.python.org:80/3/library/urllib.parse.html?highlight=params'
>>> urlparse("http://docs.python.org?")
ParseResult(scheme='http', netloc='docs.python.org',
path='', params='', query='', fragment='')
>>> urlparse("http://docs.python.org?", allow_none=True)
ParseResult(scheme='http', netloc='docs.python.org',
path='', params=None, query='', fragment=None)

Following the syntax specifications in :rfc:`1808`, urlparse recognizes
a netloc only if it is properly introduced by '//'. Otherwise the
Expand All @@ -101,47 +110,53 @@
ParseResult(scheme='', netloc='', path='www.cwi.nl/%7Eguido/Python.html',
params='', query='', fragment='')
>>> urlparse('help/Python.html')
ParseResult(scheme='', netloc='', path='help/Python.html', params='',
query='', fragment='')
ParseResult(scheme='', netloc='', path='help/Python.html',
params='', query='', fragment='')
>>> urlparse('help/Python.html', allow_none=True)
ParseResult(scheme=None, netloc=None, path='help/Python.html',
params=None, query=None, fragment=None)

The *scheme* argument gives the default addressing scheme, to be
used only if the URL does not specify one. It should be the same type
(text or bytes) as *urlstring*, except that the default value ``''`` is
(text or bytes) as *urlstring* or ``None``, except that the ``''`` is
always allowed, and is automatically converted to ``b''`` if appropriate.

If the *allow_fragments* argument is false, fragment identifiers are not

Check warning on line 124 in Doc/library/urllib.parse.rst

View workflow job for this annotation

GitHub Actions / Docs / Docs

py:attr reference target not found: fragment [ref.attr]
recognized. Instead, they are parsed as part of the path, parameters
or query component, and :attr:`fragment` is set to the empty string in
the return value.
or query component, and :attr:`fragment` is set to ``None`` or the empty
string (depending on the value of *allow_none*) in the return value.

The return value is a :term:`named tuple`, which means that its items can
be accessed by index or as named attributes, which are:

+------------------+-------+-------------------------+------------------------+
| Attribute | Index | Value | Value if not present |
+==================+=======+=========================+========================+
| :attr:`scheme` | 0 | URL scheme specifier | *scheme* parameter |
+------------------+-------+-------------------------+------------------------+
| :attr:`netloc` | 1 | Network location part | empty string |
+------------------+-------+-------------------------+------------------------+
| :attr:`path` | 2 | Hierarchical path | empty string |
+------------------+-------+-------------------------+------------------------+
| :attr:`params` | 3 | Parameters for last | empty string |
| | | path element | |
+------------------+-------+-------------------------+------------------------+
| :attr:`query` | 4 | Query component | empty string |
+------------------+-------+-------------------------+------------------------+
| :attr:`fragment` | 5 | Fragment identifier | empty string |
+------------------+-------+-------------------------+------------------------+
| :attr:`username` | | User name | :const:`None` |
+------------------+-------+-------------------------+------------------------+
| :attr:`password` | | Password | :const:`None` |
+------------------+-------+-------------------------+------------------------+
| :attr:`hostname` | | Host name (lower case) | :const:`None` |
+------------------+-------+-------------------------+------------------------+
| :attr:`port` | | Port number as integer, | :const:`None` |
| | | if present | |
+------------------+-------+-------------------------+------------------------+
+------------------+-------+-------------------------+-------------------------------+
| Attribute | Index | Value | Value if not present |
+==================+=======+=========================+===============================+
| :attr:`scheme` | 0 | URL scheme specifier | *scheme* parameter or |
| | | | empty string [1]_ |

Check warning on line 136 in Doc/library/urllib.parse.rst

View workflow job for this annotation

GitHub Actions / Docs / Docs

py:attr reference target not found: scheme [ref.attr]
+------------------+-------+-------------------------+-------------------------------+
| :attr:`netloc` | 1 | Network location part | ``None`` or empty string [1]_ |
+------------------+-------+-------------------------+-------------------------------+

Check warning on line 139 in Doc/library/urllib.parse.rst

View workflow job for this annotation

GitHub Actions / Docs / Docs

py:attr reference target not found: netloc [ref.attr]
| :attr:`path` | 2 | Hierarchical path | empty string |
+------------------+-------+-------------------------+-------------------------------+

Check warning on line 141 in Doc/library/urllib.parse.rst

View workflow job for this annotation

GitHub Actions / Docs / Docs

py:attr reference target not found: path [ref.attr]
| :attr:`params` | 3 | Parameters for last | ``None`` or empty string [1]_ |
| | | path element | |

Check warning on line 143 in Doc/library/urllib.parse.rst

View workflow job for this annotation

GitHub Actions / Docs / Docs

py:attr reference target not found: params [ref.attr]
+------------------+-------+-------------------------+-------------------------------+
| :attr:`query` | 4 | Query component | ``None`` or empty string [1]_ |
+------------------+-------+-------------------------+-------------------------------+

Check warning on line 146 in Doc/library/urllib.parse.rst

View workflow job for this annotation

GitHub Actions / Docs / Docs

py:attr reference target not found: query [ref.attr]
| :attr:`fragment` | 5 | Fragment identifier | ``None`` or empty string [1]_ |
+------------------+-------+-------------------------+-------------------------------+

Check warning on line 148 in Doc/library/urllib.parse.rst

View workflow job for this annotation

GitHub Actions / Docs / Docs

py:attr reference target not found: fragment [ref.attr]
| :attr:`username` | | User name | ``None`` |
+------------------+-------+-------------------------+-------------------------------+

Check warning on line 150 in Doc/library/urllib.parse.rst

View workflow job for this annotation

GitHub Actions / Docs / Docs

py:attr reference target not found: username [ref.attr]
| :attr:`password` | | Password | ``None`` |
+------------------+-------+-------------------------+-------------------------------+

Check warning on line 152 in Doc/library/urllib.parse.rst

View workflow job for this annotation

GitHub Actions / Docs / Docs

py:attr reference target not found: password [ref.attr]
| :attr:`hostname` | | Host name (lower case) | ``None`` |
+------------------+-------+-------------------------+-------------------------------+

Check warning on line 154 in Doc/library/urllib.parse.rst

View workflow job for this annotation

GitHub Actions / Docs / Docs

py:attr reference target not found: hostname [ref.attr]
| :attr:`port` | | Port number as integer, | ``None`` |
| | | if present | |
+------------------+-------+-------------------------+-------------------------------+

.. [1] Depending on the value of the *allow_none* argument.

Reading the :attr:`port` attribute will raise a :exc:`ValueError` if
an invalid port is specified in the URL. See section
Expand Down Expand Up @@ -187,12 +202,15 @@

.. versionchanged:: 3.6
Out-of-range port numbers now raise :exc:`ValueError`, instead of
returning :const:`None`.
returning ``None``.

.. versionchanged:: 3.8
Characters that affect netloc parsing under NFKC normalization will
now raise :exc:`ValueError`.

.. versionchanged:: 3.14
Added the *allow_none* parameter.


.. function:: parse_qs(qs, keep_blank_values=False, strict_parsing=False, encoding='utf-8', errors='replace', max_num_fields=None, separator='&')

Expand Down Expand Up @@ -287,16 +305,27 @@
separator key, with ``&`` as the default separator.


.. function:: urlunparse(parts)
.. function:: urlunparse(parts, *, keep_empty=False)

Construct a URL from a tuple as returned by ``urlparse()``. The *parts*
argument can be any six-item iterable. This may result in a slightly
different, but equivalent URL, if the URL that was parsed originally had
unnecessary delimiters (for example, a ``?`` with an empty query; the RFC
states that these are equivalent).
argument can be any six-item iterable.

This may result in a slightly different, but equivalent URL, if the
URL that was parsed originally had unnecessary delimiters (for example,
a ``?`` with an empty query; the RFC states that these are equivalent).

If *keep_empty* is true, empty strings are kept in the result (for example,
a ``?`` for an empty query), only ``None`` components are omitted.
This allows to restore the URL that was parsed with option
``allow_none=True``.
By default, *keep_empty* is true if *parts* is the result of the
:func:`urlparse` call with ``allow_none=True``.

.. function:: urlsplit(urlstring, scheme='', allow_fragments=True)
.. versionchanged:: 3.14
Added the *keep_empty* parameter.


.. function:: urlsplit(urlstring, scheme=None, allow_fragments=True, *, allow_none=False)

This is similar to :func:`urlparse`, but does not split the params from the URL.
This should generally be used instead of :func:`urlparse` if the more recent URL
Expand All @@ -310,28 +339,31 @@
The return value is a :term:`named tuple`, its items can be accessed by index
or as named attributes:

+------------------+-------+-------------------------+----------------------+
| Attribute | Index | Value | Value if not present |
+==================+=======+=========================+======================+
| :attr:`scheme` | 0 | URL scheme specifier | *scheme* parameter |
+------------------+-------+-------------------------+----------------------+
| :attr:`netloc` | 1 | Network location part | empty string |
+------------------+-------+-------------------------+----------------------+
| :attr:`path` | 2 | Hierarchical path | empty string |
+------------------+-------+-------------------------+----------------------+
| :attr:`query` | 3 | Query component | empty string |
+------------------+-------+-------------------------+----------------------+
| :attr:`fragment` | 4 | Fragment identifier | empty string |
+------------------+-------+-------------------------+----------------------+
| :attr:`username` | | User name | :const:`None` |
+------------------+-------+-------------------------+----------------------+
| :attr:`password` | | Password | :const:`None` |
+------------------+-------+-------------------------+----------------------+
| :attr:`hostname` | | Host name (lower case) | :const:`None` |
+------------------+-------+-------------------------+----------------------+
| :attr:`port` | | Port number as integer, | :const:`None` |
| | | if present | |
+------------------+-------+-------------------------+----------------------+
+------------------+-------+-------------------------+-------------------------------+
| Attribute | Index | Value | Value if not present |
+==================+=======+=========================+===============================+
| :attr:`scheme` | 0 | URL scheme specifier | *scheme* parameter or |
| | | | empty string [1]_ |
+------------------+-------+-------------------------+-------------------------------+
| :attr:`netloc` | 1 | Network location part | ``None`` or empty string [2]_ |
+------------------+-------+-------------------------+-------------------------------+
| :attr:`path` | 2 | Hierarchical path | empty string |
+------------------+-------+-------------------------+-------------------------------+
| :attr:`query` | 3 | Query component | ``None`` or empty string [2]_ |
+------------------+-------+-------------------------+-------------------------------+
| :attr:`fragment` | 4 | Fragment identifier | ``None`` or empty string [2]_ |
+------------------+-------+-------------------------+-------------------------------+
| :attr:`username` | | User name | ``None`` |
+------------------+-------+-------------------------+-------------------------------+
| :attr:`password` | | Password | ``None`` |
+------------------+-------+-------------------------+-------------------------------+
| :attr:`hostname` | | Host name (lower case) | ``None`` |
+------------------+-------+-------------------------+-------------------------------+
| :attr:`port` | | Port number as integer, | ``None`` |
| | | if present | |
+------------------+-------+-------------------------+-------------------------------+

.. [2] Depending on the value of the *allow_none* argument.

Reading the :attr:`port` attribute will raise a :exc:`ValueError` if
an invalid port is specified in the URL. See section
Expand All @@ -356,7 +388,7 @@

.. versionchanged:: 3.6
Out-of-range port numbers now raise :exc:`ValueError`, instead of
returning :const:`None`.
returning ``None``.

.. versionchanged:: 3.8
Characters that affect netloc parsing under NFKC normalization will
Expand All @@ -368,15 +400,30 @@
.. versionchanged:: 3.12
Leading WHATWG C0 control and space characters are stripped from the URL.

.. versionchanged:: 3.14
Added the *allow_none* parameter.

.. _WHATWG spec: https://url.spec.whatwg.org/#concept-basic-url-parser

.. function:: urlunsplit(parts)
.. function:: urlunsplit(parts, *, keep_empty=False)

Combine the elements of a tuple as returned by :func:`urlsplit` into a
complete URL as a string. The *parts* argument can be any five-item
iterable. This may result in a slightly different, but equivalent URL, if the
URL that was parsed originally had unnecessary delimiters (for example, a ?
with an empty query; the RFC states that these are equivalent).
iterable.

This may result in a slightly different, but equivalent URL, if the
URL that was parsed originally had unnecessary delimiters (for example,
a ``?`` with an empty query; the RFC states that these are equivalent).

If *keep_empty* is true, empty strings are kept in the result (for example,
a ``?`` for an empty query), only ``None`` components are omitted.
This allows to restore the URL that was parsed with option
``allow_none=True``.
By default, *keep_empty* is true if *parts* is the result of the
:func:`urlsplit` call with ``allow_none=True``.

.. versionchanged:: 3.14
Added the *keep_empty* parameter.


.. function:: urljoin(base, url, allow_fragments=True)
Expand Down Expand Up @@ -422,30 +469,35 @@
Behavior updated to match the semantics defined in :rfc:`3986`.


.. function:: urldefrag(url)
.. function:: urldefrag(url, *, allow_none=False)

If *url* contains a fragment identifier, return a modified version of *url*
with no fragment identifier, and the fragment identifier as a separate
string. If there is no fragment identifier in *url*, return *url* unmodified
and an empty string.
and an empty string (by default) or ``None`` if *allow_none* is true.

The return value is a :term:`named tuple`, its items can be accessed by index
or as named attributes:

+------------------+-------+-------------------------+----------------------+
| Attribute | Index | Value | Value if not present |
+==================+=======+=========================+======================+
| :attr:`url` | 0 | URL with no fragment | empty string |
+------------------+-------+-------------------------+----------------------+
| :attr:`fragment` | 1 | Fragment identifier | empty string |
+------------------+-------+-------------------------+----------------------+
+------------------+-------+-------------------------+-------------------------------+
| Attribute | Index | Value | Value if not present |
+==================+=======+=========================+===============================+
| :attr:`url` | 0 | URL with no fragment | empty string |
+------------------+-------+-------------------------+-------------------------------+
| :attr:`fragment` | 1 | Fragment identifier | ``None`` or empty string [3]_ |
+------------------+-------+-------------------------+-------------------------------+

.. [3] Depending on the value of the *allow_none* argument.

See section :ref:`urlparse-result-object` for more information on the result
object.

.. versionchanged:: 3.2
Result is a structured object rather than a simple 2-tuple.

.. versionchanged:: 3.14
Added the *allow_none* parameter.

.. function:: unwrap(url)

Extract the url from a wrapped URL (that is, a string formatted as
Expand All @@ -465,8 +517,9 @@
purity.

Instead of raising an exception on unusual input, they may instead return some
component parts as empty strings. Or components may contain more than perhaps
they should.
component parts as empty strings or ``None`` (depending on the value of the
*allow_none* argument).
Or components may contain more than perhaps they should.

We recommend that users of these APIs where the values may be used anywhere
with security implications code defensively. Do some verification within your
Expand Down Expand Up @@ -542,7 +595,8 @@
Return the re-combined version of the original URL as a string. This may
differ from the original URL in that the scheme may be normalized to lower
case and empty components may be dropped. Specifically, empty parameters,
queries, and fragment identifiers will be removed.
queries, and fragment identifiers will be removed unless the URL was parsed
with ``allow_none=True``.

For :func:`urldefrag` results, only empty fragment identifiers will be removed.
For :func:`urlsplit` and :func:`urlparse` results, all noted changes will be
Expand All @@ -559,6 +613,9 @@
>>> r2 = urlsplit(r1.geturl())
>>> r2.geturl()
'http://www.Python.org/doc/'
>>> r3 = urlsplit(url, allow_none=True)
>>> r3.geturl()
'http://www.Python.org/doc/#'


The following classes provide the implementations of the structured parse
Expand Down
Loading
Loading