Guide · Python · SDK · REST API

URL Validation in Python — Beyond Regex

A URL looks simple until you deal with encoding, unicode domains, protocol schemes, relative paths, and query string edge cases. Here's how to validate and parse URLs properly in Python — with a single API call.

1. Why URL validation is harder than it looks

URLs appear deceptively simple — a protocol, a domain, maybe a path. In reality, the full URL specification (RFC 3986) covers a surprising number of edge cases that make naive validation unreliable:

1

Percent-encoding

Spaces become %20, special characters get encoded. A valid URL can contain sequences like %E2%9C%93 that look like garbage but represent valid UTF-8 characters.

2

Unicode and IDN domains

Internationalised domain names like xn--nxasmq6b.com (Punycode) or direct unicode domains like münchen.de are perfectly valid but break most simple validators.

3

Protocol schemes

URLs are not just http:// and https://. There are ftp://, mailto:, tel:, data:, custom-app:// schemes, and protocol-relative URLs starting with //.

4

Relative URLs

Paths like /about, ../images/logo.png, or ?q=search are valid relative URLs but have no protocol or domain — context determines their meaning.

A proper URL validator needs to handle all of these while also decomposing the URL into its component parts — protocol, domain, port, path, query parameters, and fragment identifier.


2. The anatomy of a URL

Every URL is composed of up to seven distinct parts. Understanding these components is essential for proper validation and parsing:

https://example.com:8080/search?q=hello+world&lang=en#results
 
protocol   domain   port  path          query          fragment
ComponentExampleNotes
ProtocolhttpsThe scheme — http, https, ftp, mailto, etc.
Domainexample.comThe hostname — can be an IP, IDN, or standard domain
Port8080Optional — defaults to 80 (HTTP) or 443 (HTTPS)
Path/searchThe resource path — can contain encoded characters
Queryq=hello+world&lang=enKey-value pairs after the ? delimiter
FragmentresultsClient-side anchor after the # — never sent to server
ℹ️The IsValid URL API returns all of these components as structured fields, so you do not need to parse the URL yourself. Query parameters are returned as a key-value object, making them immediately usable without manual splitting.

3. Why regex fails for URLs

RFC 3986 defines the URL syntax, and the full specification is far too complex for a practical regex. Most regex-based validators fall into the same traps:

import re

# ❌ Too strict — rejects valid URLs
SIMPLE_REGEX = re.compile(r'^https?://[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}(/.*)?$')

print(SIMPLE_REGEX.match('https://example.com/path?q=hello world'))  # None — space in query
print(SIMPLE_REGEX.match('https://例え.jp'))                          # None — IDN domain
print(SIMPLE_REGEX.match('ftp://files.example.com/doc.pdf'))         # None — non-http scheme
print(SIMPLE_REGEX.match('https://localhost:3000'))                   # None — no TLD

# ❌ Too loose — accepts invalid URLs
LOOSE_REGEX = re.compile(r'^https?://.+')
print(LOOSE_REGEX.match('https://'))            # Match ✗ — no domain
print(LOOSE_REGEX.match('https:// not a url'))  # Match ✗ — spaces in domain
print(LOOSE_REGEX.match('https://...'))         # Match ✗ — empty labels

The Punycode problem

Internationalised domain names are encoded as Punycode in DNS. The domain münchen.de becomes xn--mnchen-3ya.de. A regex that only allows ASCII letters will reject either the unicode form or the Punycode form (which contains the xn-- prefix).

Query string complexity

Query strings can contain encoded special characters, nested brackets (e.g. filter[name]=value), empty values, duplicate keys, and plus signs as spaces. A regex cannot meaningfully parse these — it would need a full URL parser.

Python's urllib.parse is not enough

Python's standard library provides urllib.parse.urlparse() which splits a URL into components, but it does not validate. It will happily parse completely invalid strings without raising an error.

from urllib.parse import urlparse

# urlparse accepts almost anything without complaint
r = urlparse('https://')
print(r.scheme, r.netloc)  # 'https' '' — empty netloc, no error

r = urlparse('not-a-url')
print(r.scheme, r.path)    # '' 'not-a-url' — treated as a path

r = urlparse('https:// spaces in domain')
print(r.netloc)            # ' spaces in domain' — no validation

# It also silently accepts unusual inputs
r = urlparse('blob:null/uuid')
print(r.scheme)            # 'blob' — valid parse, but not a network URL
⚠️Using urlparse() as your only validation will never reject invalid URLs — it always returns a result without raising an exception, even for completely malformed input.

4. The right solution

The IsValid URL API validates and parses URLs in a single request. It returns a boolean validity flag plus all decomposed components — protocol, domain, path, query parameters as a structured object, port, and fragment.

<10ms
Validation
parse + validate
7
Parsed fields
protocol to fragment
100/day
Free tier
no credit card

Full parameter reference and response schema: URL Validation API docs →


5. Python code example

Using the isvalid-sdk Python SDK or the popular requests library. Install with pip install isvalid-sdk or pip install requests.

# url_validator.py
import os
from isvalid_sdk import IsValidConfig, create_client

iv = create_client(IsValidConfig(api_key=os.environ["ISVALID_API_KEY"]))

# ── Example usage ─────────────────────────────────────────────────────────────

result = iv.url("https://example.com/search?q=hello+world&lang=en#results")
print(result["valid"])     # True
print(result["protocol"])  # 'https'
print(result["isHttps"])   # True
print(result["domain"])    # 'example.com'
print(result["path"])      # '/search'
print(result["query"])     # {'q': 'hello world', 'lang': 'en'}
print(result["hash"])      # 'results'

In a Django or Flask webhook handler — validate user-submitted URLs before storing:

# views.py (Django)
from django.http import JsonResponse
import requests as http_client


def shorten_url(request):
    url = request.POST.get("url", "").strip()

    if not url:
        return JsonResponse({"error": "URL is required"}, status=400)

    try:
        check = validate_url(url)
    except http_client.Timeout:
        return JsonResponse({"error": "URL validation service timeout"}, status=502)
    except http_client.HTTPError as exc:
        return JsonResponse({"error": str(exc)}, status=502)

    if not check["valid"]:
        return JsonResponse({"error": "Invalid URL"}, status=400)

    if not check["isHttps"]:
        return JsonResponse({
            "error": "Only HTTPS URLs are accepted for security reasons.",
        }, status=400)

    # Proceed with URL shortening
    short_link = create_short_link(
        original_url=url,
        domain=check["domain"],
        path=check["path"],
    )
    return JsonResponse({"short_url": short_link})

Flask equivalent with a decorator pattern:

# app.py (Flask)
from flask import Flask, request, jsonify

app = Flask(__name__)


@app.route("/shorten", methods=["POST"])
def shorten():
    url = request.json.get("url", "").strip()

    if not url:
        return jsonify(error="URL is required"), 400

    try:
        check = validate_url(url)
    except Exception:
        return jsonify(error="URL validation service unavailable"), 502

    if not check["valid"]:
        return jsonify(error="Invalid URL"), 400

    if not check.get("isHttps"):
        return jsonify(error="Only HTTPS URLs are accepted"), 400

    short_link = create_short_link(url, domain=check["domain"])
    return jsonify(short_url=short_link)
Use the parsed domain field to build allowlists or blocklists. For example, you can reject URLs pointing to known phishing domains without needing to parse the URL yourself.

6. cURL example

Validate a URL with query parameters and fragment:

curl -G -H "Authorization: Bearer YOUR_API_KEY" \
  --data-urlencode "value=https://example.com/search?q=hello+world&lang=en#results" \
  "https://api.isvalid.dev/v0/url"

Test with a URL that has a port:

curl -G -H "Authorization: Bearer YOUR_API_KEY" \
  --data-urlencode "value=https://api.example.com:8080/v1/users" \
  "https://api.isvalid.dev/v0/url"

Test with an invalid URL:

curl -G -H "Authorization: Bearer YOUR_API_KEY" \
  --data-urlencode "value=not-a-url" \
  "https://api.isvalid.dev/v0/url"

7. Understanding the response

Valid HTTPS URL with query parameters and fragment:

{
  "valid": true,
  "protocol": "https",
  "isHttps": true,
  "domain": "example.com",
  "path": "/search",
  "query": { "q": "hello world", "lang": "en" },
  "port": null,
  "hash": "results"
}

Valid URL with explicit port and no query or fragment:

{
  "valid": true,
  "protocol": "https",
  "isHttps": true,
  "domain": "api.example.com",
  "path": "/v1/users",
  "query": {},
  "port": "8080",
  "hash": null
}

Invalid URL:

{
  "valid": false
}
FieldTypeDescription
validbooleanWhether the URL is structurally valid
protocolstringThe URL scheme — e.g. "https", "http", "ftp"
isHttpsbooleantrue if the protocol is HTTPS
domainstringThe hostname portion of the URL
pathstringThe path component after the domain
queryobjectParsed query string as key-value pairs
portstring | nullThe port number if explicitly specified, null otherwise
hashstring | nullThe fragment identifier (without the # prefix), null if absent

8. Edge cases

Internationalised domain names (IDN)

URLs with unicode domains like https://münchen.de/info are valid. They get encoded as Punycode (xn--mnchen-3ya.de) in DNS. The IsValid API handles both forms — you can submit either the unicode or Punycode version and get a valid parse.

# Both forms are accepted
unicode_result = iv.url("https://münchen.de/info")
punycode_result = iv.url("https://xn--mnchen-3ya.de/info")
# Both return valid: True with domain parsed correctly

# You can also convert between forms using the idna library
import idna  # pip install idna

domain = "münchen.de"
ascii_domain = idna.encode(domain).decode("ascii")
print(ascii_domain)  # xn--mnchen-3ya.de

Data URIs

Data URIs (data:text/html;base64,...) are technically valid URIs but are not network URLs. Depending on your use case, you may want to reject them after validation by checking that the protocol field is http or https.

Missing protocol

Users often type example.com without a protocol. This is not a valid URL per RFC 3986. If you want to be user-friendly, prepend https:// before validating:

import re

def normalize_url(user_input: str) -> str:
    """Prepend https:// if no scheme is present."""
    trimmed = user_input.strip()
    if not re.match(r'^[a-zA-Z][a-zA-Z0-9+.-]*:', trimmed):
        return f"https://{trimmed}"
    return trimmed

result = iv.url(normalize_url("example.com/path"))
# Validates https://example.com/path

Query string encoding

The query object in the response contains decoded key-value pairs. Plus signs in query values are decoded as spaces (e.g. q=hello+world becomes {"q": "hello world"}). Percent-encoded characters are also decoded, so q=caf%C3%A9 becomes {"q": "café"}.


Summary

Do not use a regex to validate URLs — RFC 3986 is too complex for a single pattern
Do not rely on urlparse() alone — it never rejects invalid input
Use the IsValid API to validate and parse URLs in a single call
Check the protocol field to enforce HTTPS-only policies
Prepend https:// for user-entered URLs missing a scheme
Use the parsed query dict directly — no manual splitting needed

See also

Validate and parse URLs instantly

Free tier includes 100 API calls per day. No credit card required. Full URL decomposition with protocol, domain, path, query, port, and fragment — under 10ms.