Wednesday, January 7, 2026

macOS: Capture Video's Frames using Bash

The blog will cover how to screenshot a video on macOS using Bash. Videos typically run at 24 FPS or 30 FPS, so taking a screenshot every second will likely capture a representative frame of the video. macOS provides a command-line utility, screencapture, that takes screenshots.

The man page for screencapture is viewed as follows:

man screencapture

Below is a screenshot showing the key command-line arguments of screencapture:

When running screencapture from a macOS terminal console, a window may be displayed by the system. Permission will have to be given for Terminal to access the computer’s screen and audio. Similarly, when running screencapture in a Bash script from Visual Studio Code, the same permissions will have to be granted to Visual Studio Code.

To screenshot a video, run the Bash script. Then display the video full screen. The script will capture an image each second. By default, the script runs for 26 minutes, as a great many shows are 24 minutes long. A command line is provided to change the script duration.

If you are reading this, you are a developer. The command line for the script is readable in the comments. The script is less than two hundred lines of code, but most of the work is processing the command line.

Here is the script in its entirety:

#!/usr/bin/env bash
# Version 0.0 January 1, 2025 
# orignal release
# Version 0.1 January 2, 2025
# Made screen capture loop deduct processing time to hit closer to 1 second
# Screenshot Capture Tool (macOS)
#
# Purpose
# -------
# Captures full-screen screenshots at a fixed interval for a fixed duration and
# writes them into a deterministic, hierarchical folder structure suitable for
# show / episodic video processing workflows.
#
# The tool is designed to capture a complete episode into a "raw" frame set that
# can later be post-processed (pruning, cropping, encoding, etc.).
#
#
# Folder Structure
# ----------------
# The output directory structure is built from the provided parameters as follows:
#
#   <CONTENT_ROOT>/
#     <SHOW_NAME>/
#       Season<SS>/
#         EP<EE>/
#           raw/
#             000000.png
#             000001.png
#             000002.png
#             ...
#
# Where:
#   <SS> = zero-padded season number (2 digits)
#   <EE> = zero-padded episode number (2 digits)
#
#
# Example
# -------
# Given:
#   CONTENT_ROOT = ~/ShowContent
#   SHOW_NAME   = Frieren
#   SEASON       = 1
#   EPISODE      = 2
#
# The resulting output directory will be:
#
#   ~/ShowContent/Frieren/Season01/EP02/raw/
#
# With files:
#
#   000000.png
#   000001.png
#   000002.png
#   ...
#
#
# File Naming
# -----------
# - Frames are named sequentially using a zero-padded counter:
#     %06d.png
# - Example: 000000.png, 000001.png, 000002.png
# - Existing files are never overwritten; the script exits on collision.
#
#
# Required Parameters
# -------------------
#   -n <show name>  Show name (folder name, verbatim)
#   -s <season>     Season number (integer)
#   -e <episode>    Episode number (integer)
#
#
# Optional Parameters
# -------------------
#   -r <root>      Content root directory (default: ~/ShowContent)
#   -i <seconds>   Capture interval in seconds (default: 1)
#   -d <minutes>   Total capture duration in minutes (default: 26)
#
#
# Notes
# -----
# - Uses macOS `screencapture` (same pipeline as Cmd+Shift+3)
# - Requires screen recording permission for the shell/terminal
# - Intended for deterministic, repeatable frame extraction
#

set -euo pipefail

CONTENT_ROOT="$HOME/ShowContent"
CAPTURE_INTERVAL_SECONDS=1
RUN_DURATION_MINUTES=26
RUN_DURATION_SECONDS=$((RUN_DURATION_MINUTES * 60))

SHOW_NAME=""
SEASON_NUMBER=""
EPISODE_NUMBER=""
RAW_FOLDER_NAME="raw"

usage() {
  echo "Usage: $0 -n <show name> -s <season> -e <episode> [-r <root>] [-i <interval>] [-d <duration>]"
  exit 1
}

while getopts "n:s:e:r:i:d:" opt; do
  case "$opt" in
    n) SHOW_NAME="$OPTARG" ;;
    s) SEASON_NUMBER="$OPTARG" ;;
    e) EPISODE_NUMBER="$OPTARG" ;;
    r) CONTENT_ROOT="$OPTARG" ;;
    i) CAPTURE_INTERVAL_SECONDS="$OPTARG" ;;
    d) RUN_DURATION_MINUTES="$OPTARG"
      RUN_DURATION_SECONDS=$((RUN_DURATION_MINUTES * 60))
      ;;
    *) usage ;;
  esac
done

if [[ -z "$SHOW_NAME" || -z "$SEASON_NUMBER" || -z "$EPISODE_NUMBER" ]]; then
  usage
fi

CONTENT_ROOT=$(cd "$(dirname "$CONTENT_ROOT")" && pwd)/$(basename "$CONTENT_ROOT")
if [[ "$CONTENT_ROOT" == /Volumes/* ]]; then
  VOLUME_ROOT="/Volumes/$(basename "$(dirname "$CONTENT_ROOT")")"

  if [ ! -d "$VOLUME_ROOT" ]; then
    echo "External drive not mounted: $VOLUME_ROOT" >&2
    exit 1
  fi
fi

if ! mkdir -p "$CONTENT_ROOT"; then
  echo "Failed to create content root: $CONTENT_ROOT" >&2
  exit 1
fi

SEASON_TAG=$(printf '%02d' "$SEASON_NUMBER")
EPISODE_TAG=$(printf '%02d' "$EPISODE_NUMBER")

RAW_DIR="$CONTENT_ROOT/$SHOW_NAME/Season$SEASON_TAG/EP$EPISODE_TAG/$RAW_FOLDER_NAME"

mkdir -p "$RAW_DIR"

printf 'Starting capture\n'
printf '  Show        : %s\n' "$SHOW_NAME"
printf '  Season       : %02d\n' "$SEASON_NUMBER"
printf '  Episode      : %02d\n' "$EPISODE_NUMBER"
printf '  Content root : %s\n' "$CONTENT_ROOT"
printf '  Output dir   : %s\n' "$RAW_DIR"
printf '  Interval (s) : %d\n' "$CAPTURE_INTERVAL_SECONDS"
printf '  Duration (m) : %d\n' "$RUN_DURATION_MINUTES"
printf '\n'

START_TIME_SECONDS=$SECONDS
END_TIME_SECONDS=$((START_TIME_SECONDS + RUN_DURATION_SECONDS))
FRAME_COUNTER=0
LAST_STATUS_SECONDS=0

while (( SECONDS < END_TIME_SECONDS )); do
  ITERATION_START=$SECONDS
  FRAME_NAME=$(printf '%06d.png' "$FRAME_COUNTER")
  FILE_PATH="$RAW_DIR/$FRAME_NAME"

  if [ -e "$FILE_PATH" ]; then
    echo "Frame file already exists: $FILE_PATH" >&2
    exit 1
  fi

  screencapture -x "$FILE_PATH"
  if [ ! -s "$FILE_PATH" ]; then
    echo "Screenshot failed or file not written: $FILE_PATH" >&2
    exit 1
  fi

  FRAME_COUNTER=$((FRAME_COUNTER + 1))

  if (( SECONDS - LAST_STATUS_SECONDS >= 30 )); then
    ELAPSED=$((SECONDS - START_TIME_SECONDS))
    printf '[%02d:%02d] frames=%d last=%s\n' \
      $((ELAPSED/60)) $((ELAPSED%60)) "$FRAME_COUNTER" "$FRAME_NAME"
    LAST_STATUS_SECONDS=$SECONDS
  fi

  ITERATION_END=$SECONDS
  PROCESSING_TIME=$((ITERATION_END - ITERATION_START))
  SLEEP_TIME=$((CAPTURE_INTERVAL_SECONDS - PROCESSING_TIME))

  if (( SLEEP_TIME > 0 )); then
    sleep "$SLEEP_TIME"
  fi
done

Tuesday, December 23, 2025

macOS: Configure Visual Studio Code to Run the Bash Debug Extension

The Bash Debug Extension, as of version 0.3.9, requires at least Bash 4.* or Bash 5.* to be installed. macOS natively has Bash 3.2 installed.

A convenient way to install Bash is with Homebrew. If Homebrew has not previously been installed, it can be installed as follows. From a terminal session, invoke the following to install Homebrew:

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

The Homebrew installer provides the following steps to finalize the installation for the current user:

- Run these commands in your terminal to add Homebrew to your PATH:
    echo >> /Users/<username>/.zprofile
    echo 'eval "$(/opt/homebrew/bin/brew shellenv)"' >> /Users/<username>/.zprofile
    eval "$(/opt/homebrew/bin/brew shellenv)"

Install Bash (currently 5.3.9) using Homebrew:

brew install bash

To run the latest Bash, specify the pathBash attribute in launch.json as follows:

{
    "version": "0.2.0",
    "configurations": [
        {
            "type": "bashdb",
            "request": "launch",
            "name": "Bash-Debug (simplest configuration)",
            "program": "${file}",
            "pathBash": "/opt/homebrew/bin/bash"
        }
    ]
}




Monday, September 29, 2025

Git Command-Line: Listing All Untracked Files

When a new folder is added to a local Git repo, "git status" only shows the folder name:

C:\repo>git status

On branch bug/jdnark/1234408-Fix-Automation-Unzip

Untracked files:

  (use "git add <file>..." to include in what will be committed)

        CICDAutomation/

To see all the files that are untracked, and not just the folder, use the following:

git status --untracked-files=all

An example of the files that are untracked is as follows:

C:\repo>git status --untracked-files=all
On branch bug/jdnark/1234408-Fix-Automation-Unzip
Untracked files:
  (use "git add <file>..." to include in what will be committed)
        Automation/.gitignore
        Automation/Readme.md
        Automation/get_logs.py

nothing added to commit but untracked files present (use "git add" to track)

Monday, September 22, 2025

Visual Studio Code: A launch.json for Python Development

Having spent years as a PowerShell developer, I find myself in need of a launch.json file for Python work. To add the launch.json file, create a new file under .vscode in Visual Studio Code. An example launch.json is as follows:

{
  "version": "0.2.0",
  "configurations": [
    {
      "name": "Python: Current File",
      "type": "debugpy",
      "request": "launch",
      "program": "${file}",
      "console": "integratedTerminal"
    },
    {
      "name": "Python: Upload CVEs",
      "type": "debugpy",
      "request": "launch",
      "program": "${workspaceFolder}/upload_cve_docs.py",
      "console": "integratedTerminal",
      "args": [
        "--file", "cve.jsonl",
        "--format", "jsonl",
        "--merge"
      ]
    }
  ]
}

The configuration, Python: Current File, allows me to debug the current file I have open. The configuration, Python: Upload CVEs, is specific to a script I am debugging (upload_cve_docs.py) and the parameters associated with that script.

The type attribute should be set as follows:
      "type": "debugpy",

Older version of launch.json specified:

      "type": python",

Using the above type attribute will display the following message as the attribute value python has been deprecated: 


WSL/Ubuntu: Make Visual Studio Code Always Use Your Python Virtual Environment

Recall from WSL: Install Azure OpenAI SDK that in order to install the Azure OpenAI SDK on Ubuntu, a virtual environment had to be used. When running Visual Studio Code and developing with Python, the terminal associated with this virtual environment must be used; otherwise, the Azure OpenAI SDK-related packages will not be found.

This is not an issue when running an interactive terminal. Each time an interactive terminal is opened, .bashrc contains the following line, which facilitates access to the virtual environment:

source ~/.venvs/azure/bin/activate

The Ubuntu instance that is running will always access the virtual environment during development. Every time Visual Studio Code is launched from Ubuntu, the Python interpreter associated with the virtual environment must be used.

To set this behavior globally, as opposed to per-project, open Visual Studio Code from your WSL Ubuntu instance such as:


From Visual Studio Code, press Ctrl+Shift+P (Command Palette) and search for:

Preferences: Open Remote Settings (JSON)

The previous command opens the global settings file. Insert the following lines, replacing the path with your venv’s Python interpreter:

{
  "python.defaultInterpreterPath": "/home/jann/.venvs/azure/bin/python",
  "python.terminal.activateEnvironment": true
}

The example above is from my personal environment, so use your own specific virtual environment setup. 

The python.defaultInterpreterPath setting points Visual Studio Code to your virtual environment. The setting python.terminal.activateEnvironment ensures it is auto-activated in the integrated terminal.

Once these changes have been made, save the file and close it.

Moving forward, every folder you open in Visual Studio Code will default to the virtual environment interpreter automatically.

This can also be configured on a per-folder basis by editing .vscode/settings.json, adding the following lines, and saving the file after modification:

{
    "python.defaultInterpreterPath": "/home/jann/.venvs/azure/bin/python",
    "python.terminal.activateEnvironment": true,
    "python.envFile": "${workspaceFolder}/.env"    
}

Again, the settings shown above (python.defaultInterpreterPath) are specific to my environment, and you should use your own virtual environment settings.

Sunday, September 21, 2025

AI: Sample Dataset – National Vulnerability Database (NVD) CVE

Any AI needs seed data such as critical business data. With the help of ChatGPT, I created a Python app that downloads records from the National Vulnerability Database (NVD) CVE feed. This data is a structured dump of security vulnerability records retrieved from the REST endpoint:

https://services.nvd.nist.gov/rest/json/cves/2.0

The file size for all records in 2024 is approximately 100 MB. Be aware that Microsoft's documentation specifies Azure AI Search pricing as follows, where the size limit of the Free tier is 50 MB:

An example record retrieved from the National Vulnerability Database's CVE feed is as follows:

{
  "id": "CVE-2024-12345",
  "title": "Buffer overflow in XYZ software before 1.2.3 allows remote code execution.",
  "description": "Buffer overflow in XYZ software before 1.2.3 allows remote attackers to execute code.",
  "severity": "CRITICAL",
  "score": 9.8,
  "cwes": ["CWE-120"],
  "references": ["https://vendor.com/security/advisory/123"],
  "published": "2024-03-05T17:00:00Z",
  "last_modified": "2024-05-01T12:34:00Z",
  "source": "nvd"
}

The data endpoint rejects too many requests and cannot handle large requests. To work around this:

  • A paging strategy was implemented where records were retrieved 120 days at a time.
  • A current status checkpoint file, checkpoint.json, was maintained so the query could restart from the failure point.
  • Iterations made use of timeouts (time.sleep) between page retrievals and between retries after errors.

On any failure, simply wait a few minutes and the code will restart from the last point of failure.

The Python code is as follows:

#!/usr/bin/env python3
import argparse
import datetime as dt
import io
import json
import os
import random
import sys
import time
import urllib.request
import urllib.parse
import urllib.error
from typing import Optional, List, Dict, Tuple

API_BASE = "https://services.nvd.nist.gov/rest/json/cves/2.0"
UA = "Mozilla/5.0 (cve-extractor/1.0)"
MAX_WINDOW_DAYS = 120
DEFAULT_PAGE_SIZE = 1000
MAX_RETRIES = 6
BACKOFF_BASE = 1.6
JITTER_MAX = 0.5
DEFAULT_CHECKPOINT = "checkpoint.json"

def _get_default_year() -> int:
    """Gets most recent full years (all days in year) versus say the current year which is partial."""
    LAST_MONTH_OF_YEAR = 12
    LAST_DAY_OF_DECEMBER = 31
    today = dt.date.today()
    year = today.year
    if today.month < LAST_MONTH_OF_YEAR or today.day < LAST_DAY_OF_DECEMBER:
        year -= 1
    return year

def _windows_for_year(year: int) -> List[Tuple[str, str]]:
    """Yield (start_iso, end_iso) windows of ≤120 days across the year."""
    start = dt.datetime(year, 1, 1, 0, 0, 0, 0)
    year_end = dt.datetime(year, 12, 31, 23, 59, 59, 999000)
    step = dt.timedelta(days=MAX_WINDOW_DAYS)
    cur = start
    out = []
    while cur <= year_end:
        end = min(cur + step - dt.timedelta(seconds=1), year_end)
        s = cur.strftime("%Y-%m-%dT%H:%M:%S.%f")[:-3]
        e = end.strftime("%Y-%m-%dT%H:%M:%S.%f")[:-3]
        out.append((s, e))
        cur = end + dt.timedelta(seconds=1)
    return out

def _request_json(url: str, api_key: Optional[str], retries: int = MAX_RETRIES) -> dict:
    """HTTP GET with retries/backoff; honors Retry-After for 429/403/503."""
    headers = {"User-Agent": UA, "Accept": "application/json"}
    if api_key:
        headers["apiKey"] = api_key
    last_err = None
    for attempt in range(retries):
        req = urllib.request.Request(url, headers=headers)
        try:
            with urllib.request.urlopen(req, timeout=60) as r:
                return json.loads(r.read().decode("utf-8"))
        except urllib.error.HTTPError as e:
            last_err = e
            if e.code in (429, 403, 503):
                retry_after = 0.0
                try:
                    ra = e.headers.get("Retry-After")
                    if ra:
                        retry_after = float(ra)
                except Exception:
                    retry_after = 0.0
                backoff = max(retry_after, (BACKOFF_BASE ** attempt) + random.uniform(0, JITTER_MAX))
                sys.stderr.write(f"HTTP {e.code} -> backoff {backoff:.2f}s (attempt {attempt+1}/{retries})\n")
                time.sleep(backoff)
                continue
            raise
        except urllib.error.URLError as e:
            last_err = e
            backoff = (BACKOFF_BASE ** attempt) + random.uniform(0, JITTER_MAX)
            sys.stderr.write(f"Network error '{e.reason}' -> retry in {backoff:.2f}s (attempt {attempt+1}/{retries})\n")
            time.sleep(backoff)
            continue
    raise last_err

def _flatten_vuln(v: dict) -> Dict[str, object]:
    """Flatten one NVD v2 vulnerability object to a compact record for RAG."""
    cve = v.get("cve", {})
    cve_id = cve.get("id")
    # description (english)
    desc = ""
    for d in cve.get("descriptions", []):
        if d.get("lang") == "en":
            desc = d.get("value", "")
            break
    # metrics: prefer v3.1 → v3.0 → v2
    severity = None
    score = None
    metrics = cve.get("metrics", {}) if isinstance(cve.get("metrics", {}), dict) else {}
    for key in ("cvssMetricV31", "cvssMetricV30", "cvssMetricV2"):
        arr = metrics.get(key) or []
        if arr:
            m = arr[0]
            if key.startswith("cvssMetricV3"):
                cd = m.get("cvssData", {})
                severity, score = cd.get("baseSeverity"), cd.get("baseScore")
            else:
                severity = m.get("baseSeverity")
                score = m.get("cvssData", {}).get("baseScore", m.get("baseScore"))
            break
    # CWEs
    cwes = []
    for w in cve.get("weaknesses", []):
        for d in w.get("description", []):
            if d.get("lang") == "en" and d.get("value"):
                cwes.append(d["value"])
    # references
    refs = [r.get("url") for r in cve.get("references", []) if r.get("url")]
    return {
        "id": cve_id,
        "title": (desc.split("\n", 1)[0].strip() if desc else cve_id),
        "description": desc,
        "severity": severity,
        "score": score,
        "cwes": cwes,
        "references": refs,
        "published": cve.get("published"),
        "last_modified": cve.get("lastModified"),
        "source": "nvd",
    }

def _write_jsonl_records(path: str, records: List[dict], flatten: bool):
    """Append JSONL records to file (create if not exists)."""
    mode = "a" if os.path.exists(path) else "w"
    with open(path, mode, encoding="utf-8") as f:
        for v in records:
            rec = _flatten_vuln(v) if flatten else v
            f.write(json.dumps(rec, ensure_ascii=False) + "\n")

# -----------------------
# checkpointing
# -----------------------
def _load_checkpoint(path: str) -> Optional[dict]:
    if not os.path.exists(path):
        return None
    with open(path, "r", encoding="utf-8") as f:
        return json.load(f)

def _save_checkpoint(path: str, year: int, window: Tuple[str, str], next_index: int, out_path: str):
    tmp = path + ".tmp"
    data = {
        "year": year,
        "window": {"start": window[0], "end": window[1]},
        "next_index": next_index,
        "out": out_path,
        "updated": dt.datetime.utcnow().isoformat(timespec="seconds") + "Z",
    }
    with open(tmp, "w", encoding="utf-8") as f:
        json.dump(data, f, ensure_ascii=False, indent=2)
    os.replace(tmp, path)

def _clear_checkpoint(path: str):
    try:
        os.remove(path)
    except FileNotFoundError:
        pass

# -----------------------
# main fetch (API + pagination + checkpoint)
# -----------------------
def fetch_year_to_jsonl(year: int,
                        out_path: str,
                        api_key: Optional[str],
                        page_size: int,
                        page_delay: float,
                        window_delay: float,
                        flatten: bool,
                        checkpoint_path: Optional[str],
                        resume: bool):
    # Prepare windows
    windows = _windows_for_year(year)

    # Checkpoint load
    start_window_idx = 0
    resume_index = 0
    if resume and checkpoint_path:
        cp = _load_checkpoint(checkpoint_path)
        if cp and cp.get("year") == year and cp.get("out") == out_path:
            w = cp.get("window") or {}
            if "start" in w and "end" in w and "next_index" in cp:
                try:
                    start_window_idx = windows.index((w["start"], w["end"]))
                    resume_index = int(cp["next_index"])
                    sys.stderr.write(f"Resuming from window {start_window_idx+1}/{len(windows)} "
                                     f"@ startIndex={resume_index}\n")
                except ValueError:
                    sys.stderr.write("Checkpoint window not found in computed windows; starting fresh.\n")

    # For a clean restart (new out file) consider removing existing file;
    # here we append to allow true resume.
    for w_idx, (start_iso, end_iso) in enumerate(windows[start_window_idx:], start=start_window_idx):
        sys.stderr.write(f"Window {w_idx+1}/{len(windows)}: {start_iso} → {end_iso}\n")
        start_index = resume_index if w_idx == start_window_idx else 0

        while True:
            qs = urllib.parse.urlencode({
                "pubStartDate": start_iso,
                "pubEndDate":   end_iso,
                "startIndex":   start_index,
                "resultsPerPage": page_size,
            })
            url = f"{API_BASE}?{qs}"
            data = _request_json(url, api_key)
            batch = data.get("vulnerabilities", []) or []
            total = int(data.get("totalResults", 0))

            if batch:
                _write_jsonl_records(out_path, batch, flatten=flatten)
                start_index += len(batch)
                # Save checkpoint after each page
                if checkpoint_path:
                    _save_checkpoint(checkpoint_path, year, (start_iso, end_iso), start_index, out_path)
                sys.stderr.write(
                    f"  window {w_idx+1}/{len(windows)} "
                    f"page {start_index//page_size+1}: +{len(batch):,} / {total:,} in this window\n"
                )
                # Page pacing
                if page_delay > 0:
                    time.sleep(page_delay)
            else:
                break

            if start_index >= total:
                break

        # reset resume index for next window
        resume_index = 0
        # Small delay between windows
        if window_delay > 0:
            time.sleep(window_delay)

    if checkpoint_path:
        _clear_checkpoint(checkpoint_path)

def main():
    default_year = _get_default_year()
    ap = argparse.ArgumentParser(description='Download and export NVD CVEs for a year to JSONL (with resume).')
    ap.add_argument('--year', type=int, default=default_year, help='CVE year (e.g., 2024)')
    ap.add_argument('--out', type=str, default=None, help='Output JSONL path (default: cves-<year>.jsonl)')
    ap.add_argument('--api-key', type=str, default=None, help='Optional NVD API key (header: apiKey)')
    ap.add_argument('--raw', action='store_true', help='Write raw NVD objects instead of flattened records')
    ap.add_argument('--page-size', type=int, default=DEFAULT_PAGE_SIZE, help='API resultsPerPage (default: 1000)')
    ap.add_argument('--page-delay', type=float, default=0.8, help='Seconds to sleep between pages')
    ap.add_argument('--window-delay', type=float, default=1.5, help='Seconds to sleep between 120-day windows')
    ap.add_argument('--checkpoint', type=str, default=DEFAULT_CHECKPOINT, help='Checkpoint file path')
    ap.add_argument('--no-resume', 
                    action='store_false', 
                    dest='resume',
                    help='Do not resume from checkpoint (default: resume)')
    ap.add_argument('--rate', type=int, default=0, help='Target records/hour (overrides --page-delay if set)')
    ap.add_argument('--limit', 
                    type=int, 
                    default=0, 
                    help='(Deprecated here) Not applied because we stream-write pages')

    args = ap.parse_args()
    out_path = args.out or f"cves-{args.year}.jsonl"

    # If a rate is specified, compute per-page sleep to approximate that rate.
    page_delay = args.page_delay
    if args.rate and args.rate > 0:
        # seconds per page = 3600 * page_size / rate
        page_delay = (3600.0 * args.page_size) / float(args.rate)
        sys.stderr.write(f"Rate target {args.rate} recs/hour → page_delay ≈ {page_delay:.2f}s\n")

    try:
        fetch_year_to_jsonl(
            year=args.year,
            out_path=out_path,
            api_key=args.api_key,
            page_size=args.page_size,
            page_delay=page_delay,
            window_delay=args.window_delay,
            flatten=(not args.raw),
            checkpoint_path=args.checkpoint,
            resume=args.resume,
        )
    except Exception as e:
        print(f"ERROR: {e}", file=sys.stderr)
        sys.exit(1)

    print(f"Done. Output: {out_path}")

if __name__ == '__main__':
    main()


Saturday, September 20, 2025

WSL: Install Azure OpenAI SDK

 I have a pristine Ubuntu (Ubuntu-24.04) distro running on WSL (see: WSL: Keeping a Pristine (Greenfield) Version of Your Distro). The blog presents the steps required to install Azure OpenAI SDK on this distro.

First, ensure packages are up to date:

sudo apt update

Ensure Python is installed:

python3 --version

The standard distro for Ubuntu contains Python by default as shown by the message returned by python3 --version:

Python 3.12.3

Ubuntu 24.04 blocks installing Python packages system-wide with pip to protect system tools. This means using pip system-wide to install packages will fail with an “externally-managed-environment” error. The correct way is to use a virtual environment (venv), which is a private sandbox for Python and its packages.

To enable venv on Ubuntu, install the package:

sudo apt install -y python3-venv

To make use of venv, create and activate a virtual environment where the source command activates the virtual environment:

python3 -m venv ~/.venvs/azure
source ~/.venvs/azure/bin/activate

Within venv, pip can be upgraded:

pip install --upgrade pip

The azure-ai-openai package isn’t publicly available on PyPI, so install the OpenAI Python SDK:

pip install -U openai

The -U option above ensures the version installed is upgrade to the latest.” → “The -U option above ensures the package is upgraded to the latest version.

To test the install run python from inside the distro:

python3

Inside Python, paste the following to ensure access to the Azure OpenAI package:

from openai import AzureOpenAI
print("Import worked:", AzureOpenAI)

Once it runs successfully inside Python, exit Python using exit() or Ctrl-D.

Every time you access Azure OpenAI on this distro instance, the venv must be activated using the following:

source ~/.venvs/azure/bin/activate

To simplify, add the above command to ~/.bashrc..