Back to skills
extension
Category: OtherNo API key required

Screen Vision

macOS screen OCR & click automation via Apple Vision + ScreenCaptureKit. Capture any window or screen region, extract text with coordinates, find text, and c...

personAuthor: jackyun1024hubclawhub

screen-vision

macOS screen OCR & click automation CLI powered by Apple Vision + ScreenCaptureKit.

IMPORTANT: Auto-Setup (run first)

Before running any screen-vision command, check if the binary exists. If not, run the setup script:

command -v screen-vision &>/dev/null || bash "${CLAUDE_SKILL_DIR}/setup.sh"

This installs screen-vision (via Homebrew or source build) and cliclick automatically.

Requirements

  • macOS 14.0+ (Sonoma)
  • Screen Recording permission (System Settings > Privacy & Security > Screen Recording)

Commands

| Command | Description | Output | |---------|-------------|--------| | screen-vision ocr [--app NAME] | Full OCR | JSON array [{text, x, y, w, h, confidence}] | | screen-vision list [--app NAME] | OCR list | Human-readable text with coordinates | | screen-vision find "text" [--app NAME] | Find text | JSON {text, x, y, found} | | screen-vision has "text" [--app NAME] | Check text exists | Exit code 0 (found) / 1 (not found) | | screen-vision tap "text" [--app NAME] [--retry N] | Find + click | JSON {text, x, y, tapped} | | screen-vision wait "text" [--timeout SEC] | Poll until text appears | JSON {text, x, y, found} |

Capture Priority

--region x,y,w,h  >  --app "AppName"  >  full screen (default)

Usage Patterns

OCR a specific app window

screen-vision list --app "Safari"

Check if text is visible (for conditionals)

screen-vision has "Submit" --app "MyApp" && echo "Found" || echo "Not found"

Click on text with retry

screen-vision tap "OK" --app "MyApp" --retry 3

Wait for text to appear (e.g. loading complete)

screen-vision wait "Complete" --timeout 30

Full screen OCR as JSON (pipe to jq)

screen-vision ocr | jq '.[].text'

$ARGUMENTS Handling

Parse the user's request to determine which command to run:

  • "화면에 뭐 있어?" / "what's on screen?" → screen-vision list
  • "~찾아" / "find ~" → screen-vision find "text"
  • "~클릭해" / "click ~" → screen-vision tap "text"
  • "~보여?" / "is ~ visible?" → screen-vision has "text"
  • "~뜰 때까지 기다려" / "wait for ~" → screen-vision wait "text"