open-source sdk for macos

Your AI doesn't need
to see the screen.

It just needs to know
what's on it.

Glance reads the OS accessibility tree and gives your AI structured data about every UI element — role, label, value, exact pixel coordinates. No screenshots. No vision models. One function call.

0ms Your eye takes 300ms to blink. Glance reads the entire screen 10 times in that window.
faster LLMs process text tokens 10× faster than image tokens. Less data in, faster answer out.
cheaper ~500 text tokens vs ~3,000 image tokens per call. That adds up fast at scale.

Let's be real for a second.

Glance doesn't replace screenshots entirely. About 55% of the time — when your user is in a browser, text editor, terminal, or any standard productivity app — structured text is all your AI needs. It responds faster, costs less, and gets exact element positions instead of guessing from pixels.

The other 45%? Canvas apps, games, custom-rendered UIs — the accessibility tree is sparse there. Fall back to screenshots for those. Glance tells you when to switch. Use the right tool for the moment.

Or just tell your agent to set it up
import { screen } from 'glance-sdk'

// one line — returns structured text, not pixels
const ctx = await screen()

// feed it to any LLM as plain text
const res = await anthropic.messages.create({
  model: 'claude-sonnet-4-6',
  messages: [{ role: 'user', content: ctx }]
})

The screenshot tax

Every Clicky, Snippy, and screen companion being built right now pays it on every single interaction. Capture, encode, upload, wait for the vision model to squint at pixels. Your users feel that lag.

Screenshot pipeline
1 Capture 2MB screenshot
2 Base64 encode, upload to API
3 Vision model processes ~3,000 tokens
4 AI guesses element coordinates
2–5 sec ~$0.03 / call ~50px off
With Glance
1 Read OS accessibility tree — 30ms
2 Send ~500 tokens of structured text
3 Exact coordinates included free
~0.8 sec ~$0.001 / call 0px off
Why is it so much faster? LLMs process text and image tokens on different pipelines. Image tokens take 1–3 extra seconds to encode and interpret — that's time your user spends waiting. With text-only input, the model starts generating immediately. The response feels instant because, to the model, it basically is.

What your LLM actually receives

Instead of a 2MB image for the AI to squint at, it gets this:

Glance.screen() → ~480 tokens
[App: DaVinci Resolve 19.1 | Window: "Project 1 - Edit"]

## Focused
- [Slider] "Midtones" value=0.32 at (510,390) [FOCUSED]

## Controls
- [Button] "Cut" at (120,42)
- [Button] "Color" at (680,42)
- [PopUpButton] "Node" value="Corrector 1" at (820,42)
- [Button] "Add Node" at (900,42)

## Input Fields
- [Slider] "Lift" value=0.15 at (400,380)
- [Slider] "Gamma" value=-0.08 at (560,380)
- [Slider] "Gain" value=0.22 at (720,380)

## Content
- [StaticText] "Timeline: 47 clips" at (960,680)

The coordinates are exact — from the OS, not estimated from pixels. When your AI says "adjust the Midtones slider" you know it's at (510, 390). No vision model needed.

Where Glance shines

Glance reads the native macOS accessibility tree — the same structured data VoiceOver uses. How much it can see depends on how well the app exposes its UI.

Best for ~55% of use cases

Full element tree — every button, field, link, label, and its exact position. Text-only mode works perfectly here. This is where Glance saves you the most.

  • Browsers — Chrome, Safari, Firefox, Arc. Reads web page content too: headings, links, form fields, ARIA labels.
  • Code editors — VS Code, Cursor, Zed, Xcode. Sees file trees, tabs, editor content, terminal output.
  • Chat & productivity — Slack, Discord, Notion, Notes, Mail. Full message and UI structure.
  • System apps — Finder, System Settings, Terminal, iTerm2, Warp. Everything exposed.
  • Electron apps — Any app built on Electron inherits Chromium's accessibility tree.
Works, with gaps ~20% of use cases

Menus, toolbars, and panels are readable — but the main canvas or viewport is a single opaque element. Combine with a cropped screenshot of just the canvas area for best results.

  • DaVinci Resolve — color panels, timeline controls, menus yes. Video viewer no.
  • Adobe apps — toolbars, layers panel, menus yes. Artboard/canvas no.
  • Figma desktop — app chrome yes. Design canvas no.
  • Blender — UI panels yes. 3D viewport no.
Use screenshot ~10% of use cases

These render everything custom — no accessibility tree to read. Glance detects this automatically (elementCount will be very low) so you can switch to screenshots.

  • Games
  • Canvas / WebGL apps
  • Remote desktop streams

Smart fallback, built in

Glance tells you when structured text isn't enough. One check, and you switch strategies automatically.

import { capture } from 'glance-sdk'

const state = await capture()

if (state.elementCount > 5) {
  // rich structure — use text (10× faster, 30× cheaper)
  sendToLLM({ role: 'user', content: state.prompt })
} else {
  // canvas app — fall back to screenshot
  const img = await captureScreenshot()
  sendToLLM({ role: 'user', content: [{ type: 'image', data: img }] })
}
Glance (text) Screenshot (image)
Capture ~30ms ~50ms
Network nothing to upload 200–500ms (2MB image)
LLM thinking ~0.3s (500 text tokens) ~1.8s (3,000 image tokens)
Total latency ~0.8s ~3.5s
Element positions exact (from OS) estimated (from pixels)
Cost ~$0.001 ~$0.03

Get started

The pre-built macOS binary is bundled inside the package. Nothing else to install.

npm
npm install glance-sdk
pip
pip install glance-sdk
swift
.package(url: "github.com/rishabhsai/glance", from: "0.1.0")

Four functions. That's the API.

screen()

Returns an LLM-ready string. Every UI element with role, label, value, and exact coordinates. Drop it straight into your prompt as text.

→ string
capture()

Full structured data — app name, window title, element array, the prompt string, and timing metrics. Use when you need programmatic access or want to check elementCount for fallback logic.

→ object
find(name)

Look up a UI element by label. Returns its exact pixel position — built for Clicky-style cursor pointing. No more coordinate guessing.

→ element | null
checkAccess()

Check if macOS Accessibility permission is granted. Same permission Clicky and similar tools already require for push-to-talk.

→ boolean

Drop it into your stack

Glance works as a drop-in enhancement for any screen-aware AI tool.

Clicky / Snippy

Replace the screenshot capture in CompanionManager.swift with Glance.screen(). Two lines changed. Responses feel instant, pointing becomes pixel-perfect.

Electron apps

npm install glance-sdk — the native binary ships inside node_modules. Call screen() from your main process. No native compilation needed.

Python agents

pip install glance-sdk — binary bundled in the wheel. Works with LangChain, Claude SDK, OpenAI SDK, CrewAI, or any framework.

Any language

Use the CLI: ./glance screen --json outputs structured JSON to stdout. Parse it from Go, Rust, Ruby — anything that can exec a process.

This is just the start

Windows & Linux

macOS uses AXUIElement. Windows has UI Automation, Linux has AT-SPI. Same idea, different OS APIs. We're working on it — or grab the issue and ship it first.

Smarter formatting

The prompt formatter is good but not perfect. Better grouping, context-aware truncation, app-specific templates — lots of room to make the LLM output even tighter.

Diff mode

Instead of sending the full screen every time, only send what changed since the last capture. Fewer tokens, faster responses, lower cost.

Your idea

Open an issue, open a PR. This is MIT-licensed and built to be extended. If you're making an AI screen companion, you probably know what's missing better than we do.

Make your AI companion
actually fast.

Open source · MIT license · One function call