open-source sdk for macos
Glance reads the OS accessibility tree and gives your AI structured data about every UI element — role, label, value, exact pixel coordinates. No screenshots. No vision models. One function call.
Glance doesn't replace screenshots entirely. About 55% of the time — when your user is in a browser, text editor, terminal, or any standard productivity app — structured text is all your AI needs. It responds faster, costs less, and gets exact element positions instead of guessing from pixels.
The other 45%? Canvas apps, games, custom-rendered UIs — the accessibility tree is sparse there. Fall back to screenshots for those. Glance tells you when to switch. Use the right tool for the moment.
import { screen } from 'glance-sdk'
// one line — returns structured text, not pixels
const ctx = await screen()
// feed it to any LLM as plain text
const res = await anthropic.messages.create({
model: 'claude-sonnet-4-6',
messages: [{ role: 'user', content: ctx }]
})
Every Clicky, Snippy, and screen companion being built right now pays it on every single interaction. Capture, encode, upload, wait for the vision model to squint at pixels. Your users feel that lag.
Instead of a 2MB image for the AI to squint at, it gets this:
[App: DaVinci Resolve 19.1 | Window: "Project 1 - Edit"]
## Focused
- [Slider] "Midtones" value=0.32 at (510,390) [FOCUSED]
## Controls
- [Button] "Cut" at (120,42)
- [Button] "Color" at (680,42)
- [PopUpButton] "Node" value="Corrector 1" at (820,42)
- [Button] "Add Node" at (900,42)
## Input Fields
- [Slider] "Lift" value=0.15 at (400,380)
- [Slider] "Gamma" value=-0.08 at (560,380)
- [Slider] "Gain" value=0.22 at (720,380)
## Content
- [StaticText] "Timeline: 47 clips" at (960,680)
The coordinates are exact — from the OS, not estimated from pixels. When your AI says
"adjust the Midtones slider" you know it's at (510, 390). No vision model needed.
Glance reads the native macOS accessibility tree — the same structured data VoiceOver uses. How much it can see depends on how well the app exposes its UI.
Full element tree — every button, field, link, label, and its exact position. Text-only mode works perfectly here. This is where Glance saves you the most.
Menus, toolbars, and panels are readable — but the main canvas or viewport is a single opaque element. Combine with a cropped screenshot of just the canvas area for best results.
These render everything custom — no accessibility tree to read. Glance detects this automatically (elementCount will be very low) so you can switch to screenshots.
Glance tells you when structured text isn't enough. One check, and you switch strategies automatically.
import { capture } from 'glance-sdk'
const state = await capture()
if (state.elementCount > 5) {
// rich structure — use text (10× faster, 30× cheaper)
sendToLLM({ role: 'user', content: state.prompt })
} else {
// canvas app — fall back to screenshot
const img = await captureScreenshot()
sendToLLM({ role: 'user', content: [{ type: 'image', data: img }] })
}
The pre-built macOS binary is bundled inside the package. Nothing else to install.
npm install glance-sdk
pip install glance-sdk
.package(url: "github.com/rishabhsai/glance", from: "0.1.0")
screen()
Returns an LLM-ready string. Every UI element with role, label, value, and exact coordinates. Drop it straight into your prompt as text.
→ stringcapture()
Full structured data — app name, window title, element array, the prompt string, and timing metrics. Use when you need programmatic access or want to check elementCount for fallback logic.
→ objectfind(name)
Look up a UI element by label. Returns its exact pixel position — built for Clicky-style cursor pointing. No more coordinate guessing.
→ element | nullcheckAccess()
Check if macOS Accessibility permission is granted. Same permission Clicky and similar tools already require for push-to-talk.
→ booleanGlance works as a drop-in enhancement for any screen-aware AI tool.
Replace the screenshot capture in CompanionManager.swift with Glance.screen(). Two lines changed. Responses feel instant, pointing becomes pixel-perfect.
npm install glance-sdk — the native binary ships inside node_modules. Call screen() from your main process. No native compilation needed.
pip install glance-sdk — binary bundled in the wheel. Works with LangChain, Claude SDK, OpenAI SDK, CrewAI, or any framework.
Use the CLI: ./glance screen --json outputs structured JSON to stdout. Parse it from Go, Rust, Ruby — anything that can exec a process.
macOS uses AXUIElement. Windows has UI Automation, Linux has AT-SPI. Same idea, different OS APIs. We're working on it — or grab the issue and ship it first.
The prompt formatter is good but not perfect. Better grouping, context-aware truncation, app-specific templates — lots of room to make the LLM output even tighter.
Instead of sending the full screen every time, only send what changed since the last capture. Fewer tokens, faster responses, lower cost.
Open an issue, open a PR. This is MIT-licensed and built to be extended. If you're making an AI screen companion, you probably know what's missing better than we do.
Open source · MIT license · One function call