Airas Agent – Autonomous Web Navigation with AI Vision

Published at

4 days ago

Main Article

https://github.com/airas-network/airas-agent

AIRAS Agent

Flight booking interface displaying departure city_ destination_ dates_ regional settings_ and search options for Skyscanner__2025-02-22

Note: This project is for research and non-commercial use only.

A web automation system that uses AI with vision capabilities and Playwright to achieve user-defined goals through autonomous web navigation. Supports both OpenAI and Ollama as AI providers.

Quick Start

Clone and Install

git clone https://github.com/airas-network/airas-agent.git
cd airas-agent
npm install

Configure Create a .env file in the root directory:

# Choose your AI provider
AI_PROVIDER=openai  # or 'ollama'

# If using OpenAI
OPENAI_API_KEY=your-api-key-here
OPENAI_MODEL=gpt-4o-mini
OPENAI_VISION_MODEL=gpt-4-vision-preview

# If using Ollama
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_MODEL=llama3.1
OLLAMA_VISION_MODEL=llava

Run
```
npm run dev
```
Use
- Open http://localhost:3000 in your browser
- Enter a goal in the mission parameters (e.g., "Find me a good NFT from Rarible")
- Click "INITIALIZE MISSION"

Features

Multiple AI Providers:
- OpenAI (GPT-4V)
- Ollama (Local models with vision capabilities)
Vision-Enhanced Navigation: Uses AI vision to analyze and interact with web pages
Autonomous Decision Making: Makes intelligent decisions based on visual and textual context
Real-Time Feedback: Shows current status, steps, and visual feedback
Multi-Tab Support: Handles new tab navigation and dynamic content
Smart Element Detection: Improved DOM selection and interaction

Architecture

Core Components

Frontend (app/components/Agent.tsx)
- Main interface for user interaction
- Displays current status, steps taken, and execution timeline
- Shows visual feedback of current webpage state
- Displays active AI provider and model information
- Uses the useAgent hook for state management and API interactions
Browser Service (app/lib/browser/playwright-service.ts)
- Manages browser automation using Playwright
- Handles multi-tab navigation and synchronization
- Provides high-level browser control methods
- Manages page lifecycle and cleanup
- Features:
  - Automatic new tab detection and switching
  - Improved element interaction stability
  - Better handling of dynamic content
  - Robust error recovery
DOM Service (app/lib/browser/dom-service.ts)
- Smart element detection and interaction
- Maintains element state across page changes
- Features:
  - Dynamic element highlighting
  - Automatic tab synchronization
  - Improved element visibility detection
  - Better handling of overlays and modals
Action Executor (app/lib/actions/executor.ts)
- Executes navigation actions
- Coordinates between browser and DOM services
- Handles action validation and error recovery
- Features:
  - Improved action validation
  - Better error handling
  - State preservation across actions
AI Providers
- Modular provider system supporting multiple AI backends
- Each provider implements a common interface:
```
interface AIProvider {
  chat(messages: ChatMessage[]): Promise<string>;
  chatWithVision(messages: ChatMessage[], imageBase64: string): Promise<string>;
}
```
- Supported providers:
  - OpenAI Provider: Uses GPT-4V for vision tasks
  - Ollama Provider: Uses local models with vision capabilities

Navigation Flow

Initialization

// Initialize browser
await playwrightService.initialize();

// Create browser context for tab management
const context = await browser.newContext();

// Set up new tab handling
context.on('page', async (page) => {
  // Handle new tab
  await page.waitForLoadState();
  // Update current page
});

Element Interaction

// Smart element detection
const element = await page.waitForSelector(selector, {
  state: 'visible',
  timeout: 10000
});

// Stable clicking with retry
try {
  await element.click({ timeout: 10000, force: false });
} catch {
  // Fall back to JavaScript click
  await page.evaluate((sel) => {
    document.querySelector(sel)?.click();
  }, selector);
}

State Management

// Get current page state
const state = await domService.getPageState();

// Format elements for AI
const elements = await domService.getFormattedElements();

// Take screenshot
const screenshot = await page.screenshot();

Error Handling

Browser Level
- Automatic recovery from crashes
- Tab synchronization maintenance
- Resource cleanup
- Session state preservation
DOM Level
- Element validation before interaction
- Visibility and interactivity checks
- Dynamic content handling
- Modal and overlay detection
Action Level
- Pre-action validation
- Post-action verification
- Error recovery strategies
- State rollback capabilities

Configuration

Browser Configuration

# Browser viewport settings
BROWSER_WIDTH=1280
BROWSER_HEIGHT=800

# Navigation timeouts (in milliseconds)
NAVIGATION_TIMEOUT=30000
NETWORK_IDLE_TIMEOUT=10000

# Element interaction settings
CLICK_TIMEOUT=10000
ELEMENT_WAIT_TIMEOUT=10000

Development Features

# Enable detailed logging
ENABLE_LOGGING=true

# Save screenshots to disk
ENABLE_SCREENSHOTS=true

# Directory paths
SCREENSHOT_DIR=screenshots
LOG_DIR=logs

Contributing

This project welcomes contributions! Some areas for improvement:

Enhanced Navigation
- Better dynamic content handling
- Improved modal interaction
- Smarter tab management
- Form handling improvements
AI Integration
- Better context preservation
- Improved decision making
- Enhanced visual understanding
- More efficient prompting
Error Handling
- Better recovery strategies
- Improved state preservation
- More robust cleanup
- Better error reporting
Performance
- Faster element detection
- Better resource management
- Reduced memory usage
- Improved screenshot handling

License

This project is licensed under the GNU General Public License v3.0 - see the LICENSE.txt file for details.

This means you are free to:

Use the software for any purpose
Change the software to suit your needs
Share the software with your friends and neighbors
Share the changes you make

Under the following terms:

If you distribute the software, you must also distribute:
- The complete source code or make it freely available
- The same license terms to recipients
- Any modifications under the same license
Include copyright and license notices
State significant changes made to the software
Disclose source code of your version

The GPL-3.0 license ensures that all versions of the software remain free and open source.

Development Mode

When developing or debugging, you can enable additional features:

Set ENABLE_LOGGING=true to:
- Write detailed logs to the logs directory
- Each session gets its own log file with timestamp
- Includes step execution details, navigation status, and errors
Set ENABLE_SCREENSHOTS=true to:
- Save screenshots to disk in the screenshots directory
- Each screenshot is saved with a timestamp
- Useful for debugging navigation issues

Usage

Start the development server:
```
npm run dev
```
Enter a goal in the interface (e.g., "Find me a good NFT from Rarible")
The system will:
- Break down the goal into steps
- Take screenshots of each state
- Use AI vision to analyze the page
- Execute each step autonomously
- Display progress in real-time
- Handle errors gracefully
- Complete when the goal is achieved

Error Handling

API Level
- Graceful cleanup of browser resources
- Detailed error messages
- Status code handling
- Screenshot verification
Frontend Level
- Visual error feedback
- Retry capabilities
- State recovery
- Screenshot display

Logging

Comprehensive logging at each step:

Step execution details
Element finding attempts
Navigation status
Network request completion
Screenshot captures
Error details

This system is designed to be extensible, with the ability to add new step types and element finding strategies as needed. The vision capabilities ensure more accurate navigation by making decisions based on what is actually visible on the page.

Contributing

This project is currently in its early stages and we welcome contributions from the community! While the current implementation is barebones, with enough interest and contributions, we aim to expand its capabilities significantly.

Areas for Improvement

Enhanced Navigation Strategies
- Better handling of dynamic content
- Support for more complex interaction patterns
- Improved SPA navigation detection
Vision Capabilities
- Better element recognition
- Support for more complex visual patterns
- Improved decision making based on visual context
Error Handling
- More robust recovery strategies
- Better handling of timeouts and failures
- Improved logging and debugging capabilities
Performance Optimization
- Reduce memory usage
- Improve screenshot handling
- Better resource cleanup

How to Contribute

Fork the Repository

git clone https://github.com/GPT-Protocol/airas-agent.git
cd airas-agent

Install Dependencies
```
npm install
```
If you encounter peer dependency issues during installation, you can use the legacy peer deps flag:
```
npm install --legacy-peer-deps
```
This may be necessary due to some packages having strict peer dependency requirements.

Create a Branch

git checkout -b feature/your-feature-name

Make Your Changes
- Write clean, commented code
- Follow existing code style
- Add tests if possible
- Update documentation as needed
Test Your Changes
```
npm run dev
```
Submit a Pull Request
- Provide a clear description of your changes
- Link any related issues
- Include screenshots if relevant

Development Guidelines

Code Style
- Use TypeScript
- Follow existing patterns
- Add appropriate comments
- Use meaningful variable names
Testing
- Test with different websites
- Verify error handling
- Check edge cases
- Document any limitations
Documentation
- Update README if needed
- Document new features
- Add inline comments
- Update configuration examples

Project Status

This is an experimental project in its early stages. The current implementation provides basic autonomous web navigation capabilities, but there's significant room for improvement. We're releasing it in this state to:

Gather community feedback
Identify key areas for improvement
Allow early adopters to experiment and contribute
Build a foundation for more advanced features

If you're interested in contributing or have ideas for improvements, please:

Open an issue to discuss your ideas
Submit pull requests with improvements
Share your use cases and feedback
Join the discussion in GitHub issues

Future Plans

With sufficient community interest, we plan to add:

More sophisticated navigation strategies
Better visual understanding capabilities
Additional browser automation features
Improved error recovery
Better performance and reliability
Extended documentation and examples

Your contributions and feedback are welcome and appreciated!

gittech. site

Airas Agent – Autonomous Web Navigation with AI Vision

AIRAS Agent

Quick Start

Features

Architecture

Core Components

Navigation Flow

Error Handling

Configuration

Browser Configuration

Development Features

Contributing

License

Development Mode

Usage

Error Handling

Logging

Contributing

Areas for Improvement

How to Contribute

Development Guidelines

Project Status

Future Plans