Integrating AI Features in Your App Without Using OpenAI

Team 4 min read

#ai

#opensource

#integration

Overview

Adding AI features to an app doesn’t have to mean tying yourself to a single vendor. This guide explores approaches to integrate AI capabilities—such as text generation, summarization, and classification—without using OpenAI. You’ll learn about on-device inference, self-hosted models, and alternative cloud providers, plus practical steps to choose the right path for your use case.

Why avoid OpenAI

There are several reasons teams consider non-OpenAI options:

  • Data privacy and control: keep sensitive user data within your own environment or trusted providers.
  • Cost predictability: avoid variable API costs tied to usage.
  • Vendor independence: reduce reliance on a single ecosystem and policy changes.
  • Compliance and latency: meet regulatory requirements and optimize latency by deploying closer to users.

Architecture options

  • On-device inference: run lightweight models directly in the browser or mobile apps using frameworks like TensorFlow Lite, Core ML, or ONNX Runtime Mobile.
  • Self-hosted inference: host larger models on your own servers or private cloud using containerized deployments (e.g., Docker) or optimized runtimes.
  • Cloud-based non-OpenAI providers: use alternative services (e.g., Cohere, AI21 Studio, or others) with hosted endpoints.
  • Hybrid: combine edge inference for common tasks with cloud backends for heavier workloads or specialized models.

Getting started

  • Define the feature: text generation, summarization, translation, classification, or embeddings.
  • Choose your path: on-device, self-hosted, or third-party provider.
  • Select a model or service: lightweight on-device models, moderately sized self-hosted models, or reputable cloud providers.
  • Set up infrastructure: choose your runtime (mobile, web, server) and set up endpoints or bundles.
  • Build and integrate: implement API calls, handle prompts and inputs, and incorporate safety checks.
  • Monitor and iterate: track latency, accuracy, and costs; refine prompts and models as needed.

On-device inference

Benefits:

  • Privacy: data doesn’t leave the device.
  • Latency: reduced round-trips for responsive UX.
  • Compliance: easier to meet data residency requirements.

Approaches:

  • Mobile: TensorFlow Lite, Core ML, or ONNX Runtime Mobile.
  • Web: ONNX.js or TensorFlow.js for browser-based inference (with smaller models or WASM acceleration).
  • Considerations: model size, hardware constraints, power usage, and occasional model updates.

Tips:

  • Start with distilled or quantized models designed for mobile.
  • Use streaming generation where possible to improve perceived responsiveness.
  • Cache frequent responses locally to reduce repeated inferences.

Self-hosted inference

Benefits:

  • Greater control over data and models.
  • Ability to run larger or more capable models than on-device permits.
  • Customization and fine-tuning opportunities.

Stack options:

  • Lightweight web services: FastAPI/Flask (Python) or Node.js servers exposing a /generate or /embed endpoint.
  • Model ecosystems: GPT-NeoX, LLaMA, Mistral, or other open-source models (respect licenses).
  • Runtimes: NVIDIA Triton, HuggingFace Inference, or CPU/GPU-optimized containers.

Important considerations:

  • Hardware requirements (CPU vs GPU, VRAM, memory).
  • Latency vs throughput trade-offs.
  • Model management, versioning, and monitoring.
  • Safety and abuse prevention controls.

Example outline:

  • Run a model container locally or in your private cloud.
  • Expose a REST or GraphQL endpoint.
  • Integrate your app to call the endpoint with prompts and stream responses if supported.

Cloud-based non-OpenAI providers

Providers to explore:

  • Cohere
  • AI21 Studio
  • Others that offer text generation, summarization, or classification APIs
  • Key considerations: pricing, rate limits, data handling, and SLA

Guiding tips:

  • Vet data handling and retention policies.
  • Use regional endpoints to reduce latency.
  • Combine with on-device or self-hosted for sensitive workflows.

Building a simple feature: text generation without OpenAI

  • Use a local or hosted endpoint to generate text from a prompt.
  • Implement prompt templates and a lightweight safety guardrail.

Code example (fetching from a local/self-hosted endpoint):

# Call a local inference endpoint
curl -X POST http://localhost:8000/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt":"Summarize the following article: ...","max_tokens":200}'

Code example (client-side fetch in a web app):

async function generateText(prompt) {
  const res = await fetch('/api/generate', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({ prompt, max_tokens: 200 })
  });
  return res.json();
}

Performance, privacy, and cost considerations

  • Performance: measure latency, throughput, and model quality; balance with device capabilities.
  • Privacy: prefer on-device or private-cloud deployments for sensitive data.
  • Cost: monitor API usage if using third-party services; consider model size and hardware costs for self-hosted options.
  • Maintenance: plan for model updates, drift monitoring, and content safeguards.

Security and governance

  • Implement input validation and rate limiting to mitigate abuse.
  • Use access controls and secrets management for endpoints.
  • Log and audit AI interactions to meet compliance requirements.
  • Establish content policies and guardrails to prevent unsafe outputs.

Next steps

  • Pick a single AI feature to start (e.g., text summarization) and choose an implementation path (on-device, self-hosted, or cloud provider).
  • Prototype quickly with a minimal model and a simple API, then iterate on latency and accuracy.
  • Plan for data privacy, cost management, and governance from day one.
  • As you grow, consider a hybrid architecture to balance privacy, performance, and scale.