Integrating AI Features in Your App Without Using OpenAI

Team 4 min read

#ai

#opensource

#integration

Overview

Adding AI features to an app doesn’t have to mean tying yourself to a single vendor. This guide explores approaches to integrate AI capabilities—such as text generation, summarization, and classification—without using OpenAI. You’ll learn about on-device inference, self-hosted models, and alternative cloud providers, plus practical steps to choose the right path for your use case.

Why avoid OpenAI

There are several reasons teams consider non-OpenAI options:

Data privacy and control: keep sensitive user data within your own environment or trusted providers.
Cost predictability: avoid variable API costs tied to usage.
Vendor independence: reduce reliance on a single ecosystem and policy changes.
Compliance and latency: meet regulatory requirements and optimize latency by deploying closer to users.

Architecture options

On-device inference: run lightweight models directly in the browser or mobile apps using frameworks like TensorFlow Lite, Core ML, or ONNX Runtime Mobile.
Self-hosted inference: host larger models on your own servers or private cloud using containerized deployments (e.g., Docker) or optimized runtimes.
Cloud-based non-OpenAI providers: use alternative services (e.g., Cohere, AI21 Studio, or others) with hosted endpoints.
Hybrid: combine edge inference for common tasks with cloud backends for heavier workloads or specialized models.

Getting started

Define the feature: text generation, summarization, translation, classification, or embeddings.
Choose your path: on-device, self-hosted, or third-party provider.
Select a model or service: lightweight on-device models, moderately sized self-hosted models, or reputable cloud providers.
Set up infrastructure: choose your runtime (mobile, web, server) and set up endpoints or bundles.
Build and integrate: implement API calls, handle prompts and inputs, and incorporate safety checks.
Monitor and iterate: track latency, accuracy, and costs; refine prompts and models as needed.

On-device inference

Benefits:

Privacy: data doesn’t leave the device.
Latency: reduced round-trips for responsive UX.
Compliance: easier to meet data residency requirements.

Approaches:

Mobile: TensorFlow Lite, Core ML, or ONNX Runtime Mobile.
Web: ONNX.js or TensorFlow.js for browser-based inference (with smaller models or WASM acceleration).
Considerations: model size, hardware constraints, power usage, and occasional model updates.

Tips:

Start with distilled or quantized models designed for mobile.
Use streaming generation where possible to improve perceived responsiveness.
Cache frequent responses locally to reduce repeated inferences.

Self-hosted inference

Benefits:

Greater control over data and models.
Ability to run larger or more capable models than on-device permits.
Customization and fine-tuning opportunities.

Stack options:

Lightweight web services: FastAPI/Flask (Python) or Node.js servers exposing a /generate or /embed endpoint.
Model ecosystems: GPT-NeoX, LLaMA, Mistral, or other open-source models (respect licenses).
Runtimes: NVIDIA Triton, HuggingFace Inference, or CPU/GPU-optimized containers.

Important considerations:

Hardware requirements (CPU vs GPU, VRAM, memory).
Latency vs throughput trade-offs.
Model management, versioning, and monitoring.
Safety and abuse prevention controls.

Example outline:

Run a model container locally or in your private cloud.
Expose a REST or GraphQL endpoint.
Integrate your app to call the endpoint with prompts and stream responses if supported.

Cloud-based non-OpenAI providers

Providers to explore:

Cohere
AI21 Studio
Others that offer text generation, summarization, or classification APIs
Key considerations: pricing, rate limits, data handling, and SLA

Guiding tips:

Vet data handling and retention policies.
Use regional endpoints to reduce latency.
Combine with on-device or self-hosted for sensitive workflows.

Building a simple feature: text generation without OpenAI

Use a local or hosted endpoint to generate text from a prompt.
Implement prompt templates and a lightweight safety guardrail.

Code example (fetching from a local/self-hosted endpoint):

# Call a local inference endpoint
curl -X POST http://localhost:8000/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt":"Summarize the following article: ...","max_tokens":200}'

Code example (client-side fetch in a web app):

async function generateText(prompt) {
  const res = await fetch('/api/generate', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({ prompt, max_tokens: 200 })
  });
  return res.json();
}

Performance, privacy, and cost considerations

Performance: measure latency, throughput, and model quality; balance with device capabilities.
Privacy: prefer on-device or private-cloud deployments for sensitive data.
Cost: monitor API usage if using third-party services; consider model size and hardware costs for self-hosted options.
Maintenance: plan for model updates, drift monitoring, and content safeguards.

Security and governance

Implement input validation and rate limiting to mitigate abuse.
Use access controls and secrets management for endpoints.
Log and audit AI interactions to meet compliance requirements.
Establish content policies and guardrails to prevent unsafe outputs.

Next steps

Pick a single AI feature to start (e.g., text summarization) and choose an implementation path (on-device, self-hosted, or cloud provider).
Prototype quickly with a minimal model and a simple API, then iterate on latency and accuracy.
Plan for data privacy, cost management, and governance from day one.
As you grow, consider a hybrid architecture to balance privacy, performance, and scale.

Share this article

Share on Twitter Share on LinkedIn