Integrating AI Features in Your App Without Using OpenAI
Team 4 min read
#ai
#opensource
#integration
Overview
Adding AI features to an app doesn’t have to mean tying yourself to a single vendor. This guide explores approaches to integrate AI capabilities—such as text generation, summarization, and classification—without using OpenAI. You’ll learn about on-device inference, self-hosted models, and alternative cloud providers, plus practical steps to choose the right path for your use case.
Why avoid OpenAI
There are several reasons teams consider non-OpenAI options:
- Data privacy and control: keep sensitive user data within your own environment or trusted providers.
- Cost predictability: avoid variable API costs tied to usage.
- Vendor independence: reduce reliance on a single ecosystem and policy changes.
- Compliance and latency: meet regulatory requirements and optimize latency by deploying closer to users.
Architecture options
- On-device inference: run lightweight models directly in the browser or mobile apps using frameworks like TensorFlow Lite, Core ML, or ONNX Runtime Mobile.
- Self-hosted inference: host larger models on your own servers or private cloud using containerized deployments (e.g., Docker) or optimized runtimes.
- Cloud-based non-OpenAI providers: use alternative services (e.g., Cohere, AI21 Studio, or others) with hosted endpoints.
- Hybrid: combine edge inference for common tasks with cloud backends for heavier workloads or specialized models.
Getting started
- Define the feature: text generation, summarization, translation, classification, or embeddings.
- Choose your path: on-device, self-hosted, or third-party provider.
- Select a model or service: lightweight on-device models, moderately sized self-hosted models, or reputable cloud providers.
- Set up infrastructure: choose your runtime (mobile, web, server) and set up endpoints or bundles.
- Build and integrate: implement API calls, handle prompts and inputs, and incorporate safety checks.
- Monitor and iterate: track latency, accuracy, and costs; refine prompts and models as needed.
On-device inference
Benefits:
- Privacy: data doesn’t leave the device.
- Latency: reduced round-trips for responsive UX.
- Compliance: easier to meet data residency requirements.
Approaches:
- Mobile: TensorFlow Lite, Core ML, or ONNX Runtime Mobile.
- Web: ONNX.js or TensorFlow.js for browser-based inference (with smaller models or WASM acceleration).
- Considerations: model size, hardware constraints, power usage, and occasional model updates.
Tips:
- Start with distilled or quantized models designed for mobile.
- Use streaming generation where possible to improve perceived responsiveness.
- Cache frequent responses locally to reduce repeated inferences.
Self-hosted inference
Benefits:
- Greater control over data and models.
- Ability to run larger or more capable models than on-device permits.
- Customization and fine-tuning opportunities.
Stack options:
- Lightweight web services: FastAPI/Flask (Python) or Node.js servers exposing a /generate or /embed endpoint.
- Model ecosystems: GPT-NeoX, LLaMA, Mistral, or other open-source models (respect licenses).
- Runtimes: NVIDIA Triton, HuggingFace Inference, or CPU/GPU-optimized containers.
Important considerations:
- Hardware requirements (CPU vs GPU, VRAM, memory).
- Latency vs throughput trade-offs.
- Model management, versioning, and monitoring.
- Safety and abuse prevention controls.
Example outline:
- Run a model container locally or in your private cloud.
- Expose a REST or GraphQL endpoint.
- Integrate your app to call the endpoint with prompts and stream responses if supported.
Cloud-based non-OpenAI providers
Providers to explore:
- Cohere
- AI21 Studio
- Others that offer text generation, summarization, or classification APIs
- Key considerations: pricing, rate limits, data handling, and SLA
Guiding tips:
- Vet data handling and retention policies.
- Use regional endpoints to reduce latency.
- Combine with on-device or self-hosted for sensitive workflows.
Building a simple feature: text generation without OpenAI
- Use a local or hosted endpoint to generate text from a prompt.
- Implement prompt templates and a lightweight safety guardrail.
Code example (fetching from a local/self-hosted endpoint):
# Call a local inference endpoint
curl -X POST http://localhost:8000/generate \
-H "Content-Type: application/json" \
-d '{"prompt":"Summarize the following article: ...","max_tokens":200}'
Code example (client-side fetch in a web app):
async function generateText(prompt) {
const res = await fetch('/api/generate', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ prompt, max_tokens: 200 })
});
return res.json();
}
Performance, privacy, and cost considerations
- Performance: measure latency, throughput, and model quality; balance with device capabilities.
- Privacy: prefer on-device or private-cloud deployments for sensitive data.
- Cost: monitor API usage if using third-party services; consider model size and hardware costs for self-hosted options.
- Maintenance: plan for model updates, drift monitoring, and content safeguards.
Security and governance
- Implement input validation and rate limiting to mitigate abuse.
- Use access controls and secrets management for endpoints.
- Log and audit AI interactions to meet compliance requirements.
- Establish content policies and guardrails to prevent unsafe outputs.
Next steps
- Pick a single AI feature to start (e.g., text summarization) and choose an implementation path (on-device, self-hosted, or cloud provider).
- Prototype quickly with a minimal model and a simple API, then iterate on latency and accuracy.
- Plan for data privacy, cost management, and governance from day one.
- As you grow, consider a hybrid architecture to balance privacy, performance, and scale.