AISENSE AI Data Feed Specification v1.0

Purpose

The AISENSE AI Data Feed format defines a machine-readable JSON structure designed to make text content easily ingestible by AI systems such as:

  • Large Language Models (LLMs)

  • Retrieval Augmented Generation (RAG) systems

  • Knowledge graph builders

  • AI search engines

  • semantic indexing pipelines

The format is designed to reduce the need for HTML parsing and enable direct AI ingestion.


Versioning

Every feed item should include a version identifier.

Example:

 
{
"spec_version": "1.0",
"content": { ... }
}
 

Version rules:

  • Minor changes do not break compatibility

  • Major versions may introduce structural changes


Document Structure

Each feed item is a standalone JSON document.

Example:

 
{

"spec_version": "1.0",

"content": {
"text": "Example article content.",
"summary": "Short summary of the content.",
"keywords": ["ai","example"],
"entities": [],
"links": []
},

"structure": {
"@context": "https://schema.org",
"@type": "Article"
},

"ai_meta": {
"token_est": 120,
"chars": 780,
"crawler_hint": "normal",
"richness_score": 3,
"embedding_ready": true
}
}
 

Core Fields

spec_version

Specifies the version of the AISENSE AI Data Feed format.

Type: string
Required: yes

Example:

 
“spec_version”: “1.0”
 

content Object

Contains the main text content and semantic metadata.

text

Cleaned primary content text.

Type: string
Required: yes


summary

Short summary of the content.

Type: string
Optional


keywords

Keywords describing the content topic.

Type: array of strings

Example:

 
“keywords”: [“AI”,”documentation”,”tutorial”]
 

entities

Named entities extracted from the text.

Supported types include:

  • Person

  • Organization

  • Location

  • Product

  • Event

Example:

 
"entities": [
{
"type": "Organization",
"name": "Example Corp"
}
]
 

links

URLs related to the content.

Typical uses:

  • canonical source

  • raw text version

  • related documentation

Example:

 
"links": [
"https://example.com/article",
"https://data.example.com/article.txt"
]
 

structured_data

Schema.org compatible structured metadata.

Example:

 
"structured_data": {
"mainEntity": [
{
"@type": "Question",
"name": "What is AI?",
"acceptedAnswer": {
"@type": "Answer",
"text": "Artificial intelligence refers to..."
}
}
]
}
 

structure Object

Defines the semantic structure of the document using schema.org.

Example:

 
"structure": {
"@context": "https://schema.org",
"@type": "FAQ"
}
 

Common types:

  • Article

  • FAQ

  • HowTo

  • Product

  • Dataset

  • Documentation

  • Guide


ai_meta Object

Metadata intended for AI ingestion pipelines.

token_est

Estimated token count of the text.

Used for cost estimation in LLM pipelines.

Type: integer


chars

Character length of the main text.

Type: integer


crawler_hint

Hint describing the content density.

Allowed values:

 
low density content
normal content
rich content
 

richness_score

Semantic richness score.

Example scale:

1 minimal content
3 standard content
5 highly structured content


embedding_ready

Indicates that the text is already cleaned and suitable for direct embedding.

Type: boolean


Optional Fields

reasoning

Explanation of how the content structure was generated.

Example:

 
“The content was structured as FAQ because it contains question-answer pairs.”
 

This field is optional and mainly intended for debugging or transparency.


File Naming Convention

Recommended naming:

 
content_<timestamp>_<hash>.json
 

Example:

 
content_1773235424_fab76247faa5.json
 

Where:

timestamp = Unix timestamp
hash = unique identifier


Feed Directory Layout

Example structure:

 
/content
/2026
/03
/11
content_1773235424_xxxxx.json
 

This structure allows efficient chronological crawling.


Feed Discovery

Publishers should expose a discovery endpoint.

Example:

 
https://example.com/ai-feed.json
 

Example discovery file:

 
{
"spec_version": "1.0",
"feed_url": "https://data.example.com/content/",
"updated": "2026-03-11T10:00:00Z"
}
 

JSON Schema

Example simplified schema:

 
{
"type": "object",
"required": ["spec_version","content"],
"properties": {

“spec_version”: {
“type”: “string”
},

“content”: {
“type”: “object”,
“required”: [“text”],
“properties”: {
“text”: { “type”: “string” },
“summary”: { “type”: “string” },
“keywords”: {
“type”: “array”,
“items”: { “type”: “string” }
}
}
},

“ai_meta”: {
“type”: “object”,
“properties”: {
“token_est”: { “type”: “integer” },
“chars”: { “type”: “integer” },
“embedding_ready”: { “type”: “boolean” }
}
}
}
}

 

Example AI Ingestion Pipeline

Example workflow for AI systems consuming the feed:

 
discover feed

crawl new JSON files

extract content.text

generate embeddings

store in vector database

index metadata
 

Design Principles

The format is designed to:

  • minimize parsing complexity

  • preserve source attribution

  • support semantic structure

  • enable fast ingestion for AI systems

  • remain human-readable


Open Adoption

The AISENSE AI Data Feed format is intended as an open format that can be implemented by any publisher or platform.

No dependency on AISENSE infrastructure is required.

Scroll to Top