AISENSE AI Data Feed Specification v1.0

Overview

The AISENSE AI Data Feed format is a machine-readable JSON structure designed to make text content easily ingestible by AI systems such as large language models, RAG pipelines, search engines, and knowledge graph builders.

The format focuses on:

  • clean text extraction

  • structured metadata

  • schema.org compatibility

  • AI ingestion hints

  • source attribution

The goal is to allow AI systems to consume content without complex HTML parsing.


Document Structure

Each feed item is published as a JSON document.

Example:

{
"content": {
"text": "Example text content.",
"summary": "Short summary of the content.",
"keywords": ["example","ai","content"],
"entities": [],
"links": [],
"structured_data": {}
},
"structure": {
"@context": "https://schema.org",
"@type": "Article"
},
"ai_meta": {
"token_est": 100,
"chars": 650,
"crawler_hint": "normal",
"richness_score": 3,
"embedding_ready": true
}
}

Top-Level Fields

content

Contains the primary textual data and semantic metadata.

text

The cleaned main text content.

Type: string
Required: yes

Example:

“text”: “Example article content.”
 

summary

Short machine-generated or user-provided summary.

Type: string
Required: optional


keywords

Keywords describing the content topic.

Type: array of strings

Example:

“keywords”: [“ai”,”knowledge”,”tutorial”]
 

entities

Named entities extracted from the text.

Possible types:

  • Person

  • Organization

  • Location

  • Product

  • Event

Type: array

Example:

 
"entities": [
{
"type": "Organization",
"name": "Example Corp"
}
]

links

Links related to the content.

Typical uses:

  • original source

  • raw text version

  • related resources

Type: array of URLs

Example:

 
"links": [
"https://example.com/article",
"https://data.example.com/article.txt"
]

structured_data

Schema.org compatible structured metadata.

Type: object

Example:

"structured_data": {
"mainEntity": [
{
"@type": "Question",
"name": "What is AI?",
"acceptedAnswer": {
"@type": "Answer",
"text": "Artificial intelligence is..."
}
}
]
}

structure

Defines the semantic structure of the document using schema.org.

Type: object

Example:

"structure": {
"@context": "https://schema.org",
"@type": "FAQ"
}

Possible values:

  • Article

  • FAQ

  • HowTo

  • Product

  • Dataset

  • Documentation

  • Guide


ai_meta

Metadata specifically intended for AI ingestion pipelines.

token_est

Estimated token size of the content.

Used by AI pipelines to estimate processing cost.

Type: integer


chars

Character length of the main content.

Type: integer


crawler_hint

Hint for crawlers about content density.

Possible values:

low density content
normal content
rich content


richness_score

Approximate semantic richness of the content.

Scale example:

1 very simple text
3 normal content
5 complex structured content


embedding_ready

Indicates whether the text is clean enough to be directly embedded.

Type: boolean

Example:

 
“embedding_ready”: true
 

Optional Field

reasoning

Explanation of how the system structured the content.

Used for transparency and debugging.

Example:

 
“reasoning”: “The content was structured as FAQ because it contains question-answer pairs.”
 

File Naming Convention

Example:

content_<timestamp>_<hash>.json
 

Example:

content_1773235424_fab76247faa5.json
 

Where:

timestamp = Unix timestamp
hash = unique identifier


Feed Layout Example

 
/content
/2026
/03
/11
content_1773235424_xxxxx.json
 

This allows crawlers to ingest new data efficiently by date.


Typical AI Pipeline

Example ingestion workflow:

 
crawl feed

download JSON

read content.text

generate embedding

store in vector database

link back to source
 
 

Design Goals

The format is designed to:

  • minimize parsing complexity

  • preserve attribution

  • support schema.org semantics

  • enable fast ingestion into AI systems

  • remain human-readable


Live system

Resources:

AISENSE – AI DATA FEED GENERATOR


 

License and Usage

The AISENSE AI Data Feed format is intended as an open format that can be implemented by any platform or publisher. Extended spesification

No dependency on AISENSE infrastructure is required to adopt the format.

Scroll to Top