A powerful AI-driven package for automatically processing, analyzing, and tagging content from various sources including RSS feeds, websites, local files, and cloud storage.
The @memberjunction/content-autotagging package provides an extensible framework for ingesting content from diverse sources and leveraging AI models to extract meaningful tags, summaries, and metadata. Built on the MemberJunction platform, it helps organizations automatically organize and categorize their content for improved searchability and insights.
npm install @memberjunction/content-autotagging
@memberjunction/ai (2.43.0) - AI model integration@memberjunction/aiengine (2.43.0) - AI processing pipeline @memberjunction/core (2.43.0) - Core MemberJunction functionality@memberjunction/core-entities (2.43.0) - Entity models@memberjunction/global (2.43.0) - Global utilitiesaxios - HTTP requestscheerio - HTML parsing and web scrapingpdf-parse - PDF document parsingofficeparser - Microsoft Office document parsingrss-parser - RSS feed parsingdate-fns & date-fns-tz - Date manipulation and timezone handlingopenai - OpenAI API integrationxml2js - XML parsingcrypto - Checksum generationThe package follows a modular architecture with these key components:
AutotagRSSFeed - RSS feed processingAutotagWebsite - Website crawling and processingAutotagLocalFileSystem - Local file processingAutotagAzureBlob - Azure Blob Storage integrationimport { AutotagRSSFeed } from '@memberjunction/content-autotagging';
import { UserInfo } from '@memberjunction/core';
const rssTagger = new AutotagRSSFeed();
const userContext: UserInfo = { /* your user context */ };
// Process all configured RSS feeds
await rssTagger.Autotag(userContext);
import { AutotagWebsite } from '@memberjunction/content-autotagging';
import { UserInfo } from '@memberjunction/core';
const websiteTagger = new AutotagWebsite();
const userContext: UserInfo = { /* your user context */ };
// Process all configured websites with crawling options
await websiteTagger.Autotag(userContext);
import { AutotagLocalFileSystem } from '@memberjunction/content-autotagging';
import { UserInfo } from '@memberjunction/core';
const fileTagger = new AutotagLocalFileSystem();
const userContext: UserInfo = { /* your user context */ };
// Process files from configured local directories
await fileTagger.Autotag(userContext);
import { AutotagAzureBlob } from '@memberjunction/content-autotagging';
import { UserInfo } from '@memberjunction/core';
const blobTagger = new AutotagAzureBlob(
process.env.AZURE_STORAGE_CONNECTION_STRING,
'your-container-name'
);
await blobTagger.Authenticate();
await blobTagger.Autotag(userContext);
For more control over the processing pipeline:
import { AutotagBaseEngine } from '@memberjunction/content-autotagging';
import { ContentItemEntity } from '@memberjunction/core-entities';
const engine = AutotagBaseEngine.Instance;
// Process specific content items
const contentItems: ContentItemEntity[] = [ /* your content items */ ];
await engine.ExtractTextAndProcessWithLLM(contentItems, userContext);
Content sources are configured in the MemberJunction database with these key fields:
Name: Display nameURL: Source location (RSS URL, website URL, file path, etc.)ContentTypeID: Type of content (article, blog post, etc.)ContentSourceTypeID: Source type (RSS Feed, Website, etc.)ContentFileTypeID: Expected file formatThe package uses AI models configured in MemberJunction. Key parameters:
modelID: Specific AI model to useminTags: Minimum number of tags to generatemaxTags: Maximum number of tags to generateFor website sources, these parameters can be configured:
CrawlOtherSitesInTopLevelDomain: Whether to crawl other subdomainsCrawlSitesInLowerLevelDomain: Whether to crawl child pathsMaxDepth: Maximum crawl depthRootURL: Base URL for crawlingURLPattern: Regex pattern for URL filteringimport { AutotagBase } from '@memberjunction/content-autotagging';
import { RegisterClass } from '@memberjunction/global';
import { ContentSourceEntity, ContentItemEntity } from '@memberjunction/core-entities';
import { UserInfo } from '@memberjunction/core';
@RegisterClass(AutotagBase, 'AutotagCustomSource')
export class AutotagCustomSource extends AutotagBase {
public async SetContentItemsToProcess(
contentSources: ContentSourceEntity[]
): Promise<ContentItemEntity[]> {
// Implement logic to fetch and create content items
const contentItems: ContentItemEntity[] = [];
// Your custom source logic here
return contentItems;
}
public async Autotag(contextUser: UserInfo): Promise<void> {
// Set up content source type
const contentSourceTypeID = await this.engine.setSubclassContentSourceType(
'Custom Source',
contextUser
);
// Get configured sources
const contentSources = await this.engine.getAllContentSources(
contextUser,
contentSourceTypeID
);
// Process content
const contentItems = await this.SetContentItemsToProcess(contentSources);
await this.engine.ExtractTextAndProcessWithLLM(contentItems, contextUser);
}
}
Add custom prompts for specific content types by creating Content Type Attributes in the database. These will be automatically included in the AI processing prompts.
abstract class AutotagBase {
abstract SetContentItemsToProcess(
contentSources: ContentSourceEntity[]
): Promise<ContentItemEntity[]>;
abstract Autotag(contextUser: UserInfo): Promise<void>;
}
class AutotagBaseEngine extends AIEngine {
// Process content items with AI
async ExtractTextAndProcessWithLLM(
contentItems: ContentItemEntity[],
contextUser: UserInfo
): Promise<void>;
// Process individual content item text
async ProcessContentItemText(
params: ContentItemProcessParams,
contextUser: UserInfo
): Promise<void>;
// Get all content sources for a type
async getAllContentSources(
contextUser: UserInfo,
contentSourceTypeID: string
): Promise<ContentSourceEntity[]>;
}
# For Azure Blob Storage
AZURE_STORAGE_CONNECTION_STRING=your_connection_string
# AI Model API Keys (handled by @memberjunction/ai)
OPENAI_API_KEY=your_openai_key
# Other AI provider keys as needed
The package includes comprehensive error handling:
The package works with these MemberJunction entities:
Content Sources - Configuration for each sourceContent Items - Individual pieces of contentContent Item Tags - Generated tagsContent Item Attributes - Additional extracted metadataContent Process Runs - Processing historyContent Types - Content categorizationContent Source Types - Source type definitionsISC