PRD: Project Matter Refactor and Clean Re-import #2

New issue

Open

opened 2026-06-14 10:12:16 +00:00 by liang · 0 comments

liang commented

2026-06-14 10:12:16 +00:00

Owner

PRD: Project Matter Refactor and Clean Re-import

Labels: ready-for-agent

Problem Statement

The current Sales AI OS stores project work mostly as Communications, Project-level Tasks, Documents, and Daily Logs. That is not enough for manufacturing technical sales work, where the same Project contains many long-running business threads such as RFQs, quotation feedback, technical clarifications, sample delivery concerns, PPAP follow-ups, and quality issues.

The current implementation also writes too much too early: email ingestion can auto-create placeholder Projects, create Tasks directly from extracted action items, and feed old Daily Log and Agent Memory flows that do not match the Project Matter model. Because existing database contents are disposable test data, the system should be rebuilt around the confirmed Project Matter model, then repopulated from Obsidian work logs and archived email.

Solution

Rebuild the database schema around Project, Project Matter, Communication, Intake Item, Review Item, Proposed Change Set, Stored File, File Reference, and Document. Use Obsidian work-log wikilinks as the seed source for Projects, Project Titles, Project Aliases, and Project Responsibility before re-importing emails.

Email re-import should archive Communications and Communication Evidence first, then classify them. Classification should record Classification Decisions and route uncertain work through Intake Flow or Review Queue. Agent write operations must produce Proposed Change Sets rather than directly creating or updating Project Matters, Matter Deadlines, Matter Milestones, Tasks, Communication Interpretations, or responsibility handling.

The first implementation should prioritize a correct data model, deterministic Project matching, clean re-import, API/CLI review execution, and tests. Full UI workflows, Matter-centered Daily Log, Chat, Weekly Review, Agent Memory, and email drafting are explicitly after this foundation.

User Stories

As a technical sales user, I want each Project to contain Project Matters, so that fragmented communications can be tracked as business threads instead of isolated Tasks.
As a technical sales user, I want Project Matters to have Matter Codes, Matter Titles, Matter Types, Matter Statuses, Matter Priority, Matter Owner, Matter Deadline, Current Summary, and Matter Timeline, so that each business thread is actionable and auditable.
As a technical sales user, I want Tasks to usually belong to Project Matters, so that concrete execution work is tied to the business thread it serves.
As a technical sales user, I want rare Project-level Tasks to remain possible, so that administrative project work does not require a fake Project Matter.
As a technical sales user, I want old test data cleared and the database rebuilt cleanly, so that new behavior is not constrained by disposable prototype data.
As a technical sales user, I want Alembic to start from a new baseline migration, so that future schema history is understandable after the large refactor.
As a technical sales user, I want old migrations archived outside the active migration chain, so that historical context is retained without confusing new schema upgrades.
As a technical sales user, I want Projects seeded from my Obsidian work-log wikilinks, so that the system starts from the Projects I actually handle.
As a technical sales user, I want every wikilink inside the work-log scope treated as a Project Candidate, so that extraction does not miss non-PRJ project code formats.
As a technical sales user, I want the first token in a work-log wikilink to become the Project code, so that codes such as PRJ-23-905209 are extracted predictably.
As a technical sales user, I want the remaining wikilink text to become a Project Alias, so that names like L形硬排 are available for display and search.
As a technical sales user, I want Project Candidates written to an editable seed file, so that I can batch confirm, reject, merge, rename, and enrich them before database import.
As a technical sales user, I want the seed file to omit evidence paths and excerpts, so that it stays clean and easy to edit.
As a technical sales user, I want Project Candidate occurrence counts, so that frequently referenced Projects can be reviewed first.
As a technical sales user, I want the most recent alias to become the default Project Title, so that displayed Project names match current daily usage.
As a technical sales user, I want Projects to use title instead of codename, so that display names are not confused with internal nicknames.
As a technical sales user, I want Project Aliases stored separately from Project Identifier Variants, so that human names and machine identifiers are not mixed.
As a technical sales user, I want Project Responsibility confirmed during seed review, so that Owned, Observed, and Delegated Projects can drive email workflow correctly.
As a technical sales user, I want Owned Projects to create active workflow when matching Communications arrive, so that work I handle appears in Review Queue and Project Matter views.
As a technical sales user, I want Observed Projects to remain searchable without default reminders, so that passive visibility does not create noise.
As a technical sales user, I want Delegated Projects to behave like Owned Projects during their delegation period, so that temporary responsibility is tracked.
As a technical sales user, I want Project Identifier Variants such as zero-padded project codes to match canonical Projects, so that email formatting differences do not create duplicate Projects.
As a technical sales user, I want only the last numeric segment normalized for leading zeros, so that matching is useful without over-merging unrelated codes.
As a technical sales user, I want Project Aliases to be weak match signals by default, so that generic names do not automatically assign Communications to the wrong Project.
As a technical sales user, I want strong Project Aliases to be manually marked, so that only trusted aliases can write Communication Project Hints.
As a technical sales user, I want Communications to retain a Communication Project Hint, so that existing project filtering and initial classification remain practical.
As a technical sales user, I want Communication Project Hints not to imply Project Matter assignment, so that a Project-level hint does not hide unfinished classification.
As a technical sales user, I want one Communication to link to multiple Project Matters, so that emails covering multiple business threads remain accurate.
As a technical sales user, I want every Matter Link to include a Link Rationale, so that I can audit why evidence belongs to a specific Project Matter.
As a technical sales user, I want all Communications archived before classification, so that evidence is preserved even when it is irrelevant to active work.
As a technical sales user, I want Irrelevant Communications recorded as archived and not active, so that search remains complete without polluting Review Queue.
As a technical sales user, I want Raw Communication, Readable Communication, original HTML, metadata, attachments, and inline images preserved when available, so that Agent decisions can be checked against original evidence.
As a technical sales user, I want Visual Evidence preserved even when the Agent cannot interpret it, so that image-heavy emails are not incorrectly ignored.
As a technical sales user, I want Stored Files saved outside the database by content hash, so that large evidence files are deduplicated and easy to move later.
As a technical sales user, I want File References to connect Stored Files to Communications, Documents, Project Matters, and derived artifacts, so that file evidence remains relational and auditable.
As a technical sales user, I want File References to distinguish original evidence from derived files, so that OCR text or previews never replace original evidence.
As a technical sales user, I want Documents to be business views over one or more File References, so that attachments can become Documents without duplicating files.
As a technical sales user, I want Documents to link to Company, Project, or Project Matter as appropriate, so that company-level agreements and matter-specific evidence both fit.
As a technical sales user, I want email ingestion to create Contact placeholders from exact email addresses, so that communication history can be tied to contacts with low risk.
As a technical sales user, I do not want email ingestion to create Projects automatically from unknown codes, so that the Project list stays clean.
As a technical sales user, I do not want email ingestion to create Tasks directly, so that extracted action items do not pollute my execution list before review.
As a technical sales user, I do not want email ingestion to generate Agent Memory during re-import, so that old sentiment-oriented behavior does not pollute the new workflow.
As a technical sales user, I want Project matching to use explicit Project Match Signals, so that wrong semantic guesses do not silently assign work.
As a technical sales user, I want exact Project code matches and Project Identifier Variant matches to write Communication Project Hints, so that clear matches become useful immediately.
As a technical sales user, I want semantic similarity to be candidate-only, so that it helps discovery without becoming a silent write signal.
As a technical sales user, I want thread relationships saved and used carefully, so that replies can inherit context without ignoring Thread Drift.
As a technical sales user, I want Thread Links to downgrade when a new project code appears, a thread is reused after a long gap, or the thread was previously split, so that reused email subjects do not corrupt classification.
As a technical sales user, I want fallback subject-based threads to be weaker than header-based or Bichon-provided threads, so that uncertain threading does not become overconfident.
As a technical sales user, I want Classification Decisions saved for linked, intake, and irrelevant outcomes, so that I can understand why a Communication did or did not enter workflow.
As a technical sales user, I want Classification Decisions to save structured audit summaries rather than full prompts and raw model responses, so that auditability does not create unnecessary sensitive logs.
As a technical sales user, I want Communication Interpretations to be current editable interpretations with decision and feedback audit, so that the source evidence stays separate from Agent understanding.
As a technical sales user, I want uncertain Communications to become Intake Items, so that missing Project, responsibility, or Matter decisions are explicit.
As a technical sales user, I want Intake Statuses for needs project, needs responsibility, needs matter, deferred, archived, and resolved, so that unfinished classification is visible.
As a technical sales user, I want Review Items to be global user-decision entries, so that pending work appears in one Review Queue.
As a technical sales user, I want Review Items to be separate from Intake Items, so that business classification state is not confused with the queue entry asking me to act.
As a technical sales user, I want Proposed Change Sets to be separate from Review Items, so that not every review need has to be modeled as write operations.
As a technical sales user, I want Proposed Change Sets to contain Proposed Changes with domain operation types, so that changes are readable and enforceable.
As a technical sales user, I want to approve or reject individual Proposed Changes, so that I can accept useful Agent work without accepting everything.
As a technical sales user, I want Proposed Changes with unmet dependencies blocked from execution, so that dependent writes do not create broken records.
As a technical sales user, I want Proposed Change Sets to be one-time audit objects, so that the original suggestion is not mutated after review.
As a technical sales user, I want Agent status changes, deadline changes, summary changes, task creation, and Matter creation to go through Proposed Change Sets, so that the Agent cannot silently rewrite active work.
As a technical sales user, I want my direct manual API/CLI actions to write immediately with events, so that Proposed Change review does not slow explicit user decisions.
As a technical sales user, I want Matter Types maintained in the database, so that enabled types can evolve without code migrations.
As a technical sales user, I want the initial Matter Type vocabulary to include RFQ, quotation feedback, technical clarification, sample delivery, PPAP documentation, quality issue, commercial terms, internal follow-up, and other, so that common manufacturing sales work is covered.
As a technical sales user, I want Matter Status to be fixed, so that workflow semantics remain stable for reminders, logs, and views.
As a technical sales user, I want Matter Priority fixed to urgent, high, normal, and low, so that business importance is simple and consistent.
As a technical sales user, I want Review Queue Priority fixed to high, normal, and low, so that review urgency is not confused with Matter Priority.
As a technical sales user, I want Matter Deadline to be one primary date, so that the main due date is clear.
As a technical sales user, I want Matter Milestones to represent multiple dated lifecycle points, so that internal targets, customer commitments, deliveries, and follow-ups can coexist.
As a technical sales user, I want Matter Events separate from global system Events, so that user-facing timelines stay readable while audit logs remain complete.
As a technical sales user, I want Search to filter by Project and Matter at the SQL level, so that relevant Communication chunks are not lost after global top-k truncation.
As a technical sales user, I want Project filtering to consider both Communication Project Hints and Project Matter links, so that search works during and after Matter classification.
As a technical sales user, I want Matter filtering to return only evidence linked to that Matter, so that a Matter view stays focused.
As a technical sales user, I want first-phase Review actions available through API or CLI, so that the core workflow can be validated before full UI work.
As a technical sales user, I want Streamlit UI limited to minimal viewing in this phase, so that implementation energy stays on the data and ingestion foundation.

Implementation Decisions

Rebuild the schema destructively because existing database contents are disposable test data.
Preserve Alembic as the migration mechanism, but reset to a new baseline migration and archive the old active migration chain.
Use Project Matter as the canonical business thread layer between Project and Task.
Use matter-based table and field names for Project Matter concepts.
Keep Communication Project Hint for compatibility, filtering, and initial classification, but treat Project Matter links as the business-thread relationship.
Allow one Communication to link to multiple Project Matters.
Require a Link Rationale per Matter Link.
Replace codename with Project Title and do not keep codename in the rebuilt schema.
Add Project Alias records for human-readable names extracted from work-log wikilinks.
Add Project Identifier Variant records for external or alternate written project code forms.
Implement zero-padding normalization only for the final numeric segment of a hyphen-separated identifier.
Treat Project Alias matching as weak by default; only manually marked strong aliases may write Communication Project Hints.
Extract Project Candidates only from the Obsidian work-log scope.
Treat every work-log wikilink as a Project Candidate.
Parse each wikilink by splitting on the first whitespace: first token is Project code, remaining text is Project Alias.
Write Project Candidates to editable seed files before importing them into the database.
Include occurrence counts in Project seed files.
Do not include evidence paths or excerpts in Project seed files.
Confirm Project Responsibility during Project seed review.
Import confirmed Projects, Project Titles, Project Aliases, Project Identifier Variants, Project Responsibility, and Matter Types before email re-import.
Support Owned, Observed, and Delegated Project Responsibility.
Defer separate Responsible Sender List implementation in the first phase.
Archive all Communications before classification.
Preserve Raw Communication, Readable Communication, original HTML when available, metadata, attachments, inline images, chunks, and file references.
Do not require full rendered email screenshots in the first phase.
Mark Communications with important Visual Evidence for Visual Review when text is insufficient.
Store files outside the database by content hash.
Separate Stored File, File Reference, and Document.
Distinguish original file evidence from derived artifacts such as extracted text, OCR text, preview images, converted files, and rendered screenshots.
Allow Documents to associate with Company, Project, and Project Matter.
Allow automatic lightweight Contact placeholder creation from exact email addresses.
Forbid automatic Project creation from email ingestion; unknown Projects go through Intake Flow or Proposed Change Set.
Forbid direct Task creation from email ingestion; extracted action items become Communication Interpretation and Proposed Changes.
Do not generate Agent Memory during first-phase email re-import.
Match Projects using explicit Project Match Signals only.
Project matching priority is: Confirmed Link or prior correction, exact Project code, exact Project Identifier Variant, confirmed Thread Link, document metadata, then semantic similarity only for candidate ranking.
Company-only match and pure semantic similarity are not Project Match Signals.
Implement minimal thread support using Bichon conversation ID first, email headers second, and normalized subject plus participants plus time window as weak fallback.
Downgrade Thread Links when Thread Drift signals appear.
Save Classification Decisions for Project Matter work, Intake Flow, and Irrelevant Communication outcomes.
Classification Decisions save decision, confidence, rationale, evidence references, signals used, model name, prompt version, correction status, and linked Proposed Change Set when present.
Do not save full LLM prompts, raw responses, or reasoning traces by default.
Save Classification Feedback when users correct classifications.
Do not auto-promote Classification Feedback to active rules.
Store Rule Suggestions and confirmed rule records, but defer a complete classification rule engine.
Keep Intake Item, Review Item, and Proposed Change Set as separate concepts.
Use Intake Status values: needs_project, needs_responsibility, needs_matter, deferred, archived, resolved.
Use Review Item Types: proposed_change_set, visual_review, ambiguous_project, needs_matter, suggested_deadline, rule_suggestion, handoff_decision.
Implement the first four Review Item Types in the first phase; reserve data shape for the remaining types.
Use Review Queue Priority values: high, normal, low.
Use Proposed Change domain operation types rather than JSON Patch.
First-phase Proposed Change operation types are create_project_matter, link_communication_to_matter, link_document_to_matter, set_matter_deadline, add_matter_milestone, create_task, update_matter_summary, and set_matter_status.
Support partial approval and rejection of Proposed Changes.
Block execution of Proposed Changes whose dependencies were not approved or executed.
Treat Proposed Change Sets as one-time audit objects.
Matter Type is database-maintained with stable keys, labels, descriptions, enabled flags, and sort order.
Seed Matter Types with rfq, quotation_feedback, technical_clarification, sample_delivery, ppap_documentation, quality_issue, commercial_terms, internal_followup, and other.
Use Chinese display labels: 新询价 / RFQ, 报价反馈, 技术澄清, 样件与交付, PPAP 文件, 质量问题, 商务条款, 内部跟进, 其他.
Matter Status is fixed: new, open, waiting_customer, waiting_internal, blocked, done, cancelled.
new means a confirmed Project Matter exists but has not yet been actively handled.
Matter Priority is fixed: urgent, high, normal, low.
Matter Owner is lightweight text or a single-user marker, not a permission system.
Matter Deadline is the current primary due date, with at most one primary deadline per Project Matter.
Matter Milestones represent multiple dated lifecycle points and are not all deadlines.
Agent-suggested Current Summary updates use Proposed Change Sets.
Agent-suggested Matter Status changes use Proposed Change Sets; user-initiated changes write directly with events.
Keep global Events for audit and Matter Events for user-facing Matter Timelines.
Search filtering must be pushed down into SQL rather than applied after global top-k retrieval.
First-phase Search focuses on Communication chunks, Project filters, and Matter filters.
Document full-text/vector search is out of the first-phase acceptance path.
API/CLI review execution is sufficient for the first phase; full Review Queue UI is not required.

The first implementation sequence is:

Schema rebuild and new Alembic baseline.
Obsidian Project seed extractor.
Seed importer for Projects, Project Aliases, Project Identifier Variants, Project Responsibility, and Matter Types.
Email archive ingestion that preserves Communication Records and Communication Evidence, writes chunks and File References, and matches Communication Project Hints without creating Matter or Task records.
Classification pipeline producing Classification Decisions, Intake Items, Review Items, and Proposed Change Sets.
Review execution through API/CLI that writes Project Matters, Matter Links, Tasks, Matter Milestones, Matter Deadlines, Current Summary, Matter Events, and audit Events.
Search filtering updates for Project and Matter.

Testing Decisions

Tests should validate externally observable behavior at the highest practical seam. They should not lock in internal helper structure or LLM wording. Where LLM classification is involved, tests should use fake classifiers and deterministic fixtures rather than live model calls.

Test the Obsidian work-log Project seed extractor with representative wikilinks, including PRJ-style codes, non-PRJ codes, links with aliases, links without aliases, duplicate Project codes, and occurrence counts.
Test the Project seed importer by importing confirmed seed data and verifying Projects, Project Titles, Project Aliases, Project Identifier Variants, and Project Responsibility.
Test Project Identifier Variant normalization, especially final-segment zero stripping and non-normalized cases.
Test Project matching priority with fixtures for exact Project code, identifier variant, strong alias, weak alias, thread link, multiple matches, and semantic-only candidates.
Test Communication archive ingestion with Bichon-like fixtures and verify Communication Records, raw/readable/html content, chunks, attachments, inline images, Stored Files, and File References.
Test that email ingestion does not create Projects, Project Matters, Tasks, or Agent Memory.
Test Contact placeholder creation from exact email addresses.
Test Intake Flow transitions for needs project, needs responsibility, needs matter, deferred, archived, and resolved.
Test Classification Decision persistence for linked/proposed, intake, and irrelevant outcomes.
Test Proposed Change Set partial approval and dependency blocking.
Test execution of domain operation types for creating Project Matters, linking Communications, creating Tasks, adding Matter Milestones, setting Matter Deadline, updating Current Summary, and setting Matter Status.
Test Matter Link behavior when one Communication links to multiple Project Matters, including separate Link Rationales.
Test Stored File hash path generation and deduplication behavior.
Test File Reference original-versus-derived role behavior.
Test SQL search filters for Project and Matter before top-k truncation.
Test that Observed Project Communications are archived and searchable without creating active workflow unless strong user-relevance signals exist.
Test that Agent-suggested writes produce Proposed Change Sets and user direct writes can execute immediately with audit Events.

Existing testing prior art is mostly manual scripts, so this PRD should introduce pytest-based automated tests around deterministic seams. The fake ingestion and fake classifier seams are the preferred high-level seams for most behavior.

Out of Scope

Preserving or migrating existing database data.
Full Streamlit or Web UI for Review Queue and Proposed Change approval.
Matter-centered Daily Log implementation.
Weekly Review implementation.
Chat UI and Conversation Context.
Full classification rule engine.
Responsible Sender List management.
Agent Memory generation or review.
Email drafting and sending workflow refactor.
Full Document full-text/vector search.
WeChat, DingTalk, or Teams API integrations.
Multi-user permissions, OAuth, or notification routing.
Automatic Project Matter close suggestions.
Bulk extraction of Project Matters from historical Obsidian Daily Logs.
Saving full LLM prompts, raw responses, or reasoning traces by default.
Complete rendered email screenshot generation.

Further Notes

The current codebase still reflects the old model in multiple places: email ingestion can create placeholder Projects and Tasks, Daily Log generation uses raw Communications and Tasks with sentiment, API project views assume Project-level Communications and Tasks, and document storage still uses a simpler file path model. These mismatches are expected and should be resolved by the refactor rather than patched piecemeal.

The implementation should respect all Project Matter ADRs, especially the decisions that Agent writes use Proposed Change Sets, Project matching uses explicit signals, files live outside the database through File References, database contents can be rebuilt, Project Candidates come from editable seed files, Project Title replaces codename, Project Identifier Variants support code matching, and Alembic gets a new baseline.

# PRD: Project Matter Refactor and Clean Re-import Labels: `ready-for-agent` ## Problem Statement The current Sales AI OS stores project work mostly as Communications, Project-level Tasks, Documents, and Daily Logs. That is not enough for manufacturing technical sales work, where the same Project contains many long-running business threads such as RFQs, quotation feedback, technical clarifications, sample delivery concerns, PPAP follow-ups, and quality issues. The current implementation also writes too much too early: email ingestion can auto-create placeholder Projects, create Tasks directly from extracted action items, and feed old Daily Log and Agent Memory flows that do not match the Project Matter model. Because existing database contents are disposable test data, the system should be rebuilt around the confirmed Project Matter model, then repopulated from Obsidian work logs and archived email. ## Solution Rebuild the database schema around Project, Project Matter, Communication, Intake Item, Review Item, Proposed Change Set, Stored File, File Reference, and Document. Use Obsidian work-log wikilinks as the seed source for Projects, Project Titles, Project Aliases, and Project Responsibility before re-importing emails. Email re-import should archive Communications and Communication Evidence first, then classify them. Classification should record Classification Decisions and route uncertain work through Intake Flow or Review Queue. Agent write operations must produce Proposed Change Sets rather than directly creating or updating Project Matters, Matter Deadlines, Matter Milestones, Tasks, Communication Interpretations, or responsibility handling. The first implementation should prioritize a correct data model, deterministic Project matching, clean re-import, API/CLI review execution, and tests. Full UI workflows, Matter-centered Daily Log, Chat, Weekly Review, Agent Memory, and email drafting are explicitly after this foundation. ## User Stories 1. As a technical sales user, I want each Project to contain Project Matters, so that fragmented communications can be tracked as business threads instead of isolated Tasks. 2. As a technical sales user, I want Project Matters to have Matter Codes, Matter Titles, Matter Types, Matter Statuses, Matter Priority, Matter Owner, Matter Deadline, Current Summary, and Matter Timeline, so that each business thread is actionable and auditable. 3. As a technical sales user, I want Tasks to usually belong to Project Matters, so that concrete execution work is tied to the business thread it serves. 4. As a technical sales user, I want rare Project-level Tasks to remain possible, so that administrative project work does not require a fake Project Matter. 5. As a technical sales user, I want old test data cleared and the database rebuilt cleanly, so that new behavior is not constrained by disposable prototype data. 6. As a technical sales user, I want Alembic to start from a new baseline migration, so that future schema history is understandable after the large refactor. 7. As a technical sales user, I want old migrations archived outside the active migration chain, so that historical context is retained without confusing new schema upgrades. 8. As a technical sales user, I want Projects seeded from my Obsidian work-log wikilinks, so that the system starts from the Projects I actually handle. 9. As a technical sales user, I want every wikilink inside the work-log scope treated as a Project Candidate, so that extraction does not miss non-PRJ project code formats. 10. As a technical sales user, I want the first token in a work-log wikilink to become the Project code, so that codes such as `PRJ-23-905209` are extracted predictably. 11. As a technical sales user, I want the remaining wikilink text to become a Project Alias, so that names like `L形硬排` are available for display and search. 12. As a technical sales user, I want Project Candidates written to an editable seed file, so that I can batch confirm, reject, merge, rename, and enrich them before database import. 13. As a technical sales user, I want the seed file to omit evidence paths and excerpts, so that it stays clean and easy to edit. 14. As a technical sales user, I want Project Candidate occurrence counts, so that frequently referenced Projects can be reviewed first. 15. As a technical sales user, I want the most recent alias to become the default Project Title, so that displayed Project names match current daily usage. 16. As a technical sales user, I want Projects to use `title` instead of `codename`, so that display names are not confused with internal nicknames. 17. As a technical sales user, I want Project Aliases stored separately from Project Identifier Variants, so that human names and machine identifiers are not mixed. 18. As a technical sales user, I want Project Responsibility confirmed during seed review, so that Owned, Observed, and Delegated Projects can drive email workflow correctly. 19. As a technical sales user, I want Owned Projects to create active workflow when matching Communications arrive, so that work I handle appears in Review Queue and Project Matter views. 20. As a technical sales user, I want Observed Projects to remain searchable without default reminders, so that passive visibility does not create noise. 21. As a technical sales user, I want Delegated Projects to behave like Owned Projects during their delegation period, so that temporary responsibility is tracked. 22. As a technical sales user, I want Project Identifier Variants such as zero-padded project codes to match canonical Projects, so that email formatting differences do not create duplicate Projects. 23. As a technical sales user, I want only the last numeric segment normalized for leading zeros, so that matching is useful without over-merging unrelated codes. 24. As a technical sales user, I want Project Aliases to be weak match signals by default, so that generic names do not automatically assign Communications to the wrong Project. 25. As a technical sales user, I want strong Project Aliases to be manually marked, so that only trusted aliases can write Communication Project Hints. 26. As a technical sales user, I want Communications to retain a Communication Project Hint, so that existing project filtering and initial classification remain practical. 27. As a technical sales user, I want Communication Project Hints not to imply Project Matter assignment, so that a Project-level hint does not hide unfinished classification. 28. As a technical sales user, I want one Communication to link to multiple Project Matters, so that emails covering multiple business threads remain accurate. 29. As a technical sales user, I want every Matter Link to include a Link Rationale, so that I can audit why evidence belongs to a specific Project Matter. 30. As a technical sales user, I want all Communications archived before classification, so that evidence is preserved even when it is irrelevant to active work. 31. As a technical sales user, I want Irrelevant Communications recorded as archived and not active, so that search remains complete without polluting Review Queue. 32. As a technical sales user, I want Raw Communication, Readable Communication, original HTML, metadata, attachments, and inline images preserved when available, so that Agent decisions can be checked against original evidence. 33. As a technical sales user, I want Visual Evidence preserved even when the Agent cannot interpret it, so that image-heavy emails are not incorrectly ignored. 34. As a technical sales user, I want Stored Files saved outside the database by content hash, so that large evidence files are deduplicated and easy to move later. 35. As a technical sales user, I want File References to connect Stored Files to Communications, Documents, Project Matters, and derived artifacts, so that file evidence remains relational and auditable. 36. As a technical sales user, I want File References to distinguish original evidence from derived files, so that OCR text or previews never replace original evidence. 37. As a technical sales user, I want Documents to be business views over one or more File References, so that attachments can become Documents without duplicating files. 38. As a technical sales user, I want Documents to link to Company, Project, or Project Matter as appropriate, so that company-level agreements and matter-specific evidence both fit. 39. As a technical sales user, I want email ingestion to create Contact placeholders from exact email addresses, so that communication history can be tied to contacts with low risk. 40. As a technical sales user, I do not want email ingestion to create Projects automatically from unknown codes, so that the Project list stays clean. 41. As a technical sales user, I do not want email ingestion to create Tasks directly, so that extracted action items do not pollute my execution list before review. 42. As a technical sales user, I do not want email ingestion to generate Agent Memory during re-import, so that old sentiment-oriented behavior does not pollute the new workflow. 43. As a technical sales user, I want Project matching to use explicit Project Match Signals, so that wrong semantic guesses do not silently assign work. 44. As a technical sales user, I want exact Project code matches and Project Identifier Variant matches to write Communication Project Hints, so that clear matches become useful immediately. 45. As a technical sales user, I want semantic similarity to be candidate-only, so that it helps discovery without becoming a silent write signal. 46. As a technical sales user, I want thread relationships saved and used carefully, so that replies can inherit context without ignoring Thread Drift. 47. As a technical sales user, I want Thread Links to downgrade when a new project code appears, a thread is reused after a long gap, or the thread was previously split, so that reused email subjects do not corrupt classification. 48. As a technical sales user, I want fallback subject-based threads to be weaker than header-based or Bichon-provided threads, so that uncertain threading does not become overconfident. 49. As a technical sales user, I want Classification Decisions saved for linked, intake, and irrelevant outcomes, so that I can understand why a Communication did or did not enter workflow. 50. As a technical sales user, I want Classification Decisions to save structured audit summaries rather than full prompts and raw model responses, so that auditability does not create unnecessary sensitive logs. 51. As a technical sales user, I want Communication Interpretations to be current editable interpretations with decision and feedback audit, so that the source evidence stays separate from Agent understanding. 52. As a technical sales user, I want uncertain Communications to become Intake Items, so that missing Project, responsibility, or Matter decisions are explicit. 53. As a technical sales user, I want Intake Statuses for needs project, needs responsibility, needs matter, deferred, archived, and resolved, so that unfinished classification is visible. 54. As a technical sales user, I want Review Items to be global user-decision entries, so that pending work appears in one Review Queue. 55. As a technical sales user, I want Review Items to be separate from Intake Items, so that business classification state is not confused with the queue entry asking me to act. 56. As a technical sales user, I want Proposed Change Sets to be separate from Review Items, so that not every review need has to be modeled as write operations. 57. As a technical sales user, I want Proposed Change Sets to contain Proposed Changes with domain operation types, so that changes are readable and enforceable. 58. As a technical sales user, I want to approve or reject individual Proposed Changes, so that I can accept useful Agent work without accepting everything. 59. As a technical sales user, I want Proposed Changes with unmet dependencies blocked from execution, so that dependent writes do not create broken records. 60. As a technical sales user, I want Proposed Change Sets to be one-time audit objects, so that the original suggestion is not mutated after review. 61. As a technical sales user, I want Agent status changes, deadline changes, summary changes, task creation, and Matter creation to go through Proposed Change Sets, so that the Agent cannot silently rewrite active work. 62. As a technical sales user, I want my direct manual API/CLI actions to write immediately with events, so that Proposed Change review does not slow explicit user decisions. 63. As a technical sales user, I want Matter Types maintained in the database, so that enabled types can evolve without code migrations. 64. As a technical sales user, I want the initial Matter Type vocabulary to include RFQ, quotation feedback, technical clarification, sample delivery, PPAP documentation, quality issue, commercial terms, internal follow-up, and other, so that common manufacturing sales work is covered. 65. As a technical sales user, I want Matter Status to be fixed, so that workflow semantics remain stable for reminders, logs, and views. 66. As a technical sales user, I want Matter Priority fixed to urgent, high, normal, and low, so that business importance is simple and consistent. 67. As a technical sales user, I want Review Queue Priority fixed to high, normal, and low, so that review urgency is not confused with Matter Priority. 68. As a technical sales user, I want Matter Deadline to be one primary date, so that the main due date is clear. 69. As a technical sales user, I want Matter Milestones to represent multiple dated lifecycle points, so that internal targets, customer commitments, deliveries, and follow-ups can coexist. 70. As a technical sales user, I want Matter Events separate from global system Events, so that user-facing timelines stay readable while audit logs remain complete. 71. As a technical sales user, I want Search to filter by Project and Matter at the SQL level, so that relevant Communication chunks are not lost after global top-k truncation. 72. As a technical sales user, I want Project filtering to consider both Communication Project Hints and Project Matter links, so that search works during and after Matter classification. 73. As a technical sales user, I want Matter filtering to return only evidence linked to that Matter, so that a Matter view stays focused. 74. As a technical sales user, I want first-phase Review actions available through API or CLI, so that the core workflow can be validated before full UI work. 75. As a technical sales user, I want Streamlit UI limited to minimal viewing in this phase, so that implementation energy stays on the data and ingestion foundation. ## Implementation Decisions - Rebuild the schema destructively because existing database contents are disposable test data. - Preserve Alembic as the migration mechanism, but reset to a new baseline migration and archive the old active migration chain. - Use Project Matter as the canonical business thread layer between Project and Task. - Use matter-based table and field names for Project Matter concepts. - Keep Communication Project Hint for compatibility, filtering, and initial classification, but treat Project Matter links as the business-thread relationship. - Allow one Communication to link to multiple Project Matters. - Require a Link Rationale per Matter Link. - Replace `codename` with Project Title and do not keep `codename` in the rebuilt schema. - Add Project Alias records for human-readable names extracted from work-log wikilinks. - Add Project Identifier Variant records for external or alternate written project code forms. - Implement zero-padding normalization only for the final numeric segment of a hyphen-separated identifier. - Treat Project Alias matching as weak by default; only manually marked strong aliases may write Communication Project Hints. - Extract Project Candidates only from the Obsidian work-log scope. - Treat every work-log wikilink as a Project Candidate. - Parse each wikilink by splitting on the first whitespace: first token is Project code, remaining text is Project Alias. - Write Project Candidates to editable seed files before importing them into the database. - Include occurrence counts in Project seed files. - Do not include evidence paths or excerpts in Project seed files. - Confirm Project Responsibility during Project seed review. - Import confirmed Projects, Project Titles, Project Aliases, Project Identifier Variants, Project Responsibility, and Matter Types before email re-import. - Support Owned, Observed, and Delegated Project Responsibility. - Defer separate Responsible Sender List implementation in the first phase. - Archive all Communications before classification. - Preserve Raw Communication, Readable Communication, original HTML when available, metadata, attachments, inline images, chunks, and file references. - Do not require full rendered email screenshots in the first phase. - Mark Communications with important Visual Evidence for Visual Review when text is insufficient. - Store files outside the database by content hash. - Separate Stored File, File Reference, and Document. - Distinguish original file evidence from derived artifacts such as extracted text, OCR text, preview images, converted files, and rendered screenshots. - Allow Documents to associate with Company, Project, and Project Matter. - Allow automatic lightweight Contact placeholder creation from exact email addresses. - Forbid automatic Project creation from email ingestion; unknown Projects go through Intake Flow or Proposed Change Set. - Forbid direct Task creation from email ingestion; extracted action items become Communication Interpretation and Proposed Changes. - Do not generate Agent Memory during first-phase email re-import. - Match Projects using explicit Project Match Signals only. - Project matching priority is: Confirmed Link or prior correction, exact Project code, exact Project Identifier Variant, confirmed Thread Link, document metadata, then semantic similarity only for candidate ranking. - Company-only match and pure semantic similarity are not Project Match Signals. - Implement minimal thread support using Bichon conversation ID first, email headers second, and normalized subject plus participants plus time window as weak fallback. - Downgrade Thread Links when Thread Drift signals appear. - Save Classification Decisions for Project Matter work, Intake Flow, and Irrelevant Communication outcomes. - Classification Decisions save decision, confidence, rationale, evidence references, signals used, model name, prompt version, correction status, and linked Proposed Change Set when present. - Do not save full LLM prompts, raw responses, or reasoning traces by default. - Save Classification Feedback when users correct classifications. - Do not auto-promote Classification Feedback to active rules. - Store Rule Suggestions and confirmed rule records, but defer a complete classification rule engine. - Keep Intake Item, Review Item, and Proposed Change Set as separate concepts. - Use Intake Status values: `needs_project`, `needs_responsibility`, `needs_matter`, `deferred`, `archived`, `resolved`. - Use Review Item Types: `proposed_change_set`, `visual_review`, `ambiguous_project`, `needs_matter`, `suggested_deadline`, `rule_suggestion`, `handoff_decision`. - Implement the first four Review Item Types in the first phase; reserve data shape for the remaining types. - Use Review Queue Priority values: `high`, `normal`, `low`. - Use Proposed Change domain operation types rather than JSON Patch. - First-phase Proposed Change operation types are `create_project_matter`, `link_communication_to_matter`, `link_document_to_matter`, `set_matter_deadline`, `add_matter_milestone`, `create_task`, `update_matter_summary`, and `set_matter_status`. - Support partial approval and rejection of Proposed Changes. - Block execution of Proposed Changes whose dependencies were not approved or executed. - Treat Proposed Change Sets as one-time audit objects. - Matter Type is database-maintained with stable keys, labels, descriptions, enabled flags, and sort order. - Seed Matter Types with `rfq`, `quotation_feedback`, `technical_clarification`, `sample_delivery`, `ppap_documentation`, `quality_issue`, `commercial_terms`, `internal_followup`, and `other`. - Use Chinese display labels: 新询价 / RFQ, 报价反馈, 技术澄清, 样件与交付, PPAP 文件, 质量问题, 商务条款, 内部跟进, 其他. - Matter Status is fixed: `new`, `open`, `waiting_customer`, `waiting_internal`, `blocked`, `done`, `cancelled`. - `new` means a confirmed Project Matter exists but has not yet been actively handled. - Matter Priority is fixed: `urgent`, `high`, `normal`, `low`. - Matter Owner is lightweight text or a single-user marker, not a permission system. - Matter Deadline is the current primary due date, with at most one primary deadline per Project Matter. - Matter Milestones represent multiple dated lifecycle points and are not all deadlines. - Agent-suggested Current Summary updates use Proposed Change Sets. - Agent-suggested Matter Status changes use Proposed Change Sets; user-initiated changes write directly with events. - Keep global Events for audit and Matter Events for user-facing Matter Timelines. - Search filtering must be pushed down into SQL rather than applied after global top-k retrieval. - First-phase Search focuses on Communication chunks, Project filters, and Matter filters. - Document full-text/vector search is out of the first-phase acceptance path. - API/CLI review execution is sufficient for the first phase; full Review Queue UI is not required. The first implementation sequence is: 1. Schema rebuild and new Alembic baseline. 2. Obsidian Project seed extractor. 3. Seed importer for Projects, Project Aliases, Project Identifier Variants, Project Responsibility, and Matter Types. 4. Email archive ingestion that preserves Communication Records and Communication Evidence, writes chunks and File References, and matches Communication Project Hints without creating Matter or Task records. 5. Classification pipeline producing Classification Decisions, Intake Items, Review Items, and Proposed Change Sets. 6. Review execution through API/CLI that writes Project Matters, Matter Links, Tasks, Matter Milestones, Matter Deadlines, Current Summary, Matter Events, and audit Events. 7. Search filtering updates for Project and Matter. ## Testing Decisions Tests should validate externally observable behavior at the highest practical seam. They should not lock in internal helper structure or LLM wording. Where LLM classification is involved, tests should use fake classifiers and deterministic fixtures rather than live model calls. - Test the Obsidian work-log Project seed extractor with representative wikilinks, including PRJ-style codes, non-PRJ codes, links with aliases, links without aliases, duplicate Project codes, and occurrence counts. - Test the Project seed importer by importing confirmed seed data and verifying Projects, Project Titles, Project Aliases, Project Identifier Variants, and Project Responsibility. - Test Project Identifier Variant normalization, especially final-segment zero stripping and non-normalized cases. - Test Project matching priority with fixtures for exact Project code, identifier variant, strong alias, weak alias, thread link, multiple matches, and semantic-only candidates. - Test Communication archive ingestion with Bichon-like fixtures and verify Communication Records, raw/readable/html content, chunks, attachments, inline images, Stored Files, and File References. - Test that email ingestion does not create Projects, Project Matters, Tasks, or Agent Memory. - Test Contact placeholder creation from exact email addresses. - Test Intake Flow transitions for needs project, needs responsibility, needs matter, deferred, archived, and resolved. - Test Classification Decision persistence for linked/proposed, intake, and irrelevant outcomes. - Test Proposed Change Set partial approval and dependency blocking. - Test execution of domain operation types for creating Project Matters, linking Communications, creating Tasks, adding Matter Milestones, setting Matter Deadline, updating Current Summary, and setting Matter Status. - Test Matter Link behavior when one Communication links to multiple Project Matters, including separate Link Rationales. - Test Stored File hash path generation and deduplication behavior. - Test File Reference original-versus-derived role behavior. - Test SQL search filters for Project and Matter before top-k truncation. - Test that Observed Project Communications are archived and searchable without creating active workflow unless strong user-relevance signals exist. - Test that Agent-suggested writes produce Proposed Change Sets and user direct writes can execute immediately with audit Events. Existing testing prior art is mostly manual scripts, so this PRD should introduce pytest-based automated tests around deterministic seams. The fake ingestion and fake classifier seams are the preferred high-level seams for most behavior. ## Out of Scope - Preserving or migrating existing database data. - Full Streamlit or Web UI for Review Queue and Proposed Change approval. - Matter-centered Daily Log implementation. - Weekly Review implementation. - Chat UI and Conversation Context. - Full classification rule engine. - Responsible Sender List management. - Agent Memory generation or review. - Email drafting and sending workflow refactor. - Full Document full-text/vector search. - WeChat, DingTalk, or Teams API integrations. - Multi-user permissions, OAuth, or notification routing. - Automatic Project Matter close suggestions. - Bulk extraction of Project Matters from historical Obsidian Daily Logs. - Saving full LLM prompts, raw responses, or reasoning traces by default. - Complete rendered email screenshot generation. ## Further Notes The current codebase still reflects the old model in multiple places: email ingestion can create placeholder Projects and Tasks, Daily Log generation uses raw Communications and Tasks with sentiment, API project views assume Project-level Communications and Tasks, and document storage still uses a simpler file path model. These mismatches are expected and should be resolved by the refactor rather than patched piecemeal. The implementation should respect all Project Matter ADRs, especially the decisions that Agent writes use Proposed Change Sets, Project matching uses explicit signals, files live outside the database through File References, database contents can be rebuilt, Project Candidates come from editable seed files, Project Title replaces codename, Project Identifier Variants support code matching, and Alembic gets a new baseline.

liang added the

ready-for-agent

label

2026-06-14 10:12:16 +00:00

liang referenced this issue

2026-06-14 10:21:20 +00:00

Rebuild Project Matter schema baseline with smoke workflow #3

liang referenced this issue

2026-06-14 10:21:20 +00:00

Extract Obsidian wikilinks into editable Project Candidate seed files #4

liang referenced this issue

2026-06-14 10:21:20 +00:00