Commit graph

108 commits

Author SHA1 Message Date
OliverGiertz
f710141828 fix(scheduler): prevent duplicate slot assignment from concurrent pipeline runs
Two bugs caused multiple articles to land on the same publish slot:

1. main.py: asyncio.create_task() returned immediately, allowing a second
   pipeline trigger (N8N + Telegram /run or two N8N calls) to start a
   second concurrent run. Added asyncio.Lock (_pipeline_lock) so any
   second trigger while the pipeline is running is rejected immediately.

2. scheduler.py: reserve_publish_slot() read the list of occupied slots
   and wrote the new slot in two separate DB connections. Concurrent threads
   could both see the same "free" slot before either committed its write.
   Fixed by wrapping the entire read-find-write cycle in a threading.Lock
   (_slot_lock) and a single DB connection, so the slot check and the
   slot assignment are atomic.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-08 05:03:13 +00:00
OliverGiertz
2456e4aca7 chore: gitignore CLAUDE.md instead of CLAUDE_CONTEXT.md 2026-04-20 06:24:34 +00:00
OliverGiertz
1498fa7156 chore: gitignore CLAUDE_CONTEXT.md (contains credentials) 2026-04-20 06:20:20 +00:00
OliverGiertz
cdcf441daf feat(admin): bulk-editable article list with WP ID inline editing
- New /admin/article-list: paginated (50/page) table with thumbnail,
  title, excerpt (120 chars), status, scheduled date, and WP ID input
- Sticky save bar with live change counter (JS tracks modified inputs,
  highlights changed cells in amber, disables save when nothing changed)
- POST /admin/article-list/update: saves only changed WP IDs in one
  request; clears stale wp_post_url so WP-Sync repopulates it cleanly
- Filter by status + free-text search (title or article ID)
- Pagination with page/filter state preserved through save redirects
- repositories: add list_articles_page() (offset + search) and
  bulk_update_wp_post_ids()
- Dashboard nav: add Artikelliste link

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-10 09:00:25 +00:00
OliverGiertz
2d02b56b65 feat(admin): WordPress→DB sync for scheduled slots
Adds sync_db_from_wordpress() that treats WordPress as source of truth:
- future posts: update scheduled_publish_at to WP's actual date
- draft posts: clear scheduled_publish_at (not yet scheduled)
- published posts: mark article as 'published' in DB
- trashed/deleted posts: clear wp_post_id + wp_post_url + slot so article
  can be re-processed

Exposed via POST /admin/wp-sync with a sync button on the schedule page.
Run after any manual rescheduling in WordPress to bring DB back in sync.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-10 08:53:44 +00:00
OliverGiertz
8676ace102 feat(pipeline): article age filter, image URL validation, schedule UI, retry button
1. Article age filter (ingestion.py + config.py):
   - New setting pipeline_max_article_age_days=7 (0 = no limit)
   - Skip RSS entries older than N days before expensive extract_article()
   - Prevents old articles from Google Alerts re-entering pipeline

2. Image URL pre-validation (ingestion.py):
   - HEAD request probe for each primary image candidate during ingestion
   - Falls back to next-best candidate if primary returns 4xx
   - Network errors treated as OK to avoid false negatives on flaky servers

3. Stale WP draft cleanup (pipeline.py):
   - Quality gate rejections now delete any pre-existing WP draft (wp_post_id)
   - Prevents orphaned drafts when re-running articles that previously had drafts

4. Schedule overview UI (scheduler.py + admin_ui.py + admin_schedule.html):
   - New /admin/schedule page showing calendar grid of all booked slots
   - Distinguishes Pipeline-DB slots from WordPress-only slots
   - Link added to dashboard navigation

5. Retry for failed articles (admin_ui.py + admin_dashboard.html):
   - New POST /admin/articles/{id}/retry endpoint: resets to 'new', releases slot
   - '🔄 Wiederholen' button shown in dashboard for all 'close' (error) articles

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-10 08:44:28 +00:00
OliverGiertz
cf2d826c8a fix(scheduler,pipeline): fix WP auth attribute name and release slot on hard errors
- scheduler: use wordpress_app_password (not wordpress_password) so
  _fetch_wp_occupied_slots() can actually authenticate against the WP
  REST API — previously always returned empty set silently
- pipeline: release reserved publish slot when draft creation fails with
  a non-ValueError exception (e.g. WP API error), preventing permanently
  blocked slots on failed articles

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-10 08:22:26 +00:00
OliverGiertz
2d1dd14e45 fix(pipeline): send individual Telegram notifications for quality gate rejections
- Add individual Telegram message when an article is rejected by quality
  gate (too short raw content or rewritten text), so users see each
  rejection in real time instead of only in the bulk summary
- Add quality_gate_rejected counter to PipelineStats and result dict
- Show quality gate rejections separately in pipeline-done summary
  (✂️ Qualitätsprüfung: N) distinct from score-based rejections

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 07:02:03 +00:00
OliverGiertz
09dcf6ce36 feat(pipeline): add two-stage article quality gate (min word count)
Stage 1 (before OpenAI rewrite): reject if raw content < pipeline_min_words_raw (default 120)
Stage 2 (after rewrite): reject if rewritten text < pipeline_min_words_rewritten (default 150)

Both stages set status='error' with a descriptive note and skip WP draft creation.
The reserved publish slot is released so it stays available for the next article.
Quality rejections don't abort the pipeline — processing continues with the next article.

New config settings (overridable via .env):
  PIPELINE_MIN_WORDS_RAW=120
  PIPELINE_MIN_WORDS_REWRITTEN=150

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 09:42:02 +00:00
OliverGiertz
94bd93a18a fix(scheduler): fill schedule gaps instead of always appending to end
Previously the scheduler started searching from the last scheduled post date,
skipping all free slots in between (e.g. a free slot on Apr 20 would be ignored
if the last post was on May 18).

Now starts scanning from tomorrow, finding the first available slot regardless
of whether earlier dates have gaps — fills the calendar naturally.

Also extended lookahead from 30 to 60 days.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 09:34:27 +00:00
OliverGiertz
8fa46312e8 fix(scheduler): query WordPress future posts to avoid double-booking slots
The scheduler previously only checked the local SQLite DB for occupied slots.
Posts created outside the pipeline (e.g. recovery scripts) were invisible,
causing newly scheduled articles to land on already-taken WP dates.

_fetch_wp_occupied_slots() now queries WP /wp/v2/posts?status=future before
each slot assignment. All scheduling functions accept a wp_occupied set.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 09:29:24 +00:00
OliverGiertz
764e7bff6a fix(ingestion): skip data: URIs and known placeholder images
- ingestion.py: filter out data:image/... inline URIs before ranking
- ingestion.py: penalise (-300) known placeholder paths (some-default.jpg etc.)
- wordpress.py: _is_usable_image_url rejects data: URIs and placeholder paths

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-07 09:09:44 +00:00
OliverGiertz
426a799371 fix(wordpress): use status=future for posts with a future scheduled_publish_at
WordPress ignores the date field for draft posts and shows "Sofort veröffentlichen"
instead. Setting status=future causes WP to display and honour the scheduled date,
auto-publishing the post at the given time as intended.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 14:29:25 +00:00
OliverGiertz
8c6022fead fix(pipeline): always reserve publish slot before WP draft creation
If scheduled_publish_at is not set when _do_rewrite_and_draft runs
(e.g. rewrite_and_update_draft called on a review article), reserve
a slot now so the WP draft always receives a future date.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 14:14:03 +00:00
OliverGiertz
1a8d0775c7 fix(wordpress): correctly detect bare credit marker prefix before caption fallback
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 08:47:09 +00:00
OliverGiertz
45c533c674 fix(wordpress): extract credit portion from caption for attribution block
When the credit field only captured a marker prefix (e.g. "Foto:") due to
CSS-class-based extraction picking up only the label element, fall back to
regex-extracting the credit line from the full figcaption caption text.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 08:41:28 +00:00
OliverGiertz
d1cb809852 fix(wordpress): fix attribution block source name and image credit lookup
- Derive real source hostname from canonical URL when feed name is generic
  (e.g. "Google Alerts"), so the link shows "moin.de" instead of "Google Alerts"
- Use _get_image_meta_for_url() (fuzzy URL matching) for image credit lookup
- Use caption field for Bildnachweis since it already contains embedded credits

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 08:28:44 +00:00
OliverGiertz
82f2df610d fix(wordpress): fuzzy URL match for image metadata and simplify caption builder
Image metadata keys may have query params (e.g. ?w=1200) that differ from
the selected_url stored in image_review. Fall back to comparing URLs without
query string so the figcaption text is correctly found.

Also simplified _build_image_caption: figcaption text already contains the
credit info, so just use caption directly instead of appending the redundant
credit prefix marker.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 08:24:40 +00:00
OliverGiertz
8e65485f0c fix(ingestion): strip HTML tags from feed entry titles
Google Alerts wraps matched keywords in <b>...</b> tags.
Strip all HTML tags from the title before storing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 08:08:07 +00:00
OliverGiertz
0d07a9804d fix(ingestion): resolve Google Alerts redirect URLs before article fetch
Google Alerts feed entries use google.com/url?...&url=<encoded_real_url>&...
tracking links. The extractor was fetching the Google redirect page instead
of the actual article, resulting in empty content and no images.

_resolve_google_redirect() extracts the real URL from the 'url' query
parameter before passing it to extract_article(). Non-Google URLs are
returned unchanged.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 07:10:30 +00:00
OliverGiertz
aaac5def27 feat(pipeline): image caption/credit extraction, no-image exclusion, WP attribution
source_extraction.py:
- New _extract_image_metadata(): extracts figcaption text + copyright/credit
  per image URL using 3 strategies (figure+figcaption, data-* attributes,
  adjacent credit spans)
- ExtractedArticle gets new image_metadata field
- extracted_article_to_meta() includes image_metadata in stored JSON

pipeline.py:
- After auto image selection, check if selected_url is set
- Articles without usable image → status "no_image" (excluded with Telegram notice)
- PipelineStats and summary report include no_image counter

db.py:
- Add "no_image" to articles status CHECK constraint
- Migration: recreates articles table with updated constraint on existing DBs

workflow.py / main.py:
- Map no_image as own UI status with rewrite/close transitions

wordpress.py:
- _upload_featured_media() accepts image_caption param, sends to WP media
- _get_image_meta_for_url() / _build_image_caption() helpers
- _build_attribution_block(): separator + attribution paragraph at article end
  (original link, author, Bildnachweis/credit)
- _build_post_content() appends attribution block

telegram_bot.py:
- notify_pipeline_done() shows 🖼️ no-image count

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 07:08:48 +00:00
OliverGiertz
1963e32ab4 fix(rewrite): make image upload non-fatal and add rewrite tracing logs
- wordpress.py: catch image download/upload failures and skip image
  instead of aborting the entire WP draft update
- pipeline.py: add INFO logs at each step of _do_rewrite_and_draft
  to trace OpenAI call, tag generation, and WP API call
- telegram_bot.py: add INFO logs around rewrite execution + exc_info
  on error for full traceback in logs
- repositories.py: include scheduled_publish_at in get_article_by_id

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 07:45:55 +00:00
OliverGiertz
12932bca90 fix(rewrite): attribute claims to source instead of using first-person 'wir'
Rewrites must not use 'wir haben erforscht/berechnet' since the content
comes from a third-party source. The prompt now passes the source name
and instructs GPT to attribute all claims to the original publisher
(e.g. 'laut PiNCAMP', 'die Auswertung zeigt').

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 07:36:09 +00:00
OliverGiertz
013af2ab62 fix(pipeline): set warning-zone articles to review status to prevent re-warnings
Articles scoring between warn and auto threshold stayed in "new" status,
causing repeated warning notifications on every /run call. Now they are
set to "review" status after the first warning is sent.

The override callback already resets status to "new" before processing,
so the existing flow works correctly. Also include "review" articles in
/rejected command output so they can be acted on.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 07:22:47 +00:00
OliverGiertz
a64bf31ff6 fix(telegram): restore webhook to RSS-News backend and forward app-release commands
The N8N App Release Telegram Trigger had overwritten the webhook
registration, pointing it to N8N instead of the RSS-News backend.
This caused all callback_query events (inline buttons) to be lost,
breaking the override/rewrite/discard buttons.

Changes:
- Re-register webhook to https://news.vanityontour.de/telegram/webhook
  with both message and callback_query in allowed_updates
- Add _forward_to_n8n_app_release() to proxy unknown bot commands
  (e.g. /release) to the N8N App Release webhook, keeping that
  workflow functional without needing its own Telegram Trigger

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 06:34:49 +00:00
OliverGiertz
970f509ad4 feat(wordpress): store suggested publish date directly in WP draft
Reserve the publish slot before creating the WP draft so the
scheduled_publish_at timestamp is available when building the post
payload. WordPress receives the `date` field (e.g. 2026-03-24T09:00:00)
which sets the scheduled publish time on the draft.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-21 11:15:39 +00:00
OliverGiertz
e9c472b722 fix(telegram): async webhook handler + deduplicate callback responses
- Webhook returns 200 immediately, processing runs in background task
  → Telegram no longer retries, eliminates duplicate callbacks and 400 errors
- Consolidate answer_callback_query call to top of handler (before heavy work)
- Add logger.info/error for callback actions to aid debugging

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-21 11:08:32 +00:00
OliverGiertz
1020526e76 fix(pipeline): run N8N pipeline endpoint async to avoid HTTP timeout
Pipeline runs in background via asyncio. Endpoint returns immediately,
results arrive via Telegram notifications.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-21 10:03:13 +00:00
OliverGiertz
d9ab599466 fix(deploy): correct service name and app path for Hetzner
Service is rss-news-api (not rss-app), app lives at /opt/rss-news.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-21 09:43:55 +00:00
OliverGiertz
0a9c0b10d6 test(ingestion): update test for removed Ampel risk-level check
Ampel system removed – all enabled feeds are now processed regardless
of risk_level. Updated test to verify feeds with any risk_level are
processed instead of blocked.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-21 09:41:34 +00:00
OliverGiertz
6192f8e527 feat(automation): autonomous pipeline with Telegram bot and N8N integration
- Add full auto pipeline: RSS ingest → GPT relevance score → AI rewrite → WP draft
- Add Telegram bot with inline buttons (rewrite/discard/override) and commands (/run, /rejected, /status)
- Add smart publish scheduler: max 2 drafts/day, spread over week (09:00 & 14:00 CET)
- Add N8N API endpoints (/api/n8n/pipeline, /api/n8n/ingest) with X-API-Key auth
- Add GPT-based relevance scoring (0-100) for VanLife/Camping/Outdoor topics
- Remove Ampel risk-level policy check from ingestion (all enabled feeds are used)
- Add Telegram webhook endpoint and setup endpoint
- Add delete_wp_post() for Telegram discard action
- Add DB migrations for relevance_score and scheduled_publish_at columns
- Update .env.example with all new configuration variables
- Add docs/AUTOMATION.md with full setup and usage documentation

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-21 09:40:15 +00:00
6332a9a399
feat(wordpress): publish true Gutenberg blocks and remove auto summary/details sections 2026-02-21 14:55:20 +01:00
93f52f72b9
fix(ingestion): preserve article workflow data and skip closed items on re-import 2026-02-21 14:51:36 +01:00
b0f995d5c9
feat(rewrite): add batch rewrite run, AI tags for WP, and agentur contact detection 2026-02-21 14:39:47 +01:00
da269d08f1
chore(admin): remove legal approval step from UI workflow 2026-02-21 14:11:03 +01:00
88b2ee1d01
feat(admin): add feed/source management, rewrite editor, reopen flow, and WP block output 2026-02-21 14:03:49 +01:00
50f737f434
feat(admin): add connectivity diagnostics page for domains and endpoints 2026-02-21 13:58:40 +01:00
35ccceb260
feat(workflow): simplify article flow and add automated rewrite step 2026-02-21 13:43:22 +01:00
8d7375c99f feat(ui): classify publisher errors with actionable hints 2026-02-21 13:11:43 +01:00
24d8e5ad0f feat(wordpress): improve post html structure and excerpt generation 2026-02-21 13:09:00 +01:00
e68b6a41fd feat(wordpress): upload selected image and set featured_media on draft publish 2026-02-21 13:07:08 +01:00
ba83b24510 chore: finalize current state and prepare next wordpress-focused roadmap 2026-02-18 11:11:49 +01:00
fee5e76842 feat(ui): add publish readiness indicators and WP env key aliases 2026-02-18 11:03:53 +01:00
592d699166 chore(config): load shared rss-news .env for wordpress and keys 2026-02-18 11:00:57 +01:00
1cee56205e feat(publisher): add wordpress draft queue with retry and admin controls 2026-02-18 10:49:43 +01:00
dcdf4d954a feat(ui): show auto image ranking reasons in article detail 2026-02-18 10:43:17 +01:00
26e3d26b93 feat(images): auto-select relevant article images and tidy detail header 2026-02-18 10:40:39 +01:00
fb3465fb10 fix(images): add proxy fallback to direct source url rendering 2026-02-18 10:20:47 +01:00
910ca72c81 fix(ui): render article images via authenticated proxy thumbnails 2026-02-18 10:16:30 +01:00
efaf132936 feat(images): add thumbnail gallery with select/exclude workflow 2026-02-18 10:11:22 +01:00