Two bugs caused multiple articles to land on the same publish slot:
1. main.py: asyncio.create_task() returned immediately, allowing a second
pipeline trigger (N8N + Telegram /run or two N8N calls) to start a
second concurrent run. Added asyncio.Lock (_pipeline_lock) so any
second trigger while the pipeline is running is rejected immediately.
2. scheduler.py: reserve_publish_slot() read the list of occupied slots
and wrote the new slot in two separate DB connections. Concurrent threads
could both see the same "free" slot before either committed its write.
Fixed by wrapping the entire read-find-write cycle in a threading.Lock
(_slot_lock) and a single DB connection, so the slot check and the
slot assignment are atomic.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- New /admin/article-list: paginated (50/page) table with thumbnail,
title, excerpt (120 chars), status, scheduled date, and WP ID input
- Sticky save bar with live change counter (JS tracks modified inputs,
highlights changed cells in amber, disables save when nothing changed)
- POST /admin/article-list/update: saves only changed WP IDs in one
request; clears stale wp_post_url so WP-Sync repopulates it cleanly
- Filter by status + free-text search (title or article ID)
- Pagination with page/filter state preserved through save redirects
- repositories: add list_articles_page() (offset + search) and
bulk_update_wp_post_ids()
- Dashboard nav: add Artikelliste link
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds sync_db_from_wordpress() that treats WordPress as source of truth:
- future posts: update scheduled_publish_at to WP's actual date
- draft posts: clear scheduled_publish_at (not yet scheduled)
- published posts: mark article as 'published' in DB
- trashed/deleted posts: clear wp_post_id + wp_post_url + slot so article
can be re-processed
Exposed via POST /admin/wp-sync with a sync button on the schedule page.
Run after any manual rescheduling in WordPress to bring DB back in sync.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1. Article age filter (ingestion.py + config.py):
- New setting pipeline_max_article_age_days=7 (0 = no limit)
- Skip RSS entries older than N days before expensive extract_article()
- Prevents old articles from Google Alerts re-entering pipeline
2. Image URL pre-validation (ingestion.py):
- HEAD request probe for each primary image candidate during ingestion
- Falls back to next-best candidate if primary returns 4xx
- Network errors treated as OK to avoid false negatives on flaky servers
3. Stale WP draft cleanup (pipeline.py):
- Quality gate rejections now delete any pre-existing WP draft (wp_post_id)
- Prevents orphaned drafts when re-running articles that previously had drafts
4. Schedule overview UI (scheduler.py + admin_ui.py + admin_schedule.html):
- New /admin/schedule page showing calendar grid of all booked slots
- Distinguishes Pipeline-DB slots from WordPress-only slots
- Link added to dashboard navigation
5. Retry for failed articles (admin_ui.py + admin_dashboard.html):
- New POST /admin/articles/{id}/retry endpoint: resets to 'new', releases slot
- '🔄 Wiederholen' button shown in dashboard for all 'close' (error) articles
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- scheduler: use wordpress_app_password (not wordpress_password) so
_fetch_wp_occupied_slots() can actually authenticate against the WP
REST API — previously always returned empty set silently
- pipeline: release reserved publish slot when draft creation fails with
a non-ValueError exception (e.g. WP API error), preventing permanently
blocked slots on failed articles
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add individual Telegram message when an article is rejected by quality
gate (too short raw content or rewritten text), so users see each
rejection in real time instead of only in the bulk summary
- Add quality_gate_rejected counter to PipelineStats and result dict
- Show quality gate rejections separately in pipeline-done summary
(✂️ Qualitätsprüfung: N) distinct from score-based rejections
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Stage 1 (before OpenAI rewrite): reject if raw content < pipeline_min_words_raw (default 120)
Stage 2 (after rewrite): reject if rewritten text < pipeline_min_words_rewritten (default 150)
Both stages set status='error' with a descriptive note and skip WP draft creation.
The reserved publish slot is released so it stays available for the next article.
Quality rejections don't abort the pipeline — processing continues with the next article.
New config settings (overridable via .env):
PIPELINE_MIN_WORDS_RAW=120
PIPELINE_MIN_WORDS_REWRITTEN=150
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously the scheduler started searching from the last scheduled post date,
skipping all free slots in between (e.g. a free slot on Apr 20 would be ignored
if the last post was on May 18).
Now starts scanning from tomorrow, finding the first available slot regardless
of whether earlier dates have gaps — fills the calendar naturally.
Also extended lookahead from 30 to 60 days.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The scheduler previously only checked the local SQLite DB for occupied slots.
Posts created outside the pipeline (e.g. recovery scripts) were invisible,
causing newly scheduled articles to land on already-taken WP dates.
_fetch_wp_occupied_slots() now queries WP /wp/v2/posts?status=future before
each slot assignment. All scheduling functions accept a wp_occupied set.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
WordPress ignores the date field for draft posts and shows "Sofort veröffentlichen"
instead. Setting status=future causes WP to display and honour the scheduled date,
auto-publishing the post at the given time as intended.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
If scheduled_publish_at is not set when _do_rewrite_and_draft runs
(e.g. rewrite_and_update_draft called on a review article), reserve
a slot now so the WP draft always receives a future date.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When the credit field only captured a marker prefix (e.g. "Foto:") due to
CSS-class-based extraction picking up only the label element, fall back to
regex-extracting the credit line from the full figcaption caption text.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Derive real source hostname from canonical URL when feed name is generic
(e.g. "Google Alerts"), so the link shows "moin.de" instead of "Google Alerts"
- Use _get_image_meta_for_url() (fuzzy URL matching) for image credit lookup
- Use caption field for Bildnachweis since it already contains embedded credits
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Image metadata keys may have query params (e.g. ?w=1200) that differ from
the selected_url stored in image_review. Fall back to comparing URLs without
query string so the figcaption text is correctly found.
Also simplified _build_image_caption: figcaption text already contains the
credit info, so just use caption directly instead of appending the redundant
credit prefix marker.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Google Alerts wraps matched keywords in <b>...</b> tags.
Strip all HTML tags from the title before storing.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Google Alerts feed entries use google.com/url?...&url=<encoded_real_url>&...
tracking links. The extractor was fetching the Google redirect page instead
of the actual article, resulting in empty content and no images.
_resolve_google_redirect() extracts the real URL from the 'url' query
parameter before passing it to extract_article(). Non-Google URLs are
returned unchanged.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
source_extraction.py:
- New _extract_image_metadata(): extracts figcaption text + copyright/credit
per image URL using 3 strategies (figure+figcaption, data-* attributes,
adjacent credit spans)
- ExtractedArticle gets new image_metadata field
- extracted_article_to_meta() includes image_metadata in stored JSON
pipeline.py:
- After auto image selection, check if selected_url is set
- Articles without usable image → status "no_image" (excluded with Telegram notice)
- PipelineStats and summary report include no_image counter
db.py:
- Add "no_image" to articles status CHECK constraint
- Migration: recreates articles table with updated constraint on existing DBs
workflow.py / main.py:
- Map no_image as own UI status with rewrite/close transitions
wordpress.py:
- _upload_featured_media() accepts image_caption param, sends to WP media
- _get_image_meta_for_url() / _build_image_caption() helpers
- _build_attribution_block(): separator + attribution paragraph at article end
(original link, author, Bildnachweis/credit)
- _build_post_content() appends attribution block
telegram_bot.py:
- notify_pipeline_done() shows 🖼️ no-image count
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- wordpress.py: catch image download/upload failures and skip image
instead of aborting the entire WP draft update
- pipeline.py: add INFO logs at each step of _do_rewrite_and_draft
to trace OpenAI call, tag generation, and WP API call
- telegram_bot.py: add INFO logs around rewrite execution + exc_info
on error for full traceback in logs
- repositories.py: include scheduled_publish_at in get_article_by_id
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Rewrites must not use 'wir haben erforscht/berechnet' since the content
comes from a third-party source. The prompt now passes the source name
and instructs GPT to attribute all claims to the original publisher
(e.g. 'laut PiNCAMP', 'die Auswertung zeigt').
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Articles scoring between warn and auto threshold stayed in "new" status,
causing repeated warning notifications on every /run call. Now they are
set to "review" status after the first warning is sent.
The override callback already resets status to "new" before processing,
so the existing flow works correctly. Also include "review" articles in
/rejected command output so they can be acted on.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The N8N App Release Telegram Trigger had overwritten the webhook
registration, pointing it to N8N instead of the RSS-News backend.
This caused all callback_query events (inline buttons) to be lost,
breaking the override/rewrite/discard buttons.
Changes:
- Re-register webhook to https://news.vanityontour.de/telegram/webhook
with both message and callback_query in allowed_updates
- Add _forward_to_n8n_app_release() to proxy unknown bot commands
(e.g. /release) to the N8N App Release webhook, keeping that
workflow functional without needing its own Telegram Trigger
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Reserve the publish slot before creating the WP draft so the
scheduled_publish_at timestamp is available when building the post
payload. WordPress receives the `date` field (e.g. 2026-03-24T09:00:00)
which sets the scheduled publish time on the draft.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Webhook returns 200 immediately, processing runs in background task
→ Telegram no longer retries, eliminates duplicate callbacks and 400 errors
- Consolidate answer_callback_query call to top of handler (before heavy work)
- Add logger.info/error for callback actions to aid debugging
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Pipeline runs in background via asyncio. Endpoint returns immediately,
results arrive via Telegram notifications.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Ampel system removed – all enabled feeds are now processed regardless
of risk_level. Updated test to verify feeds with any risk_level are
processed instead of blocked.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>