type: decision
status: active
timestamp: 2026-06-22
tags: [decision, ncert, app, scraping, pdf-merge, github-releases, education]

ncert.oriz.in app — combined PDF directory (scrape + merge + release)

ncert.nic.in only per-chapter PDFs. ncert.oriz.in combines them is to provide COMBINED whole-book PDFs that don't exist anywhere else. GH Action\ scrapes https://ncert.nic.in/textbook.php via Playwright (using the playwright-cli\ skill or playwright-mcp), enumerates every Class \xD7 Subject \xD7 Language combination,\ downloads each chapter PDF, merges them in correct order using pdftk/qpdf, names\ the output {class}-{subject}-{lang}.pdf, releases on GitHub as artefacts. Website\ is the catalog UI that links to GH release URLs. Sorted properly so downloads\ are obvious. Languages: English + Hindi (other regional NCERTs deferred to v1).

ncert.oriz.in — combined PDF directory

Why it exists

User explicit (2026-06-22): “I wanted to read the books but there was no combined book, no site that provided the combined books in a combined format. There were individual chapters available on the NCERT website but not the complete books.”

ncert.nic.in publishes free official textbooks, but ONLY as per-chapter PDFs. To get a “whole book” you’d download 10-15 PDFs per class-subject-language and merge them yourself. This app removes that friction.

Pipeline

  1. Scrape https://ncert.nic.in/textbook.php via Playwright (use playwright-cli skill — signed binaries; survives Defender ASR). Enumerate every Class × Subject × Language combination.
  2. Download each chapter PDF to a temp dir. Names follow ncert.nic.in’s own convention.
  3. Sort chapters in correct order (chapter index from the catalog page, not filename).
  4. Merge using qpdf --empty --pages <chap1.pdf> <chap2.pdf> ... -- out.pdf (qpdf preserves PDF integrity; pdftk has Java dep issues on CI).
  5. Name output {class}-{subject}-{language}.pdf (e.g. class-9-mathematics-en.pdf, class-10-vigyan-hi.pdf).
  6. Release to GitHub Releases at chirag127/oriz-ncert-app with tag books-YYYY-MM-DD. Each release has all merged PDFs as assets.

Cron cadence

Once a year (June 1 IST cron) re-scrape to pick up any new chapter or new edition. NCERT updates books rarely; we don’t need monthly polling.

Website surface

The catalog UI at ncert.oriz.in:

What we DON’T do

Languages in v0

Deferred to v1: Urdu (some books), Sanskrit (some books), regional translations.

GitHub Action

.github/workflows/scrape-and-release.yml:

Cross-refs


Edit on GitHub · Back to index