← knowledge.oriz.in

ncert.oriz.in app — combined PDF directory (scrape + merge + release)

decision decisionncertappscrapingpdf-mergegithub-releaseseducation

ncert.oriz.in — combined PDF directory

Why it exists

User explicit (2026-06-22): "I wanted to read the books but there was no combined book, no site that provided the combined books in a combined format. There were individual chapters available on the NCERT website but not the complete books."

ncert.nic.in publishes free official textbooks, but ONLY as per-chapter PDFs. To get a "whole book" you'd download 10-15 PDFs per class-subject-language and merge them yourself. This app removes that friction.

Pipeline

  1. Scrape https://ncert.nic.in/textbook.php via Playwright (use playwright-cli skill — signed binaries; survives Defender ASR). Enumerate every Class × Subject × Language combination.
  2. Download each chapter PDF to a temp dir. Names follow ncert.nic.in's own convention.
  3. Sort chapters in correct order (chapter index from the catalog page, not filename).
  4. Merge using qpdf --empty --pages <chap1.pdf> <chap2.pdf> ... -- out.pdf (qpdf preserves PDF integrity; pdftk has Java dep issues on CI).
  5. Name output {class}-{subject}-{language}.pdf (e.g. class-9-mathematics-en.pdf, class-10-vigyan-hi.pdf).
  6. Release to GitHub Releases at chirag127/oriz-ncert-app with tag books-YYYY-MM-DD. Each release has all merged PDFs as assets.

Cron cadence

Once a year (June 1 IST cron) re-scrape to pick up any new chapter or new edition. NCERT updates books rarely; we don't need monthly polling.

Website surface

The catalog UI at ncert.oriz.in:

What we DON'T do

Languages in v0

Deferred to v1: Urdu (some books), Sanskrit (some books), regional translations.

GitHub Action

.github/workflows/scrape-and-release.yml:

Cross-refs