I needed to convert some webpages to a human-readable format for studing/review at a later time. After some thought I came up with the following:
- download pages via curl/wget or good old “save page as”
- slugify filenames for easier shell manipulation
- convert to markdown using Aaron Swartz’s html2text
- rename file extensions to reflect new format
# delete directories (handling spaces in filenames)
$ find . -type d -print0 | xargs -0 rm -rf
# slugify, rename, then delete html files
$ ls | xargs -0 | slugify && \
for file in $(ls ); do
html2text $file > ${file%.html}.md;
done && \
rm *.html