1. Error message produced by HTML proofer
  2. Finding all illegal (non-ascii) characters in html files inside a folder

Sometimes you need to grep for occurrences of invalid files in an entire directory. I personally came across this issue when running htmlproofer to validate this blog’s generated files. Let’s take a look at the error message and then the solution.

Error message produced by HTML proofer

1
2
3
	(....)
 'reencode'
/Users/joaorocha/.rvm/gems/ruby-2.5.1/gems/nokogumbo-2.0.2/lib/nokogumbo/html5.rb:164:in 'encode': "\xC3" on US-ASCII (Encoding::InvalidByteSequenceError)

After that very enlightening error message (sarcasm), I decided to look for the problematic file in my site/ directory, which contains the HTML generated by Jekyll.

Finding all illegal (non-ascii) characters in html files inside a folder

First, install pcregrep on macOS:

1
brew install pcre

Then, we scan all files within the ./_site directory with an html extension, and injects their paths in the a pcregrep command.

1
find "./_site" -name "*.html" |  xargs pcregrep --color='auto' -n '[^\x00-\x7F]' {}

Credits for the pcregreg section here.