Views:
68,995β
Votes: 16β
Tags:
html
bash
html-escape-characters
Link:
π See Original Answer on Stack Overflow β§ π
URL:
https://stackoverflow.com/q/43058947
Title:
Bash script to convert from HTML entities to characters
ID:
/2017/03/28/Bash-script-to-convert-from-HTML-entities-to-characters
Created:
March 28, 2017
Edited: July 21, 2023
Upload:
September 15, 2024
Layout: post
TOC:
false
Navigation: false
Copy to clipboard: false
This answer is based on: Short way to escape HTML in Bash? which works fine for grabbing answers (using wget
) on Stack Exchange and converting HTML to regular ASCII characters:
sed 's/ / /g; s/&/\&/g; s/</\</g; s/>/\>/g; s/"/\"/g; s/#'/\'"'"'/g; s/“/\"/g; s/”/\"/g;'
Edit 1: April 7, 2017 - Added left double quote and right double quote conversion. This is part of bash script that web-scrapes SE answers and compares them to local code files here: Ask Ubuntu - Code Version Control between local files and Ask Ubuntu answers
Edit June 26, 2017
Using sed
was taking ~3 seconds to convert HTML to ASCII on a 1K line file from Ask Ubuntu / Stack Exchange. As such I was forced to use Bash built-in search and replace for ~1 second response time.
Hereβs the function:
bash bash
LineOut="" # Make global
HTMLtoText () {
LineOut=$1 # Parm 1= Input line
# Replace external command: Line=$(sed 's/&/\&/g; s/</\</g;
# s/>/\>/g; s/"/\"/g; s/'/\'"'"'/g; s/“/\"/g;
# s/”/\"/g;' <<< "$Line") -- With faster builtin commands.
LineOut="${LineOut// / }"
LineOut="${LineOut//&/&}"
LineOut="${LineOut//</<}"
LineOut="${LineOut//>/>}"
LineOut="${LineOut//"/'"'}"
LineOut="${LineOut//'/"'"}"
LineOut="${LineOut//“/'"'}" # TODO: ASCII/ISO for opening quote
LineOut="${LineOut//”/'"'}" # TODO: ASCII/ISO for closing quote
} # HTMLtoText ()