/til/

2024 0922 Transformed HTML on the macOS clipboard

Have you ever noticed that you can copy text from a web page, and when you paste it into a rich text editor like Word or Google Docs, all the links work? And also of course, simple formatting is copied pretty well too, including bold, italics, and even tables.

This is great, but can you get it to output a text string of HTML tags for you?

For example, right now I’m looking at the Old fashioned (cocktail) article on Wikipedia. If you view source, you’ll see relative links to other articles like <a href="/wiki/Hudson,_New_York" title="Hudson, New York">Hudson, New York</a> and anchor links to footnotes like <sup id="cite_ref-5" class="reference"><a href="#cite_note-5"><span class="cite-bracket">[</span>5<span class="cite-bracket">]</span></a></sup>. Copying from the page source in the browser will just leave you with those useless non-absolute links.

But if you copy from that page paste this into a rich text editor, these relative and anchor links are transformed into absolute links like https://en.wikipedia.org/wiki/Hudson,_New_York and https://en.wikipedia.org/wiki/Old_fashioned_(cocktail)#cite_note-5. That’s what I want. How can I get that?

pbpaste(1) can’t do it

You’d really like the built-in clippasteboard tool to be able to handle this for you, but it can’t. If you copy text from a browser and run pbpaste(1), you’ll get all the formatting stripped out.

The first documented definition of the word "cocktail" was in response to a reader's letter asking to define the word in the 6 May 1806, issue of The Balance and Columbian Repository in Hudson, New York.

pbpaste(1) can get RTF or Postscript data if you pass -Prefer rtf or -Prefer ps. However, the browser doesn’t convert rich text to either of those formats on copy, so you just end up with the text result.

Wait, what kind of data is on the clipboard?

Oh yeah, maybe it’s worth being explicit about this: the clipboard may contain different representations of the same data. We can look at it with a bit of Applescript.

osascript -e 'tell app "Finder" to clipboard info'

When the clipboard contains text from a web browser, that shows:

«class HTML», 4604, «class utf8», 1760, «class ut16», 3520, string, 1106, Unicode text, 3518

Interesting that it has a UTF-8 and UTF-16 version on there, but what matters to us is that there an HTML type. pbpaste(1) doesn’t support that type :(.

Use AppleScript

So instead we have to use AppleScript to get it. In fact, I copiloted a little shell script.

#!/bin/sh
set -eu

osascript <<EOF
use framework "Foundation"
use framework "AppKit"

set thePasteboard to current application's NSPasteboard's generalPasteboard()
set theHTML to thePasteboard's stringForType:(current application's NSPasteboardTypeHTML)

if theHTML is missing value then
    return "No HTML content found in clipboard."
else
    return theHTML as text
end if
EOF

And that results in something like:

<html><head><meta http-equiv="content-type" content="text/html; charset=utf-8"></head><body>The first documented definition of the word "cocktail" was in response
to a reader's letter asking to define the word in the 6 May 1806, issue
of <i>The Balance and Columbian Repository</i> in <a href="https://en.wikipedia.org/wiki/Hudson,_New_York" title="Hudson, New York">Hudson, New York</a>.</body></html>

AHA!

Responses

Webmentions

Hosted on remote sites, and collected here via Webmention.io (thanks!).

Comments

Comments are hosted on this site and powered by Remark42 (thanks!).