Have you ever noticed that you can copy text from a web page, and when you paste it into a rich text editor like Word or Google Docs, all the links work? And also of course, simple formatting is copied pretty well too, including bold, italics, and even tables.
This is great, but can you get it to output a text string of HTML tags for you?
View source isn’t good enough for links
For example, right now I’m looking at
the Old fashioned (cocktail) article on Wikipedia.
If you view source, you’ll see relative links to other articles like
<a href="/wiki/Hudson,_New_York" title="Hudson, New York">Hudson, New York</a>
and anchor links to footnotes like
<sup id="cite_ref-5" class="reference"><a href="#cite_note-5"><span class="cite-bracket">[</span>5<span class="cite-bracket">]</span></a></sup>
.
Copying from the page source in the browser will just leave you with those useless non-absolute links.
But if you copy from that page paste this into a rich text editor,
these relative and anchor links are transformed into absolute links like
https://en.wikipedia.org/wiki/Hudson,_New_York
and https://en.wikipedia.org/wiki/Old_fashioned_(cocktail)#cite_note-5
.
That’s what I want.
How can I get that?
pbpaste(1)
can’t do it
You’d really like the built-in clippasteboard tool to be able to handle this for you, but it can’t.
If you copy text from a browser and run pbpaste(1)
,
you’ll get all the formatting stripped out.
The first documented definition of the word "cocktail" was in response to a reader's letter asking to define the word in the 6 May 1806, issue of The Balance and Columbian Repository in Hudson, New York.
pbpaste(1)
can get RTF or Postscript data if you pass -Prefer rtf
or -Prefer ps
.
However, the browser doesn’t convert rich text to either of those formats on copy,
so you just end up with the text result.
Wait, what kind of data is on the clipboard?
Oh yeah, maybe it’s worth being explicit about this: the clipboard may contain different representations of the same data. We can look at it with a bit of Applescript.
osascript -e 'tell app "Finder" to clipboard info'
When the clipboard contains text from a web browser, that shows:
«class HTML», 4604, «class utf8», 1760, «class ut16», 3520, string, 1106, Unicode text, 3518
Interesting that it has a UTF-8 and UTF-16 version on there,
but what matters to us is that there an HTML
type.
pbpaste(1)
doesn’t support that type :(.
Use AppleScript
So instead we have to use AppleScript to get it. In fact, I copiloted a little shell script.
#!/bin/sh
set -eu
osascript <<EOF
use framework "Foundation"
use framework "AppKit"
set thePasteboard to current application's NSPasteboard's generalPasteboard()
set theHTML to thePasteboard's stringForType:(current application's NSPasteboardTypeHTML)
if theHTML is missing value then
return "No HTML content found in clipboard."
else
return theHTML as text
end if
EOF
And that results in something like:
<html><head><meta http-equiv="content-type" content="text/html; charset=utf-8"></head><body>The first documented definition of the word "cocktail" was in response
to a reader's letter asking to define the word in the 6 May 1806, issue
of <i>The Balance and Columbian Repository</i> in <a href="https://en.wikipedia.org/wiki/Hudson,_New_York" title="Hudson, New York">Hudson, New York</a>.</body></html>
AHA!