Frank Bennett
Snapshot images saved with Zotero or its cousin Multilingual Zotero do a pretty good job of capturing the elements of target Web pages. Unfortunately, in page types for which snapshot images are most commonly used (newspapers, magazines, blogs) consist almost entirely of content that is no immediate interest (sidebars, advertising, navigation headers and footers, and miscellaneous graphical clutter).
Tools such as AdBlock attempt to remove some of this extraneous content, but if you are only interested in the authored text embedded in a page, they may not go far enough for your taste. Particularly if you are concerned to strip your library of attachment storage down to a minimum size, something more aggressive might be useful.
MLZ Attacher approaches the problem from the other end, by completely recasting the content as a spartan page that contains only what it guesses to be meaningful text content. It relies on a "composeDoc()" translator function that is available only in Multilingual Zotero (MLZ). At present, it will not work with the official Zotero client.
As this is a fairly drastic approach, it is applied through a separate "Attach" button set next to the Zotero translation icon in the address line of the browser. The icon appears only when the URL field of the selected item (or its parent, in the case of attachments) matches the URL of the current browser tab.
The plugin was originally built to fix up a library with many missing attachments. In the current version, any snapshots or stored-file attachments with no content will be deleted from the item automatically when the "Attach" button is used.
In the initial release, the following logic is used to identify "meaningful" text content:
-
If the page contains a node with id "pdf" (JSTOR has these), the link in the node is attached to the current item as a PDF (this has been tested only with JSTOR).
-
If the page contains an element with id "abstract" or "gs_opinion_wrapper", this element and its children are used to construct the snapshot.
-
If the page contains one or more heading tags (meaning "h1", "h2" or "h3" or "div" with one or more of these as children in this context) that is followed by at least one "div" tag that has at least two paragraph children ("p"), both the heading and paragraph tag sets are used to construct the snapshot.
-
If the page contains one or more heading tags (meaning "h1", "h2" or "h3" in this context) and paragraph tags ("p"), whatever their nesting relationship, the heading/paragraph set with the largest count of paragraph tags is used to construct the snapshot.
This logic is not perfect, but it seems to serve reasonably well. Feel free to fork and/or submit pull requests with improvements.
Enjoy!
Frank