Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] Read my Firefox's cache contents on Linux



Dave M G writes:

 > I'm a little confused right now about how caching works.

No, you're not, because there is no prescribed "how" to understand.
Only whatever the application happens to do.  In computer parlance, a
*cache* is not only a collection of valuables (here, valuable data),
it is also *by definition* optional and ephemeral.  Caching is
transparent to process; you only notice it by "side effects" such as
disk usage and (presumably) snappier performance.

For example, when you tab between page A and page B, the displayed
portion of page A is cached in graphics memory, and when you switch
back from page B to page A, the redisplay is instant.  If you now
scroll page A, the content was cached in memory and/or on disk, and
display updating is instantaneous as far as a human can tell.  In
theory, the browser *could* go back out to the web and refetch the
current page every time you tab, but you wouldn't put up with a
browser like that, would you?

 > Is it possible for a web site, by JavaScript or some other method
 > to stop itself from being written to a cache?

No.  All networking requires buffers, and a cache is nothing more nor
less than a buffer that somebody intentionally failed to clean up.

What a web site can do (as Birkir pointed out) is to advise the
application that caching will be ineffective because the data is
volatile.  Since resources are limited, a smart application will
decide to cache those things that are (a) expensive to reacquire or
(b) very likely to be reused, and preferably both, while releasing the
space occupied by data that is neither.

However, IIRC the no-cache pragma is entirely aimed at proxies, and
the Expires header is advisory, and certainly is not going to result
in a new fetch every time you hit PageDown!

 > I tried an experiment where (after backing up), I deleted the contents 
 > of my cache entirely. Then I went to a web page which has dynamic 
 > content, and is on the same web site from which I originally got the 
 > data I'm after.
 > 
 > I closed FireFox and did nothing else. So, in theory, the cache should 
 > contain only my one visit to that site.

That theory is wrong.  The cache might contain nothing.  The cache
might also contain content from every page referenced by the page you
visited, plus content from everything in your bookmarks and history.
That latter would be a hellaciously aggressive cache.  When taken to
the petabyte extreme, it's also known as "Google".

 > And then I searched for text that I know for a fact to have been on that 
 > web site because I just looked at it.

No, you don't *know* that, unless you are very sophisticated indeed.
What is in the cache could be the raw object, in various stages of
partial decoding.  For example, it might be UTF-16 text, so if you
search for "Dave" you will not find, whereas "D\000a\000v\000e" would
get a hit.

 > And yet, nothing. I can find all sorts of images referenced, from banner 
 > ads and whatnot, and pretty much anything *except* the main body text of 
 > the web page in question.

That's easy to explain.  Images are large and static, expensive to
fetch, and likely to be reused.  They're going to end up in the disk
cache.  Text is small and dynamic.  It will be cached in memory until
you close the containing tab or window, because you might tab back to
it and scroll.  But at that point it will be redundant, and the
browser will release it sometime after the cache fills.

 > So either I don't know how caches work, or the main bulk of the text on 
 > the web page is somehow avoiding being cached, or I'm still not 
 > searching using the right methodology.
 > 
 > Any hints or advice on this?

It would be nice if you'd tell us why you want to do this, for
starters.  You keep telling us what operations you've tried, and that
it didn't work.  "Well, OK, Dave, that didn't work" is about all we
can really say for sure.

My suggestion at this point is "learn to use wget".  Browsers are not
designed to save everything they download.  They're designed to show
you pretty pictures quickly, they cache what they need to do achieve
the highest performance possible given bounds on network
responsiveness, and you decide what you want to save permanently.

If what you want is sufficiently dynamic, you may have to write your
own browser.  Or maybe you can just drill down using the DOM explorer.

If you really want to, you could try something like

$ ( ulimit -d 5000; firefox ) &

The ulimit command restricts the amount of memory firefox is allowed
to use, which theoretically should force it to put practically
everything into the disk cache (or crash it; my money's on the
latter).

HTH



Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links