Pre-generate your web site cache files with HTTrack

Pre-generate your web site cache files with HTTrack

If your website is using a cache system this article may interest you.

If your website does not have a cache system, you should install one right this instant. Remember that a cache system will allow your server to handle more users, your users to load pages faster, and you to sleep better.

There are many cache systems. They are useful for almost any kind of web site. For CMS websites as well as for e-commerce ones.
If you are using WordPress WP Super Cache is a good solution.
If you are using Magento, Varnish Cache can be relevant depending on your needs.
Whichever solution you choose, the method explained here should work.

Why pre-generating cache files?

One major drawback of cache system is that sometimes you have to empty cache, let’s say 10mn before going live on sales day to be sure all prices are re-calculated.
And when this happens, when 1000000 user try to connect on a non-cached website, creating the cache again will make it suffer as hell.
An easy solution is to pre-create the cache before going live, for (almost) every pages.
This article shows you how to do that.

What do I need?

There are many programs that allow you to do that, we chose to use HTTrack: it’s free, easy to use and it has many features. We will only use the crawling ones.
Download HTTrack here.

On Mac OS, you will need:

Then install HTTrack just by typing the following on Terminal window:
$ sudo port install httrack

Let’s get to it!

Pre-generating your cache is so simple that it can fit within one single line. Just type this in Terminal (or shell or cmd.exe):
httrack http://www.your-website.tld/ -c1R2Kb1s0ap0r3F "HTTrack cache builder" -*.jpg -*.css -*.js -*.png -*.gif -*.jpeg -*.ico

If it makes you feel more confortable, you can expand the c1R2Kb1s0ap0r3F gibberish:
httrack http://www.your-website.tld/ -c1 -R2 -K -b1 -s0 -a -p0 -r3 -F -*.jpg -*.css -*.js -*.png -*.gif -*.jpeg -*.ico

Let’s have a closer look at it:

httrack the program of course
http://www.your-website.tld/ Enter here the home url for your web site.
c1 Number of multiple connections, you can increase it
R2 Number of retries, in case of timeout or non-fatal errors
K Keep original links (we don’t need HTTracker to update the links since we won’t keep the downloaded files)
b1 Accept cookies in cookies.txt
s0 Do not follow robots.txt and meta robots tags
a Stay on the same address
p0 Just scan, don’t save anything
r3 Set the mirror depth to 3. This should be OK, keep in mind that is you pre-generate too many pages this may have counter productive effect on you cache and will provide no benefit: pages that are “hidden” deep away in the web site do not require to be cached as only fex people will load them.
F “HTTrack cache builder” User-agent field. This can be useful to check what happened in log files.
-*.jpg -*.css -*.js -*.png -*.gif -*.jpeg -*.ico Do not download static files as they do not need to be generated. (Moreover they are usually not handled by the same cache system.)

Remember that you can get help by typing “httrack –help” on Terminal or on HTTrack website: Httrack Users Guide.

Was that really helpfull?

You will have to test live to know.
But a good way to check is to launch HTTrack, again, after having generated your cache.
It should go at least 5 times faster.


Like this Article? Share it!

About the Author

Author Gravatar
Benjamin Bellamy

Paris, Beirut, NYC & Agen // e-commerce, social media, open-source & geek // follow me on twitter: @benjaminbellamy.

Related Posts

Comments are closed.