服务器日志分析案例之:新站


服务器日志分析案例之:新站

这两天已经把服务器日志分割好了,网站的内容也逐步添加了一些,因此可以开放给蜘蛛抓取了,分析了这两天的服务器日志,发现一个有意思的现象,就是百度蜘蛛对新站不会那么快就抓取核心页面,比如在我理解中,蜘蛛首先是抓取首页,然后解析导航的URL,然后继续解析内容的URL,这是应该是最高效,最完美的抓取策略,但是分析日志却是另一翻景象。
服务器日志.jpg

首先我们先来看几段日志,看数据说话:
123.125.71.90 - - [16/Feb/2015:23:56:15 +0800] "GET /topic/SEO HTTP/1.1" 200 6507 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"
***
***
123.125.71.45 - - [16/Feb/2015:23:59:08 +0800] "GET /static/css/default/link.css?v=20141205 HTTP/1.1" 200 2418 "http://zhixin99.com/topic/SEO" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"
123.125.71.47 - - [16/Feb/2015:23:59:08 +0800] "GET /static/css/icon.css HTTP/1.1" 200 4582 "http://zhixin99.com/topic/SEO" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"
123.125.71.49 - - [16/Feb/2015:23:59:08 +0800] "GET /static/css/bootstrap.css HTTP/1.1" 200 109498 "http://zhixin99.com/topic/SEO" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"
123.125.71.57 - - [16/Feb/2015:23:59:09 +0800] "GET /static/css/default/common.css?v=20141205 HTTP/1.1" 200 73536 "http://zhixin99.com/topic/SEO" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"
123.125.71.53 - - [16/Feb/2015:23:59:10 +0800] "GET /static/js/jquery.2.js?v=20141205 HTTP/1.1" 200 84244 "http://zhixin99.com/topic/SEO" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"
123.125.71.34 - - [16/Feb/2015:23:59:11 +0800] "GET /static/js/jquery.form.js?v=20141205 HTTP/1.1" 200 15248 "http://zhixin99.com/topic/SEO" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"
123.125.71.44 - - [16/Feb/2015:23:59:12 +0800] "GET /static/js/compatibility.js HTTP/1.1" 200 1125 "http://zhixin99.com/topic/SEO" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"
123.125.71.34 - - [16/Feb/2015:23:59:16 +0800] "GET /static/js/app/topic.js HTTP/1.1" 200 2902 "http://zhixin99.com/topic/SEO" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"
123.125.71.50 - - [16/Feb/2015:23:59:17 +0800] "GET /topic/ajax/question_list/type-favorite__topic_title-SEO__page-0 HTTP/1.1" 200 5 "http://zhixin99.com/topic/SEO" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"
123.125.71.57 - - [16/Feb/2015:23:59:17 +0800] "GET /topic/ajax/get_focus_users/topic_id-3 HTTP/1.1" 200 1456 "http://zhixin99.com/topic/SEO" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"
123.125.71.33 - - [16/Feb/2015:23:59:17 +0800] "GET /crond/run/1424102175 HTTP/1.1" 200 5 "http://zhixin99.com/topic/SEO" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"

仔细分析日志就会发现百度蜘蛛不会那么快重点解析导航的URL,而是一个重要的页面,然后开始解析这个页面上的CSS,JS,AJAX,我的robots.txt早就屏蔽了static栏目了,但是百度蜘蛛依然抓取。有些页面是无效的页面,还没屏蔽,不过还是需要按照规则屏蔽这些页面。因此增加了几行屏蔽代码:

User-agent: *
Disallow: /app/
Disallow: /cache/
Disallow: /install/
Disallow: /models/
Disallow: /system/
Disallow: /tmp/
Disallow: /views/
Disallow: /static/
Disallow: /?/admin/
Disallow: /today/
Disallow: /topic/*?rf
Disallow: /reader/
Disallow: /crond/
Disallow: /question/
Disallow: /topic/ajax/

加粗的几行是新增加的。通过合理的屏蔽,希望百度蜘蛛能高效抓取有效页面。后续会继续监控百度蜘蛛的抓取行为。
已邀请:

要回复问题请先登录注册