Jun
25
2007
Find googlebot with IIS Log Parser
Posted by admin under
Tools
For webstats analyzes I have tested out a lot of different solutions. While both SmarterStats
and AWStats is pretty good in my personal opinion, I jumped in the Google Analytics train a while ago and must say it's a really nice interface for people working with multiple sites.
One of the things with a Javascript based solution such as Google Analytics is the fact that you can just track what you explicitly track. Meaning - if you don't add the tracking javascript code for a certain page that page will never be counted.
Also a lot of other stuff such as 404:s, image bandwidth hotlinking etc can't found obviously.
So lets look at how you can use Microsoft Log Parser to get information you miss direct from your logfiles. In this example we are gonna retrieve every entry with googlebot to see how often the google robot vists our site.
While there is a exefile capable of doing most things we want, since I'm a coder I wanna do it with C# code.
Lets say we have a bunch of sites. All those sites stores there IIS logfiles in a separate directory:

Now we want to retrieve the log entries with googlebot and put them into a new outpout directory.

First we download Log Parser. If you look in Logparser directory after installing you see we have a Logparser.dll file.
Great! Now on to the code. I start by defining some stuff my app.config:
<?xml version="1.0" encoding="utf-8" ?>
<configuration>
<appSettings>
<add key="site1_name" value="ASPCode Forum"/>
<add key="site1_logfilepath" value="D:\ex070604\ex070604\"/>
<add key="site1_outputpath" value="D:\ex070604\ex070604\test\"/>
<add key="site2_name" value="Site2"/>
<add key="site2_logfilepath" value="D:\exdd\"/>
<add key="site2_outputpath" value="D:\websites\dd\output\"/>
</appSettings>
</configuration>
Really a quick and dirty solution.
What we do is specify a number of sites - and for each site we set a name, and a
logfile input path - and a output path.
The inputpath is of course the location of the raw IIS logfiles and the output path could be
a directory of your public website, making it easy downloadable.
Now the code. As I said a quick and dirty solution, checking for site[0-100]_name etc but that's not the point.
We first loop through the output directory to see which files are already created. Since we
name the output files the same way as IIS names the raw logfiles (ending with yyMMdd.log)
it's so easy to see which logfiles we already have processed.
static void Main(string[] args)
{
//Alla sajter...
for (int i = 0; i < 100; i++)
{
string sSite = System.Configuration.ConfigurationManager.AppSettings["site" + i.ToString() + "_name"];
if (sSite == null || sSite.Length == 0)
continue;
string sLogfilePath = System.Configuration.ConfigurationManager.AppSettings["site" + i.ToString() + "_logfilepath"];
if (sLogfilePath == null || sLogfilePath.Length == 0)
continue;
string sOutputpath = System.Configuration.ConfigurationManager.AppSettings["site" + i.ToString() + "_Outputpath"];
if (sOutputpath == null || sOutputpath.Length == 0)
continue;
//
MSUtil.LogQueryClassClass oLog = new MSUtil.LogQueryClassClass();
MSUtil.COMIISW3CInputContextClass oInputW3C = new MSUtil.COMIISW3CInputContextClass();
MSUtil.COMW3COutputContextClass oOutputW3C = new MSUtil.COMW3COutputContextClass();
//Where to start?
//Which is the latest in output
string sLastRunDate = "";
foreach (string s in System.IO.Directory.GetFiles(sOutputpath))
{
string sName = System.IO.Path.GetFileName(s);
string []sParts = sName.Split('-');
if ( sParts.Length == 3 )
{
if (sParts[0] == sSite && sParts[1] == "googlebot")
{
//Check date...
if (sLastRunDate == "")
sLastRunDate = sParts[2];
else
{
DateTime dt11 = new DateTime(2000+Convert.ToInt32(sLastRunDate.Substring(0, 2)),
Convert.ToInt32(sLastRunDate.Substring(2, 2)), Convert.ToInt32(sLastRunDate.Substring(4, 2)), 0, 0, 0);
DateTime dt2 = new DateTime(2000 + Convert.ToInt32(sParts[2].Substring(0, 2)),
Convert.ToInt32(sParts[2].Substring(2, 2)), Convert.ToInt32(sParts[2].Substring(4, 2)), 0, 0, 0);
if ( dt2 > dt11 )
sLastRunDate = sParts[2];
}
}
}
}
if (sLastRunDate == "")
sLastRunDate = DateTime.Now.AddDays(-27).ToString("yyMMdd");
DateTime dt1 = new DateTime(2000 + Convert.ToInt32(sLastRunDate.Substring(0, 2)),
Convert.ToInt32(sLastRunDate.Substring(2, 2)), Convert.ToInt32(sLastRunDate.Substring(4, 2)),0,0,0);
while (dt1 < DateTime.Now.AddDays(-1).Date)
{
string sOneLogFile = sLogfilePath + "ex" + dt1.ToString("yyMMdd") + ".log";
if (System.IO.File.Exists(sOneLogFile) == true)
{
string sOneOutLogFile = sOutputpath + sSite + "-googlebot-" + dt1.ToString("yyMMdd") + ".log"; ;
string sQuery = "select * from " + sOneLogFile + " to \"" + sOneOutLogFile + "\" where cs(user-agent) like '%googlebot%'";
oLog.ExecuteBatch(sQuery, oInputW3C, oOutputW3C);
}
dt1 = dt1.AddDays(1);
}
}
}
The
oLog.ExecuteBatch(sQuery, oInputW3C, oOutputW3C);
does all the work. And read up on LogParser to see how you can use to filter out 404:s etc as well. Really useful tool I must say.
Links: Microsoft Log Parser