Tuesday, February 19, 2008

Citation counts

When deciding which method to use, or on what areas to focus development, the popularity of related references matters. For example, if method A is used 20 times more frequently to answer a given question than method B, in the absence of other information, a naive user should probably use method A. Just counting total citations could be misleading, though: a good new method only available in the last year will take some time to gain as many citations as a poorer older method, despite acquiring citations at a faster rate. Thus, both number and rate matter. I'm thinking of showing both the total number of citations and rate of gain of citations over a year or so; another way of displaying related info would be comparing the number of citations for a paper with the median number of citations for papers published in the same year (perhaps limiting the papers in the reference set to those similar in scope to the paper of interest). To get this info, I'll need citation info, which is not generally available (see earlier post). I'm using the number of hits in Yahoo, using its search API (which gives slightly different numbers than its html form search results) for both all pages and only PDF-formatted pages with an article's title phrase, last name of first author, and publication year. The tricky things getting this to work were converting apostrophes to html characters and making sure to include the title as a phrase, rather than as a string of words. This approach has a few disadvantages: while web hits probably correlate with how many times a paper is cited (early work suggests this is roughly true), it is not wonderfully correlated (but it can pick up hits for interesting new papers faster than waiting for later citing papers to appear), plus it is easier to mislead (I could add my papers' titles, authors, and years as signatures to all my posts and then start posting on various forums: TreeTapper would see my papers as very popular). Perhaps with the upcoming release of the new CiteSeerX, I can use that system to track citations better. I've set up TreeTapper to store the number of hits for each paper every two weeks; this will allow me to track how the popularity of articles changes through time to recover rate rather than just number of citations. As this recording is new, I'm now only showing total hits until the data are recorded over more time intervals.

No comments: