2005

On Benchmarking Frequent Itemset Mining Algorithms: from Measurement to Analysis

Co-authors

Appeared in

1st Int'l Workshop on Open Source Data Mining on Frequent Pattern Mining Implementations (OSDM2005) in conjunction with
The 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. (ACM SIGKDD 2005)
August 21-24, 2005
Chicago, IL, USA

Abstract

We point out problems of current practices in comparing Frequent Itemset Mining Implementations, and suggest techniques that can help to avoid the conclusions of measurements being tainted by these problems.

Download

full text pdf
slides ppt

BibTeX

@inproceedings{ racz05benchmarking,
   author="B. R\'{a}cz and F. Bodon and L. Schmidt-Thieme",
   title="On Benchmarking Frequent Itemset Mining Algorithms",
   booktitle="Proceedings of the 1\textsuperscript{st} {I}nternational {W}orkshop on {O}pen {S}ource {D}ata {M}ining, in conjunction with {ACM} {SIGKDD}",
   year="2005"}

Citing papers 

Have you found a citing paper? Please drop me an e-mail!

Scaling Link-Based Similarity Search

Co-authors

Appeared in

The 14th Int'l World Wide Web Conference (WWW2005)
May 10-14, 2005
Chiba (Tokyo), Japan

Abstract

To exploit the similarity information hidden in the hyperlink structure of the web, this paper introduces algorithms scalable to graphs with billions of vertices on a distributed architecture.  The similarity of multi-step neighborhoods of vertices are numerically evaluated by similarity functions including SimRank, a recursive refinement of cocitation; PSimRank, a novel variant with better theoretical characteristics; and the Jaccard coefficient, extended to multi-step neighborhoods. Our methods are presented in a general framework of Monte Carlo similarity search algorithms that precompute an index database of random fingerprints, and at query time, similarities are estimated from the fingerprints. The performance and quality of the methods were tested on the Stanford Webbase graph of 80M pages by comparing our scores to similarities extracted from the ODP directory. Our experimental results suggest that the hyperlink structure of vertices within four to five steps provide more adequate information for similarity search than single-step neighborhoods.

Download

full text pdf
slides ppt

BibTeX

@inproceedings{ fogaras05scaling,
   author="D. Fogaras and B. R\'{a}cz",
   title="Scaling Link-Based Similarity Search",
   booktitle="Proceedings of the 14\textsuperscript{th} {I}nt'l {W}orld {W}ide {W}eb Conference",
   year="2005"}

Citing papers

Have you found a citing paper not listed here? Please drop me an e-mail!

Fifteen Minutes of Fame: The Dynamics of Information Access on the Web

Co-authors

Appeared in

 ArXiv physics/0505087, 2005.

Abstract

While current studies on complex networks focus on systems that change relatively slowly in time, the structure of the most visited regions of the Web is altered at the timescale from hours to days. Here we investigate the dynamics of visitation of a major news portal, representing the prototype for such a rapidly evolving network. The nodes of the network can be classified into stable nodes, that form the time independent skeleton of the portal, and news documents. The visitation of the two node classes are markedly different, the skeleton acquiring visits at a constant rate, while a news document’s visitation peaking after a few hours. We find that the visitation pattern of a news document decays as a power law, in contrast with the exponential prediction provided by simple models of site visitation. This is rooted in the inhomogeneous nature of the browsing pattern characterizing individual users: the time interval between consecutive visits by the same user to the site follows a power law distribution, in contrast with the exponential expected for Poisson processes. We show that the exponent characterizing the individual user’s browsing patterns determines the power-law decay in a document’s visitation. Finally, our results document the fleeting quality of news and events: while fifteen minutes of fame is still an exaggeration in the online media, we find that access to most news items significantly decays after 36 hours of posting.

Download

full text pdf

Citing papers

Have you found a citing paper not listed here? Please drop me an e-mail!

2004

Towards Scaling Fully Personalized PageRank

Co-authors

Appeared in

3rd Int'l Workshop on Algorithms and Models for the Web-Graph  (WAW2004), in conjunction with
The 45th Annual IEEE Symposium on Foundations of Computer Science (FOCS2004)
October 17-19, 2004
Rome, Italy

Journal of Internet Mathematics (extended version to appear)

Abstract

Personalized PageRank expresses link-based page quality around user-selected pages in a similar way as PageRank expresses quality over the entire Web.  Existing personalized PageRank algorithms can however serve on-line queries only for a restricted choice of pages. In this paper we achieve full personalization by a novel algorithm that precomputes a compact database, and using this database it can serve on-line responses to arbitrary user selected personalization. The algorithm uses simulated random walks; we prove that for a fixed error probability, the size of our database is linear in the number of web pages. We justify our estimation approach by asymptotic worst-case lower bounds: we show that on some sets of graphs exact personalized PageRank values can only be obtained from a database of size quadratic in the number of vertices. Furthermore, we evaluate the precision of approximation experimentally on the Stanford WebBase graph.

Download

full text (conference version) pdf
slides ppt
full text (journal version) pdf

BibTeX

@inproceedings{ fogaras04ppr,
   author="D. Fogaras and B. R\'{a}cz",
   title="Towards Fully Personalizing {P}age{R}ank",
   year=2004,
   booktitle="Proceedings of the 3\textsuperscript{rd} Workshop on Algorithms and Models for the Web-Graph (WAW2004), in conjunction with FOCS 2004."}

Citing papers

Have you found a citing paper not listed here? Please drop me an e-mail!

nonordfp: An FP-growth variation without rebuilding the FP-tree

Co-authors

none

Appeared in

2nd Int'l Workshop on Frequent Itemset Mining Implementations (FIMI2004), in conjunction with
The 4th IEEE International Conference on Data Mining (ICDM2004)
November 01-04, 2004
Brighton, UK

CEUR Workshop Proceedings 126

Abstract

  We describe a frequent itemset mining algorithm and implementation based on the well-known algorithm FP-growth. The theoretical difference is the main data structure (tree), which is more compact and which we do not need to rebuild for each conditional step. We thoroughly deal with implementation issues, data structures, memory layout, I/O and library functions we use to achieve comparable performance as the best implementations of the 1st Frequent Itemset Mining Implementations (FIMI) Workshop.

Download

full text pdf
slides ppt
implementation tgz. Note that this source code differs from the original published version in the elimination of a conflict with the memusage program, which caused nonordfp to fail on many datasets in the evaluation performed by the FIMI04 organisers.

BibTeX

@inproceedings{racz04nonordfp,
  author    = {Bal{\'a}zs R{\'a}cz},
  title     = {nonordfp: An {FP}-{G}rowth variation without rebuilding the
               {FP}-tree},
  booktitle = {FIMI '04, Proceedings of the IEEE ICDM Workshop on Frequent
               Itemset Mining Implementations},
  year      = {2004},
  ee        = {http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-126/racz.pdf}
}

Citing papers

Have you found a citing paper not listed here? Please drop me an e-mail!

A Scalable Randomized Method to Compute Link-Based Similarity Rank on the Web Graph

Note: you may want to read our WWW2005 paper instead

It supersedes this paper in most aspects.

Co-authors

Appeared in

Clustering Information over the Web (ClustWeb), Int'l Workshop in conjunction with the
9th Int'l Conference on Extending Database Technology (EDBT 2004)
March 14-18, 2004
Heraklion - Crete, Greece

Abstract

Several iterative hyperlink-based similarity measures were published to express the similarity of web pages. However, it usually seems hopeless to evaluate complex similarity functions over large repositories containing hundreds of millions of pages. We introduce scalable algorithms computing SimRank scores, which express the contextual similarities of pages based on the hyperlink structure. The proposed methods scale well to large repositories, fulfilling strict requirements about computational complexity. The algorithms were tested on a set of ten million pages, but parallelization techniques make it possible to compute the SimRank scores even for the entire web with over 4 billion pages. The key idea is that randomized Monte Carlo methods combined with indexing techniques yield a scalable approximation of SimRank.

Download

full text pdf
slides ppt

BibTeX

@inproceedings{ fogaras04scalable,
   author="D. Fogaras and B. Rácz",
   title="A Scalable Randomized Method to Compute Link-Based Similarity Rank on the Web Graph",
   booktitle="Proceedings of the {C}lustering {I}nformation over the {W}eb workshop, {C}onference on {E}xtending {D}atabase {T}echnology, LNCS vol. 3268",
   year="2004"}

Citing papers 

Have you found a citing paper? Please drop me an e-mail!