2005
On Benchmarking Frequent Itemset
Mining Algorithms: from
Measurement to Analysis
Co-authors
- Ferenc Bodon (link)
- Lars Schmidt-Thieme (link)
Appeared in
1st Int'l Workshop on Open Source Data
Mining on Frequent Pattern Mining Implementations (OSDM2005) in conjunction with
The 11th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining. (ACM SIGKDD 2005)
August 21-24, 2005
Chicago, IL, USA
Abstract
We point out problems of current practices in comparing Frequent
Itemset Mining Implementations, and suggest techniques that can help to
avoid the conclusions of measurements being tainted by these problems.
Download
full text pdf
slides ppt
BibTeX
@inproceedings{
racz05benchmarking,
author="B. R\'{a}cz and F. Bodon and L. Schmidt-Thieme",
title="On Benchmarking Frequent Itemset Mining Algorithms",
booktitle="Proceedings of the 1\textsuperscript{st}
{I}nternational {W}orkshop on {O}pen {S}ource {D}ata {M}ining, in
conjunction with {ACM} {SIGKDD}",
year="2005"}
Citing papers
Have you found a citing paper? Please drop me an e-mail!
Scaling Link-Based Similarity Search
Co-authors
Appeared in
The 14th
Int'l
World Wide Web Conference (WWW2005)
May 10-14, 2005
Chiba (Tokyo), Japan
Abstract
To exploit the similarity information hidden in the hyperlink
structure of the web, this paper introduces algorithms scalable to
graphs with billions of vertices on a distributed architecture.
The similarity of multi-step neighborhoods of vertices are numerically
evaluated by similarity functions including SimRank, a recursive
refinement of cocitation; PSimRank, a novel variant with better
theoretical characteristics; and the Jaccard coefficient, extended to
multi-step neighborhoods. Our methods are presented in a general
framework of Monte Carlo similarity search algorithms that precompute
an index database of random fingerprints, and at query time,
similarities are estimated from the fingerprints. The performance and
quality of the methods were tested on the Stanford Webbase graph of 80M
pages by comparing our scores to similarities extracted from the ODP
directory. Our experimental results suggest that the hyperlink
structure of vertices within four to five steps provide more adequate
information for similarity search than single-step neighborhoods.
Download
full text pdf
slides ppt
BibTeX
@inproceedings{ fogaras05scaling,
author="D. Fogaras and B. R\'{a}cz",
title="Scaling Link-Based Similarity Search",
booktitle="Proceedings of the 14\textsuperscript{th}
{I}nt'l {W}orld {W}ide {W}eb Conference",
year="2005"}
Citing papers
- Monika R. Henzinger. Hyperlink
analysis on the World Wide Web. In:
Proceedings of the 16th ACM conference on Hypertext and
Hypermedia, 2005.
- Srivatsa Iyengar. Entity reconciliation in SPIN.
M.Tech project report, Indian Institute of Technology Bombay, 2005.
Have you found a citing paper not listed here? Please drop me an e-mail!
Fifteen Minutes of Fame: The Dynamics
of Information Access on the Web
Co-authors
- Zoltán Dezső (link)
- Eivind Almaas (link)
- András Lukács (link)
- István Szakadát (link)
- Albert-László Barabási (link)
Appeared in
ArXiv physics/0505087, 2005.
Abstract
While current studies on complex networks focus on systems that
change relatively slowly in time, the structure of the most visited
regions of the Web is altered at the timescale from hours to days. Here
we investigate the dynamics of visitation of a major news portal,
representing the prototype for such a rapidly evolving network. The
nodes of the network can be classified into stable nodes, that form the
time independent skeleton of the portal, and news documents. The
visitation of the two node classes are markedly different, the skeleton
acquiring visits at a constant rate, while a news document’s visitation
peaking after a few hours. We find that the visitation pattern of a
news document decays as a power law, in contrast with the exponential
prediction provided by simple models of site visitation. This is rooted
in the inhomogeneous nature of the browsing pattern characterizing
individual users: the time interval between consecutive visits by the
same user to the site follows a power law distribution, in contrast
with the exponential expected for Poisson processes. We show that the
exponent characterizing the individual user’s browsing patterns
determines the power-law decay in a document’s visitation. Finally, our
results document the fleeting quality of news and events: while fifteen
minutes of fame is still an exaggeration in the online media, we find
that access to most news items significantly decays after 36 hours of
posting.
Download
full text pdf
Citing papers
- A. Vazquez. Exact results for the Barabasi model of human
dynamics. arXiv:physics/0506126, 2005.
Have you found a citing paper not listed here? Please drop me an e-mail!
2004
Towards Scaling Fully Personalized
PageRank
Co-authors
- Dániel Fogaras (link)
- Károly Csalogány (journal version only)
- Tamás Sarlós (journal version only)
Appeared in
3rd
Int'l Workshop on Algorithms and Models for the Web-Graph (WAW2004),
in conjunction with
The 45th Annual IEEE Symposium on Foundations of Computer
Science (FOCS2004)
October 17-19, 2004
Rome, Italy
Journal of Internet Mathematics (extended version to appear)
Abstract
Personalized PageRank expresses link-based page quality around
user-selected pages in a similar way as PageRank expresses quality over
the entire Web. Existing personalized PageRank algorithms can
however serve on-line queries only for a restricted choice of pages. In
this paper we achieve full personalization by a novel algorithm that
precomputes a compact database, and using this database it can serve
on-line responses to arbitrary user selected personalization. The
algorithm uses simulated random walks; we prove that for a fixed error
probability, the size of our database is linear in the number of web
pages. We justify our estimation approach by asymptotic worst-case
lower bounds: we show that on some sets of graphs exact personalized
PageRank values can only be obtained from a database of size quadratic
in the number of vertices. Furthermore, we evaluate the precision of
approximation experimentally on the Stanford WebBase graph.
Download
full text (conference version) pdf
slides ppt
full text (journal version) pdf
BibTeX
@inproceedings{ fogaras04ppr,
author="D. Fogaras and B. R\'{a}cz",
title="Towards Fully Personalizing {P}age{R}ank",
year=2004,
booktitle="Proceedings of the 3\textsuperscript{rd}
Workshop on Algorithms and Models for the Web-Graph (WAW2004), in
conjunction with FOCS 2004."}
Citing papers
- Andras A. Benczúr, Károly Csalogány,
Tamás Sarlós and Máté Uher. SpamRank -
Fully Automatic Link Spam Detection. In: First International
Workshop
on Adversarial Information Retrieval on the Web, WWW2005, 2005.
- Andras A. Benczúr, Károly Csalogány and
Tamás Sarlós. On the feasibility of low-rank
approximation for personalized PageRank. In: The 14th
Int'l
World Wide Web Conference (WWW2005, poster), 2005.
- K.Avrachenkov, N. Litvak, D. Nemirovsky and N. Osipova. Monte
Carlo methods in PageRank computation: When one iteration is
sufficient. In: 5th IMACS Seminar on Monte Carlo
Methods,
2005.; Proceedings of the ACM Sigmetrics 2005 conference, 2005.
Have you found a citing paper not listed here? Please drop me an e-mail!
nonordfp: An FP-growth variation
without rebuilding the FP-tree
Co-authors
none
Appeared in
2nd
Int'l Workshop on Frequent Itemset Mining Implementations (FIMI2004), in
conjunction with
The 4th IEEE International Conference on Data Mining (ICDM2004)
November 01-04, 2004
Brighton, UK
CEUR Workshop Proceedings 126
Abstract
We describe a frequent itemset mining algorithm and
implementation based on the well-known algorithm FP-growth. The
theoretical difference is the main data structure (tree), which is more
compact and which we do not need to rebuild for each conditional step.
We thoroughly deal with implementation issues, data structures, memory
layout, I/O and library functions we use to achieve comparable
performance as the best implementations of the 1st Frequent
Itemset Mining Implementations (FIMI) Workshop.
Download
full text pdf
slides ppt
implementation tgz. Note that this
source code differs from the original published version in the
elimination of a conflict with the memusage program, which caused
nonordfp to fail on many datasets in the evaluation performed by the
FIMI04 organisers.
BibTeX
@inproceedings{racz04nonordfp,
author = {Bal{\'a}zs R{\'a}cz},
title = {nonordfp: An {FP}-{G}rowth
variation without rebuilding the
{FP}-tree},
booktitle = {FIMI '04, Proceedings of the IEEE ICDM Workshop on
Frequent
Itemset Mining Implementations},
year = {2004},
ee =
{http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-126/racz.pdf}
}
Citing papers
- F. Bodon and L. Schmidt-Thieme. The Relation of Closed
Itemset
Mining, Complete Pruning Strategies and Item Ordering in Apriori-based
FIM algorithms. In: 9th European Conference on
Principles
and Practice of Knowledge Discovery in Databases (PKDD), 2005.
- F. Bodon. A Trie-based APRIORI Implementation for Mining
Frequent
Item Sequences. In: 1st Int'l Workshop on Open Source
Data
Mining (in conjunction with SIGKDD2005), 2005.
- T. Uno, M. Kiyomi, H. Arimura. LCM ver 3.: Collaboration of
Array, Bitmap and Prefix Tree for Frequent Itemset Mining. In: 1st
Int'l Workshop on Open Source Data Mining (in conjunction with
SIGKDD2005), 2005.
Have you found a citing paper not listed here? Please drop me an e-mail!
A Scalable Randomized Method to
Compute Link-Based Similarity Rank on the Web Graph
Note: you may want to read our WWW2005
paper instead
It supersedes this paper in most aspects.
Co-authors
Appeared in
Clustering Information over the Web (ClustWeb),
Int'l Workshop in conjunction with the
9th Int'l Conference on Extending Database Technology (EDBT 2004)
March 14-18, 2004
Heraklion - Crete, Greece
Abstract
Several iterative hyperlink-based similarity measures were
published to express the similarity of web pages. However, it usually
seems hopeless to evaluate complex similarity functions over large
repositories containing hundreds of millions of pages. We introduce
scalable algorithms computing SimRank scores, which express the
contextual similarities of pages based on the hyperlink structure. The
proposed methods scale well to large repositories, fulfilling strict
requirements about computational complexity. The algorithms were tested
on a set of ten million pages, but parallelization techniques make it
possible to compute the SimRank scores even for the entire web with
over 4 billion pages. The key idea is that randomized Monte Carlo
methods combined with indexing techniques yield a scalable
approximation of SimRank.
Download
full text pdf
slides ppt
BibTeX
@inproceedings{ fogaras04scalable,
author="D. Fogaras and B. Rácz",
title="A Scalable Randomized Method to Compute Link-Based
Similarity Rank on the Web Graph",
booktitle="Proceedings of the {C}lustering {I}nformation
over the {W}eb workshop, {C}onference on {E}xtending {D}atabase
{T}echnology, LNCS vol. 3268",
year="2004"}
Citing papers
Have you found a citing paper? Please drop me an e-mail!