- M. Kurucz, I. Bíró, D. Siklósi, P. Csizsek, Z. Fekete, R. Iwatt, T.
Kiss, A. Szabó. KDD Cup 2009@Budapest: feature partitioning and boosting.
In Journal of Machine Learning Research special issue on KDD Cup
2009, 2009.
Abstract Paper: PDF
We describe the method used in our final submission to KDD Cup 2009 as well as a selection of promising directions that are generally believed to work well but did not justify our expectations. Our final method consists of a combination of a LogitBoost and an ADTree classifier with a feature selection method that, as shaped by the experiments we have conducted, have turned out to be very different from those described in some well-cited surveys. Some methods that failed include distance, information and dependence measures for feature selection as well as combination of classifiers over a partitioned feature set. As another main lesson learned, alternating decision trees and LogitBoost outperformed most classifiers for most feature subsets of the KDD Cup 2009 data. - István Bíró and Jácint Szabó. Latent dirichlet allocation for
automatic document categorization. In Proceedings of the 19th European
Conference on Machine Learning and 12th Principles of Knowledge Discovery
in Databases, 2009.
Abstract Paper: PDF
In this paper we introduce and evaluate a technique for applying latent Dirichlet allocation to supervised semantic categorization of documents. In our setup, for every category an own collection of topics is assigned, and for a labeled training document only topics from its category are sampled. Thus, compared to the classical LDA that processes the entire corpus in one, we essentially build separate LDA models for each category with the category-specific topics, and then these topic collections are put together to form a unified LDA model. For an unseen document the inferred topic distribution gives an estimation how much the document fits into the category. We use this method for Web document classification. Our key results are 46% decrease in 1-AUC value in classification accuracy over tf.idf with SVM and 43% over the plain LDA baseline with SVM. Using a careful vocabulary selection method and a heuristic which handles the effect that similar topics may arise in distinct categories the improvement is 83% over tf.idf with SVM and 82% over LDA with SVM in 1-AUC. - István Bíró, Jácint Szabó, András A. Benczúr, and Dávid Siklósi.
Linked Latent Dirichlet Allocation in Web Spam Filtering. In
Proceedings of the 5th international workshop on Adversarial
Information Retrieval on the Web, 2009.
Abstract Paper: PDF
Latent Dirichlet allocation (LDA) is a fully generative statistical language model on the content and topics of a corpus of documents. In this paper we apply for web spam classification an extension of LDA, the recently introduced linked LDA technique, that takes also linkage into account. In this setup, topics propagate along links in such a way that the linked document directly influences the words in the linking document. The inferred LDA model can be applied for classification as dimensionality reduction similarly to latent semantic indexing. we test linked LDA on the UK2007-WEBSPAM corpus. By using BayesNet classifier, in terms of the AUC of classification, we achieve 3% improvement over plain LDA with BayesNet, and 8% over the public link features with C4.5. The addition of this method to a log-odds based combination of strong link and content baseline classifiers results in a 3% improvement in AUC. - István Bíró, Jácint Szabó, and András A. Benczúr.
Latent dirichlet allocation in web spam filtering.
In Proceedings of the 4rd international workshop on Adversarial
information retrieval on the web, 2008.
Abstract Paper: PDF
Latent Dirichlet allocation (LDA) (Blei, Ng, Jordan 2003) is a fully generative statistical language model on the content and topics of a corpus of documents. In this paper we apply a modification of LDA, the novel multi-corpus LDA technique for web spam classification. We create a bag-of-words document for every Web site and run LDA both on the corpus of sites labeled as spam and as non-spam. In this way collections of spam and non-spam topics are created in the training phase. In the test phase we take the union of these collections, and an unseen site is deemed spam if its total spam topic probability is above a threshold. As far as we know, this is the first web retrieval application of LDA. We test this method on the UK2007-WEBSPAM corpus, and reach a relative improvement of 11% in F-measure by a logistic regression based combination with strong link and content baseline classifiers.
- István Bíró, Jácint Szabó, András A. Benczúr, and Ana Gabriela Maguitman.
A comparative analysis of latent variable models for web page
classification.
In Ricardo A. Baeza-Yates, Wagner Meira Jr., and Luis Antonio Olsina
Santos, editors, LA-WEB, pages 23-28. IEEE Computer Society, 2008.
Abstract Paper: PDF
A main challenge for Web content classification is how to model the input data. This paper discusses the application of two text modeling approaches, Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA), in the Web page classification task. We report results on a comparison of these two approaches using different vocabularies consisting of links and text. Both models are evaluated using different numbers of latent topics. Finally, we evaluate a hybrid latent variable model that combines the latent topics resulting from both LSA and LDA. This new approach turns out to be superior to the basic LSA and LDA models. In our experiments with categories and pages obtained from the ODP web directory the hybrid model achieves an averaged F-measure value of 0.852 and an averaged ROC value of 0.96.
- András A. Benczúr, István Bíró, Mátyás Brendel, Károly Csalogány, Bálint
Daróczy, and Dávid Siklósi.
Multimodal retrieval by text-segment biclustering.
Lecture Notes in Computer Science, 5152, 2008.
Abstract Paper: PDF
We describe our approach to the ImageCLEFphoto 2007 task. The novelty of our method consists of biclustering image segments and annotation words. Given the query words, it is possible to select the image segment clusters that have strongest cooccurrence with the corresponding word clusters. These image segment clusters act as the selected segments relevant to a query. We rank text hits by our own tf.idf-based information retrieval system and image similarities by using a 20-dimensional vector describing the visual content of an image segment. Relevant image segments were selected by the biclustering procedure. Images were segmented by graph-based segmentation.We used neither query expansion nor relevance feedback; queries were generated automatically from the title and the description words. The later were weighted by 0.1.
- Péter Schönhofen, István Bíró, András A. Benczúr, and Károly Csalogány.
Cross-language retrieval with wikipedia.
Lecture Notes in Computer Science, 5152, 2008.
Abstract Paper: PDF
We demonstrate a twofold use of Wikipedia for cross-lingual information retrieval. As our main contribution, we exploit Wikipedia hyperlinkage for query term disambiguation. We also use bilingual Wikipedia articles for dictionary extension. Our method is based on translation disambiguation; we combine the Wikipedia based technique with a method based on bigram statistics of pairs formed by translations of different source language terms.
- Dávid Siklósi, András A. Benczúr, Zsolt Fekete, Miklós Kurucz, István Bíró,
Attila Pereszlényi, Simon Rácz, Adrienn Szabó, and Jácint Szabó.
Web spam hunting @ budapest.
In Proceedings of the 4rd international workshop on Adversarial
information retrieval on the web, 2008.
Abstract Paper: PDF
We use a combination, in the expected order of their strength, of the following classificators: SVM over tf.idf, an augmented set of the public statistical spam features, graph stacking and text classification by latent Dirichlet allocation and compression, the latter two only used in our second submission.
- A. Benczúr, D. Siklósi, J. Szabó, I. Bíró, Z. Fekete, M. Kurucz, A. Pereszlényi, S. Rácz, and
A. Szabó.
Web Spam: A Survey with Vision for The Archivist.
In 8th International Web archiving workshop, 2008.
Abstract Paper: PDF
While Web archive quality is endangered by Web spam, a side effect of the high commercial value of top-ranked search-engine results, so far Web spam filtering technologies are rarely used by Web archivists. In this paper we make the first attempt to disseminate existing methodology and envision a solution for Web archives to share knowledge and unite efforts in Web spam hunting. We survey the state of the art in Web spam filtering illustrated by the recent Web spam challenge data sets and techniques and describe the filtering solution for archives envisioned in the LiWA—Living Web Archives project.
- András A. Benczúr, István Biró, Károly Csalogány, and Tamás Sarlós.
Web spam detection via commercial intent analysis.
In AIRWeb '07: Proceedings of the 3rd international workshop on
Adversarial information retrieval on the web, pages 89-92. ACM Press New
York and NY and USA, 2007.
Abstract Paper: PDF
We propose a number of features for Web spam filtering based on the occurrence of keywords that are either of high advertisement value or highly spammed. Our features include popular words from search engine query logs as well as high cost or volume words according to Google AdWords. We also demonstrate the spam filtering power of the Online Commercial Intention (OCI) value assigned to an URL in a Microsoft adCenter Labs Demonstration and the Yahoo! Mindset classification of Web pages as either commercial or non-commercial as well as metrics based on the occurrence of Google ads on the page. We run our tests on the WEBSPAM-UK2006 dataset recently compiled by Castillo et al. as a standard means of measuring the performance of Web spam detection algorithms. Our features improve the classification accuracy of the publicly available WEBSPAM-UK2006 features by 3%.
- András A. Benczúr, István Bíró, Mátyás Brendel, Károly Csalogány, Bálint
Daróczy, and Dávid Siklósi.
Cross-modal retrieval by text and image feature biclustering.
In Working Notes for the CLEF 2007 Workshop, 2007.
Abstract Paper: PDF
We describe our approach to the ImageCLEF-Photo 2007 task. The novelty of our method consists of biclustering image segments and annotation words. Given the query words, we may select the image segment clusters that have strongest cooccurrence with the corresponding word clusters. These image segment clusters act as the selected segments relevant to a query. We rank text hits by our own tf.idf based information retrieval system and image similarities by using a 20-dimensional vector describing the visual content of image segments. Here relevant image segments were selected by the biclustering procedure. Images were segmented by a home developed segmenter. We used neither query expansion nor relevance feedback; queries were generated automatically from the title and the 0.1 weighted description words.
- István Bíró, Csaba Szepesvári, and Zoltán Szamonek.
Sequence prediction exploiting similary information.
In Manuela M. Veloso, editor, IJCAI 2007: Proceedings of the
20th International Joint Conference on Artificial Intelligence and Hyderabad
and India, pages 1576-1581, 2007.
Abstract Paper: PDF
When data is scarce or the alphabet is large, smoothing the probability estimates becomes inescapable when estimating n-gram models. In this paper we propose a method that implements a form of smoothing by exploiting similarity information of the alphabet elements. The idea is to view the log-conditional probability function as a smooth function defined over the similarity graph. The algorithm that we propose uses the eigenvectors of the similarity graph as the basis of the expansion of the log conditional probability function whose coefficients are found by solving a regularized logistic regression problem. The experimental results demonstrate the superiority of the method when the similarity graph contains relevant information, whilst the method still remains competitive with state-of-the-art smoothing methods even in the lack of such information.
- Péter Schönhofen, István Bíró, András A. Benczúr, and Károly Csalogány.
Performing cross language retrieval with wikipedia.
In Working Notes for the CLEF 2007 Workshop, 2007.
Abstract Paper: PDF
We describe a method which is able to translate queries extended by narrative information from one language to another, with help of an appropriate machine readable dictionary and the Wikipedia on-line encyclopedia. Processing occurs in three steps: first, we look up possible translations phrase by phrase using both the dictionary and the cross-lingual links provided by Wikipedia; second, improbable translations, detected by a simple language model computed over a large corpus of documents written in the target language are eliminated; and finally, further clustering is applied by matching Wikipedia concepts against the query narrative and removing translations not related to the overall query topic. Experiments performed on the Los Angeles Times 2002 corpus, translating from Hungarian to English showed that while queries generated at end of the second step were roughly only half as effective as original queries, primarily due to the limitations of our tools, after the third step precision improved significantly, reaching 60% of the native English level.
- Zoltán Szamonek and István Bíró.
Similarity based smoothing in language modeling.
Acta Cybernetica, 18(2):303-314, 2007.
Abstract Paper: PDF
In this paper, we improve our previously proposed Similarity Based Smoothing (SBS) algorithm. The idea of the SBS is to map words or part of sentences to an Euclidean space and approximate the language model in that space. The bottleneck of the original algorithm was to train a regularized logistic regression model, which was incapable to deal with real world data. We replace the logistic regression by regularized maximum entropy estimation and a Gaussian mixture approach to model the language in the Euclidean space and showing other possibilities to use the main idea of SBS. We show that the regularized maximum entropy model is flexible enough to handle conditional probability density estimation and thus enable parallel computation tasks with significantly decreased iteration steps. The experimental results demonstrate the success of our method and we achieve 14% improvement on a real world corpus.
- András A. Benczúr, István Bíró, Károly Csalogány, Balázs Rácz, Tamás Sarlós,
and Máté Uher.
Pagerank és azon túl: Hiperhivatkozások szerepe a keresésben
(pagerank and beyond: The role of hyperlinks in search and in hungarian).
Magyar Tudomány, (11):1325-1331, November 2006.
Abstract Paper: HTML
A hiperhivatkozások kulcsfontosságúak a világhálón található adatok használhatóságában and a "Web" elterjedésében. A hivatkozások teszik hálózattá ezt a rendkívüli méretu és szinte minden létezo témát felölelo dokumentumgyujteményt. Tanulmányunk témája a hivatkozások szerkezetének vizsgálata: bemutatjuk and hogyan lehet segítségükkel rangsorolni a weblapok minoségét and akár az egyes felhasználók igényei szerint személyre szabhatóan; hogyan lehet a manipulatív and keresorendszerek megtévesztésére létrehozott oldalakat szurni and illetve adott oldalhoz hasonlókat találni. Foglalkozunk azzal a kérdéssel is and hogy ezek a problémák mennyire nehezek and és milyen különleges algoritmikus technikákat igényel megoldásuk a már több milliárd oldalból álló világhálón.
- András A. Benczúr, István Bíró, Károly Csalogány, and Máté Uher.
Detecting nepotistic links by language model disagreement.
In Les Carr, David De Roure, Arun Iyengar, Carole A. Goble, and
Michael Dahlin, editors, WWW '06: Proceedings of the 15th international
conference on World Wide Web, pages 939-940. ACM Press New York and NY and
USA, 2006.
Abstract Paper: PDF
In this short note we demonstrate the applicability of hyperlink downweighting by means of language model disagreement. The method filters out hyperlinks with no relevance to the target page without the need of white and blacklists or human interaction. We fight various forms of nepotism such as common maintainers, ads and link exchanges or misused affiliate programs. Our method is tested on a 31 M page crawl of the .de domain with a manually classified 1000-page random sample.
- István Bíró, Zoltán Szamonek, and Csaba Szepesvári.
Simítás hasonlósági információ felhasználásával.
In Proceedings of Association for Hungarian Computational
Linguistics, 2006.
Abstract Paper: PDF
Ebben a dolgozatban azt vizsgáljuk and hogy hogyan lehet a szavak egymáshoz való viszonyára vonatkozó információt kihasználva javítani a nyelvmodellek minőségén. Elviekben világos and hogy a szavak disztribúciós hasonlóságát kihasználva ugyannyi adat esetén jobb modelleket lehet építeni. Mivel azonban a disztribúciós hasonlóságra vonatkozó információ nem tökéletes and kérdéses and hogy az ebből adódó hiba ellenére is működhet-e egy a szóhasonlóságokra építő módszer. A dolgozat fő eredménye az SBS (similarity based smoothing) algoritmus, amelyik képes kihasználni a szavakra vonatkozó hasonlósági információt, amennyiben ez az információ kellően pontos and míg ha az információ nem pontos, akkor az algoritmus addícionális vesztesége elhanyagolható.
- Eszter P. Windhager, Libertad Tansini, István Bíró, and Devdatt Dubhashi.
Iterative algorithms for collaborative filtering with mixture models.
In In proceedings of International Workshop on Intelligent
Information Access (IIIA), 2006.
Abstract Paper: PDF
We present a suite of four algorithms for collaborative filtering in the context of probabilistic mixture models proposed by Hofmann and Puzicha. All four algorithms are based on an initial soft clustering followed by different iterative schemes inspired by Kleinberg's HITS algorithm. The suite is tested on data generated according to various sub-classes of mixture models ranging from disjoint to fully mixed. The results are shown to be of quality comparable or better than recent benchmark studies. We also tested the algorithms on real data collected from the Hungarian web and the results show performance comparable to that shown on the test data.