The Enron Corpus: A New Dataset for Email Classification Research
- Bryan Klimt,
- Yiming Yang
- … show all 2 hide
Download Chapter (358 KB)
Abstract
Automated classification of email messages into user-specific folders and information extraction from chronologically ordered email streams have become interesting areas in text learning research. However, the lack of large benchmark collections has been an obstacle for studying the problems and evaluating the solutions. In this paper, we introduce the Enron corpus as a new test bed. We analyze its suitability with respect to email folder prediction, and provide the baseline results of a state-of-the-art classifier (Support Vector Machines) under various conditions, including the cases of using individual sections (From, To, Subject and body) alone as the input to the classifier, and using all the sections in combination with regression weights.



Inside
Chapter Metrics
7 CitationsOther actions
- Brutlag, J.D., Meek, C.: Challenges of the Email Domain for Text Classification. In: ICML 2000, pp. 103–110 (2000)
- Cohen, W.W.: Learning Rules that classify E-mail. In: Proc. of the 1996 AAAI Spring Symposium in Information Access (1996)
- Crawford, E., Kay, J., McCreath, E.: Automatic Induction of Rules for e-mail Classification. In: ADCS 2001 Proceedings of the Sixth Australasian Document Computing Symposium, Coffs Harbour, NSW Australia, pp. 13–20 (2001)
- Diao, Y., Lu, H., Wu, D.: A comparative study of classification-based personal e-mail filtering. In: Terano, T., Chen, A.L.P. (eds.) PAKDD 2000. LNCS, vol. 1805, pp. 408–419. Springer, Heidelberg (2000) CrossRef
- Hung, E.: Deduction of Procmail Recipes from Classified Emails. CMSC724 Database Management Systems, individual research project report (May 2001)
- Kiritchenko, S., Matwin, S.: Email classification with co-training. In: Proc. of the 2001 Conference of the Centre for Advanced Studies on Collaborative Research, Toronto, Ontario, Canada, p. 8 (2001)
- Lewis, D.D., Knowles, K.A.: Threading Electronic Mail: A Preliminary Study. Information Processing and Management 33(2), 209–217 (1997) CrossRef
- Manco, G., Masciari, E., Ruffolo, M., Tagarelli, A.: Towards an Adaptive Mail Classifier. In: AIIA 2002 (September 2002)
- Murakoshi, H., Shimazu, A., Ochimizu, K.: Construction of Deliberation Structure in Email Communication. In: Pacific Association for Computational Linguistics (PACLING 1999), August 1999, pp. 16–28 (1999)
- Rennie, J.: ifile: An Application of Machine Learning to E-Mail Filtering. In: Proc. KDD 2000 Workshop on Text Mining, Boston (2000)
- Segal, R.B., Kephart, J.O.: MailCat: An Intelligent Assistant for Organizing E-Mail. In: Proc. of the 3rd International Conference on Autonomous Agents (1999)
- Yang, Y.: A Study of Thresholding Strategies for Text Categorization. In: Proc. of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, New Orleans, LA, pp. 137–145 (2001)
- Title
- The Enron Corpus: A New Dataset for Email Classification Research
- Book Title
- Machine Learning: ECML 2004
- Book Subtitle
- 15th European Conference on Machine Learning, Pisa, Italy, September 20-24, 2004. Proceedings
- Pages
- pp 217-226
- Copyright
- 2004
- DOI
- 10.1007/978-3-540-30115-8_22
- Print ISBN
- 978-3-540-23105-9
- Online ISBN
- 978-3-540-30115-8
- Series Title
- Lecture Notes in Computer Science
- Series Volume
- 3201
- Series ISSN
- 0302-9743
- Publisher
- Springer Berlin Heidelberg
- Copyright Holder
- Springer-Verlag Berlin Heidelberg
- Additional Links
- Topics
- Industry Sectors
- eBook Packages
- Editors
-
-
Jean-François Boulicaut
(18)
-
Floriana Esposito
(19)
-
Fosca Giannotti
(20)
-
Dino Pedreschi
(21)
-
Jean-François Boulicaut
- Editor Affiliations
-
- 18. INSA-Lyon, LIRIS CNRS UMR5205
- 19. Dipartimento di Informatica, Università degli Studi di Bari
- 20. Pisa KDD Laboratory, ISTI - CNR, Area della Ricerca di Pisa
- 21. Dipartimento di Informatica
- Authors
-
-
Bryan Klimt
(22)
-
Yiming Yang
(22)
-
Bryan Klimt
- Author Affiliations
-
- 22. Language Technologies Institute, Carnegie Mellon University, Pittsburgh, PA, 15213-8213, USA
Continue reading...
To view the rest of this content please follow the download PDF link above.