Text classification is the task of selecting a class or category for a document or block of text. The canonical example of this is the use of the Naive Bayes classifier for identifying spam vs. non-spam email. Classifiers can also be used for language identification, categorizing news articles or blog posts, detecting trackback spam, comment spam, wiki spam, and more. In my talk I will cover the basics of document classification while focusing on the various tools available in Ruby for each aspect of classification.
Paul Dix is a computer science student at Columbia University in New York City. Before going back to school in 2005, Paul worked at McAfee as a developer. He has been attending the nyc.rb meetings since October of 2005. Text classification is a subset of Paul’s interests in natural language processing, machine learning, and information retrieval. Last summer he worked as a consultant with EastMedia developing web applications in Ruby on Rails. Paul also attended RailsConf last June and codes in Ruby every chance he gets.