Saturday, October 16, 2010

Programming Collective Intelligence

Programming Collective Intelligence: Building Smart Web 2.0 ApplicationsProgramming Collective Intelligence: Building Smart Web 2.0 Applications by Toby Segaran

My rating: 4 of 5 stars


Programming Collective Intelligence (Segaran, 2007) uses a multitude of examples to show how data can be combined and analyzed to produce results that are “more human.” The book intersperses text with Python programming snippets. The programming code allows someone to work through all of the examples discussed in the book. At times, some more advanced examples require additional library downloads, but everything in the book is accessible to the reader.

The book covers a wide range of topics related to data analysis. It begins with a simple algorithm that recommends movies based on your previous movie reviews and the movie reviews of others. Although this was the easiest task within the book, I felt that is was one of the most powerful examples. What is powerful about this chapter is that the mathematics behind the programming was very simple. I think this illustrates the power of the Internet and Web2.0 systems. Sometimes the analysis of the data is very easy.

I also think this chapter related to movie recommendations also points to some of the frailties of data mining. The results are only as good as the data that has been collected and analyzed. Thinking of my own personal experiences with movie and music websites that make recommendations, I know that we still have a long way to go to improve the accuracy of these systems. I think the algorithms behind the programming are sound, but I think we are missing some critical components in the collection of the data. There is something very personal about certain datasets that I believe we are still missing. I don’t doubt that we will eventually become more accurate, but I think we still need to find more indicators to include with the datasets.

I also think a powerful statement was made in Chapter 9 when the author stated that, “An important thing to take away from this chapter is that it’s rarely possible to throw a complex dataset at an algorithm and expect it to learn how to classify things accurately. Choosing the right algorithm and preprocessing the data appropriately is often required to get good results” (p. 197). This is a precursor to the chapter related to “Matchmaking” using advanced classification strategies. Throughout the chapter, Segaran talks about the raw data and discusses ways to restructure and normalize the data. I think this is important. For example, converting street address data to discern actual mileage difference between two points, and grouping interests into categories (e.g., snowboarding and skiing). Without this type of preprocessing, comparisons are limited.

Most reviews of this book focus on the fact that it is a balance between programming and mathematical computations. There is a great deal of code on almost every page, but there are little mathematical explanations in terms of formulas. For advanced mathematicians, most of the mathematics used in the book is probably already known, so the formulas may not be needed. In general, I would have liked to see some more theoretical discussions of the topics and perhaps the inclusion of more detailed information related to the mathematical formulas. I believe that this would make the process of applying examples in the book to other datasets a little easier.
Given my minimal programming experience and minimal mathematics experience, I found that the Python code made the book confusing at times. I was not interested in running the programs as I was reading the book, so I found myself trying too hard to decipher the code. If I focused on the text, I was fine. I did find the explanations, tables, and diagrams to be extremely interesting. I have never thought about the process behind search engine rankings, spam filters, or optimization used in recommending the best travel itinerary; however, the book did an excellent job explaining these concepts.

I think that the prospects of connecting datasets to mine data and produce “intelligent” results are particularly powerful. In my profession (K-12 education), I could see these concepts being used to analyze assessment data and make instructional decisions for individual students. I have already seen certain products that attempt this, but I have not seen anything that does a thorough job. Many schools currently assign students to remedial classes or activities to try to increase student performance. If a web application could model using decision tree logic as discussed in Chapter 7, schools could identify the student groups that need particular help in certain areas. I think this type of prescriptive-teacher would be very beneficial. Of course, all of this depends on the accuracy, specificity, and validity of the assessment tools. Education has wrestled with this concept for a long time.

My overall rating of this book is a 4.5 out of 5. Even though some of the concepts and programming were above my head, it caused me to rethink my pedestrian VBA projects and how I could use concepts within this book on future projects. For me, I will need to do much more research to learn and implement these concepts, but I do not think I was the intended audience for this book. I think the best audience for this book would be a programmer that has minimal experience with working with live Web 2.0 data. For a person with preexisting knowledge of programming, and a solid background in some advanced mathematics, I believe that this book would really open the doors to creating interactive websites or applications that use scraped data to enhance an end-user’s experience.




View all my reviews

No comments:

Post a Comment