It is no secret that the media equation of the 1950s, large-scale production by few for mass consumption, is being flipped on its head. Blogs, podcasts, Facebook, and Digg all showcase the impact of user-generated content. But as information of a fundamentally social nature takes pace with the traditionally fact-oriented, the demand for tools to help us navigate and understand our virtual world also increases. Google and Yahoo enabled the first wave of Internet popularization by empowering everyday people to search for what they wanted--so long as it was fact-oriented. Asking Google "what types of people are a member of XYZ forum" only yields results if someone has taken the time to post such a response onto the Internet. Yet Google already has much of the information it needs to answer such questions: the behavior patterns and contributions of user communities are typically available publicly. Greg1024 usually talks about guns unless someone questions health care. xxCuteLovRxx is dedicated to pictures of cats doing funny things, often followed by "OmG! s00 kUt3!"
Such patterns in language use, posting behavior, friendship circles, and geographical relevance are ripe for integration into such communities through not-yet-developed interactive tools and data models. My research tries to tackle such a challenge, using a combination of stochastic linguistic modeling, sociological theory, and information visualization. My first work in this direction, "Is Britney Spears Spam", was well received at the Conference on Email and Anti-Spam. It sought to classify users by the humanness of their communication behavior and social structure. I am now in the process of developing far more sophisticated models using topic modeling algorithms as a core basis. You can get a better sense of where I'm going by looking at the proposal for my general exams, which I'm in the process of preparing for now.
Here is a sneak peak at some early raw results from a large social networking site. Each "topic" represents a set of words that probalistically belong together, which also happens to usefully function as "cultural markers."
|
Topic "A" hey whats time long talk havent hows good talked haven goin forever ttyl summer heard hope bye school awhile buddy |
Topic "B" ha cute yeah today song cool thing thought didn fun made mom guess found picture wow sister totally huh sweet |
Topic "C" friends real fat back life fake shit friend homies drink send parents call ass homie food cry stuff bang |
Topic "D" best n-grams eminem_presents candy_couture louis_vuitton_denim dior_saddle chanel_cambron chloe_paddington newest_styles candy_couture_carries balenciaga_le_dix_motorcycle fendi_spy gucci_hobo prada_messenger hermes_birkin candy_couture_fashionistas ysl_muse louis_vuitton_perforation wholesale_discount_prices wholesale_luxury |