Visualizing Communities Formed Using Combinations of Clustering Algorithms, Twitter Features, and Similarity Measures

This research was an undergraduate thesis done by Austin Fernandez, Marc San Pedro, Johansson Tan, and myself, Clarisse Poblete, with our adviser, Ms. Charibeth Cheng. The goal of the thesis was to develop a tool that detected communities based on user-selected combinations of clustering algorithms, Twitter features and similarity measures. This developed tool was web application coded using HTML, CSS, Javascript and Python (with Django).

Software Features

Representation of communities from a high or low level view

Filtering of user nodes by community or by user

Display of characteristics of individual communities

Data Visualization

While all members of the thesis group contributed to the planning of the implementation of all modules, including the collection module (for crawling and cleaning the data) and the processing module (for applying the algorithm-feature-measure combination to the dataset to generate communities), I was primarily in charge of the visualization module.

The visualization module is responsible for displaying a graphical representation of the generated communities. The user interface of the web application was coded using HTML, CSS and Javascript, and the graphs were generated using the D3 Javascript library. Here are more examples of some of the outputs of the visualization module:

High-level views of generated communities. Each community is represented as a node. The larger the node, the larger the community, and the larger the distance between nodes, the less similar they are.

Low-level views of generated communities. Each user is represented as a node. Nodes from the same community are distinguished by color. Users may be filtered by community (with or without outside connections) or by selected users (to view only certain nodes' connections).

Conclusion

We were able to identify differences in the resulting community structures resulting from using different clustering algorithms, as well as the differences in the characteristics of users in the communities formed from using different Twitter features as basis for the clustering. The visualization was able to provide us with more insights into the the resulting community structures as well as into new features that could possibly be added in extensions of this research in the future, to further aid in the analysis of the communities generated by the software.