Visualizing the most important features in Wikipedia's Article for Deletion(AfD)

Danial Javanmardi

1. Abstract

The use of computational methods such as natural language processing and visualization techniques is getting more and more attention in studying online deliberations. We present a database tool to facilitate computational content analysis and visualize the relationships in Wikipedia's Article for Deletion (AfD). Our tool offers several unique features. The tool filters human errors and noises in the original textual data and then parses each AfD based on Wikipedia's namespaces. The investigated namespaces of AfDs include main namespace (article), user namespace, category namespace, and Wikipedia namespace (policies and guidelines). The parsed information is stored in a relational database. We have organized two years of AfDs in the English Wikipedia and collected about 40,000 AfDs in a database. We also provide visualizations and analyses to investigate the patterns from the dataset, which is one of the main contributions of the study. In particular, Wikipedia's policies have a high-volume and complex structure, which makes it challenging for newcomer editors to understand them. Our visualization helps newcomer editors to more smoothly contribute to the community. Most of the Wikipedia's editors never read or refer to the Wikipedia's policies and guidelines [6, 5], while, the articles being deleted essentially based on the fact that they did not comply with the policies and guidelines.

2. Visualization

Human still has much more creativity, flexibility and experiences to solve real word problem than a machine. Human's vision gives them the abilities to detect the visual patterns quickly. "Information visualization takes advantage of the human eye's broad bandwidth pathway into the mind to allow users to see, explore, and understand large amounts of information at once" [14]. On the other hand, human memory has its limitation. The main challenge for visualization is how to assists human to overcome to this imitation. In spite of the advance of computer hardware such as the size and speed of a processor, human's ability and capability to process information has still remained the same [12].

First Visualization

the outcome of each AfD with regards to the number of "keep" and "delete" votes. The number of "keep" votes shown in columns 1-23, and the number of "delete" votes in rows 1-23. The colour coding is conventional and adopted from the use of colour in a traffic light. The colour green represents the AfDs more "keep", the colour red represents those with more "delete" votes, and the colours yellow represents those with an even number, or close to and even number, of "keep" and "delete" votes. Each cell colour represents the average outcome of "keep" and "delete" votes for that specific cell. For example, the cell in the first row and fifth columns describes the AfDs with one "delete" vote and five "keep" votes that filled with green colours depicting the average outcome for this cell is keep outcome. To view the exact percentages in each specific cell for "keep", "delete", and "other" votes, hover over the desired cell.

The stacked bar chart on the right side of first visualization illustrates the length of the AfDs in the x-axis and the population for each length in the y-axis. The length of an AfD was calculated based on the total number of "delete," "other" and "keep" votes. The red bar represents the AfDs with "delete" outcomes, and green one represents the AfDs with "keep" outcomes. According to the figure, the length of 5 is the highest peak of the chart, which around 80% of AfDs with this length reached delete outcome, and the rest reached keep outcome

Second Visualization

Second Visualization displays the population of AfDs in each specific cell. The colour coding used in this diagram is adopted based on Heatmap diagram, a graph which displays the density of population. The colour gray represents no AfDs, the colour white to light blue represent 1-15 AfDs, the colour light blue to blue represent 15-300 AfDs, the colours blue to purple represent 300-700 AfDs, and the colours purple to dark purple represents 700-7019 AfDs. For example, the cell in the second row and eleventh column contains twenty AfDs and is of a light blue colour. To view the exact amount of AfDs in each specific cell, hover over the desired cell.

Third Visualization

Third Visualization depicts a sunburst diagram representing the percentage of mentioned policies in the votes for the AfDs. A sunburst diagram is similar to a tree diagram, the only difference being that it uses a radial layout. The "root" of the tree is in the center of the sunburst diagram, with the "leaves" on the circumference around it. The sunburst diagram consists of three tiers: the first tier, which is the inner circle of the diagram, represents the highest hierarchy of policies; the second tier, which is the middle circle of the diagram, represents the middle of the hierarchy; and the third tier, which is the outermost circle of the diagram, represents the bottom of the hierarchy. The length of the arc of each policy corresponds to the percentage of this policy in the AfD comments. The colour coding of this diagram is the default colouring; a random colour has been assigned to each policy. Despite this random assignment, an effort has been made to maintain a similarity between the colours of the parent and child policies.

The interaction and concept of this visualization were inspired from the visualization mantra: "Overview first, zoom and filter, then details on demand" [20]. The details related to each target's policy is filtered and displayed to the visitor based on their interest. To view the different percentages of the policies in the AfD comments, one can hover over the desired policy. When hovering over a policy, it, as well as its parent policies, will be highlighted, while the remainder of the diagram fades. The percentage displayed in the center of the sunburst diagram will be the percentage of the policies mentioned in the comments. Information related to the highlighted policy and its parent policies will be present on the right-hand side of the diagram, with a colour correspondent for each separate policies located beneath the policy title. Included in this information are the policy title (which also acts as a link to the Wikipedia page describing the policy), the percentages of the policies mentioned in each separate tier, and the percentages of "delete," "other," and "keep" votes in the AfDs for each policy. This interaction with the diagram allows for easy navigation and an overview of how frequent each policy was used in the dataset. Knowledge can be further acquired through the linked Wikipedia pages in the policy titles.

Forth Visualization

Fourth Visualization depicts a sunburst diagram representing the percentage of mentioned categories in the AfDs. For the matter of simplicity and usability, we developed the same sunburst visualization as third visualization but for the categories in the AfDs. This sunburst diagram has four tiers, the most inner circle the topest in the tree hierarchy. The bigger is the arc of a category that category repeated more in the AfDs. The default colour coding of this diagram is also a random colour coding for each category. Despite this random assignment, an effort has been made to maintain a similarity between the colours of the parent and child categories. The functionality of the fourth visualization is the same as the functionality of the third visualization.

Fifth Visualization

Fifth visualization illustrates the relationship between the fiction category and the most mentioned policies in this category. This is an interactive visualization whenever a user hovers the mouse in a category, its respective policies are highlighted on the right side, or vice-versa. In this way, if an editor wants to write a new article in a category, he can interact with this visualization and get a concise view of what he needs to know about the most important policies in that category.

4. Conclusion

We have given two years of AfDs to the developed parser to create our dataset. We explored visualization techniques to analyze the dataset. Our visualizations help the editors to understand the most important features such as policies in the AfDs. Understanding these policies assist them to more smoothly argument and prevent their article of being deleted in the AfD. It also helps them to improve the quality of their article based on the Wikipedia's policies and guidelines. Nearly half of the mentioned policies in the AfDs are the subsets of the notability policy. These policies include general notability (Wikipedia:GNN), notability for people (Wikipedia:BIO), notability in sport (Wikipedia:ATH), notability in music (Wikipedia:NMG), notability in the organization and company (Wikipedia:ORG), notability in academics (Wikipedia:PROF), and notability in events (Wikipedia:N(E)). We also provided visualization for the categories in the AfDs that have not been investigated before. Finally the last visualization illustrates the relationship between policies and categories which is the most effective, concise way to support the editor. It is based on the fact that an editor writes an article in one of the categories, and he can quickly go through the

All Rights Reserved. © 2016