Visualization of ALBERTA_government_information_all_urls.csv

This is a visualization of the ALBERTA_government_information_all_urls.csv file. Each bar shows the distribution of pages in that crawl by top-level domain. The legend on the right orders the domains by the number of pages across the entire collection, ordered from the bottom—that is, the bottom domain has the most pages, the second one from the bottom has the second most pages, etc. The legend breaks out the top 20 domains; everything else is grouped into "other" (at the very top). The stacking of the colors in each column is consistent with the legend ordering. Mousing over a portion of a bar will outline in red the domain across all crawls, as will mousing over any part of the legend.

This visualization was created with d3, based on this example and this example. The raw data for this visualization comes from the output of a Spark script.

Original code created by Jimmy Lin and found in the warcbase repo here.