Supervised machine learning for automated coding of websites: an exploratory pilot study of government hyperlink networks
Timothy Graham, Robert Ackland, Paul Henman
Building: Holme Building
Room: Holme Room
Date: 2014-12-10 01:30 PM – 03:00 PM
Last modified: 2014-12-05
Abstract
In order to analyse the structure of hyperlink networks, researchers need to understand the type and nature of webpages or websites (i.e. nodes) that comprise such networks. Information such as generic top-level domain (e.g. com, org, gov) only provides ‘coarse-grained’ data about what these nodes are. However, social scientists often require more detailed information about the websites under analysis. Usually this involves manually labelling, or ‘coding’, nodes into categories, using techniques similar to textual or documentary analysis. However, the size and nature of hyperlink networks often makes this task quite time-consuming and costly. In this exploratory pilot study we investigate the use of supervised machine learning to automatically code websites in government hyperlink networks into discrete policy domains (e.g. health, education, environment). This involves extracting or ‘scraping’ text from the HTML of the sample websites in order to provide data for the algorithm to ‘learn’ from. Specifically, we deploy Support Vector Machines (SVMs) on a sample of websites already correctly categorised into policy domains by a human coder. The sample is then divided into a ‘training sample’ and a ‘test sample’; SVMs learn from websites in the training sample and are tasked with correctly predicting policy domains for websites in the test sample. Preliminary results indicate a surprising level of accuracy, with some policy domains correctly classified more than 90% of the time. This suggests that supervised machine learning may offer a powerful and useful tool for social science research involving large-scale analysis of websites, particularly where the categories of websites are discrete and fairly well-defined (e.g. policy domains in government hyperlink networks). Future work will investigate the validity and robustness of this methodology using a larger sample of data and other machine learning algorithms and techniques.