Research Spotlight: Untapped Potential: Leveraging Unstructured Data

In any discipline, data retrieval and analysis are required to answer key questions of interest. With the advancement of technology and the exponential growth of the quantity of data being produced and collected, there needs to be an ongoing dialogue to discuss the potential that unstructured data provide. In general, discussions regarding ‘data’ or ‘data collection’ are geared towards structured data such as surveys, data warehouses, databases, or the ol’ rows and columns in a data frame but what about the vast amount of data available having little to no structure? These data make up as much as 90% of available data that researchers and analysts can leverage, yet only 0.5% of this data is analyzed.

I have spent much of my early career analyzing unstructured data (i.e., textual, verbal, audio, and visual materials). These ‘qualitative’ data generate scale which can be quantified in a manner that provides actionable insights. Leveraging these unstructured data with standard retrieval tools can offer an advantageous alternative when addressing problem spaces without the use of expensive surveys, reliance on maintained data warehouses or much needed data access.

Unstructured data sources are playgrounds of untapped feature potential allowing for a wealth of possibility where any researcher or analyst can focus on feature creation which often leads to additional questions for consideration as well as analyses to perform. The internet is a large resource of unstructured data storage, an underexplored repository for unstructured data retrieval. From the web, an individual can extract content from webpages, sites, documents (docx, pdf, pptx, txt, xml, etc.), RSS feeds, video/image files (mp4, mov, wmv, jpeg, etc.), links, and audio files (wav, mp3, mat, etc.), just to name a few.

So, what tools would I recommend to get an individual started in their unstructured data journey? Well thanks to open-source code and APIs (application programming interfaces) there are many but I will simply stick to those that I have found most useful in my own line of work.

  • Web crawlers/scrapers, are tools that crawl and scrape web content in order to index and extract data. A great source is Scrapy for Python and RCrawler for R.
  • Parsers are great tools for analyzing documents. These tools take input data (often times text) and build a data structure from the data. Different parsers cater to different file types.
    • Text embedded in images – Tesseract library in Python and the Magick package in R.

A final note, while the tools I have mentioned herein require coding on the part of the individual, there are numerous platforms which offer the same services for a fee allowing any individual access to quickly and efficiently extract and analyze unstructured data without any need to code. With this, I urge individuals to go forth, try any number of tools out there and analyze!

Scarlett Marklin

Scarlett Marklin is a PhD Candidate in the Department of Sociology at FSU. Her research interests include project-based research, computer science, data analysis, and research design. You can learn more about Scarlett here.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.