Skip to content

Pre-processing coverage data for Data Visualizations #6

Open
@flyingzumwalt

Description

@flyingzumwalt

@mhucka has been exploring ways to facilitate visually drilling down into the coverage data (aka. public record of all the data held by participating orgs). Discussion of dataviz options here: https://github.com/datatogether/research/tree/master/data_visualization

This will inevitably require pre-processing of the data, partially because you often end up with situations where there are tens of thousands of items (ie. URLs) at a given layer of the navigation tree. In addition to pre-processing based on simple analysis of the content, such as running files through FITS to extract content types, there is clearly a need for deeper machine analysis. At the very least you could use entity extraction to identify patterns/topics within a corpus.

@mhucka has already been working on some of this. Let's rope in a few more people. @chrpr and @mejackreed come to mind.

The ETL pattern seems pretty applicable, and opens opportunities for experimenting with incorporating distributed data and distributed tools into machine analysis pipelines:

  1. aggregate the essential info into a workable dataset (currently tracking info in a SQL database, eventually will be distributed)
  2. analyze that dataset
  3. write the analyzed/reformatted result (ie. to IPFS)
  4. pass around a reference to the updated/processed/extended dataset (ie. IPFS hash)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions