Data Science Portfolio - Project Strategy

Introduction

Background

Some time ago, not long after the inception of this blog, I was reflecting strategically on the sort of content that I thought would be worth generating. One of the most important elements of this strategy, in my view, was the concept of projects. The main aim of this content, as I stated previously, was:

‘the application of what I know to the solution of an interesting and complex “problem” scenario’

Project content are essentially Data Science projects focused on the analysis of collections of one or more datasets that I find interesting… for the love of data! In doing so, project challenges help to cement data science skills that I learn from both “formal”” and “informal” sources in a flexible and useful way.

Constant learning

I have learned a range of data science skills “formerly” at Coursera, some of which are listed on my bio. However, as I explore these skills further and add new ones, I do so “as I go forth” and analyse data. This ned to learn things as I go “informally” means that I need to constantly master various techniques through practice. Often, I encounter really useful links to things that I want to learn that are impractical to achieve formally because I don’t have the time to invest at the time, or because they cannot be aquired in that manner.

Career development

My background is in the life science, which means that I don’t possess the “traditional” data science formal training with either a computer science or statistics major at its base. This is also why I need to build a solid project portfolio. In addition to strengthening my ongoing skills development by bringing order and connection to individual concepts, projects are a useful platform around which to frame a data science oriented CV based on independent projects, leveraging open data and open source methods.

I did some brief research and found a couple of links about creating a data science focused CVs here and here. There is also some decent background here about why and how to create a data science project portfolio.

Basically, I am not really interested in trying to convince people of what I know, it is far easier and more mutually beneficial to simply demonstrate what I know. For, beauty (or value), as it has been said, is truly “in the eye of the beholder”!. This conviction was reinforced when, recently, I read this interesting article on the topic of genuine knowledge and how to discern it its existence in others. The interesting introductory anectote about Max Planck, though probably urban-legendary, sets up the critical takehome message of the piece, which is (emphasis mine):

‘There are no shortcuts. Real knowledge comes when people do the work. Real knowledge comes from doing, from “experiencing”.’

Basic Strategy

This dataquest blog post creation contained some very useful advice regarding how to build a data science portfolio. This included descriptions of the different types of project that one could include, each with varying focus (skills to showcase), in addition to some interesting datasets and potential idea storylines.

Most, if not all of the projects that I will construct would be “independent” projects focused on asking interesting, complex questions of open datasets using open source tools. The two main categories, as I can see, that I will focus on are:

  • data projects: typically github-based projects consisting of elements such as:
    • source code
    • project code and strategy documentation
    • data products such as reports(), visualisations (e.g. ggplot2 and D3) and web apps on platforms like Shiny.
  • Competition submissions: Rankings and resources related to competitions such as kaggle.com and drivendata.org.

There are possibly other categories or venues for development, but these represent a good start.

Organisation

Ok… how do I organise this? These projects will, naturally, be a combination of “internal” (hosted on this site) and “exernal” (hosted elsewhere) content. After thinking about this for a while, I figured that it was a good idea to connect project content on this site via a single main project homepage and multiple overview pages describing individual projects:

  1. Project glossary page: Serves as a glossary or directory index page listing all of the data science projects that are either ongoing or completed
    • Containing an introductory paragraph describing the purpose of the page and the content that it organises.
    • A small section with links to blog posts hosted on this site that relate generally to data science projects and strategy. Featuring content such as new tools or interesting application outlets such as competitions or cooperatives (e.g hackathons).
    • links out the individual project pages
  2. Project overview pages: A brief overview description page consisting generally of the following components:
    • Title: What is the project called?
    • Brief project description (Abstract): Why does this projec exist and what does it aim to achieve?
    • Project-specific blog posts that I may write
    • Links to Project-specific resources:
      • github source code and docs
      • derivative data products
      • related external references e.g:
        • competion standings or Kaggle kernels
        • hackathon community boards

Things to do next:

1. Get started!

My first project, healthyAir, is a github-based end to end project focused on investing the connection between respiratory health trends and potential causative factors, using some interesting open data.

2. Get organised!

Construct the project glossary and overview pages described above in the organisation section.

3. Get competitive!

Start understanding how to participate in data science competitions. Kaggle is a good start, and there is some useful tutorial material about competing using R and python. Kaggle kernels, are an examle of some of the concepts that I need to get my head around in order to participate successfully in Kaggle competitions. However… this will come in time :smile:.

Conclusion

Well… let’s begin to:

“Work the problem”

:wink:

Written on March 2, 2017