Data Science Fundamentals — Technical Communication & Documentation

Published in

K2 Data Science & Engineering

8 min readJan 2, 2019

Often times, first‐rate technical work can be tragically undervalued
if people fail to communicate it in an effective way. I’ve also seen that just a few basic, easy‐to‐learn principles can make a world of difference between incomprehensibility and a stunning presentation. Internalizing a few guiding principles will make you a more valuable asset on any team.

Data scientists are in a uniquely communication‐intensive niche. Software
engineers mostly talk with other software engineers, business analysts
with business analysts, and so on. It is the job of a data scientist to bridge the
gaps between the worlds of business, analytics, and software. So, it’s a crying
shame that, frankly, most of us aren’t that good at it. Ultimately, everything
in this lesson is about one goal: conveying ideas to your audience in a
way that (1) they will actually understand and (2) they have an easy time
understanding.

Let’s start with a few general principles that I think underlie most good technical communication.

Source: The Data Science Handbook by Field Cady

Know Your Audience

This is one of the most basic principles of technical communication, but it’s
also one of the hardest to master. As a data scientist, you will talk to the following people:

Domain experts who understand what you’re studying better than you do
but don’t know much about software and analytics.
Analytical people, who are very interested in the nitty‐gritty of what you’ve done. You can expect to spend a lot of time here discussing (and possibly justifying) your statistical methodology and modeling choices.
Software engineers, who will often want to treat your code as a black box
that magically spits out answers, but they will care a lot about its performance in real‐world situations.
Business people, a diverse group, ranging from former engineers who will grill you about the details to nontechnical managers who want everything translated into business speak.

Another important part of knowing your audience is knowing how much
detail to include and how much of the story of your work. A high‐level executive might want to know just a few key take‐home points. Your peers might want more details about your methodology, especially if your findings are especially important or surprising or if you change direction mid‐project
because of preliminary findings.

Show Why it Matters

Always make sure to frame an analytics in the context of something people
already care about, usually a business problem, in order to make it compelling.

Depending on your audience, you may also need to clearly explain how the
analytics relates back to the problem and can impact the bottom line. You
typically don’t need to belabor the point, but you should give people a reason
to care about what you’re saying

Make it Concrete

The human brain doesn’t do very well with abstract concepts. I don’t just mean nontechnical people; even if somebody has the background to follow a purely abstract discussion, their understanding will be immeasurably helped if you give their brain a few concrete mental hooks.

Often, the business case at hand provides all the concrete examples you need.
Other times though, the business case is too convoluted to illustrate things
clearly, and you will want a simple toy problem.

A Picture is Worth a Thousand Words

One of the best pieces of advice I ever got for writing papers or giving technical talks was this: the heart of your presentation is one or a few key figures. The rest of the paper is just an extended caption describing how you generated those figures and how to interpret them.

In my opinion, the lack of pictures in some papers and presentations is often
a sign of laziness. It takes some planning to decide what figures would work best. Then there’s a lot of legwork in generating those figures, whether it’s manipulating a diagram in PowerPoint or making sure that the axes are set correctly on a plot. It’s a lot easier to just sit at the keyboard and churn out slides and pages of text, but that’s the wrong way to go about it.

Don’t Be Arrogant about Your Tech Knowledge

This should go without saying, but I feel compelled to bring it up because I
have seen it way too often: data scientists being jerks toward people who don’t
know as much math as they do. Obviously, this puts up a massive barrier to clear communication.

Remember that math is not synonymous with clear thinking: it’s just
a way of reducing that clear thinking to calculations so that you can get a
number out.

Make It Look Decent

I used to think that aesthetics was peripheral to clear communication. I felt like people should judge my work based on its technical merits, rather than how much I agonized over which shade of peach to use. So, it came as quite a shock when I first read about graphic design. I discovered that it isn’t an attempt to shoehorn artistic sentiments into technical work; it is a pragmatic way to make sure that communication is clear and compelling. You should use good design principles on a slide for the same reason you should use logarithmic axes when graphing some data: it helps to get the point across.

Presenting & Speaking

There are a plethora of resources on how to create a good powerpoint presentation as well as public speaking tips. You should also have experience with this already from prior work and academic experience. We won’t really dive into that here.

Written Reports

During our course, we will use Markdown. Depends on the team, but Word, Google Docs or LaTeX might be the preferred word processing tool.

The structure of a written report will vary depending on your intended audience who you are in relation to them (team member, outside consultant, member of another team, etc.), and the problem you are addressing. However, most technical reports will have some subset of the following sections:

An executive summary. This is up to one page that summarizes what problem you were addressing and why, what you did, and what can be done with it. The emphasis should be on the takeaway points from a business perspective and how your work fits into the larger context of a company.
Background and motivation. Clearly, frame how this work fits into a larger context for your likely audience. Depending on who you’re writing for, it might be a description of how this fits into the company’s business, the role it plays in software, or existing knowledge that it builds on.
Datasets used. Describe in brief which datasets are being used, where you
got them from, and what they’re describing. Plus maybe a little bit about
which features you extract from them and any limitations of the data that
should be pointed out. This section should be short and sweet; if there are a lot of gory details, put them in an appendix.
Analytical overview. Describe at a high level the analysis you performed or the algorithm you are studying. Focus on the mathematical model in the abstract, rather than how it is implemented in software (unless some key aspects of it were driven by software requirements, such as wanting it to be massively parallel). This section should probably have a diagram or two that illustrate what you’re talking about.
Results. Describe any results you got from your analysis, and present them
in graphical form. This is often the most important part of your report, so
keep it crisp and compelling, and make sure to tie it back to the context of
how these results are relevant. If you have a lot of results to report that
contain similar information (such as results for each feature), then include
only the most interesting ones in this section. Put the rest in an
appendix.
Software overview. This section often doesn’t need to be there and should be short if you do include it. It’s mostly relevant if your code is being plugged into somebody else’s code or some production system or if your code might be regularly rerun in the future as datasets are updated.
Describe how to run the code (this should be at most a handful of lines — if
it’s not, then you should refactor it and maybe combine it into a master
script) if it’s a stand‐alone analysis or how it plugs into other software if
that’s how it works. Include a high‐level architecture of the code as a
diagram, and describe which languages it is written in and what tools it
uses.
Future work. Discuss natural next steps. This section often reads as boilerplate in practice, and sometimes, it’s ok if it’s extremely brief or even omitted entirely. However, it can also be an opportunity to point the way to significant new projects and to suggest others that should not be pursued. Data science is often used to “test the waters” and see whether something is worth pursuing as a larger project or clarify the scope that such a project should have.
Conclusions.
Appendices with technical details. For me, personally, up to half my report is liable to be appendices.

Code Documentation

Whenever you provide a significant piece of code as a deliverable, it’s important to provide some kind of documentation of what it does and how to use it. Depending on the context that can take a variety of forms, including the following:

A long comment at the top of a file.
A separate runbook or user manual. This is more common with extremely
large pieces of software or if you’re giving it to a client or another team.
Pages on a company Wiki.
Unit tests that can be run against the code.

No matter what form the documentation is in though, the most important
thing about the documentation is to explain how somebody can run the code
and reproduce its functionality. This lets them use the code themselves and
verify that it works the way that it’s supposed to.

Explaining how it operates under the hood is secondary, and going into too
much detail can be counterproductive. If you are delivering your source code
to somebody, it is generally reasonable to expect them to be able to read and
understand your code, and restating it all in English is superfluous. Telling
them how to run the software tells them where to start looking in the source
code, and they can follow the thread from there. It is good to give a brief architectural overview, pointing out which modules do what, but I wouldn’t go beyond that.

The one thing that is very nice to include though is a troubleshooting section.
Most pieces of software have some weird ways they can break down, or
parts that are known to be especially fragile, that are highly specific to your
software. If this is the case, save your users potentially hours of time debugging by telling them what they probably did wrong.