Tag Archives: Data Science

Are Commuter Benefits Effective?

An interesting perspective on our commuting choices is offered by the Boston Globe Spotlight Team. The data shows that most employers’ commuter benefits are ineffective in swaying workers’ decisions to drive or take mass transit in and around Boston. The reason offered is that driving to work is a deep cultural habit that is difficult to break.

While I agree that public transportation cannot match the comfort of one’s car, there are definitely larger issues with the system as a whole. Its outdated equipment, chronic delays, and an ineffective radial design may stand in the way of our desire to seek alternatives to sitting in traffic.

Take a Walk with these Statistical Podcasts

Whenever the weather permits, I choose to walk the 45-minutes to work along the streets of Boston. In addition to enjoying the sunshine, I have found a number of statistics-themed podcasts to listen to along the way that are insightful and enjoyable for listeners of any background.  Here is a shortlist of some of my favorites: 

Stats + Stories (NPR)

Stats and Stories hosted by John Bailer and Richard Campbell is “about the statistics behind the stories and the stories behind the statistics.”  Weekly, 30-min episodes explore quantitative ideas in our daily lives and feature distinguished professionals and scholars presenting their research, work products, or just bubbling new ideas. Some recent topics I really enjoyed include the reproducibility crisis in biomedical research, the statistics behind aging, and ways to make forensic science more scientific.

More or Less: Behind the Statistics (BBC Radio 4)

More or Less is instantly addictive. Hosted by Tim Harford, a brilliant economist and award-winning journalist, this podcast is devoted to explaining the economic ideas behind everyday experiences. Episodes are issued several times a week and last between 10-25 minutes. The focus is on explaining, or debunking, the statistics reported in the news, political debates, or just daily life.  I especially enjoyed the episode that raised awareness of the historical lack of women in clinical trials and the one that questioned the Chernobyl disaster death count (this topic hit home with me as I grew up in Ukraine).

Planet Money (NPR)

Planet Money is an “economics-for-dummies” podcast and one of my long-time favorites!  Episodes are brilliant, engaging, and are just the right length (20-25 min) for an easy listen.  They come out twice a week and explore complex topics on latest events in global economy in fun and accessible way. Originally launched during the financial crisis of 2008 as a series of episodes trying to make sense of the Great Recession, today Planet Money is still going strong and continues to provide insights into everyday events with an economics angle.

In addition to my three favorites above, here are some runner-ups:

Microsoft Research Podcast is a rather technical full-length podcast that comes out every week and covers the company’s investigations into new cutting-edge tech research.

The R-Podcast: Authored by Eric Nantz, a Research Scientist at Eli Lilly, this podcast helps me keep up-to-date with the latest news in the R/RStudio world.

ASA Biopharm’s Podcast is a monthly podcast that features leaders from pharmaceutical industry and regulatory agencies talking about upcoming statistical conferences and events, and discussing current issues in Biopharmaceutical statistics.

Anti-vaccination: what in the data are we talking about?

Anti-vaccination propaganda is testing our immunity against harmful misinformation. The scientific method for evaluating vaccine safety is losing credibility to more socialized minority opinion. And this anti-vaccination dilemma is a petri dish for exploring how we respond to the different ways data is communicated to the public.

Vaccine safety is a hot topic in The New York Times, and two recent articles dig into the data of vaccine safety. The first, “By the Numbers: Vaccines Are Safe” summaries key findings (and data) in a bulleted list such as the following:

Billions of doses of vaccines have been given to Americans in the 30 years of the injury program’s existence. During that time, about 21,000 people filed claims. Of 18,000 claims that have been evaluated so far, roughly two-thirds have been dismissed because the program determined that the evidence showed vaccines were not at fault.

The second, “Vaccine Injury Claims Are Few and Far Between,” is a much longer piece that includes the following chart:

The chart incorporates bullet-style annotations excerpted from the article, such as “After an exhaustive review, federal courts ruled in 2010 that vaccines do not cause autism.” And the chart adheres to good principles of design, such as consistent increments on both axis, that are often overlooked or manipulated by less attentive publications.

The chart does an excellent job of narration…there is a timeline with story points and measurements. Charts are great for reading and research, but difficult or awkward to produce if you’re in an heated debate at the local pub with someone spouting “facts” from questionable sources.

Bullet points tell a different story and the narrative walks us down a numeric path of size and scope. We start with a reminder that “billions” of doses of vaccines have been administered over 30 years and arrive at 18,000 evaluated claims, the majority of which (two-thirds) have been dismissed by the judicial system.

Bullet points work well at the local pub. You can accent each one with a stern poke in the general direction of your less informed drinking pal. The inevitable problem with the bullet point approach is that nuance is often lost. The judicial process involved in dismissing claims that vaccines cause autism is just one example. Your new friend is likely to counter that institutions are all simply puppets of Big Pharma before making reference to a lesser-known doctor who has “proven” the case exhaustively…on YouTube.

I do not know whether the overuse of charts or bullet points have diminished our capacity for dialog. I do suspect that the surplus of quick “facts” and scarcity of attention has eroded patience for the long tale…rational arguments built by weighing different and opposing facts in an inviting narrative best shared with friends over beer or coffee and in the spirit of good company.

Women in Data Science 2019 Cambridge Conference

On March 4th I had a pleasure to attend the third annual conference for Women in Data Science in Cambridge, MA. After missing it last year (ironically, because my daughter decided to arrive a week before the conference!) and hearing so many great things about it from my colleagues, I was determined to attend it this year and excited by an impressive list of distinguished women invited to present their latest research. The one-hour delay of the start due to a mild New England snow storm only amplified my (and everyone else’s) anticipation.

The conference began with opening remarks by Cathy Chute, the Executive Director of the Institute for Applied Computational Science at Harvard. She reminded us that WiDS started at Stanford in 2015 and is now officially a global movement with events happening all around the globe. The one in Cambridge was made possible by a fruitful partnership between Harvard, MIT, and Microsoft Research New England.

Liz Langdon-Gray followed with updates about the Harvard Data Science Initiative (HDSI), which was about to celebrate a two-year anniversary. She also informed us that a highly anticipated Harvard Data Science Review, a brainchild of my former statistics professor and advisor Xiao-Li Meng, is going to launch later this spring. This inaugural publication of HDSI will be featuring “foundational thinking, research milestones, educational innovations, and major applications” in the field of data science. One of its aims is to innovate in content and presentation style and, knowing Xiao-Li’s unparalleled talent to cleverly combine deep rigor with endless entertainment, I simply cannot wait to check out the first volume of the Review when it comes out!

The first invited speaker of the conference was Cynthia Rudin, an Associate Professor of Computer Science and Electrical and Computer Engineering at Duke. Prof. Rudin started with a discussion of the concept of variable “importance” and how most methods that test for it are, usually, model-specific. However, a variable can be important for one model, but not for another. Therefore, a more interesting question to answer is whether a variable is important for any good model, or for a so-called “Rashomon set” of models.

Prof. Rudin then switched to an example that motivated her inquiry – an article on Machine Bias in ProPublica, which claimed that the proprietary “black-box” algorithm COMPAS that predicts recidivism and is used for sentencing convicts in a number of states, is racially biased. After digging deeper into the details of the ProPublica analysis and trying to fit various models to the data herself, Prof. Rudin came to a conclusion that age and criminal history were by far the most important variables in the COMPAS algorithm, not the race! Even though it is still possible to find model classes that mimic COMPAS and utilize race, this variable’s importance is probably much smaller than what was claimed in by ProPublica. Nevertheless, Prof. Rudin concluded that the “black-box” machine learning (ML) algorithm that decides person’s fate was not an ideal solution as it cannot be independently validated and might be sensitive to data errors. Instead, she advocated for the development of interpretable modeling alternatives.

We then heard from Stefanie Jegelka, an Associate Professor at MIT, who talked about tradeoffs between neural networks (NN) that are wide vs. deep. Even though theory states that an infinitely wide NN with 1-2 layer may represent any reasonable function, deep networks have shown higher accuracy results in recent classification competitions (e.g., ILSVRC). Therefore, she concluded, it was important to understand what relationships NNs could actually represent. Then Prof. Esther Duflo, a prominent economist from MIT, discussed a Double Machine Learning approach that used the power of ML apparatus to answer questions of causal nature, akin to those that, usually, require a randomized clinical trial.

Anne Jackson, a Director of Data Science and Machine Learning at Optum, was the only industry speaker at the conference. She talked about building large-scale applications in the industry settings: from data cleaning, understanding the context, to incorporating the developed model into the business process. “What we really need”, she jokingly said, “is a ‘unicorn’ – a PhD in Math, with MS in Computer Science, and a VP-level understanding of business – to get it right!”. She also cautioned against blindly relying on algorithms and, instead, always translating models into the real world. For example, comparing stakes for false-positive vs. false-negatives, considering model drift, etc. Finally, Anne touched upon the futility of efforts for building and supporting custom software. Moving away from this approach, more and more businesses start to utilize “middleware”, which is a “layer of software that connects client and back-end systems and ‘glues’ programs together”.

Finally, the last, but most certainly not least, invited speaker was Prof. Yael Grushka-Cockayne, a Visiting Professor at HBS, whose research interest revolved around behavioral decision making (among many other things). In her fun and engaging talk, she emphasized the importance of going beyond just a simple point estimation when it comes to prediction. She also reminded us of the effectiveness of crowdsourcing when it comes to forecasting, with such notable examples as The Good Judgement Project, where everyone can provide their opinion on an outcome of certain world event and get rewarded by getting it right, and the Survey or Professional Forecasters, which obtains macroeconomic predictions from a group of private-sector economists and produces quarterly reports with aggregated results. The last part of the talk was devoted to the results of Prof. Grushka-Cockayne’s successful collaboration with Heathrow Airport in applying Big Data/ML approach to improve upon passenger transfer experience, which did not sound like an easy feat! Ironically, the data which proved to be most reliable and was ultimately used in the model came from baggage transition records.

In addition to a strong lineup of featured speakers, the conference offered an excellent poster session, where students and Post Docs demonstrated their ML applications in a wide range of diverse fields, including drug development, earthquake prediction, corruption detection, and many others. All in all, this long awaited Cambridge WiDS conference most certainly exceeded my expectations and I am eagerly looking forward to the next year’s event.

The Analytics of Sustainability: Jaclyn Olsen and Caroleen Verly

On a quiet street just outside of the Square, Harvard’s Office for Sustainability occupies a decidedly green space.  The walls are literally a shade of green that hovers comfortably between lime and under-ripe avocado.  And if that becomes too perplexing, alternating blue walls (somewhere between ocean and indigo) provide visual relief.  My hosts Caroleen Verly and Jaclyn Olsen quickly explain that the colors were deliberately chosen as part of a broader mission to understand how the built environment affects health.  As I would soon learn, the Office for Sustainability views environmentalism with a wide-angle lens.

Jaclyn and Caroleen share an awe-inspiring picture of coordinated sustainability that extends well beyond the Harvard campus.  Back in 2008, the University set a campus-wide goal of reducing greenhouse gases 30% by 2016, from a 2006 baseline.  It was also the first sustainability goal that unified Harvard’s sprawling, decentralized operations towards a common objective with a clear deliverable and set of priorities. The only problem was that no one had yet agreed on what constituted a greenhouse gas or common standards of measurement.

One important role the Office for Sustainability plays is collecting and analyzing University-wide data for transparency and accountability, both internally and externally. This includes facilitating the collection and management of (large) volumes of data for participants to consume. When it came to implementing the 2006-2016 greenhouse gas reduction goal, OFS’ first step was to work with partners across campus to create a common measurement vocabulary that aligned participants in and outside of Harvard.  Let us not forget that we are talking about aggregating data from disjoint “Emissions Accounting” systems that might include building data, scope 3 emissions (e.g. Air Travel, Food) data, and procurement data.  We discuss the definition of “chicken” at length…does it only include the roasted variety?  What about chicken parmigiana?  The environmental difference between sourcing fresh vs. package meat is significant, and the challenge of creating a single definition of poultry is nothing to cluck about.

Jaclyn and Caroleen work with the Harvard and commercial communities to create new, credible measures for concepts such healthy food or greenhouse gases.  It’s an exercise of collaborating on a vision of what the ideal measurement should be, reaching consensus,  and then using this vision to assess or fit the available data into an emerging jigsaw puzzle.

So how are things going?

The ten year goal of reducing greenhouse gases by 30% was successfully achieved by 2016.  Harvard is now tackling a new set of goals striving to become fossil fuel-free by 2050, with an interim goal of becoming fossil fuel-neutral by 2026.  Never mind that  “Fossil Fuel Neutral” is a new term requiring the same level of definition that “Greenhouse Gas” needed  in 2008.  And this is only one component of Harvard’s overarching Sustainability Plan.  The Office’s work extends well beyond Harvard, providing leadership to Boston’s Green Ribbon Commission and a consortium of Higher Education in the New England area.

So how do they do it?

One of the key ingredients of successful Analytics initiatives is clear direction from the executive team.  The goal to reduce greenhouse gases by 30% came from the top, and echoed across campus.  Same for Harvard’s new goals around Fossil Fuel usage.

A second ingredient that is often overlooked is passion.  In their work on- and off-campus, Jaclyn and Caroleen refer to a shared sense of environmental purpose among participants.

A third factor is alignment between organizational and data strategy.  The data group (and hub) is designed to satisfy the strategic goals established in the Sustainability Plan.

The fourth factor is raw talent and sustained curiosity.  The Office employs expert analysts like Caroleen who are capable of the forensic work necessary to make sense of ambiguous data sources.  With a clear sense of direction, she is able to model an ideal data set and work backwards with the data at-hand to see what pieces fit and ultimately hang together with credibility.

Jaclyn Olsen is the Associate Director of the Harvard Office for Sustainability where she leads the development of new strategic initiatives and facilitates partnerships with faculty and other key University partners.

Caroleen Verly is an Analyst at the Harvard Office for Sustainability. Before joining OFS in 2013, Caroleen worked for the City of Cambridge to evaluate the feasibility of implementing a citywide curbside composting program.