Tag Archives: Women in Data Science

Women in Data Science 2019 Cambridge Conference

On March 4th I had a pleasure to attend the third annual conference for Women in Data Science in Cambridge, MA. After missing it last year (ironically, because my daughter decided to arrive a week before the conference!) and hearing so many great things about it from my colleagues, I was determined to attend it this year and excited by an impressive list of distinguished women invited to present their latest research. The one-hour delay of the start due to a mild New England snow storm only amplified my (and everyone else’s) anticipation.

The conference began with opening remarks by Cathy Chute, the Executive Director of the Institute for Applied Computational Science at Harvard. She reminded us that WiDS started at Stanford in 2015 and is now officially a global movement with events happening all around the globe. The one in Cambridge was made possible by a fruitful partnership between Harvard, MIT, and Microsoft Research New England.

Liz Langdon-Gray followed with updates about the Harvard Data Science Initiative (HDSI), which was about to celebrate a two-year anniversary. She also informed us that a highly anticipated Harvard Data Science Review, a brainchild of my former statistics professor and advisor Xiao-Li Meng, is going to launch later this spring. This inaugural publication of HDSI will be featuring “foundational thinking, research milestones, educational innovations, and major applications” in the field of data science. One of its aims is to innovate in content and presentation style and, knowing Xiao-Li’s unparalleled talent to cleverly combine deep rigor with endless entertainment, I simply cannot wait to check out the first volume of the Review when it comes out!

The first invited speaker of the conference was Cynthia Rudin, an Associate Professor of Computer Science and Electrical and Computer Engineering at Duke. Prof. Rudin started with a discussion of the concept of variable “importance” and how most methods that test for it are, usually, model-specific. However, a variable can be important for one model, but not for another. Therefore, a more interesting question to answer is whether a variable is important for any good model, or for a so-called “Rashomon set” of models.

Prof. Rudin then switched to an example that motivated her inquiry – an article on Machine Bias in ProPublica, which claimed that the proprietary “black-box” algorithm COMPAS that predicts recidivism and is used for sentencing convicts in a number of states, is racially biased. After digging deeper into the details of the ProPublica analysis and trying to fit various models to the data herself, Prof. Rudin came to a conclusion that age and criminal history were by far the most important variables in the COMPAS algorithm, not the race! Even though it is still possible to find model classes that mimic COMPAS and utilize race, this variable’s importance is probably much smaller than what was claimed in by ProPublica. Nevertheless, Prof. Rudin concluded that the “black-box” machine learning (ML) algorithm that decides person’s fate was not an ideal solution as it cannot be independently validated and might be sensitive to data errors. Instead, she advocated for the development of interpretable modeling alternatives.

We then heard from Stefanie Jegelka, an Associate Professor at MIT, who talked about tradeoffs between neural networks (NN) that are wide vs. deep. Even though theory states that an infinitely wide NN with 1-2 layer may represent any reasonable function, deep networks have shown higher accuracy results in recent classification competitions (e.g., ILSVRC). Therefore, she concluded, it was important to understand what relationships NNs could actually represent. Then Prof. Esther Duflo, a prominent economist from MIT, discussed a Double Machine Learning approach that used the power of ML apparatus to answer questions of causal nature, akin to those that, usually, require a randomized clinical trial.

Anne Jackson, a Director of Data Science and Machine Learning at Optum, was the only industry speaker at the conference. She talked about building large-scale applications in the industry settings: from data cleaning, understanding the context, to incorporating the developed model into the business process. “What we really need”, she jokingly said, “is a ‘unicorn’ – a PhD in Math, with MS in Computer Science, and a VP-level understanding of business – to get it right!”. She also cautioned against blindly relying on algorithms and, instead, always translating models into the real world. For example, comparing stakes for false-positive vs. false-negatives, considering model drift, etc. Finally, Anne touched upon the futility of efforts for building and supporting custom software. Moving away from this approach, more and more businesses start to utilize “middleware”, which is a “layer of software that connects client and back-end systems and ‘glues’ programs together”.

Finally, the last, but most certainly not least, invited speaker was Prof. Yael Grushka-Cockayne, a Visiting Professor at HBS, whose research interest revolved around behavioral decision making (among many other things). In her fun and engaging talk, she emphasized the importance of going beyond just a simple point estimation when it comes to prediction. She also reminded us of the effectiveness of crowdsourcing when it comes to forecasting, with such notable examples as The Good Judgement Project, where everyone can provide their opinion on an outcome of certain world event and get rewarded by getting it right, and the Survey or Professional Forecasters, which obtains macroeconomic predictions from a group of private-sector economists and produces quarterly reports with aggregated results. The last part of the talk was devoted to the results of Prof. Grushka-Cockayne’s successful collaboration with Heathrow Airport in applying Big Data/ML approach to improve upon passenger transfer experience, which did not sound like an easy feat! Ironically, the data which proved to be most reliable and was ultimately used in the model came from baggage transition records.

In addition to a strong lineup of featured speakers, the conference offered an excellent poster session, where students and Post Docs demonstrated their ML applications in a wide range of diverse fields, including drug development, earthquake prediction, corruption detection, and many others. All in all, this long awaited Cambridge WiDS conference most certainly exceeded my expectations and I am eagerly looking forward to the next year’s event.

The Cambridge Women in Data Science Conference

I was recently browsing Harvard’s Institute of Applied Computational Science website and saw there was a Women in Data Science conference. I was so excited to attend, so I set a reminder on my phone. As soon as the conference went live, I forwarded a link to all of my colleagues. Not much later, I started receiving feedback that the conference was sold out! It was really thrilling to see that there was so much interest. The conference was a great opportunity to hear how some women in data science are leveraging machine learning to transform healthcare, and advocating for open science to foster public debate of big data algorithms that are influencing society. Here are some highlights:

When Regina Barzilay, MIT Professor of Electrical Engineering and Computer Science, was a breast cancer patient at MGH, she could see how machine learning could be an approach to uncovering insights in the vast collection of patient information, including mammogram scans, pathology reports, and family history. Today, she’s in remission and collaborates with MGH to train the models to detect high-risk lesions sooner than ever imagined and their likelihood of being cancerous, reducing the number of unnecessary surgeries.

Heather Bell, who leads a digital and analytics department in biopharma, provided a big-picture talk of how various companies are using artificial intelligence to streamline the otherwise long and expensive R&D pipeline. One challenge is that it can take several months to recruit participants for clinical trials. In one example she shared, Clinithink developed a NLP platform that converts written doctor notes to structured data that can rapidly identify participants based on criteria. The platform was shown to recruit 2.5 times more participants in 5% of the time. In another example Heather provided, wearables and web applications are now proving to effectively monitor health between doctor visits. In one study, lung cancer patients responded to a brief questionnaire once a week about various health metrics like appetite and weight. The device algorithm, developed by SIVAN Innovation, generated an alert to the patients’ doctors in the case of a concerning change. Of the intervention cohort, 50% more were alive 7 months longer than the regular follow-up cohort. The trial was stopped early as the effect was so large.

Francesca Dominici, HSPH Professor of Biostatistics and Co-Director of Harvard’s Data Science Initiative, shared her powerful longitudinal study demonstrating an association between exposure to air pollution and mortality risk among all Medicaid beneficiaries (~67 million per year). As the study sparked media headlines and supports more stringent environmental policy during a time it’s hotly debated, Francesca espouses principled data science and an open science framework in which data are publicly available and results reproducible. While an inevitable concern in an open science framework is privacy, it’s worth considering Cynthia Dwork’s invention differential privacy — an effective tool that goes beyond de-anonymization to protect individuals’ identities in research databases. Coincidentally Cynthia was also a speaker at WiDS to discuss her latest endeavor of developing a metric for an algorithm that classifies people as fairly as possible.

Cynthia discussed how subjective this is so in that sense the metric must be culturally aware, which is another rationale for open science.

Rounding out an exciting day of data science, Tamara Broderick, MIT Assistant Professor of Computer Science, discussed achieving accurate Bayesian inferences with optimization, which I encourage you to watch here, as well as some of the other talks I’ve highlighted. It was inspirational to hear these accomplished women in data science presenting some of their impactful research. I am really looking forward to next year’s conference and I hope you are too.

To stay up-to-date on the Women in Data Science (WiDS), go to https://www.widscambridge.org.