Data Science Experience
Deloitte Services LP
Senior Data Scientist (Full-time), Jan 2015 - Sep 2020
Responsibilities |
Projects |
Proofs-of-concept |
Competitions |
Consultancies |
Conference Presentations
Deloitte Services LP is one of five Deloitte U.S. firms. Deloitte Services includes the corporate support areas which serve four client-facing firms-Deloitte Consulting, Tax, Audit and Advisory. My team is in IT, which is one of several corporate support areas. I am about two-thirds hands-on in project-related work.
Responsibilities
- Lead coordination of global teams — junior level data scientists, data architects and visualization specialists — in building machine learning and predictive analytics solutions and prototypes for complex business problems in IT, Finance, Talent, Compliance and other support and client-facing areas.
- Responsible for the end-to-end development of the data science part of large-scale software engineering projects and ad hoc prototype projects, from ideation and scoping to implementation and deployment, under agile approach.
- Collaborate with business and client teams from various functions to define business objectives, identify value drivers and develop metrics to estimate their impact, identify project and business risks, and discuss strategic opportunities that extend from the data.
- Develop solution architectures or blueprints that break down a business problem into parts, prescribe a suitable model for each part, then recombine the parts as holistic solution through visualization apps like Tableau or Qlik.
- Translate project objectives into a plan to coordinate data preparation and exploration, transformations, feature engineering, model performance and accuracy testing, cross-validation, bagging, boosting, etc.
- Carry out individually and collaboratively as team-member the entirety of the data science process — from data preparation and exploration, to transformation and feature engineering, modeling and interpretation.
- Participate in, contribute to, and sometimes lead daily and weekly routines such as scrum calls, storyboard and status updates, task (re)prioritizations, and slippage notifications.
- Conduct model (hyper-)parameter calibration for fine-tuning model performance and optimizing pipeline run-time efficiency for deployment to production environment.
- Present to business teams, client teams and executives from different functions the actionable outcomes with emphases on discussing business impacts, risk mitigation strategies, and new use case opportunities – all in a non-technical manner.
- Mentor and develop junior data scientists in US, Ukraine and India in machine learning and predictive analytics best practices, identify and prioritize their developmental needs, and provide input to their performance evaluations.
- Collaborate with Deloitte's client-facing practitioners from different functions to discuss feasibility of data science solution proposals, suitability of selected data models, and unrecognized strategic opportunities presented by the data.
- Carry out continuous learning and develop innovative thinking on assorted machine learning and predictive analytics best practices, and present these ideas to Deloitte research teams, data science and analytics CoPs, and also publicly at national & regional conferences.
Projects (Team effort, duration 6-24 months each)
A Risk Management Solution for Anomaly Detection in Compliance (Unsupervised, 1 - 1.5 Billion Records)
- Developed scalable anomaly detection capabilities using cluster analysis to identify employees whose expenses exceed norms relative to peer group, and association rules to show reciprocal expense patterns in employee cohorts. Benefits included three-fold increase in non-compliance detection, streamlined processes, and greater job satisfaction for staff.
- Created a Naive Bayesian model to prototype solution approach for a more proactive detection capability.
- Skills Used: HANA SQL, HANA PAL, HANA Studio, Tableau; Methods Used: Cluster Analysis, Assoc.Rules, Naive Bayesian
Identifying Risk-taking Sub-cultures for Mitigating Risk to Firm Brand and Reputation (Supervised, < 1 Million Records)
- Used logistic regression to identify employee demographic factors as predictors of non-compliant behavior, biserial correlation analyses to detect relationships between violation types, and harmonic (Fourier) regression to identify frequency patterns for each violation type independent of the others.
- Carried out trust-building efforts to gain project sponsors' investment, and Delphi method to move vested business teams toward consensus on a contentious issue--creation of a unified metric for overall employee compliance behavior.
- Skills Used: HANA SQL, HANA Studio, SQL Server, R / R Studio, Tableau; Methods Used: Logistic Regression, Harmonic (Fourier) Regression, Correlation, A/B Testing, Delphi
Project Management Cost Control (Supervised, 10 - 15K Records)
- Used non-linear regression to model an observed S-shape relationship between time and cost measures of software development projects, leading to earlier detection of off-track projects and better cost control over firm IT investments.
- Non-linear model detected 85-95% of projects that are off-budget, and performed better than the existing baseline linear model as demonstrated by a 5-20% increase in R-square value.
- Skills Used: HANA SQL, HANA Studio, SQL Server, R / R Studio, Excel; Methods Used: Non-linear Regression
MS Azure Platform and Cloud Applications Stand-up (Unsupervised & Supervised, Terabyte Range)
- Conducted performance testing on MS Azure platform using cluster analysis and neural network models at assorted record levels across terabyte range.
- Played key role in evaluating software repository (Visual Studio), data catalog (Informatica) and automated machine learning (Dataiku) applications for licensing and acquisition objectives.
- Participated in high-level decisions related to data governance, data staging and operating procedures.
- Skills Used: Databricks, SparkR, PySpark, Visual Studio, GitHub, Informatica, Dataiku, DataRobot, Alteryx; Methods Used: Cluster Analysis, Neural Networks
Sales & Pursuits Excellence Program (Unsupervised & Supervised, Assorted Data Sizes)
- Winning Behaviors Program: Streamlined Python code and data collection, ingestion and massaging procedures for building up a database of winning and losing client bids. Used hierarchical regression to tease out the relationships under small dataset condition.
- Skills Used: Databricks, Spark SQL, Python, PySpark; Methods Used: NLP / Text Mining, Hierarchical Regression
Travel Insights for Cost Reduction (Supervised, 1 - 1.5 Billion Records)
- Applied ensemble of supervised models on AMEX expense data to generate insights on travel-related costs including airfare, hotel, ground transportation, and meals for the objective of reducing firm travel expenses.
- Estimated 2-3% and 4-5% airfare cost savings by booking one and two days further in advance, respectively, using non-linear regression.
- Skills Used: Databricks, Spark SQL, PySpark, SparkR, R / R Studio, PowerBI; Methods Used: Random Forest, Multivariate Linear Regression, Non-linear Regression
Predicting Liquidity of Readily Marketable Securities (Supervised, Terabyte Range)
- Reviewed data staging of Bloomberg and Refinitiv financial data for optimization improvements, and assessed integrity of cluster analysis as means to streamline labeling procedures.
- Recommended supplemental modeling to add multiplicative (interaction) terms into a multivariate regression model in order to capture investors' joint consideration of price, volume and spread during investment decisions.
- Skills Used: SQL Server, MS Azure Machine Learning, Python, Qlik; Methods Used: Multivariate Regression, Cluster Analysis
Initiatives (Mostly individual effort, duration 1-3 months each, concurrent with project work)
Proofs-of-Concept
Productivity Implications of COVID-19 — A Forecast Modeling Approach
- Used forecast model on weekly time data to estimate productivity loss due to pandemic.
- Skills Used: Databricks, Spark SQL, PySpark, Spark R; Methods Used: Forecasting
ROI Estimation of 'Auto- Time and Expense Entry' through Monte Carlo Simulation
- Used Monte Carlo simulation to estimate firm-wide ROI from semi-automation of weekly task done by U.S. workforce.
- Skills Used: Excel; Methods Used: (Monte Carlo) Simulation
Competitions
- Achieved reasonable placement among 38 teams in competition to predict AIRBNB property prices using regression.
- Used kNN to fill-in missing values on property size in square feet, and neural network to feature engineer a discretization (or binning) of the "free-text" property description.
- Skills Used: R / R Studio; Methods Used: Regression, Neural Networks, k-Nearest Neighbors
Consultancies
- Application Insights Analytics: Worked with Director to devise a scalable methodology to establish one score (metric) for gauging use and satisfaction of applications. The Application Value Indicator (AVI) consisted of four performance metrics, which were standardized and weighted to reflect differences in importance across mandatory and optional use contexts.
- Unity in Adobe Analytics: Guided Unity team's effort to streamline and standardize Adobe metrics across Deloitte Services and to align those metrics with business strategy. Examined Adobe Analytics reporting capabilities for additional metrics.
- MS Teams for Collaboration Assessment: Worked with Deloitte Consulting team to assess feasibility of using data that are generated through MS Teams — such as email, attachment, text, calendar — as bases to measure collaboration levels among employees. Created graph model to prototype solution approach.
Training Program
- Developed training programs using Udemy online courses and partially tailored for Azure Spark platform. Subject areas included the data science process, select modeling methods, and Python and R programming.
- Learning tracks organized around employee personas' commonly stated learning objectives.
Conference Presentations
- Privacy Preserving Data Mining. Accepted and presented at SAP ASUG New York City Chapter Meeting, December 6, 2016.
- The Robust, High-Performance Analytics Platform of SAP HANA. Accepted at SAP TechEd 2016, September 20, 2016.
- Decomposing Business Problems for Developing Analytics Solutions on HANA. Accepted and presented at ASUG New York City Chapter Meeting, March 8, 2016; presented in SAP ASUG Webcast, June 23, 2016.