Brian Brewer | InfoLibrarian Data & Analytics Blog

Impedance Mismatch | Data Science & Traditional IT Methodologies

Because this is a very broad topic, I’m writing this in the context of typical data driven businesses.

A paradigm shift occurs when people in a given field discard the ideas and rules that had been the basis for their entire way of thinking. Because of this aversion to change, paradigm shifts are generally difficult.

I think there is a general misconception of what a data scientist is. This causes confusion and counter-productive assumptions leading to bad decisions.

I recently read a post where a graduate engineer stated “isn’t all science data science?” WOW! Yes, all science is data science, it is, but, mainstream IT thinking or paradigms are generally not science based! This is an entirely separate topic unto itself.

It comes down to this, technology is starting to move towards science-based approaches when it comes to getting value from the data… Introducing the “Data Scientist”.

What is Data Science?

A Science based analytical approach being applied to computer data generated from various systems, databases and software applications.

Data Science is about, making observations, formulating hypothesis and then building experiments to prove or disprove that hypothesis (Null hypothesis).

Data Science is about applying strict scientific methods to discover unknown facts about data without introducing pre conceived notions into those experiments.

Data science differs completely from existing Decision support, BI and BA activities in that with data science, we don’t design into or bias the answers to questions as we do with traditional BI (this is not a negative, it just is). Data Science asks a lot of questions, formulates the hypothesis, creates experimental models and then proves or disproves those hypotheses by setting up various hypothesis testing scenarios. As a rule, no assumptions should ever be made during the process!

This approach bumps up against most traditional IT practices. Not hard to see why looking at this diagram.

Figure: from Wikipedia

Where other approaches fail, Scientists use their skills and disciplined scientific approaches to discover otherwise unknown insights with these data. This is not a replacement for BI or traditional reporting it is a value-add proposition.

Data Mining

Some of these methodologies are also being used by data scientists today:

https://www.researchgate.net/publication/220969845_KDD_semma_and_CRISP-DM_A_parallel_overview

Helping your Data Scientist

It’s important to give your data scientists 3 key things:

Lots of raw! data
descriptions of your pain points
Be prepared to answer a lot of their questions. It’s their job to ask a lot of questions about these data.

A day in the Life of a Data Scientist (labor intensive & no shortcuts)

Data Scientist will use tools such as Data Science Notebooks, Python and or R to start the exploratory data analysis activities to begin wrangling the data and documenting observations. Observations on the data is a highly detailed, challenging and time-consuming effort.

Typically, the many libraries used are numerous and involve wrangling various datasets and formats, de-duplication, de-nulling. Using libraries for Math, statistics, plots, documenting density, correlation observations, feature engineering, scaling features, data normalization etc. If there is no statistical significance or confidence indicator the data is sufficient to run an experiment the process starts over.

This all happens before building any prediction experiments, not to mention building production data science and model training/re-training pipelines.

The process involves many iterations of many iterations of many iterations.

Here is a little data science humor: A data scientist spends 90% of their time dealing with problems with the data and the other 10% complaining about the time they spend dealing with data problems.

But because of the science discipline, steps cannot be skipped, the exploratory data analysis EDA is critical step in the process.

Be nice to your Data Scientist

Data scientists can be a tremendous value add that helps to find a wealth of information from data un-obtainable using other methods.
Data scientists should not be constrained by traditional IT methodologies or approaches.
Data scientists do not necessarily need previous “Business Domain knowledge” to do their job.
Data Scientists are usually not a lazy lot (they work best with minimal supervision!)
Data Scientists tend to be highly disciplined professionals.

What Skills Make up a Data Scientist?

Data Scientist are trained & educated in mathematics, science disciplines + strong programming skills.

The 8 Key IT Data Science Skills

Scientific discipline (Relevant Education).
Programming Skills.
Statistics.
Machine Learning.
Mathematics, Multivariable Calculus & Linear Algebra.
Data Wrangling.
Data Visualization & Communication, plotting
Software Engineering.

Conclusion

Data science is about bringing science to traditional IT data processing ala Advanced Analytics. As valuable addition to our data practices, Data Scientists, combined with BI and Data teams working together will deliver new opportunities and better insights into their data.

However, it is clear that data science is not a quick and dirty process that can be managed the way we traditionally approach development of software and BI Apps. Therefore, data science is not for everyone, businesses need to evaluate the true costs and benefits to investing in data science initiatives internally.

Today I will answer three questions that will help you to get traction, and succeed with your metadata initiatives.

1. What are the key elements needed?
2. Where does metadata fit in?
3. How to get started?

Knowing the answers to these three questions can literally pay big dividends for your organization. With a modest budget, a good starting point, and a clear idea of the problem you are trying to solve. You can be successful in establishing a metadata management program for your company.

#1 You need buy-in , budget, and timeframe.

This does not necessarily mean you need to boil the Atlantic Ocean. What about a smaller initiative to get things started? A novel idea. The alternative is a high risk of failure. Or even worse, a project that suffers from the inertia of never getting of the ground in the first place.

#2 Where does metadata management it fit in?

A client of mine stated “It scares me what might be involved to do a better job managing all this metadata, but it scares me more what will happen if we don’t!”
Perception that our systems are special or too complex.
Who is going to maintain all of this?
Negativity.
Industry has a way of complicating the issue… (metadata metadata…theorists)

Taxonomy + Ontology = Zoology

Again, where do we get started? Do we need to comply with the most sophisticated theories out there? Is having basic metadata impact analysis a good value for low cost and quick turn-around? Is this not a good place to start?

Identify a pain point and solve that pain only. This is the best place to start. To identify where the metadata fits in requires only one thing, identify a problem you are trying to solve.

#3 How to get started?

Trying to define an elaborate set of roles and in doing so introducing a whole new strain of resource requirements and terminologies. Don’t over-complicate.
Keep it simple (to start).
Phased approach.
The whole idea is to know what you have and where it is!
Interviewing individuals, having them describe their job including the systems and tools they use should make it possible to make deductions about the role metadata plays in their job.
Should be driven by a business need. What is the problem you are trying to solve.
Pilot projects tend to be much more successful, and provide a good foundation for future expansion into other business areas in a phased manner.
Find a project like DQ, MDM or DG and roll up metadata initiative as a part of the greater project. This is a win/win.

Finding the Bigger Project

Find a project which needs or even better can be driven by metadata management.
Should be easier to define scope and problem definition.
Budget already there.
Example: ETL Tool will transform our data to the data warehouse, will need several consultants to do the work. All part of data warehouse project.
Data Governance and compliance project. metadata is a part of that and should be underwritten as such.
Carve out part of the bigger project to manage the metadata for the project.
Consider funding small pilot projects.

Today I identified the three main aspects that will help you to better approach your metadata management initiatives.

What are the key elements needed?
Where does the metadata fit in?
How to get started by staying focused for the win?

Conclusion

Remember that a clear understanding of the problem you are trying to solve holds the key; and that you will probably be more likely to integrate metadata management in your company by finding a project to which metadata is an integral part. Example : Master Data – Customer Data focus.

Author Archives: Brian Brewer

Technology is One Big Experiment

Impedance Mismatch | Data Science & Traditional IT Methodologies

What is Data Science?

Data Mining

Helping your Data Scientist

A day in the Life of a Data Scientist (labor intensive & no shortcuts)

Be nice to your Data Scientist

What Skills Make up a Data Scientist?

Conclusion

3 Keys to Managing a Data Catalog

#1 You need buy-in , budget, and timeframe.

#2 Where does metadata management it fit in?

#3 How to get started?

Finding the Bigger Project

Conclusion