Blog: Python is not always the best tool for the job

Eddie Elizondo | Last updated September 2024 | No Hype Analytics blog homepage

In the last few years, I’ve noticed a trend among newer data scientists:

using Python for basic data analysis.

For all analytics work, the default should be using the simplest and safest tool to do a job. Excel is often the best tool for initial (first time the file is opened) and basic data analysis, or SQL for building data pipelines. Writing more complex code for basic tasks only increases the chance of errors and bugs that become hidden in code.

Python is increasingly becoming the only tool in some data scientists’ workflow & toolbox. For example:

Get a new data set? Load it into Python, start with Pandas describe(), and explore from there.
Building a data pipeline? Do a SELECT statement on SQL, then all manipulations in Python.

This is not good. For those with apprentice, coach, or teach data scientists - we need to do better. How can we do better?

The first step any data scientist or data engineer (or anyone, really) should do when getting a new dataset is inspect it visually (Excel is great for this) and get a feel for the data.

What is in it? What do those columns mean? What are summary statistics? Can you quickly filter/cut the data on one or more attributes? Are there potential issues with data quality (zeros, impossible values, etc.)

These are all things Excel is a great tool for. The intuitive understanding gained by doing this in Excel is unmatched by doing this in Python, R, or any programming language, where work is restricted to commands, viewing headers of data, etc. After getting a good understanding of the data and its quality, then proceed developing in a programming language.

For an analogy from software engineering: software engineers similarly do not jump straight into writing code. They spend at least a few minutes thinking and sketching out their approach and design on a notepad. The initial data exploration is an analogous step for a data person.

If the data is initially too big for Excel, take a representative subset (one month only, one region only, etc.) and open it in Excel.

Obviously Excel should not be used for processing data - anyone who has tried to import dates, or dealt with a string with leading zeros converted to a number knows that too well - but this visual review step is not data processing.

SQL should be used to its full potential. During very initial stages of exploration or development it makes sense to work from raw data directly in Python, etc. However, I also see cases in established data pipelines where data is SELECTed out of a SQL database and extensively processed in Python - joins, aggregations, new columns, etc. This can be more easily and quickly handled with basic SQL JOIN, GROUP BY, CASE, etc. and intermediate SQL functions like windows.

Data scientists should be apprenticed to develop a diverse toolbox over time (Python, SQL, Excel, Julia, a rigid object-oriented language, Unix shell, Git, basic cloud deployment knowledge) with knowledge when to pick the right tool. It’s like having a power drill in your toolbox. Sometimes you need a screwdriver bit and sometimes a drill bit. And you can drill a hole with just one drill bit, but you eventually need to drill different size holes. Having a diverse set of bits is important to get a job done.

Do you have questions, thoughts, feedback, comments? Please get in touch - I would love to hear from you: eddie@betteroptima.com