pandas: The Python Library That Made Data Analysis Accessible

Published July 7, 2026 · 10 min read · Python, pandas, data science, data analysis, machine learning

pandas is the default Python library for data analysis — used by hedge funds, NASA, and virtually every data team on the planet. Created in 2008 at a hedge fund, it introduced the DataFrame and transformed how Python handles structured data.

If you work with data in Python, you work with pandas. It is not one option among many — it is the default tool for loading, cleaning, transforming, and analysing structured data in virtually every professional data environment. Understanding pandas is not optional for data analysts, data engineers, or anyone building data pipelines in Python. ## Where pandas Came From pandas was created by Wes McKinney in 2008 while he was working as a quantitative analyst at AQR Capital Management, a hedge fund in Greenwich, Connecticut. McKinney was frustrated with the available tools for working with financial time-series data in Python. R had good data manipulation tools; Python did not. He built pandas to fill that gap. The name comes from "panel data" — a term from econometrics for multi-dimensional data sets that track the same subjects over time. McKinney released pandas as open source in 2009. He later left AQR, wrote the book *Python for Data Analysis*, and co-founded the company Ursa Computing. pandas is now maintained by a large community of contributors with funding from NumFOCUS, a non-profit that supports open-source scientific computing projects. pandas 2.0 was released in April 2023, introducing a new default backend based on Apache Arrow for improved memory efficiency and performance. pandas 2.2 followed in January 2024. ## The Core Data Structures pandas has two primary data structures: the **DataFrame** and the **Series**. A **Series** is a one-dimensional labelled array. It is similar to a column in a spreadsheet or a dictionary with an ordered index. Each element has a label (the index) and a value. A **DataFrame** is a two-dimensional labelled data structure with rows and columns. Think of it as a spreadsheet, a SQL table, or a dictionary of Series objects that all share the same index. This is the central object in almost all pandas workflows. A DataFrame can hold columns of different types simultaneously — integers, floats, strings, booleans, and datetime objects can all coexist in the same table. This mirrors how real-world data actually looks. ## What pandas Does Well **Loading data.** pandas can read CSV, Excel, JSON, Parquet, SQL databases, HTML tables, and more with a single function call. `pd.read_csv()` alone has replaced countless manual parsing scripts across the industry. **Filtering and selecting.** You can filter rows based on column values, select subsets of columns, and combine conditions using boolean indexing. These operations are concise, readable, and fast. **Handling missing data.** Real data is almost never complete. pandas provides built-in tools for detecting, filling, and dropping missing values — operations that would otherwise require repetitive custom code. **Grouping and aggregation.** The `groupby()` operation splits a DataFrame by one or more columns, applies an aggregation function (sum, mean, count, max, custom functions), and combines the results. This is the pandas equivalent of SQL's GROUP BY, and it handles grouped operations that would take dozens of lines in other approaches in a single expression. **Merging and joining.** pandas supports SQL-style inner joins, outer joins, left joins, and right joins between DataFrames. It also supports concatenation — stacking DataFrames vertically or horizontally. **Pivot tables.** The `pivot_table()` function creates Excel-style pivot summaries from raw data. It is used extensively in business reporting and exploratory analysis. **Time series.** pandas has first-class support for datetime data — date parsing, resampling, rolling windows, timezone handling, and date arithmetic. This reflects McKinney's original use case in financial data. ## Who Uses pandas JPMorgan Chase, Goldman Sachs, and Two Sigma use pandas extensively in their quantitative research workflows. NASA's Jet Propulsion Laboratory uses pandas for mission data analysis. The New York Times data journalism team has used pandas for investigative reporting. Virtually every data team at technology companies — Google, Meta, Amazon, Netflix, Spotify — has pandas in their analytics stack. In academia, pandas is the standard tool for empirical research involving structured data in economics, social science, biology, and public health. It appears in thousands of published research workflows. ## pandas in the Job Market According to analysis of job postings on LinkedIn and Indeed, pandas is the most frequently mentioned Python library in data analyst and data scientist job descriptions. It appears more often than NumPy, scikit-learn, or any web framework. For data engineering roles, pandas proficiency is listed as a requirement alongside SQL and Spark. The Stack Overflow Developer Survey consistently ranks pandas among the most-used Python libraries by professional developers. Its PyPI download count regularly exceeds 200 million per month, making it one of the most downloaded Python packages in existence. ## Learning pandas vs Learning SQL A common question for people entering data roles is whether to learn SQL or pandas first. The practical answer is both — they solve overlapping but distinct problems. SQL runs inside databases and is essential for extracting data. pandas runs in Python and is essential for transforming and analysing data once it has been extracted. In most professional workflows, SQL is used to get the data and pandas is used to work with it. ## What to Know Next pandas is built on top of NumPy, the numerical computing library we cover in the next article in this series. Understanding how NumPy arrays work explains why many pandas operations behave the way they do — and why pandas can handle millions of rows efficiently on a laptop.