Efficient CSV File Processing in Python: csv, pandas, and polars Compared

Compare Python's csv module, pandas, and polars for reading and writing CSV files. Covers large file processing, encoding issues, and performance optimization.

What is a CSV File?

CSV (Comma-Separated Values) is a text-based file format that stores data separated by commas. It is widely used for data exchange and as input/output for data analysis. Python offers the built-in csv module as well as third-party libraries like pandas and polars, and choosing the right tool for your use case and data scale is important.

1. The Standard Library csv Module

1.1 csv.reader: Basic Reading

csv.reader reads a CSV file row by row as lists.

import csv

with open("data.csv", "r", encoding="utf-8") as f:
    reader = csv.reader(f)
    header = next(reader)  # Get the header row
    print(f"Columns: {header}")
    for row in reader:
        print(row)  # ['value1', 'value2', 'value3']

1.2 csv.writer: Basic Writing

import csv

data = [
    ["name", "age", "city"],
    ["Alice", 30, "New York"],
    ["Bob", 25, "London"],
    ["Charlie", 35, "Tokyo"],
]

with open("output.csv", "w", encoding="utf-8", newline="") as f:
    writer = csv.writer(f)
    writer.writerows(data)

Specifying newline="" is important; without it, extra blank lines will appear on Windows.

1.3 DictReader / DictWriter: Dictionary-Based Access

Working with dictionaries keyed by column names improves code readability.

import csv

# Reading as dictionaries
with open("data.csv", "r", encoding="utf-8") as f:
    reader = csv.DictReader(f)
    for row in reader:
        print(row["name"], row["age"])  # Access by column name

# Writing from dictionaries
fieldnames = ["name", "age", "city"]
rows = [
    {"name": "Alice", "age": 30, "city": "New York"},
    {"name": "Bob", "age": 25, "city": "London"},
]

with open("output.csv", "w", encoding="utf-8", newline="") as f:
    writer = csv.DictWriter(f, fieldnames=fieldnames)
    writer.writeheader()
    writer.writerows(rows)

1.4 Custom Delimiters and Quoting

The module supports TSV (tab-separated), semicolon-separated, and other formats.

import csv

# Reading a TSV file
with open("data.tsv", "r", encoding="utf-8") as f:
    reader = csv.reader(f, delimiter="\t")
    for row in reader:
        print(row)

# Custom quoting behavior
with open("data.csv", "w", encoding="utf-8", newline="") as f:
    writer = csv.writer(f, quoting=csv.QUOTE_ALL)  # Quote all fields
    writer.writerow(["name", "address, New York", "notes"])

2. CSV Operations with pandas

2.1 read_csv: Flexible Reading

pandas.read_csv comes with a wide range of options and is the most commonly used approach in practice.

import pandas as pd

# Basic reading
df = pd.read_csv("data.csv")

# Reading with commonly used options
df = pd.read_csv(
    "data.csv",
    encoding="utf-8",       # Encoding
    header=0,               # Header row index (None for no header)
    index_col=0,            # Column to use as index
    dtype={"age": int, "sales": float},   # Explicit type specification
    usecols=["name", "age", "sales"],     # Read only specific columns
    na_values=["N/A", "-", ""],           # Strings to treat as missing
    parse_dates=["date"],   # Columns to parse as dates
    nrows=1000,             # Read only first N rows
)

2.2 to_csv: Writing

import pandas as pd

df = pd.DataFrame({
    "name": ["Alice", "Bob", "Charlie"],
    "age": [30, 25, 35],
    "city": ["New York", "London", "Tokyo"],
})

# Basic writing
df.to_csv("output.csv", index=False, encoding="utf-8")

# Commonly used options
df.to_csv(
    "output.csv",
    index=False,            # Exclude index column
    encoding="utf-8-sig",   # BOM-prefixed UTF-8 for Excel compatibility
    columns=["name", "city"],  # Specify columns to output
    sep="\t",               # Tab-separated output
    na_rep="N/A",           # Representation for missing values
)

2.3 Filtering Rows and Aggregation

import pandas as pd

df = pd.read_csv("sales.csv")

# Filtering by condition
ny_sales = df[df["city"] == "New York"]
high_sales = df[df["sales"] > 100000]

# Compound conditions
result = df[(df["city"] == "New York") & (df["age"] >= 30)]

# Group-by aggregation
summary = df.groupby("city").agg(
    total_sales=("sales", "sum"),
    avg_sales=("sales", "mean"),
    count=("sales", "count"),
).reset_index()

print(summary)

2.4 Merging Multiple CSVs

import pandas as pd
import glob

# Concatenate CSVs with the same schema
files = glob.glob("data/sales_*.csv")
dfs = [pd.read_csv(f) for f in files]
combined = pd.concat(dfs, ignore_index=True)

# Join two CSVs on a key
customers = pd.read_csv("customers.csv")
orders = pd.read_csv("orders.csv")
merged = pd.merge(orders, customers, on="customer_id", how="left")

3. CSV Operations with polars

polars is a high-performance DataFrame library written in Rust. It provides an API similar to pandas while being significantly faster on large datasets.

import polars as pl

# Reading
df = pl.read_csv("data.csv")

# Reading with type specifications
df = pl.read_csv(
    "data.csv",
    dtypes={"age": pl.Int32, "sales": pl.Float64},
    encoding="utf8",
)

# Filtering and aggregation
result = (
    df.filter(pl.col("city") == "New York")
    .group_by("category")
    .agg(
        pl.col("sales").sum().alias("total_sales"),
        pl.col("sales").mean().alias("avg_sales"),
        pl.col("sales").count().alias("count"),
    )
)

# Writing
result.write_csv("output.csv")

Lazy Evaluation (Lazy API)

A key feature of polars is its Lazy API, which automatically optimizes queries.

import polars as pl

# Build query with lazy evaluation
result = (
    pl.scan_csv("large_data.csv")  # Read as LazyFrame
    .filter(pl.col("sales") > 10000)
    .group_by("city")
    .agg(pl.col("sales").sum())
    .sort("sales", descending=True)
    .collect()  # Execution happens here
)

scan_csv reads only the data that is needed, making it memory-efficient for large files.

4. Efficient Processing of Large CSV Files

When working with CSV files of several gigabytes, loading all data into memory may not be feasible.

4.1 pandas chunksize

import pandas as pd

# Read and process in chunks
chunk_size = 10000
results = []

for chunk in pd.read_csv("large_data.csv", chunksize=chunk_size):
    # Process each chunk independently
    filtered = chunk[chunk["sales"] > 10000]
    results.append(filtered)

# Combine results
final = pd.concat(results, ignore_index=True)

4.2 Memory-Efficient Processing with itertools and csv

For minimal memory usage, combine the csv module with generators.

import csv
from itertools import islice

def read_csv_chunks(filepath, chunk_size=10000):
    """Generator that reads a CSV file in chunks."""
    with open(filepath, "r", encoding="utf-8") as f:
        reader = csv.reader(f)
        header = next(reader)
        while True:
            chunk = list(islice(reader, chunk_size))
            if not chunk:
                break
            yield header, chunk

# Example: compute total sales with minimal memory
total_sales = 0
for header, chunk in read_csv_chunks("large_data.csv"):
    sales_idx = header.index("sales")
    total_sales += sum(float(row[sales_idx]) for row in chunk)

print(f"Total sales: {total_sales:,.0f}")

4.3 Streaming with polars

import polars as pl

# Process large files with streaming mode
result = (
    pl.scan_csv("very_large_data.csv")
    .filter(pl.col("status") == "completed")
    .group_by("city")
    .agg(pl.col("sales").sum())
    .collect(streaming=True)  # Execute in streaming mode
)

5. Handling Encoding Issues

Encoding problems are common when working with CSV files, especially in multilingual environments.

5.1 Common Encodings

EncodingUse CasePython Specification
UTF-8Standard encodingencoding="utf-8"
UTF-8 BOMExcel outputencoding="utf-8-sig"
Shift_JISJapanese Windows environmentsencoding="shift_jis"
CP932Extended Shift_JISencoding="cp932"
Latin-1Western Europeanencoding="latin-1"
CP1252Windows Western Europeanencoding="cp1252"

5.2 Automatic Encoding Detection

import chardet

def detect_encoding(filepath):
    """Automatically detect file encoding."""
    with open(filepath, "rb") as f:
        raw_data = f.read(10000)  # Read first 10KB
        result = chardet.detect(raw_data)
    return result["encoding"], result["confidence"]

# Usage
encoding, confidence = detect_encoding("unknown.csv")
print(f"Encoding: {encoding} (confidence: {confidence:.2%})")

import pandas as pd
df = pd.read_csv("unknown.csv", encoding=encoding)

5.3 Handling Encoding Errors

import pandas as pd

# Ignore errors (may lose data)
df = pd.read_csv("data.csv", encoding="utf-8", encoding_errors="ignore")

# Replace errors with replacement character
df = pd.read_csv("data.csv", encoding="utf-8", encoding_errors="replace")

# Try multiple encodings in sequence
def read_csv_auto(filepath):
    """Read CSV by trying multiple encodings."""
    encodings = ["utf-8", "utf-8-sig", "cp932", "shift_jis", "latin-1", "cp1252"]
    for enc in encodings:
        try:
            return pd.read_csv(filepath, encoding=enc)
        except (UnicodeDecodeError, UnicodeError):
            continue
    raise ValueError(f"Failed to read file: {filepath}")

5.4 BOM-Prefixed UTF-8 for Excel Compatibility

Microsoft Excel requires a BOM (Byte Order Mark) to correctly display UTF-8 CSV files.

import pandas as pd

df = pd.DataFrame({"name": ["Alice", "Bob"], "sales": [100, 200]})

# Write with BOM-prefixed UTF-8 (no garbled text in Excel)
df.to_csv("for_excel.csv", index=False, encoding="utf-8-sig")

6. Practical Example: Data Preprocessing Pipeline

Here is a practical example combining multiple processing steps. For string pattern matching, regular expressions are invaluable.

import pandas as pd
import re

def process_sales_data(input_path, output_path):
    """Sales data preprocessing pipeline."""
    # 1. Read (with encoding handling)
    df = pd.read_csv(input_path, encoding="utf-8", parse_dates=["date"])

    # 2. Drop unnecessary columns
    df = df.drop(columns=["notes", "updated_at"], errors="ignore")

    # 3. Handle missing values
    df["sales"] = df["sales"].fillna(0)
    df["city"] = df["city"].fillna("Unknown")

    # 4. Type conversion
    df["sales"] = df["sales"].astype(int)

    # 5. Cleanse phone numbers with regex
    df["phone"] = df["phone"].apply(
        lambda x: re.sub(r"[^\d]", "", str(x)) if pd.notna(x) else ""
    )

    # 6. Filter (keep only rows with positive sales)
    df = df[df["sales"] > 0]

    # 7. Add aggregation columns
    df["month"] = df["date"].dt.to_period("M")

    # 8. Write output
    df.to_csv(output_path, index=False, encoding="utf-8")
    print(f"Processing complete: {len(df)} rows written")

process_sales_data("raw_sales.csv", "cleaned_sales.csv")

For repeated tasks, decorators can add logging or retry functionality to your processing functions.

7. Comparison: csv vs pandas vs polars

Featurecsv (built-in)pandaspolars
InstallationNone (built-in)pip install pandaspip install polars
SpeedSlowMediumFast
Memory EfficiencyGood (row-based)AverageGood (columnar)
Feature RichnessMinimalVery richRich
Type SafetyNoneWeakStrong
ParallelismNoneNoneAuto-parallelized
Lazy EvaluationNoneNoneYes (scan_csv)
Learning CurveLowMediumMedium
Best ForSmall/simpleGeneral analysisLarge-scale data

Performance Benchmark

Approximate read times for a CSV file with 1 million rows and 10 columns:

import time
import csv
import pandas as pd
import polars as pl

filepath = "benchmark_data.csv"  # 1M rows

# csv.reader
start = time.perf_counter()
with open(filepath, "r") as f:
    reader = csv.reader(f)
    data = list(reader)
csv_time = time.perf_counter() - start

# pandas
start = time.perf_counter()
df_pd = pd.read_csv(filepath)
pandas_time = time.perf_counter() - start

# polars
start = time.perf_counter()
df_pl = pl.read_csv(filepath)
polars_time = time.perf_counter() - start

print(f"csv.reader: {csv_time:.2f}s")
print(f"pandas:     {pandas_time:.2f}s")
print(f"polars:     {polars_time:.2f}s")
# Typical results:
# csv.reader: 3.50s
# pandas:     1.20s
# polars:     0.35s

8. Performance Tips

Key techniques for improving performance when working with large CSV files.

8.1 Read Only the Columns You Need

import pandas as pd

# Reading all columns (slow)
df = pd.read_csv("large.csv")

# Reading only needed columns (fast, less memory)
df = pd.read_csv("large.csv", usecols=["name", "sales"])

8.2 Specify Types Explicitly

pandas spends time on type inference, so specifying types explicitly speeds up reading large files.

import pandas as pd

dtypes = {
    "id": "int32",          # int64 -> int32 halves memory
    "name": "string",       # More efficient than object
    "sales": "float32",     # float64 -> float32 halves memory
    "category": "category", # Category type drastically reduces memory
}

df = pd.read_csv("large.csv", dtype=dtypes)

8.3 Convert to Parquet Format

If you repeatedly read the same CSV, converting it to Parquet format provides significantly faster reads.

import pandas as pd

# Convert CSV to Parquet (one-time)
df = pd.read_csv("large.csv")
df.to_parquet("large.parquet", engine="pyarrow")

# Subsequent reads from Parquet (several times to orders of magnitude faster)
df = pd.read_parquet("large.parquet")

8.4 Reading Multiple CSVs in Parallel

When processing many CSV files simultaneously, asynchronous processing or multiprocessing can help.

from concurrent.futures import ProcessPoolExecutor
import pandas as pd
import glob

def process_file(filepath):
    """Process an individual CSV file."""
    df = pd.read_csv(filepath)
    return df[df["sales"] > 10000]

# Parallel processing with multiprocessing
files = glob.glob("data/sales_*.csv")
with ProcessPoolExecutor(max_workers=4) as executor:
    results = list(executor.map(process_file, files))

combined = pd.concat(results, ignore_index=True)

Summary

Choosing the right tool for CSV processing depends on your use case:

  • Small-scale, simple taskscsv module (no dependencies)
  • Data analysis and transformationpandas (most widely used)
  • Large-scale data, high performancepolars (auto-parallelization, lazy evaluation)

Encoding issues are unavoidable in multilingual environments, so adopting practices like automatic detection with chardet and using utf-8-sig for Excel compatibility will save you time. Additionally, converting frequently-read files to Parquet format can dramatically improve read speeds.

References