What is a CSV File?
CSV (Comma-Separated Values) is a text-based file format that stores data separated by commas. It is widely used for data exchange and as input/output for data analysis. Python offers the built-in csv module as well as third-party libraries like pandas and polars, and choosing the right tool for your use case and data scale is important.
1. The Standard Library csv Module
1.1 csv.reader: Basic Reading
csv.reader reads a CSV file row by row as lists.
import csv
with open("data.csv", "r", encoding="utf-8") as f:
reader = csv.reader(f)
header = next(reader) # Get the header row
print(f"Columns: {header}")
for row in reader:
print(row) # ['value1', 'value2', 'value3']
1.2 csv.writer: Basic Writing
import csv
data = [
["name", "age", "city"],
["Alice", 30, "New York"],
["Bob", 25, "London"],
["Charlie", 35, "Tokyo"],
]
with open("output.csv", "w", encoding="utf-8", newline="") as f:
writer = csv.writer(f)
writer.writerows(data)
Specifying newline="" is important; without it, extra blank lines will appear on Windows.
1.3 DictReader / DictWriter: Dictionary-Based Access
Working with dictionaries keyed by column names improves code readability.
import csv
# Reading as dictionaries
with open("data.csv", "r", encoding="utf-8") as f:
reader = csv.DictReader(f)
for row in reader:
print(row["name"], row["age"]) # Access by column name
# Writing from dictionaries
fieldnames = ["name", "age", "city"]
rows = [
{"name": "Alice", "age": 30, "city": "New York"},
{"name": "Bob", "age": 25, "city": "London"},
]
with open("output.csv", "w", encoding="utf-8", newline="") as f:
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(rows)
1.4 Custom Delimiters and Quoting
The module supports TSV (tab-separated), semicolon-separated, and other formats.
import csv
# Reading a TSV file
with open("data.tsv", "r", encoding="utf-8") as f:
reader = csv.reader(f, delimiter="\t")
for row in reader:
print(row)
# Custom quoting behavior
with open("data.csv", "w", encoding="utf-8", newline="") as f:
writer = csv.writer(f, quoting=csv.QUOTE_ALL) # Quote all fields
writer.writerow(["name", "address, New York", "notes"])
2. CSV Operations with pandas
2.1 read_csv: Flexible Reading
pandas.read_csv comes with a wide range of options and is the most commonly used approach in practice.
import pandas as pd
# Basic reading
df = pd.read_csv("data.csv")
# Reading with commonly used options
df = pd.read_csv(
"data.csv",
encoding="utf-8", # Encoding
header=0, # Header row index (None for no header)
index_col=0, # Column to use as index
dtype={"age": int, "sales": float}, # Explicit type specification
usecols=["name", "age", "sales"], # Read only specific columns
na_values=["N/A", "-", ""], # Strings to treat as missing
parse_dates=["date"], # Columns to parse as dates
nrows=1000, # Read only first N rows
)
2.2 to_csv: Writing
import pandas as pd
df = pd.DataFrame({
"name": ["Alice", "Bob", "Charlie"],
"age": [30, 25, 35],
"city": ["New York", "London", "Tokyo"],
})
# Basic writing
df.to_csv("output.csv", index=False, encoding="utf-8")
# Commonly used options
df.to_csv(
"output.csv",
index=False, # Exclude index column
encoding="utf-8-sig", # BOM-prefixed UTF-8 for Excel compatibility
columns=["name", "city"], # Specify columns to output
sep="\t", # Tab-separated output
na_rep="N/A", # Representation for missing values
)
2.3 Filtering Rows and Aggregation
import pandas as pd
df = pd.read_csv("sales.csv")
# Filtering by condition
ny_sales = df[df["city"] == "New York"]
high_sales = df[df["sales"] > 100000]
# Compound conditions
result = df[(df["city"] == "New York") & (df["age"] >= 30)]
# Group-by aggregation
summary = df.groupby("city").agg(
total_sales=("sales", "sum"),
avg_sales=("sales", "mean"),
count=("sales", "count"),
).reset_index()
print(summary)
2.4 Merging Multiple CSVs
import pandas as pd
import glob
# Concatenate CSVs with the same schema
files = glob.glob("data/sales_*.csv")
dfs = [pd.read_csv(f) for f in files]
combined = pd.concat(dfs, ignore_index=True)
# Join two CSVs on a key
customers = pd.read_csv("customers.csv")
orders = pd.read_csv("orders.csv")
merged = pd.merge(orders, customers, on="customer_id", how="left")
3. CSV Operations with polars
polars is a high-performance DataFrame library written in Rust. It provides an API similar to pandas while being significantly faster on large datasets.
import polars as pl
# Reading
df = pl.read_csv("data.csv")
# Reading with type specifications
df = pl.read_csv(
"data.csv",
dtypes={"age": pl.Int32, "sales": pl.Float64},
encoding="utf8",
)
# Filtering and aggregation
result = (
df.filter(pl.col("city") == "New York")
.group_by("category")
.agg(
pl.col("sales").sum().alias("total_sales"),
pl.col("sales").mean().alias("avg_sales"),
pl.col("sales").count().alias("count"),
)
)
# Writing
result.write_csv("output.csv")
Lazy Evaluation (Lazy API)
A key feature of polars is its Lazy API, which automatically optimizes queries.
import polars as pl
# Build query with lazy evaluation
result = (
pl.scan_csv("large_data.csv") # Read as LazyFrame
.filter(pl.col("sales") > 10000)
.group_by("city")
.agg(pl.col("sales").sum())
.sort("sales", descending=True)
.collect() # Execution happens here
)
scan_csv reads only the data that is needed, making it memory-efficient for large files.
4. Efficient Processing of Large CSV Files
When working with CSV files of several gigabytes, loading all data into memory may not be feasible.
4.1 pandas chunksize
import pandas as pd
# Read and process in chunks
chunk_size = 10000
results = []
for chunk in pd.read_csv("large_data.csv", chunksize=chunk_size):
# Process each chunk independently
filtered = chunk[chunk["sales"] > 10000]
results.append(filtered)
# Combine results
final = pd.concat(results, ignore_index=True)
4.2 Memory-Efficient Processing with itertools and csv
For minimal memory usage, combine the csv module with generators.
import csv
from itertools import islice
def read_csv_chunks(filepath, chunk_size=10000):
"""Generator that reads a CSV file in chunks."""
with open(filepath, "r", encoding="utf-8") as f:
reader = csv.reader(f)
header = next(reader)
while True:
chunk = list(islice(reader, chunk_size))
if not chunk:
break
yield header, chunk
# Example: compute total sales with minimal memory
total_sales = 0
for header, chunk in read_csv_chunks("large_data.csv"):
sales_idx = header.index("sales")
total_sales += sum(float(row[sales_idx]) for row in chunk)
print(f"Total sales: {total_sales:,.0f}")
4.3 Streaming with polars
import polars as pl
# Process large files with streaming mode
result = (
pl.scan_csv("very_large_data.csv")
.filter(pl.col("status") == "completed")
.group_by("city")
.agg(pl.col("sales").sum())
.collect(streaming=True) # Execute in streaming mode
)
5. Handling Encoding Issues
Encoding problems are common when working with CSV files, especially in multilingual environments.
5.1 Common Encodings
| Encoding | Use Case | Python Specification |
|---|---|---|
| UTF-8 | Standard encoding | encoding="utf-8" |
| UTF-8 BOM | Excel output | encoding="utf-8-sig" |
| Shift_JIS | Japanese Windows environments | encoding="shift_jis" |
| CP932 | Extended Shift_JIS | encoding="cp932" |
| Latin-1 | Western European | encoding="latin-1" |
| CP1252 | Windows Western European | encoding="cp1252" |
5.2 Automatic Encoding Detection
import chardet
def detect_encoding(filepath):
"""Automatically detect file encoding."""
with open(filepath, "rb") as f:
raw_data = f.read(10000) # Read first 10KB
result = chardet.detect(raw_data)
return result["encoding"], result["confidence"]
# Usage
encoding, confidence = detect_encoding("unknown.csv")
print(f"Encoding: {encoding} (confidence: {confidence:.2%})")
import pandas as pd
df = pd.read_csv("unknown.csv", encoding=encoding)
5.3 Handling Encoding Errors
import pandas as pd
# Ignore errors (may lose data)
df = pd.read_csv("data.csv", encoding="utf-8", encoding_errors="ignore")
# Replace errors with replacement character
df = pd.read_csv("data.csv", encoding="utf-8", encoding_errors="replace")
# Try multiple encodings in sequence
def read_csv_auto(filepath):
"""Read CSV by trying multiple encodings."""
encodings = ["utf-8", "utf-8-sig", "cp932", "shift_jis", "latin-1", "cp1252"]
for enc in encodings:
try:
return pd.read_csv(filepath, encoding=enc)
except (UnicodeDecodeError, UnicodeError):
continue
raise ValueError(f"Failed to read file: {filepath}")
5.4 BOM-Prefixed UTF-8 for Excel Compatibility
Microsoft Excel requires a BOM (Byte Order Mark) to correctly display UTF-8 CSV files.
import pandas as pd
df = pd.DataFrame({"name": ["Alice", "Bob"], "sales": [100, 200]})
# Write with BOM-prefixed UTF-8 (no garbled text in Excel)
df.to_csv("for_excel.csv", index=False, encoding="utf-8-sig")
6. Practical Example: Data Preprocessing Pipeline
Here is a practical example combining multiple processing steps. For string pattern matching, regular expressions are invaluable.
import pandas as pd
import re
def process_sales_data(input_path, output_path):
"""Sales data preprocessing pipeline."""
# 1. Read (with encoding handling)
df = pd.read_csv(input_path, encoding="utf-8", parse_dates=["date"])
# 2. Drop unnecessary columns
df = df.drop(columns=["notes", "updated_at"], errors="ignore")
# 3. Handle missing values
df["sales"] = df["sales"].fillna(0)
df["city"] = df["city"].fillna("Unknown")
# 4. Type conversion
df["sales"] = df["sales"].astype(int)
# 5. Cleanse phone numbers with regex
df["phone"] = df["phone"].apply(
lambda x: re.sub(r"[^\d]", "", str(x)) if pd.notna(x) else ""
)
# 6. Filter (keep only rows with positive sales)
df = df[df["sales"] > 0]
# 7. Add aggregation columns
df["month"] = df["date"].dt.to_period("M")
# 8. Write output
df.to_csv(output_path, index=False, encoding="utf-8")
print(f"Processing complete: {len(df)} rows written")
process_sales_data("raw_sales.csv", "cleaned_sales.csv")
For repeated tasks, decorators can add logging or retry functionality to your processing functions.
7. Comparison: csv vs pandas vs polars
| Feature | csv (built-in) | pandas | polars |
|---|---|---|---|
| Installation | None (built-in) | pip install pandas | pip install polars |
| Speed | Slow | Medium | Fast |
| Memory Efficiency | Good (row-based) | Average | Good (columnar) |
| Feature Richness | Minimal | Very rich | Rich |
| Type Safety | None | Weak | Strong |
| Parallelism | None | None | Auto-parallelized |
| Lazy Evaluation | None | None | Yes (scan_csv) |
| Learning Curve | Low | Medium | Medium |
| Best For | Small/simple | General analysis | Large-scale data |
Performance Benchmark
Approximate read times for a CSV file with 1 million rows and 10 columns:
import time
import csv
import pandas as pd
import polars as pl
filepath = "benchmark_data.csv" # 1M rows
# csv.reader
start = time.perf_counter()
with open(filepath, "r") as f:
reader = csv.reader(f)
data = list(reader)
csv_time = time.perf_counter() - start
# pandas
start = time.perf_counter()
df_pd = pd.read_csv(filepath)
pandas_time = time.perf_counter() - start
# polars
start = time.perf_counter()
df_pl = pl.read_csv(filepath)
polars_time = time.perf_counter() - start
print(f"csv.reader: {csv_time:.2f}s")
print(f"pandas: {pandas_time:.2f}s")
print(f"polars: {polars_time:.2f}s")
# Typical results:
# csv.reader: 3.50s
# pandas: 1.20s
# polars: 0.35s
8. Performance Tips
Key techniques for improving performance when working with large CSV files.
8.1 Read Only the Columns You Need
import pandas as pd
# Reading all columns (slow)
df = pd.read_csv("large.csv")
# Reading only needed columns (fast, less memory)
df = pd.read_csv("large.csv", usecols=["name", "sales"])
8.2 Specify Types Explicitly
pandas spends time on type inference, so specifying types explicitly speeds up reading large files.
import pandas as pd
dtypes = {
"id": "int32", # int64 -> int32 halves memory
"name": "string", # More efficient than object
"sales": "float32", # float64 -> float32 halves memory
"category": "category", # Category type drastically reduces memory
}
df = pd.read_csv("large.csv", dtype=dtypes)
8.3 Convert to Parquet Format
If you repeatedly read the same CSV, converting it to Parquet format provides significantly faster reads.
import pandas as pd
# Convert CSV to Parquet (one-time)
df = pd.read_csv("large.csv")
df.to_parquet("large.parquet", engine="pyarrow")
# Subsequent reads from Parquet (several times to orders of magnitude faster)
df = pd.read_parquet("large.parquet")
8.4 Reading Multiple CSVs in Parallel
When processing many CSV files simultaneously, asynchronous processing or multiprocessing can help.
from concurrent.futures import ProcessPoolExecutor
import pandas as pd
import glob
def process_file(filepath):
"""Process an individual CSV file."""
df = pd.read_csv(filepath)
return df[df["sales"] > 10000]
# Parallel processing with multiprocessing
files = glob.glob("data/sales_*.csv")
with ProcessPoolExecutor(max_workers=4) as executor:
results = list(executor.map(process_file, files))
combined = pd.concat(results, ignore_index=True)
Summary
Choosing the right tool for CSV processing depends on your use case:
- Small-scale, simple tasks –
csvmodule (no dependencies) - Data analysis and transformation –
pandas(most widely used) - Large-scale data, high performance –
polars(auto-parallelization, lazy evaluation)
Encoding issues are unavoidable in multilingual environments, so adopting practices like automatic detection with chardet and using utf-8-sig for Excel compatibility will save you time. Additionally, converting frequently-read files to Parquet format can dramatically improve read speeds.
Related Articles
- Python Async Processing: A Practical asyncio Guide - Learn the fundamentals of asynchronous processing for efficiently handling large numbers of CSV files.
- Practical Python Decorators - Add logging and retry functionality to CSV processing functions using decorators.
- Complete Guide to Python Regular Expressions - Master the regex fundamentals essential for CSV data cleansing.
References
- csv — CSV File Reading and Writing (Python Official Docs)
- pandas.read_csv (pandas Official Docs)
- Polars User Guide
- chardet — Character Encoding Detection
Related Tools
- DevToolBox - Free Developer Tools - 85+ developer tools including JSON formatter, regex tester, and more
- CalcBox - Everyday Calculators - 61+ calculators for statistics, compound interest, and more