dataset Module#

Core dataset class for epidemiological data management.

This module provides the Dataset class, a pandas DataFrame wrapper optimized for epidemiological analysis with additional functionality for cleaning, transforming, and analyzing public health data.

Class#

class episia.data.dataset.Dataset(data, low_memory=True, **kwargs)[source]#

Bases: object

Dataset class for epidemiological data.

A pandas DataFrame wrapper with epidemiological-specific methods and optimizations for memory and performance.

Parameters:
df#

Underlying pandas DataFrame

metadata#

Dictionary with dataset metadata

history#

List of transformations applied

optimized#

Whether data types have been optimized

__getitem__(key)[source]#

Allow dictionary-like access to columns.

__init__(data, low_memory=True, **kwargs)[source]#

Initialize Dataset from various data sources.

Parameters:
  • data (DataFrame | Dict | str | Path) – DataFrame, dictionary, or file path

  • low_memory (bool) – Optimize memory usage if True

  • **kwargs – Additional arguments for pd.read_csv if data is path

Raises:

DataError – If data cannot be loaded

__len__()[source]#
Return type:

int

__repr__()[source]#

Return repr(self).

Return type:

str

__setitem__(key, value)[source]#

Allow dictionary-like assignment to columns.

aggregate_by_date(date_column='date', freq='W', agg_func='sum', inplace=False)[source]#

Aggregate data by date frequency.

Parameters:
  • date_column (str) – Name of date column

  • freq (str) – Frequency string (‘D’, ‘W’, ‘M’, ‘Y’)

  • agg_func (str | Dict) – Aggregation function or dict of {column: function}

  • inplace (bool) – Modify in place or return new Dataset

Returns:

Aggregated Dataset

Return type:

Dataset

calculate_incidence(cases_col, population_col=None, population_value=None, time_period=1.0)[source]#

Calculate incidence rates.

Parameters:
  • cases_col (str) – Column with case counts

  • population_col (str | None) – Column with population at risk

  • population_value (float | None) – Constant population value if no column

  • time_period (float) – Time period for rate

Returns:

Series with incidence rates

Return type:

Series

clean(drop_na='any', drop_duplicates=True, inplace=False)[source]#

Clean the dataset by removing missing values and duplicates.

Parameters:
  • drop_na (bool | str | List[str]) – How to handle missing values: True/’any’: Drop rows with any NaN ‘all’: Drop rows with all NaN List: Drop rows with NaN in specific columns

  • drop_duplicates (bool) – Remove duplicate rows

  • inplace (bool) – Modify in place or return new Dataset

Returns:

Cleaned Dataset

Return type:

Dataset

copy()[source]#

Create a copy of the Dataset.

Return type:

Dataset

create_2x2_table(exposure_col, outcome_col, strata_col=None)[source]#

Create 2x2 contingency table from dataset columns.

Parameters:
  • exposure_col (str) – Exposure variable column

  • outcome_col (str) – Outcome variable column

  • strata_col (str | None) – Stratification variable (optional)

Returns:

Dictionary with table(s) and statistics

Return type:

Dict

describe_epidemiological()[source]#

Generate epidemiological description of dataset.

Returns:

DataFrame with epidemiological summary

Return type:

DataFrame

filter(condition, inplace=False)[source]#

Filter dataset based on condition.

Parameters:
  • condition (str | Dict | Series) – Filter condition as: - Query string - Dictionary of {column: value} - Boolean Series

  • inplace (bool) – Modify in place or return new Dataset

Returns:

Filtered Dataset

Return type:

Dataset

get_history()[source]#

Get transformation history as DataFrame.

Return type:

DataFrame

optimize_types()[source]#

Optimize DataFrame column types to reduce memory usage.

Returns:

self for method chaining

Return type:

Dataset

to_csv(path, **kwargs)[source]#

Save dataset to CSV.

Parameters:

path (str | Path)

Return type:

None

to_parquet(path, **kwargs)[source]#

Save dataset to Parquet.

Parameters:

path (str | Path)

Return type:

None

Examples#

Creating a Dataset:

from episia.data.dataset import Dataset
import pandas as pd

# From DataFrame
df = pd.DataFrame({
    'date': ['2023-01-01', '2023-01-02', '2023-01-03'],
    'cases': [10, 15, 20],
    'deaths': [0, 1, 2]
})
ds = Dataset(df)

# From file
ds = Dataset("surveillance_data.csv")

Data cleaning and optimization:

# Optimize memory usage
ds.optimize_types()

# Clean missing values
ds.clean(drop_na=True, drop_duplicates=True)

# Filter data
ds_filtered = ds.filter("cases > 10")

Epidemiological analysis:

# Create 2x2 contingency table
table = ds.create_2x2_table(exposure_col='exposed', outcome_col='case')

# Calculate incidence
incidence = ds.calculate_incidence(cases_col='cases', population_col='population')

# Get epidemiological description
summary = ds.describe_epidemiological()

History tracking:

# View transformation history
history_df = ds.get_history()
print(history_df)

Export:

ds.to_csv("output.csv")
ds.to_parquet("output.parquet")