dataset Module#
Core dataset class for epidemiological data management.
This module provides the Dataset class, a pandas DataFrame wrapper
optimized for epidemiological analysis with additional functionality
for cleaning, transforming, and analyzing public health data.
Class#
- class episia.data.dataset.Dataset(data, low_memory=True, **kwargs)[source]#
Bases:
objectDataset class for epidemiological data.
A pandas DataFrame wrapper with epidemiological-specific methods and optimizations for memory and performance.
- df#
Underlying pandas DataFrame
- metadata#
Dictionary with dataset metadata
- history#
List of transformations applied
- optimized#
Whether data types have been optimized
- aggregate_by_date(date_column='date', freq='W', agg_func='sum', inplace=False)[source]#
Aggregate data by date frequency.
- calculate_incidence(cases_col, population_col=None, population_value=None, time_period=1.0)[source]#
Calculate incidence rates.
- clean(drop_na='any', drop_duplicates=True, inplace=False)[source]#
Clean the dataset by removing missing values and duplicates.
- Parameters:
- Returns:
Cleaned Dataset
- Return type:
- create_2x2_table(exposure_col, outcome_col, strata_col=None)[source]#
Create 2x2 contingency table from dataset columns.
- describe_epidemiological()[source]#
Generate epidemiological description of dataset.
- Returns:
DataFrame with epidemiological summary
- Return type:
DataFrame
- optimize_types()[source]#
Optimize DataFrame column types to reduce memory usage.
- Returns:
self for method chaining
- Return type:
Examples#
Creating a Dataset:
from episia.data.dataset import Dataset
import pandas as pd
# From DataFrame
df = pd.DataFrame({
'date': ['2023-01-01', '2023-01-02', '2023-01-03'],
'cases': [10, 15, 20],
'deaths': [0, 1, 2]
})
ds = Dataset(df)
# From file
ds = Dataset("surveillance_data.csv")
Data cleaning and optimization:
# Optimize memory usage
ds.optimize_types()
# Clean missing values
ds.clean(drop_na=True, drop_duplicates=True)
# Filter data
ds_filtered = ds.filter("cases > 10")
Epidemiological analysis:
# Create 2x2 contingency table
table = ds.create_2x2_table(exposure_col='exposed', outcome_col='case')
# Calculate incidence
incidence = ds.calculate_incidence(cases_col='cases', population_col='population')
# Get epidemiological description
summary = ds.describe_epidemiological()
History tracking:
# View transformation history
history_df = ds.get_history()
print(history_df)
Export:
ds.to_csv("output.csv")
ds.to_parquet("output.parquet")