types Module#

Data type optimization for epidemiological analysis.

This module provides functions for optimizing data types to reduce memory usage while maintaining data integrity for epidemiological analysis.

Functions#

episia.data.types.optimize_dataframe_types(df, categorical_threshold=0.5, downcast_integers=True, downcast_floats=True)[source]#

Optimize DataFrame column types to reduce memory usage.

Parameters:
  • df (DataFrame) – Input DataFrame

  • categorical_threshold (float) – Maximum unique ratio for categorical conversion

  • downcast_integers (bool) – Downcast integer columns

  • downcast_floats (bool) – Downcast float columns

Returns:

Optimized DataFrame

Return type:

DataFrame

episia.data.types.optimize_column_type(series, categorical_threshold=0.5, downcast_integers=True, downcast_floats=True)[source]#

Optimize a single column’s data type.

Parameters:
  • series (Series) – Input Series

  • categorical_threshold (float) – Maximum unique ratio for categorical

  • downcast_integers (bool) – Downcast integer columns

  • downcast_floats (bool) – Downcast float columns

Returns:

Optimized Series

Return type:

Series

episia.data.types.optimize_numeric_type(series, downcast_integers=True, downcast_floats=True)[source]#

Optimize numeric column type.

Parameters:
  • series (Series) – Numeric Series

  • downcast_integers (bool) – Downcast integer columns

  • downcast_floats (bool) – Downcast float columns

Returns:

Optimized numeric Series

Return type:

Series

episia.data.types.optimize_datetime_type(series)[source]#

Optimize datetime column type.

Parameters:

series (Series) – Datetime Series

Returns:

Optimized datetime Series

Return type:

Series

episia.data.types.optimize_object_type(series)[source]#

Optimize object (string) column type.

Parameters:

series (Series) – Object Series

Returns:

Optimized Series

Return type:

Series

episia.data.types.get_type_recommendations(df)[source]#

Get type optimization recommendations for DataFrame.

Parameters:

df (DataFrame) – Input DataFrame

Returns:

DataFrame with recommendations

Return type:

DataFrame

episia.data.types.convert_to_epidemiological_types(df, column_types)[source]#

Convert columns to specific epidemiological types.

Parameters:
  • df (DataFrame) – Input DataFrame

  • column_types (Dict[str, str]) – Dictionary of {column_name: type} Types: ‘binary’, ‘categorical’, ‘continuous’, ‘date’

Returns:

Converted DataFrame

Return type:

DataFrame

episia.data.types.convert_to_binary(series)[source]#

Convert series to binary (0/1).

Parameters:

series (Series)

Return type:

Series

episia.data.types.convert_to_categorical(series, max_categories=50)[source]#

Convert series to categorical type.

Parameters:
  • series (Series)

  • max_categories (int)

Return type:

Series

episia.data.types.convert_to_continuous(series)[source]#

Convert series to continuous numeric type.

Parameters:

series (Series)

Return type:

Series

episia.data.types.convert_to_date(series)[source]#

Convert series to datetime type.

Parameters:

series (Series)

Return type:

Series

episia.data.types.detect_column_types(df)[source]#

Detect epidemiological column types automatically.

Parameters:

df (DataFrame) – Input DataFrame

Returns:

detected_type}

Return type:

Dictionary of {column_name

Examples#

Optimize entire DataFrame:

from episia.data.types import optimize_dataframe_types
import pandas as pd

# Create DataFrame with inefficient types
df = pd.DataFrame({
    'age': [25, 30, 35, 40],
    'cases': [100, 150, 200, 250],
    'district': ['A', 'B', 'A', 'C']
})

# Optimize types
df_opt = optimize_dataframe_types(df)
# Memory usage is automatically reduced

Get type recommendations:

from episia.data.types import get_type_recommendations

recommendations = get_type_recommendations(df)
print(recommendations)
# Shows current vs recommended types and memory savings

Convert to specific epidemiological types:

from episia.data.types import convert_to_epidemiological_types

type_spec = {
    'district': 'categorical',
    'case': 'binary',
    'age': 'continuous',
    'report_date': 'date'
}

df_converted = convert_to_epidemiological_types(df, type_spec)

Binary conversion:

from episia.data.types import convert_to_binary

# Convert various representations to 0/1
series = pd.Series(['Yes', 'No', 'Yes', 'No'])
binary = convert_to_binary(series)  # Returns [1, 0, 1, 0]

Automatic column type detection:

from episia.data.types import detect_column_types

types = detect_column_types(df)
print(types)
# {'age': 'continuous', 'cases': 'continuous', 'district': 'categorical'}