Analyzing Brazilian Hospital Admissions and Causes of Death with Python (SIH-DATASUS)
In this post, we’ll delve into the Hospital Information System (SIH) of DATASUS, a rich dataset that provides detailed insights into hospital admissions in Brazil. Specifically, we’ll focus on the `CID_MORTE` column, which indicates the cause of death, and analyze the data for the state of Acre (AC) in June 2008. The full code can be found here: https://github.com/marianeneiva/downloadSIH
Setting Up
Before we begin, we need to install the `pysus` library, which facilitates the downloading of SIH data:
!pip install pysus### Fetching the Data
With pysus installed, we can easily fetch the SIH data for our desired state, year, and month:
from pysus.online_data.SIH import download
from pysus.online_data import parquets_to_dataframe as to_df
sih_ = to_df(download('AC',2008,6))
Exploring the Data
Let’s take a quick look at our data:
- First five rows:
sih_.head()
- Last 5 rows:
sih_.tail()
- A random sample of 10 rows:
sih_.sample(n=10)
To get a sense of the columns we have:
print(list(sih_.columns.values))
output: ['UF_ZI', 'ANO_CMPT', 'MES_CMPT', 'ESPEC', 'CGC_HOSP', 'N_AIH', 'IDENT', 'CEP', 'MUNIC_RES', 'NASC', 'SEXO', 'UTI_MES_IN', 'UTI_MES_AN', 'UTI_MES_AL', 'UTI_MES_TO', 'MARCA_UTI', 'UTI_INT_IN', 'UTI_INT_AN', 'UTI_INT_AL', 'UTI_INT_TO', 'DIAR_ACOM', 'QT_DIARIAS', 'PROC_SOLIC', 'PROC_REA', 'VAL_SH', 'VAL_SP', 'VAL_SADT', 'VAL_RN', 'VAL_ACOMP', 'VAL_ORTP', 'VAL_SANGUE', 'VAL_SADTSR', 'VAL_TRANSP', 'VAL_OBSANG', 'VAL_PED1AC', 'VAL_TOT', 'VAL_UTI', 'US_TOT', 'DT_INTER', 'DT_SAIDA', 'DIAG_PRINC', 'DIAG_SECUN', 'COBRANCA', 'NATUREZA', 'GESTAO', 'RUBRICA', 'IND_VDRL', 'MUNIC_MOV', 'COD_IDADE', 'IDADE', 'DIAS_PERM', 'MORTE', 'NACIONAL', 'NUM_PROC', 'CAR_INT', 'TOT_PT_SP', 'CPF_AUT', 'HOMONIMO', 'NUM_FILHOS', 'INSTRU', 'CID_NOTIF', 'CONTRACEP1', 'CONTRACEP2', 'GESTRISCO', 'INSC_PN', 'SEQ_AIH5', 'CBOR', 'CNAER', 'VINCPREV', 'GESTOR_COD', 'GESTOR_TP', 'GESTOR_CPF', 'GESTOR_DT', 'CNES', 'CNPJ_MANT', 'INFEHOSP', 'CID_ASSO', 'CID_MORTE', 'COMPLEX', 'FINANC', 'FAEC_TP', 'REGCT', 'RACA_COR', 'ETNIA', 'SEQUENCIA', 'REMESSA']
Data Cleaning
Real-world data is often messy. Let’s replace empty strings with NaN values for better data handling:
import numpy as np
sih_ = sih_.replace('',np.nan)
Diving into the Cause of Death (‘CID_MORTE’)
The ‘CID_MORTE’ column is of particular interest as it indicates the cause of death. Let’s analyze it:
- Unique values:
print(sih_["CID_MORTE"].unique())
- Number of unique values:
print(sih_['CID_MORTE'].nunique()) #output: 44
- Frequency of each unique value:
print(sih_[‘CID_MORTE’].value_counts())
Visualizing the Data
A histogram can provide a clear picture of the distribution of causes of death:
import plotly.express as px
fig = px.histogram(sih_, x="CID_MORTE").update_xaxes(categoryorder="total descending")
fig.show()
This visualization offers a comprehensive view of the most common causes of death in hospital admissions for the state of Acre (‘AC’) in June 2008.
Conclusion
Analyzing the SIH dataset provides valuable insights into hospital admissions and causes of death. With Python and the pysus library, we can easily fetch, clean, and visualize this data, making it a powerful tool for healthcare professionals, researchers, and policymakers.