Analyzing Brazilian Hospital Admissions and Causes of Death with Python (SIH-DATASUS)

Mariane Neiva — @maribneiva
3 min readOct 20, 2023

In this post, we’ll delve into the Hospital Information System (SIH) of DATASUS, a rich dataset that provides detailed insights into hospital admissions in Brazil. Specifically, we’ll focus on the `CID_MORTE` column, which indicates the cause of death, and analyze the data for the state of Acre (AC) in June 2008. The full code can be found here: https://github.com/marianeneiva/downloadSIH

Setting Up

Before we begin, we need to install the `pysus` library, which facilitates the downloading of SIH data:

!pip install pysus### Fetching the Data

With pysus installed, we can easily fetch the SIH data for our desired state, year, and month:


from pysus.online_data.SIH import download
from pysus.online_data import parquets_to_dataframe as to_df

sih_ = to_df(download('AC',2008,6))

Exploring the Data

Let’s take a quick look at our data:

  • First five rows:
sih_.head()
output of sih_.head()
  • Last 5 rows:
sih_.tail()
output of sih_.tail()
  • A random sample of 10 rows:
sih_.sample(n=10)
output of: sih_.sample(n=10)

To get a sense of the columns we have:

print(list(sih_.columns.values))
output: ['UF_ZI', 'ANO_CMPT', 'MES_CMPT', 'ESPEC', 'CGC_HOSP', 'N_AIH', 'IDENT', 'CEP', 'MUNIC_RES', 'NASC', 'SEXO', 'UTI_MES_IN', 'UTI_MES_AN', 'UTI_MES_AL', 'UTI_MES_TO', 'MARCA_UTI', 'UTI_INT_IN', 'UTI_INT_AN', 'UTI_INT_AL', 'UTI_INT_TO', 'DIAR_ACOM', 'QT_DIARIAS', 'PROC_SOLIC', 'PROC_REA', 'VAL_SH', 'VAL_SP', 'VAL_SADT', 'VAL_RN', 'VAL_ACOMP', 'VAL_ORTP', 'VAL_SANGUE', 'VAL_SADTSR', 'VAL_TRANSP', 'VAL_OBSANG', 'VAL_PED1AC', 'VAL_TOT', 'VAL_UTI', 'US_TOT', 'DT_INTER', 'DT_SAIDA', 'DIAG_PRINC', 'DIAG_SECUN', 'COBRANCA', 'NATUREZA', 'GESTAO', 'RUBRICA', 'IND_VDRL', 'MUNIC_MOV', 'COD_IDADE', 'IDADE', 'DIAS_PERM', 'MORTE', 'NACIONAL', 'NUM_PROC', 'CAR_INT', 'TOT_PT_SP', 'CPF_AUT', 'HOMONIMO', 'NUM_FILHOS', 'INSTRU', 'CID_NOTIF', 'CONTRACEP1', 'CONTRACEP2', 'GESTRISCO', 'INSC_PN', 'SEQ_AIH5', 'CBOR', 'CNAER', 'VINCPREV', 'GESTOR_COD', 'GESTOR_TP', 'GESTOR_CPF', 'GESTOR_DT', 'CNES', 'CNPJ_MANT', 'INFEHOSP', 'CID_ASSO', 'CID_MORTE', 'COMPLEX', 'FINANC', 'FAEC_TP', 'REGCT', 'RACA_COR', 'ETNIA', 'SEQUENCIA', 'REMESSA']

Data Cleaning

Real-world data is often messy. Let’s replace empty strings with NaN values for better data handling:

import numpy as np
sih_ = sih_.replace('',np.nan)

Diving into the Cause of Death (‘CID_MORTE’)

The ‘CID_MORTE’ column is of particular interest as it indicates the cause of death. Let’s analyze it:

  • Unique values:
print(sih_["CID_MORTE"].unique())
print(sih_[“CID_MORTE”].unique())
  • Number of unique values:
print(sih_['CID_MORTE'].nunique()) #output: 44
  • Frequency of each unique value:
print(sih_[‘CID_MORTE’].value_counts())
part of the output: print(sih_[‘CID_MORTE’].value_counts())

Visualizing the Data

A histogram can provide a clear picture of the distribution of causes of death:

import plotly.express as px
fig = px.histogram(sih_, x="CID_MORTE").update_xaxes(categoryorder="total descending")
fig.show()
Histogram of ‘CID_MORTE’

This visualization offers a comprehensive view of the most common causes of death in hospital admissions for the state of Acre (‘AC’) in June 2008.

Conclusion

Analyzing the SIH dataset provides valuable insights into hospital admissions and causes of death. With Python and the pysus library, we can easily fetch, clean, and visualize this data, making it a powerful tool for healthcare professionals, researchers, and policymakers.

--

--

Mariane Neiva — @maribneiva

Woman in tech, researcher @University of Sao Paulo. Passionate by artificial intelligence, innovation, scientific communication and programming.