Unraveling Zika: A Comprehensive Data Analysis with Python — Brazilian DATASUS

Mariane Neiva — @maribneiva
2 min readOct 29, 2023

Introduction

Zika, a disease primarily spread by mosquitoes, has been a subject of global concern due to its potential severe effects on pregnant women and their fetuses. In this article, we delve into the data provided by The Information System for Notifiable Diseases (SINAN) of DATASUS to analyze the Zika virus cases reported in Brazil in 2016. By utilizing Python and its powerful libraries, we aim to uncover patterns, trends, and gain insights into the spread of this disease.

Setting Up the Environment

Before we start our analysis, we need to set up our Python environment. Ensure you have Python installed, and then proceed to install the required libraries. You can install the necessary library using the following command:

pip install pysus pandas numpy plotly

Fetching the Data

With our environment ready, we can now fetch the Zika virus data. We will use the `pysus` library, which simplifies the process of downloading SIH data. Here’s how you can fetch the SINAN data for Zika virus cases reported in 2016:


from pysus.online_data.SINAN import download
from pysus.online_data import parquets_to_dataframe as to_df
import pandas as pdpo
zika = to_df(download(‘Zika’, 2016))
print(type(zika))

Exploring the Data

Once we have our data, it’s crucial to understand its structure, the type of data we are dealing with, and get a glimpse of the first few rows:


print(“Number of instances:”, len(zika))
print(“Columns:”, zika.columns)
print(“Data types:\n”, zika.dtypes)
print(“First 5 rows:\n”, zika.head())
print(“Column information:\n”, zika.info())

## Data Cleaning

Real-world data is rarely perfect. In our dataset, empty values are filled with ‘’. For better data handling, we will replace these empty strings with NaN values:


import numpy as np
zika = zika.replace(‘’, np.nan)

Analyzing Zika Cases

We are particularly interested in the ‘DT_NOTIFIC’ column, which indicates the date of notification. Let’s analyze it:

zika[‘DT_NOTIFIC’] = pd.to_datetime(zika[‘DT_NOTIFIC’])
print(“Unique values:\n”, zika[‘DT_NOTIFIC’].unique())
print(“Number of unique values:”, zika[‘DT_NOTIFIC’].nunique())
print(“Frequency of each unique value:\n”, zika[‘DT_NOTIFIC’].value_counts())

Visualizing the Data

To better understand the distribution of Zika cases over time, we will create a histogram:


import plotly.express as px
fig = px.histogram(zika, x=”DT_NOTIFIC”, color=’SG_UF_NOT’, text_auto=True)
fig.show()
Final result

Conclusion

Through this analysis, we have taken a significant step in understanding the spread of the Zika virus in Brazil. The use of Python and its libraries has enabled us to clean, analyze, and visualize the data in a meaningful way. As we continue to explore this dataset, we can uncover more insights and contribute to the global effort in combating the Zika virus.

— -

This article provides a comprehensive guide to analyzing Zika virus data using Python. By following the steps outlined, readers can replicate the analysis and gain valuable insights into the spread of this disease. Happy analyzing!

--

--

Mariane Neiva — @maribneiva

Woman in tech, researcher @University of Sao Paulo. Passionate by artificial intelligence, innovation, scientific communication and programming.