31 Mar 2020

COVID-19: Analysis of number of deaths in Italy, Spain and Netherlands with Power BI

#70 COVID-19: Analysis of number of deaths in Italy, Spain and Netherlands with Power BI 

Update 2/4/2020: Today I improved the report, which you can download now (see link at bottom of page), and I finished this blog-post, and I included the (embedded) report (at the bottom of this page).

In the report subject of this blog-post, I wanted to compare the COVID-19 state in 3 countries:
Italy, Spain (where I live) and the Netherlands (where I'm from). For the comparison I chose to use the number of deaths of Corona-patients, as this indicator shows how well a country can manage the care for the patients, so the infected people who are hospitalized and might need to go to the Intensive Care. And it is known that first in Italy, and now also in Spain, and especially Madrid (where I live), the IC-capacity has reached it's limit, or passed that point..

From this video:
US death rates v UK, Italy and South Korea - BBC News
I learned that to compare the 'curves' (in this case of Corona-death#), you have to have an x-axis with releative time, so not the calender date, but the day-number of when a country passed a certain number of deaths, i.c. 25. So e.g. day #1 is the 1st day with death# >=  25,  which Italy reached first (on 1/3), and then Spain (on 10/3, so 9 days after Italy) and then Netherlands (on 17/3, so 16 days after Italy).
So this I also did for my PowerBI-report, see fig.1.
I also made a chart with the calender-date on the x-axis, see fig.2. As you can see, here it is not as clear as in fig.1 that in Spain, the death# has been rising faster than in Italy (e.g. death# 5000, Spain passed on day 19, Italy on day 23).

For this report, I collected the data (source-files) on 30/3/2020.  This report is not daily updated (it's an ad-hoc analysis, with some manual work to 'shape' the data for my report).
The sources for my report were 3 open data sets (CSV-files), on Github:

-1: Italy
https://github.com/pcm-dpc/COVID-19
file: dpc-covid19-ita-andamento-nazionale.csv

offical source:
 http://opendatadpc.maps.arcgis.com/apps/opsdashboard/index.html#/b0c68bce2cce478eaac82fe38d4138b1

-2: Spain
https://github.com/datadista/datasets/tree/master/COVID%2019
file: ccaa_covid19_fallecidos_long.csv

offical source: https://covid19.isciii.es/

-3: Netherlands
URL: https://github.com/J535D165/CoronaWatchNL
file: rivm_corona_in_nl_fatalities.csv

offical source:: https://www.rivm.nl/nieuws/actuele-informatie-over-coronavirus

And see also fig.4 for the dashboards created with this data (by the IT/ES/NL governments).
And for a nice dashboard for the Italian data, in English, see:

https://covidashit.herokuapp.com/

which was made by Fabrizio Miano .

For the 'meta-data' of these datasets, see:
https://dataverse.harvard.edu/dataverse/covid-19-eu
by Vyacheslav Tykhonov #4tikhonov , who told me he can need some help to create an international standard for COVID-19 open data, so if you think you can be of help, please contact him.

NB: I also considered using the open data of Johns Hopkins University, see:
https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data
which COVID-19 dashboard is the most well-known I think, see:
https://coronavirus.jhu.edu/map.html
because it has the numbers of all countries in1 file. But when I compared the data of the 3 above official sources with that of JHU, I saw that JHU had for Spain, for 28/3 another death# than the official sources (5982 vs 5690), see also fig.5, which shows some sample lines of each source-file (IT, ES, NL and JHU).
For more details about the different COVID-19 sources and a comparison, see:
https://ourworldindata.org/covid-sources-comparison

Besides, the JHU CSV-file has the dates in the column header, so every day the file has one column more, which makes it more difficult to process (better is to have for each day a new record (line in the file) and column# unchanged, as the other sources have.

Because each dataset had a different format, step 1 in my Power BI report was to edit the M-query for importing the CSV-files in such a way that all 3 datasets had the same columns (see fig.3).

For the datamodel, I duplicated the tables with the source data, so I could create 2 parallel schema's, one where the tables were connected via the Calendar-Date and the other one with the Relative-Date (Day-nr) as the 'pivot', see fig.6.

I used 2 DAX-formulas for #new deaths (delta), depending on if this metric is for the relative date or for the absolute date (= calendar date):

*1: relative date:

New Deaths NL = 
VAR RowAbove=
CALCULATE (
SUM (COVID19_Deaths_NL2[Total Deaths NL]);
FILTER (COVID19_Deaths_NL2; COVID19_Deaths_NL2[Day] = EARLIER (COVID19_Deaths_NL2[Day]) - 1))
RETURN
COVID19_Deaths_NL2[Total Deaths NL] - RowAbove

NB: DAX I found here:
https://www.edureka.co/community/47018/how-calculate-difference-between-consecutive-rows-columns
And later I learned that this is a 'DAX-pattern', see this page of the Italian DAX/PowerBI-maestros Alberto Ferrari and Marco Russo: https://www.sqlbi.com/articles/comparing-with-previous-selected-time-period-in-dax/

*2: calendar date:

New Deaths NL = SUM(COVID19_Deaths_NL[Total Deaths NL]) - CALCULATE(SUM(COVID19_Deaths_NL[Total Deaths NL]);DATEADD(COVID19_Deaths_NL[Date];-1;DAY))

I am a 'basic-user' of PowerBI, with little DAX-knowledge, but thanks to the things I learned in a recent (free) webinar by #Salvador Ramos, author of https://www.elfuturodelosdatos.com/ , I could solve some issues I run into when creating my report. (e.g. duplicate values because of an error in entity-relations). Salvador gives these free webinars as long as the lock-down in Spain takes, and I can really recommend it.

I posted my PowerBI report also here:

https://community.powerbi.com/t5/Data-Stories-Gallery/Comparison-of-Coronavirus-COVID-19-deaths-in-Italy-Spain/m-p/1007632#M3666

In this Power BI community there is also a 'data-stories-gallery' for COVID-19 reports, see:

https://community.powerbi.com/t5/Data-Stories-Gallery/bd-p/DataStoriesGallery?sortby=postdate&filter=covid-19

and in fig.7 you can see some of the latest post (incl.mine).

And to conclude, here some interesting things I found when preparing this report/blog-post:

- a video of someone who made a really good PowerBI COVID-19 report:
https://youtu.be/qWHIJikCw-8

- a (Spanish) video which explains the 'virus-curve' using Italy and Spain COVID-19 data:
https://youtu.be/9LWrr5aeSn4
https://youtu.be/hbrNF6thC-A

-  alternative way to visualize the 'virus-curve' :  https://youtu.be/54XLXg4fYsc


FIGURES:


fig.1 report with #Deaths in IT, ES, NL with on x-axis: day(-nr) after #25 deaths



fig.2 report with #Deaths in IT, ES, NL with on x-axis: calendar-date




fig.3: M-Queries to transform CSV file to standard format



fig.4: #Deaths reported on 30/3/2020 in the official publications of Governments of IT, ES, NL



fig.5: COVID-19 comparison sources (CSV files on Github)



fig.6: datamodel


fig.7: Power BI community 'data-stories-gallery' for COVID-19 reports



Downloads

Mirror #1

http://tiny.cc/m10dmz


No comments: