Power BI - Art.2 : Collect and Transform the data

Oct 20, 2015 - by Vivian Lejeune

Collect and Transform the data

In this article, we will see how to use “Power BI Desktop” to access, import and transform data. Once this is done, the data will be available for the modeling step of this set of articles.

 

Introduction

As explained briefly during the article “Power BI – Introduction”, we want to create a dashboard to analyze some information about a Microsoft User Group and more specifically, the events that were organized by this group.

The data available can be grouped in two different kind of files:

  • An “Event” file

This file contains the list of events, the tracks and their categories that were presented during each event and the speaker who presented the tracks.

  • A set of “Attendee” files

These files, one by event, contain the list of attendee of each event.

As we can see on the first screenshot above, the “Event” file contains a lot of useful information but these are not really organized to be used easily:

  • The same column contains different kind of information like the “Value” one which contains information about the tracks and their speakers.
  • The “Category” column sometime contains information but not for all lines.

As someone trained to work with data, we could quickly see that it is not complicated to rework this data in order to create some analytical entities like “Event”, “Speaker” and “Track”. This work could be done in Excel in a few minutes and can even be automated with some macro knowledge and more times.

However, “Power BI Desktop” contains an easy to use, powerful and reusable component to help us in this task. This tool is “Power Query”.

 

Power Query

“Power Query” is available from the toolbar of “Power BI Desktop”.

As we can see, we can create a new query directly from a new data source or an already used data source and we can edit existing queries.

A query is the unit of work of “Power Query”. This is a reusable “piece of code” that can be “replayed” if the underlying data changes and need to be refreshed. It can also be extended by other queries in order to create a specific data processing. In the end, in “Power BI Desktop”, a query generates an “entity” that can then be used in the data modeling tool and the data visualization tool.

A query can be connected to nearly every existing data source from flat files, famous RDBMS, web data sources and specific applications (like Dynamic CRM, Salesforce, Zendesk, etc).

Once the connection to the data has been established, a set of transformation can be applied like

  • Filtering
  • Data conversion
  • Pivoting
  • Cleansing
  • Text transformation

As soon as we are happy with the result of a query, we can execute it against the data source and load an entity of the data model.

Now that we know what “Power Query” is, we can start designing our Microsoft User Group analytical solution.

 

Import and transform our sample data

Source_Data Query

As we saw before, the “Event” file contains a lot of useful information but most importantly, these ones have to be derived from the data. By consequence, we are going to import it in a first query “as is” and then reference this query into others queries.

To import the “Event” file data, we will select the “Excel” source from the “Get Data” button of the “Power BI Desktop” toolbar and then browse to the “Event” file.

Once the file has been selected, the “Navigator” pops up and we have to select which data (worksheet) we want to import in the query. In this case, we are going to select “SourceData_Final” and see a preview of the data that will be imported.

Clicking on the “Load” button will load and create an entity named “SourceData_Final” in the data model. We can see it in “Power BI Desktop” in the “Fields” part of the data visualization tool.

We can easily edit this query by clicking on the “Edit Queries” button of the “Power BI Desktop” toolbar. This action launch the “Power Query” windows.

This windows is divided in several interesting parts:

  1. The transformation toolbar provides access to all transformations available in the tool. However, these transformations are also available through the contextual menu that can be used in the preview pane.
  2. The “Query” pane gives access to all the queries defined in the “Power BI Desktop” file.
  3. The “Query Settings” pane lists and gives access to all the transformation steps of selected query. Double clicking on a step opens up the step’s setting if available. Each step is in reality an “M” instruction.
  4. The “Formula” pane gives access to the “M” instruction’s code of the selected step.
  5. The “Preview” pane previews the result of all the steps executed until the currently selected one. As said earlier, it also gives access to transformations through the contextual menu of the tool.

The only thing we have to do in this windows for the moment is renaming the “SourceData_Final” query in “Source_Data” by changing the name of the query in the “Query Settings” pane.

 

Event query

Now that the “Source_Data” query is loaded in the tool, we can reference it in order to create the “Event” entity by doing:

  1. Right-click on the “Source_Data” query in the “Query” pane and select “Reference” to create a new query based on the “Source_Data” query.

    Avoid accessing the same data source multiple times is a best practice and this is exactly what we are doing here because when the queries will be refreshed, only the “Source_Data” query will access the “Event” file and the others will work from the data already loaded into “Power BI Desktop”.

  2. Change its name to “Event” in the “Query Settings” pane.
  3. Select the “Attribute”, “Value” and “Category” columns in the “Preview” pane.
  4. In the toolbar, click on “Remove Columns” 
  5. Select the “Date” column, right-click on it, select “Change Type” and then “Date” to convert the column to the “Date” data type instead of “Date/Time”.
  6. Select the “Event Name” column, right-click on it and select “Remove Duplicates” in order to get a distinct list of value in the “Event” query. This query is now finished and should look like this: 

 

Speaker Query

We can now create the “Speaker” query by following these steps:

  1. Right-click on the “Source_Data” query in the “Query” pane and select “Reference” to create a new query based on the “Source_Data” query.
  2. Change its name to “Speaker” in the “Query Settings” pane.
  3. Select the “Attribute” column, click on the arrow beside the column’s name to open the “filter” dialog box and then select “Text Filters” followed by “Begins With…” 
  4. In the “Filter Rows” dialog bow, type “Speaker” in the input box and click “Ok” to filter the rows. 
  5. Select the “Value” columns, right-click on it and select “Remove Other Columns” to only keep this column.
  6. You can now see that this column sometimes contains multiple, comma separated, value. We will split these values by selecting the “Value” column, right-click on it and select “Split Column” followed by “By Delimiters…” Simply validate the dialog box to split the comma separated values in different columns. 
  7. Select the “Value.1”, “Value.2” and “Value.3”, right-click on them and select “Unpivot Columns” to transform these 3 columns into one. 
  8. Delete the “Attribute” column.
  9. Select the “Value” column, right-click on it and select “Replace Values…”. In the dialog box, specify “Mr” in the “Value To Find” and nothing in the “Replace With”. 
  10. Repeat the previous operations but specify “.” in the “Value To Find” input field.
  11. Select the “Value” column, right-click on it and select “Transform” followed by “Trim” to remove spaces in the beginning and the end of the different column values.
  12. Select the “Value” column, right-click on it and select “Remove Duplicates”.
  13. Rename the “Value” column in “Speaker” and sort it in ascending order. The final query should look like this: 

 

Track Query

The “Track” Query is nearly the same as the “Speaker” query. The following steps enables you to create it:

  1. Right-click on the “Source_Data” query in the “Query” pane and select “Reference” to create a new query based on the “Source_Data” query.
  2. Change its name to “Track” in the “Query Settings” pane.
  3. Select the “Attribute” column, click on the arrow beside the column’s name to open the “filter” dialog box and then select “Text Filters” and “Begins With…”
  4. In the “Filter Rows” dialog bow, type “Track” in the input box and click “Ok” to filter the rows.
  5. Select the “Id” column, right-click on it and select “Duplicate Column”.
  6. Convert the newly created column to “Text” by selecting it, right-clicking on it and select “Change Type” followed by “Text”.
  7. Select the “Attribute” column, right-click on it and select “Replace Values…”. In the dialog box, type “Track “ in the “Value To Find” input field and nothing in the “Replace With” input field.
  8. Add a new column by clicking on the “Add Custom Column” from the toolbar. Introduce the following information in the “Add Custom Column” dialog box:
    • New column name: TrackId
    • Custom column formula: [#"Id - Copy"] & "-" & [Attribute] 
  9. Remove the “Attribute”, “Value” and “Id - Copy” columns.
  10. Rename the “Id” column in “EventId” and the “Value” column in “Track”
  11. The query is now finished and should look like this: 

 

Attendee Query

As said at the beginning of this article, the attendee data are stored in a collection of files sharing the same structure. We could consolidate manually all these into a single one and then process it with the tool. However, there is a simpler and more flexible way to achieve this task:

  1. Click on the “New Source” button from the toolbar and select “More…”.
  2. In the “Get Data” dialog box, select “Folder” and click on “Connect”.
  3. Select the folder containing the attendee files and click on “Ok”. Once this is done, the tool displays some metadata about the files contained in the folder. 
  4. Rename the query from “Query1” to “Attendee”
  5. As you can see on the screenshot, all the attendee files share the same name convention. To ensure that only those file will be processed by the query, we will filter them to include only the attendee files by selecting the “Name” column, click on the arrow beside the column’s name to open the “filter” dialog box followed by select “Text Filters” and “Begins With…”. In the dialog box, type “Attendee” in the input field and validate.
  6. In order to get the data from the different files, click on the “Add Custom Column” and type the formula “Excel.Workbook([Content])” to get access to the table contained in each file. 
  7. Click on the button beside the “Custom” column and select the “Data” item. A new “Custom.Data” column is added. 
  8. Click on the button beside the “Custom.Data” column and click on “OK” to import the data from all the attendee files in the tool in new columns. 
  9. Select the “Custom.Data.Column1”, “Custom.Data.Column2” and “Custom.Data.Column3”. Right click on the columns and select “Remove Other Columns”.
  10. In the toolbar, click on the “Use First Row As Header” to promote the first line of data as header of the columns.
  11. Now if we scroll down the data in the “Preview” pane, we can see that the header lines of each attendee file is present in the data. We then need to filter these lines by selecting the small arrow beside the “Attendee” column, “Text Filters” followed by “does not equal”. In the dialog box, set the value “Attendee” and then validate.
  12. Change the type of the “EventDate” column to “Date”. The final query should look like this: 

 

Conclusion

In this article, we used “Power Query” to access, transform and load information from multiple data sources in entities that will then be available in the data modeling tool. In the next article, we will see how we can create an analytical data model from the entities we have just created.

0 commentaires

Laisser un commentaire