Build a Relational Database For a Data Analyst Portfolio Project
Introduction
So you’re applying to Data analyst positions and you have everything. You may have a degree, certificates, passion projects, or even Internship experience. With each skill you add to your resume, you still come up short for some reason. You research Data Analyst job posts for the most valuable skill sets. You discover that one of the most sought-after skills is in SQL. Last time you checked you weren’t trying to become a Database Administrator. Why is this skill so sought after? Furthermore, what does it even mean to say you know “SQL”.
In this article, I am going to explain why a Data Analyst needs to know SQL. I will walk through a portfolio project I created to explain these topics. First, let’s start by asking ourselves what SQL is? SQL stands for Structured Query Language and is used to communicate with a database. It is a standard language for relational database management. SQL statements are used to perform tasks such as update or retrieve data from a database. There are also Database management systems (DMBS) that use software to manipulate databases.
Out of the various DBMS programs, we will use Microsoft SQL. Feel free to use whatever DBMS that you feel comfortable with. I will not cover the standard SQL commands such as “Insert’’, “Update”, “Delete”, “Create”, and “Drop”. In a DBMS, these commands are controlled through a Graphical User Interface (GUI ). The statements covered in this article will be the SELECT command (Query). The SELECT command extracts rows from a table, based on customizable selection rules.
The SELECT command allows you to gather metrics on your business or organization. These metrics create KPIs (Key Performance Indicators) the basis of making evidence-based decisions. This is why the analysis you will be tasked with starts with SQL. The Query that will help you answer a specific question may not have been developed yet. Once you have set up a query that meets your demands you can extract (ETL) it into a Data visualization program. You can then start the Data analytics processes you are more comfortable with.
To begin in the tutorial we are going to go over various steps of skills that reinforce everything. We will start by designing an Entity-relationship model with excel. Next, we will be generating Mock Entities With Mockaroo. Once we have our datasets we will import them into Microsoft SQL. After we import the data we will constrain the entity and relational integrity. Finally, we will join 2 tables together from our created relational database. After that, you should have all the basics that most employers are looking for. You can finally say you know “a little bit of SQL…”
- Design a Database with Excel
- Generate Mock Entities With Mockaroo
- Create and import Database and Datasets
- Entity and Relational Integrity
- Create An Inner Join
Designing a Database with Excel
Your first goal is to develop an idea for a database. Ideas can range from a Library Management, University Database, or even an inventory control. For the purposes of this tutorial, we will create a mock eCommerce store. We want a visualization that tells us a couple of things about the table itself. This is known as a Schema. The table must represent the Entity, Relational, and domain constraints.
An Entity is an object such as a person, place or thing. The table will show the attributes of that entity. attributes are the various properties of the entity. A matching of a foreign and primary key between tables is a relationship. Primary Keys are unique identifiers that can not be null or nonexistent. Make sure there are no recursive relationships, meaning a table related to itself. Finally, the domain constraints tell us the data type the attribute will be.
(Step 1) In excel go across Cells A1 to G1 for 7 tables. Customers, Orders, Shipping, Product, Category, Address, Payments.
(Step 2)In excel go down Cells A1 to AN to develop the Attributes of your tables.
(Step 3)In Excel select and copy the columnar representation of each table and paste it below.
(Step 4) Space out the tables to give yourself enough room.
(Step 5) Select three columns to the right of the selected column and choose a unique color for each table.
(Step 6) Select the colored bars including the title of the table and then select “merge and center” for each table.
(Step 7) Highlight 3 cell columns by n cell rows (amount of attributes) by the number of attributes and select “all borders”.
(Step 8) In the second column insert the data types for each attribute.
(Step 9) In the third column insert “PK” for the only primary key in each table, and “FK” for all the foreign keys in each table.
(Step 10) Leave a 1:M at the end of each foreign key that is one-to-many.
(Step 11) It should look like this:
Generating Mock Entities With Mockaroo
We are now going to create the entities of the database. The entities will be the row entries of each attribute or column. We will use Mockaroo to generate 1000 rows of data. Mockaroo is a tool that lets you generate data for testing applications. After we generate the data sets of each of the tables we will import the CSV files into Microsoft SQL.
(Step 1) Reference back to the conceptual model back in excel and begin constructing the tables off of it.
(Step 2)In Mockaroo you will see three columns; Field Name, Type, and Options.
(Step 3) In the Field Name column enter the attributes from your conceptual model.
(Step 4) In the “Type” column enter the data type that resembles the one in your conceptual model.
(Step 5) In the options column, you can set features based on data type and set a percentage of null values.
(Step 6) Select the “Preview” next to the “Download Data” to check for problems with the data.
(Step 7) Once you’re satisfied with the table you have created, select “Download Data”.
(Step 8) Store the files in a folder you are familiar with or create a folder to store the project.
(Step 9) Skim the CSV in excel to make sure you avoid any problematic delimiters. If commas are present when you export the data into Microsoft SQL it will merge columns together.
(Step 10) Go through this process until all table entities have been generated.
Create and import Database and Datasets using Microsoft SQL
Now we are going to import the CSV sheets we created into Microsoft SQL. The Flat files will then show up as tables in our database. We are also going to create a database to store the tables. We will use an encoding called UTF-8. It will act as a variable-width character encoding for the data we import. We will also use the data types for the importing of the data to Unicode string[DT_STR]. This method converts the data types to nvarchar(). We will correct the data types once every table is imported successfully.
(Step 1) When you open up MS SQL it will ask to connect to a server; Server Type, Servername, and Authentication. Server Type: “Database Engine”, Server, Name: “localhost”, Authentication: “Windows Authentication”.
(Step 2) In the Object Explorer Right-click on the folder called “Databases”, select the new database.
(Step 3) After you have named the database right-click and select tasks, Import Data.
(Step 4) Select next, then select the data source as a flat-file, select next again.
(Step 5) Select browse and in the file explorer select all files and select one of the tables.
(Step 6) Next, set the code page to UTF-8.
(Step 7) Make sure to select advanced on the left-hand side of the wizard and select all the columns.
(Step 8) Select the data type to be a Unicode string[DT_STR].
(Step 9) Set the output column width to be 450 and then select next.
(Step 10) Then we will select the destination to be SQL Server Native Client 11.0 and select next, then select finish.
(Step 11) In the object explorer select refresh and you should see the imported file as a table.
(Step 12) Continue these steps until you’ve added all tables.
The entity, Relational, and Domain Integrity
In this section, we will apply the entity, referential, and domain constraints. Entity integrity constrains the primary key to a value that can’t be null. Referential integrity constrains the foreign key to refer to its corresponding primary key. This means that every foreign key in one table must be either null or be available in the table that is matching it. Domain Integrity constrains the data type of each attribute.
(Step 1) Go to the upper right-hand corner and select tools, options, “Designers”.
(Step 2) Deselect the table option “prevent saving changes that require table re-creation”, select OK.
(Step 3) Select one of the tables you want to edit, select right-click, and then “Design”.
(Step 4) Go back to the excel document with your conceptual data model and match the data types with the data types in MS SQL.
(Step 5) You can now recreate tables allowing saved changes, go through all tables.
(Step 6) Right-click on the first attribute in each table and select it as a primary Key and uncheck allow nulls. Right-click on the left-most column, A right arrow should appear next to the column name.
(Step 7) Leave each foreign key as an integer data type and uncheck allow nulls.
(Step 8) In the column, properties set the identity specification in each table to yes. This establishes unique columns for numeric sequences based on Identity Seed and Increment.
(Step 9) Go to a table with foreign keys and right-click on the left-most column. A right arrow should appear next to the column name.
(Step 10) Select the option that says “Relationships” and a window should appear.
(Step 11) Select add and under general, then click on ellipses to the right on the row titled “Tables And Column Spec”.
(Step 12) Under the primary key table on the left side, select the table you are relating to in a drop-down option.
(Step 13) Below the table selection, select that table’s primary key.
(Step 14) Under the foreign key table on the right side, select the table you are designing in a drop-down option.
(Step 15) Below the table selection, select the related tables foreign key.
(Step 16) Select ok, save the relationship. If a window pops up asking about the changes select yes.
(Step 17) Go through all tables until all relationships are set.
Creating an ER diagram
An Entity-Relationship (ER) Diagram is a chart that shows how schemas are connected. It gives you the table title, the attributes, and the relationship cardinality. Relationship cardinality means whether a relationship is one-to-one, one-to-many, or many-to-many. The main takeaway is that we want to make sure that there are no recursive relationships. Make sure most tables are one-to-many.
(Step 1) In the object explorer, make sure your database is maximized and you can see its folders.
(Step 2) Right-click the second folder title “Database Diagrams” and select “New Database Diagram”.
(Step 3) Select yes to “This Database does not have one or more of the support objects required to use database diagramming. Do you wish to create them?”
(Step 4) If you get an error that says “Index was outside the bounds of the array.” you may have to restart Microsoft SQL or adjust the data types you are using.
(Step 5) Once you restart the program go through and select the new database diagram option.
(Step 6) When you select it it will show up and make sure you add all the tables you created.
(Step 7) When the ER diagram appears you can now see the relationships visually represented.
Creating An Inner Join
Most of the time as an analyst you are going to need to Query data out of a database using Joins. The JOIN clause in SQL is used to combine rows from two or more tables, based on a related column between them. For the purposes of this tutorial, we are going to focus on an INNER JOIN. an INNER JOIN selects records that have matching values in both tables. So if the other table has any null it will not include them.
(Step 1) In the object explorer, make sure your database is maximized and you can see its folders.
(Step 2) Select your database and right-click, then select create a new query.
(Step 3) Select all the data from the customer (or any table you created if you made something else) table.
(Step 4) Highlight the code and press the execute button, you should see your query return the data.
(Step 5) Now do the same for the address table or another table that’s related to the first table.
(Step 6) Erase the address query and type under the existing customer query.
(Step 7) Type the command for an inner join and the table you are connecting to.
(Step 8) Make sure you specify where you’re connecting the tables.
(Step 9) Highlight the query and press execute.
(Step 10) Remove the asterisk and specify the columns you want to see from the two tables.
(Step 11) Highlight the query and press execute.
Contact me
Email: Sancaa94@gmail.com
Portfolio: http://asketez.com/
GitHub: https://github.com/Sanca94
Kaggle: https://www.kaggle.com/sancaa94