Introduction

This report outlines the process followed during an Exploratory Data Analysis (EDA) of the automobile dataset, as supplied in automobile.txt. The dataset is also available online at https://archive.ics.uci.edu/ml/datasets/Automobile.

The dataset contains data relating to 26 different attributes of 205 automobiles. The attributes are:

Name Description Type Units
symboling insurance risk rating Categorical n/a
make name of manufacturer Categorical n/a
fuel-type type of fuel used Categorical n/a
aspiration aspiration type Categorical n/a
num-of-doors number of doors in words e.g. "four" Categorical n/a
body-style style of car Categorical n/a
drive-wheels drive wheels Categorical n/a
engine-location position of engine Categorical n/a
fuel-system fuel delivery system e.g. multi-point fuel injection Categorical n/a
engine-type type of engine e.g. dual overhead camshaft ("dohc") Categorical n/a
num-of-cylinders number of cylinders in engine in words e.g. "six" Categorical n/a
normalized-losses average loss payment per insured vehicle per year Numerical USD
wheel-base distance between centres of front and rear wheels Numerical inches
length total length of vehicle Numerical inches
width width of vehicle Numerical inches
height height of vehicle Numerical inches
curb-weight kerb/curb weight i.e. weight including fluids Numerical pounds
engine-size engine capacity (total swept volume) Numerical cubic inches
bore internal diameter of engine cylinders Numerical inches
stroke swept length of cylinder (stroke) Numerical inches
compression-ratio ratio of min cylinder volume to max cylinder volume Numerical n/a
horsepower maximum engine power output Numerical horsepower
peak-rpm maximum crankshaft rotational speed Numerical rpm
city-mpg average fuel efficiency for city driving Numerical mpg
highway-mpg average fuel efficiency for highway driving Numerical mpg
price price of vehicle Numerical USD

Missing Data

The original dataset contains a number of missing values (represented by the character ?). The locations of the missing values in the dataset can be seen in the visualisation in Figure 1.

Missing data visualisation

Missing data visualisation

The accompanying table shows the number of null values in each column as an absolute number and as a percentage of the number of rows in the dataset.

Column Null values Percentage null Approach
normalized-losses 41 20.0 drop (column)
num-of-doors 2 0.98 impute or drop (rows)
bore 4 1.95 drop (rows)
stroke 4 1.95 drop (rows)
horsepower 2 0.98 drop (rows)
peak-rpm 2 0.98 drop (rows)
price 4 1.95 drop (rows)

As so much of the data in the normalized-losses column is missing, imputation would be difficult and is probably inappropriate. The best course of action is to drop the column (rather than to lose 20% of the rows in the dataset).

It seems likely that the number of doors for the two vehicles which are missing this data can be imputed from some other characteristics in the dataset, notably the body-style. All the convertibles and hardtops in the dataset have two doors, whilst all the wagons have four. Hatchbacks are more likely to have 2 doors and sedans are more likely to have 4 doors. That being said, imputing data like this can skew the outcomes of an EDA so the rows with missing data have been dropped from the dataset.

The missing values in the remaining columns would be much more difficult (and inaccurate) to impute so the rows with null values in these columns should be dropped from the dataset.

After dropping this missing data, we are left with a dataset containing information on 25 different attributes of 193 automobiles, with no missing values.

Data Cleaning

The first step in cleaning the data, after handling the missing data as described above, was to remove any duplicated rows, to avoid these skewing the results of the EDA. No duplicate rows were identified in the automobiles dataset.

Upon inspecting the data types of the remaining data, some of the types seemed like an unusual choice for representing the underlying data.

The num-of-doors and num-of-cylinders columns clearly represent integer values (a car may have 2 or 4 doors etc.). These columns were changed from object (string) values to integer values to better represent these numerical attributes.

The horsepower, peak-rpm and price columns are all integer values in the raw data, but were converted to floats when reading the data with pandas. This is due to the fact that NaN is a float in numpy and pandas. This data was cast back to integers to better represent the underlying raw data.

The name of the car manufacturer “Alfa Romeo” was misspelled as alfa-romero in the original dataset. This spelling error was corrected for clarity.

Data Exploration

Symboling data – relative risk to insure each vehicle

The distribution of the symboling data in the dataset as seen in Figure 2 is skewed towards positive symboling numbers. As this symboling data represents the relative risk to associated with insuring each vehicle, this result indicates that insurance companies are more likely to rate vehicles listed in this dataset as high risk, rather than low risk. It is possible that risky vehicles are overrepresented in the dataset or that insurance companies are simply cautious.

Distribution of symboling (relative insurance risk) data

Distribution of symboling (relative insurance risk) data

The box plot in Figure 3 indicates that there is a clear relationship between the body-style of a vehicle and its symboling value. Convertibles and hardtops are rated as being very risky to insure by insurance companies, likely because they are often driven faster and more recklessly.

Hatchbacks are also rated as relatively risky, with only wagons having a median symboling value of less than 0 (which represents average insurance risk), likely because they are usually family vehicles, used to transport children, and are therefore driven more carefully.

Relationship between symboling and body-style

Relationship between symboling and body-style

Make

The bar plot in Figure 4 shows the number of vehicles of each make (manufacturer) in the dataset. From this, we can see that Japanese export vehicles are the most common in the dataset (Toyota, Nissan, Mitsubishi, Subaru and Mazda i.e. the 5 most common makes in the dataset are all from Japan). European export vehicles are the next most common.

Vehicle make sorted by frequency in the dataset

Vehicle make sorted by frequency in the dataset

From the visualisation in Figure 5, we can see that whilst Japanese vehicles are the most common in the dataset, European exports are the most expensive. These brands are viewed as luxury and charge a premium for their vehicles.

Interestingly American made vehicles are at the bottom of the list in terms of price in the dataset. This is likely because the dataset is from an American source and the cost of import inflates the prices of the non-American vehicles relative to the American ones.

Mean vehicle price by make sorted by price

Mean vehicle price by make sorted by price

In order to investigate this relationship further, we could do some (very simple) feature engineering to include the country of origin of each vehicle in the dataset.

Fuel type

From the pie chart, in Figure 6, we can see that the vast majority of vehicles in the dataset use petrol (gasoline), rather than diesel as their fuel.

Proportion of vehicles in the dataset by fuel-type

Proportion of vehicles in the dataset by fuel-type

Interestingly, even though there are relatively few diesel vehicles in the data set, we can see from the visualisation in Figure 7 that there is a marked differece between the typical compression-ratio for a petrol engine and a diesel engine.

The mean compression-ratio for petrol engine vehicles in the dataset is around 8.85, but for diesel vehicles, the mean compression-ratio is approximately 21.97. This is nearly 2.5 times the mean value for the petrol vehicles in the dataset.

This difference is due to the different volatilities of the two fuel-types. Petrol is much more volatile, which means that a lower compression-ratio is needed to achieve ignition of the fuel-air mixture in the cylinder. In fact, an excessive compression-ratio can cause ‘knocking’ (autoignition of the fuel-air mixture at the wrong time in the engine cycle) in petrol engines.

Distribution of compression-ratio by fuel-type

Distribution of compression-ratio by fuel-type

Looking at Figure 8, we can draw several conclusions. Firstly, that there is a strong positive correlation between engine-size and horsepower, which is intuitive, but also that diesel powered vehicles tend to need a larger engine to achieve the same power output relative to petrol vehicles. This is much less obvious.

Pair plot showing the relationships between engine-size and horsepower by fuel-type

Pair plot showing the relationships between engine-size and horsepower by fuel-type

Price

It has already been shown that the make of a vehicle and its country of origin have an impact on its price. Some other features which seems to have relationship with price are identified below.

Figure 9 shows the mean price for vehicles of each body-style category. A vehicle’s body-style seems to have some impact on its price. Note that this chart does not reflect the hierarchy of symboling (insurance risk) by body-style as described above, which suggests (as previously noted) that a vehicle’s price is not the only deciding factor in determining insurance risk.

Bar plot showing mean price by body-style category

Bar plot showing mean price by body-style category

Figure 10 shows the mean price of vehicles in the dataset grouped by their drive-wheels category. We can see that front wheel drive vehicles are the least expensive, as they have the least complex drivetrain. The data shows that almost all the vehicles (190 of 193) have their engine at the front of the vehicle, so the power is easier to transfer from the engine to the wheels with a less complex drivetrain in front wheel drive vehicles.

Bar plot showing mean price by drive-wheels category

Bar plot showing mean price by drive-wheels category

Figure 11 shows that the different engine-type categories have different average prices. This is due to the relative complexity (and therefore) cost of each type of engine to build.

Bar plot showing mean price by engine-type category

Bar plot showing mean price by engine-type category

The pair plot in Figure 12, showing the relationships between a vehicle’s price, horsepower and highway-mpg shows clearly that there is a strong positive correlation between a vehicle’s horsepower and its price. There is a strongly negative correlation between highway-mpg (the vehicle’s fuel-efficiency on the highway) and its price. More powerful vehicles tend to have larger, more complex engines, and are sometimes turbocharged, all of which contributes to their increased price (and to decreased fuel-efficiency on the highway).

Pair plot showing relationships between price, horsepower, and highway-mpg values

Pair plot showing relationships between price, horsepower, and highway-mpg values