d1 | Hello

Data introduction

This classic dataset contains the prices and other attributes of almost 54,000 diamonds. There are 10 attributes:

price: price in US dollars (\$326--\$18,823)
carat: weight of the diamond (0.2--5.01)
cut: quality of the cut (Fair, Good, Very Good, Premium, Ideal)
color: diamond colour, from J (worst) to D (best)
clarity: a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
x: length in mm (0--10.74)
y: width in mm (0--58.9)
z: depth in mm (0--31.8)
depth: total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43--79)
table: width of top of diamond relative to widest point (43--95)

Data description

We can get some statistic information and have an overall view about the dataset. Then, there is no null value. Therefore, I don't need to deal with the missing value.

We can check the data distribution in each categories.

Data visualization

From total prices distribution for each categorical features, we can see that premium cut, J color, and SI2 clarity would have higher diamond prices.

From count distribution for each categorical features, we can see that ideal cut, G color type and SI1 have the most values in the dataset.

From this heatmap, carat has the highest correlation (0.92) with diamond prices. Also, it seems that table and depth are almost not correlated to the price.

The pairplot shows that carat and x features have more apparent relationship with the price.

Data processing

I converted categorical data into dummy variables.

I used random forest to find the top 10 importances features. From the figure above, carat, width, SI2, SI1, VVS2, VVS1, IF clarity, and J, I,H color are the top 10 features related with price. I used these 10 features as X and price as Y.

Data splitting

I first splitted 20% of our data to test and the rest of the data is train. Then I splitted 25% of the training data to validation and the rest of the training data is the data used to train our models. The validating data is used to evaluate the models. The test data is used as the unseen data to test if the models work.