Image Classification and Recommender System

Introduction:

This project aims to do image classification for Williams Sonoma’s products, then build a recommender system based on image or product title.

The title-based recommender system is like most recommender systems in e-commerce that are mostly text-based and usually rely on a knowledge base and use a keyword matching system. However, this requires online shoppers to provide descriptions of products, which can vary greatly from the sellers’ side to the buyers’ side. The image-based recommender system aims to change the traditional search paradigms from text description to visual discovery. The application of image matching using artificial intelligence in the online shopping field remains largely unexplored. It will be particularly beneficial for a company like Williams Sonoma, for around 70% of the total revenue comes from e-commerce.

Project Flowchart

In [1]:
# General imports
import pandas as pd
import numpy as np
import glob
import warnings
warnings.simplefilter('ignore')
import seaborn as sns
import requests
import urllib
import cv2
import re
from io import BytesIO
import requests, os
from os import path
from sklearn.model_selection import train_test_split
from tensorflow.keras import regularizers 
from keras.layers.core import Dropout
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout
from wordcloud import WordCloud
from tensorflow.keras.preprocessing.image import load_img,img_to_array
from tensorflow.keras.models import Model
from tensorflow.keras.applications.imagenet_utils import preprocess_input
from sklearn.metrics.pairwise import cosine_similarity
from tensorflow.keras.models import Sequential
from tensorflow.keras import layers, losses
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.metrics import classification_report
from keras.applications import vgg16
from keras.applications.vgg16 import VGG16
from keras.preprocessing.image import load_img,img_to_array
from keras.models import Model
from keras.applications.imagenet_utils import preprocess_input
from PIL import Image
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import cosine_similarity
from keras.layers import Flatten
from keras.optimizers import Adam
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
Using TensorFlow backend.

Data Cleaning

In [2]:
#set working directory
path = 'C:/Users/linli/Desktop/In progress project/WSI/product_file/'
os.chdir(path)

#find all csv files in the folder
#use glob pattern matching -> extension = 'csv'
#save result in list -> all_filenames
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
#print(all_filenames)

#combine all files in the list
combined = pd.concat([pd.read_csv(f) for f in all_filenames ])
#export to csv
combined.to_csv( "combined.csv", index=False, encoding='utf-8-sig')
In [3]:
data=pd.read_csv("combined.csv")
In [4]:
data
Out[4]:
image title category
0 https://assets.wsimgs.com/wsimgs/rk/images/dp/... Williams Sonoma Pantry Cereal/Soup Bowl, Set of 6 bowl
1 https://assets.wsimgs.com/wsimgs/rk/images/dp/... Melamine Mixing Bowls with Spout, Set of 3 bowl
2 https://assets.wsimgs.com/wsimgs/rk/images/dp/... 10-Piece Glass Mixing Bowl Set bowl
3 https://assets.wsimgs.com/wsimgs/rk/images/dp/... Open Kitchen by Williams Sonoma All Purpose Bowls bowl
4 https://assets.wsimgs.com/wsimgs/rk/images/dp/... Flour Shop Melamine Mixing Bowls with Lids, Se... bowl
... ... ... ...
604 https://assets.wsimgs.com/wsimgs/rk/images/dp/... Williams Sonoma Signature Nonstick Burger Spatula spatula
605 https://assets.wsimgs.com/wsimgs/rk/images/dp/... Open Kitchen by Williams Sonoma Beechwood Angl... spatula
606 https://assets.wsimgs.com/wsimgs/rk/images/dp/... Open Kitchen by Williams Sonoma All Nylon Turn... spatula
607 https://assets.wsimgs.com/wsimgs/rk/images/dp/... HARRY POTTER™ RAVENCLAW™ Spatula spatula
608 https://assets.wsimgs.com/wsimgs/rk/images/dp/... Williams Sonoma Flex Core Mini Spatulas, Set o... spatula

609 rows × 3 columns

In [5]:
data=data.reset_index()
data=data.rename(columns={"index":"PID"})
data.head(5)
Out[5]:
PID image title category
0 0 https://assets.wsimgs.com/wsimgs/rk/images/dp/... Williams Sonoma Pantry Cereal/Soup Bowl, Set of 6 bowl
1 1 https://assets.wsimgs.com/wsimgs/rk/images/dp/... Melamine Mixing Bowls with Spout, Set of 3 bowl
2 2 https://assets.wsimgs.com/wsimgs/rk/images/dp/... 10-Piece Glass Mixing Bowl Set bowl
3 3 https://assets.wsimgs.com/wsimgs/rk/images/dp/... Open Kitchen by Williams Sonoma All Purpose Bowls bowl
4 4 https://assets.wsimgs.com/wsimgs/rk/images/dp/... Flour Shop Melamine Mixing Bowls with Lids, Se... bowl

Exploratory Data Analysis

In [6]:
# Checking the unique observations, datatype & null values for every feature
d = {"Feature":[i for i in data.columns],"Number of unique entry" :data.nunique().values ,'Type' : data.dtypes.values, "Missing values" : data.isnull().sum() }
description = pd.DataFrame(data = d)
description
Out[6]:
Feature Number of unique entry Type Missing values
PID PID 609 int64 0
image image 604 object 0
title title 601 object 0
category category 10 object 0
In [7]:
pd.DataFrame(data['category'].value_counts())
Out[7]:
category
coffeemaker 80
frypan 80
lighting 80
spatula 80
rug 75
glass 48
sofa 46
knife 44
chair 44
bowl 32
In [8]:
# Creating a plot to check class distribution
plt.figure(figsize=(8,6))# Creating an empty plot 
count_classes = pd.value_counts(data['category'], sort = True)
ax=count_classes.plot(kind = 'bar', rot=0)
plt.title("William Sonoma product categories")
plt.xlabel("Class")
plt.ylabel("Frequency")
for p in ax.patches:
      ax.annotate('{}'.format(p.get_height()),(p.get_x()+0.2,p.get_height()+6)) # Adding the count above the bars
plt.show()
In [9]:
def display_img(url):
    """
    This functions takes the image url and return the picture of the image
    """
    # we get the url of the apparel and download it
    response = requests.get(url)
    img = Image.open(BytesIO(response.content))
    # we will display it in notebook 
    return plt.imshow(img)
In [10]:
display_img(data['image'][1])
print(data['category'][1])
bowl
In [11]:
display_img(data['image'][100])
print(data['category'][100])
coffeemaker
In [12]:
display_img(data['image'][200])
print(data['category'][200])
frypan
In [13]:
display_img(data['image'][300])
print(data['category'][300])
knife
In [14]:
plt.rcParams['figure.figsize'] = (10,10)
plt.style.use('fast')

wc = WordCloud(background_color = 'green', width = 1500, height = 1500).generate(str(data['title']))
plt.title('Description of the product titles', fontsize = 20)

plt.imshow(wc)
plt.axis('off')
plt.show()
In [15]:
#for idx, row in data.iterrows():
#     url = row['image']
##     response = requests.get(url)
#     img = Image.open(BytesIO(response.content))
#     img.save('C:/Users/linli/Desktop/In progress project/WSI/product_file/'+ str(row['PID'])+'.jpg')

Part I: Image Classification

Idea: given a product photo uploaded by the customer, find the category that this product most likely belongs to. I have 10 categories in my data set in total.

Method explained:
For the image classification task, I use the pre-trained Convolutional Neural Network VGG16. Since my dataset is small, it is better to reuse the lower layers of a pre-trained model because it requires significantly less training data and speeds up training considerably.

VGG16 has a classical architecture, with 2 or 3 convolutional layers and pooling layer, then again 2 or 3 convolutional layers and a pooling layer, and so on (reaching a total of 16 convolutional layers), plus a final dense network with 2 hidden layers and the output layer. It uses only 3 by 3 filters, but many of them.

VGG16 Architecture

In [16]:
# creating a function to download the image links from the dataset
def img_array(img):   
    """
    This function takes in an image and converts the image to an array after resizing
  
    """
    response = urllib.request.urlopen(img)
    image = np.asarray(bytearray(response.read()), dtype="uint8") 
    image_bgr = cv2.imdecode(image, cv2.IMREAD_COLOR)
    image_bgr = cv2.resize(image_bgr, (224,224)) # resizing all images to one size 
    return image_bgr
In [17]:
# Using the above function here to store all the images in the dataset into arrays
image_array=[]
for i in data['image']:
    image_array.append(img_array(i))
    
img_arr=np.array(image_array)
In [18]:
# Converting the response variable into numbers
data['category'][data['category']=='coffeemaker']=0
data['category'][data['category']=='spatula']=1
data['category'][data['category']=='frypan']=2
data['category'][data['category']=='lighting']=3
data['category'][data['category']=='rug']=4
data['category'][data['category']=='glass']=5
data['category'][data['category']=='sofa']=6
data['category'][data['category']=='knife']=7
data['category'][data['category']=='chair']=8
data['category'][data['category']=='bowl']=9
y=data['category'].astype(int)
In [19]:
X_train, X_test, y_train, y_test = train_test_split(img_arr, data['category'], test_size=0.2, random_state=748)
X_train= X_train/255 ## scale the raw pixel intensities to the range [0, 1]
X_test= X_test/255
In [20]:
# Preprocess class labels
from keras.utils import np_utils
# Preprocess class labels
y_train = np_utils.to_categorical(y_train, 10)
y_test = np_utils.to_categorical(y_test, 10)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
(487, 224, 224, 3)
(122, 224, 224, 3)
(487, 10)
(122, 10)
In [21]:
from keras.layers import Dense
# define cnn model
def define_model():
# load model
  vgg_model= VGG16(include_top=False, input_shape=(224,224, 3))
# mark loaded layers as not trainable
  for layer in vgg_model.layers:
    layer.trainable = False
# add new classifier layers
  flat1 = Flatten()(vgg_model.layers[-1].output)#transforms the format of the extracted features from a 2d-array to a 1d-array of 224*224*3 pixel values.
  class1 = Dense(128, activation="relu", kernel_initializer="he_uniform")(flat1)
  output = Dense(10, activation="softmax")(class1)
# define our image classification new model
  model = Model(inputs=vgg_model.inputs, outputs=output)
# compile model
  opt = Adam(lr=0.001)
  model.compile(optimizer=opt, loss="categorical_crossentropy", metrics=["accuracy"])
  return model
In [22]:
image_classification_model=define_model()
In [23]:
image_classification_model.summary()
Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         (None, 224, 224, 3)       0         
_________________________________________________________________
block1_conv1 (Conv2D)        (None, 224, 224, 64)      1792      
_________________________________________________________________
block1_conv2 (Conv2D)        (None, 224, 224, 64)      36928     
_________________________________________________________________
block1_pool (MaxPooling2D)   (None, 112, 112, 64)      0         
_________________________________________________________________
block2_conv1 (Conv2D)        (None, 112, 112, 128)     73856     
_________________________________________________________________
block2_conv2 (Conv2D)        (None, 112, 112, 128)     147584    
_________________________________________________________________
block2_pool (MaxPooling2D)   (None, 56, 56, 128)       0         
_________________________________________________________________
block3_conv1 (Conv2D)        (None, 56, 56, 256)       295168    
_________________________________________________________________
block3_conv2 (Conv2D)        (None, 56, 56, 256)       590080    
_________________________________________________________________
block3_conv3 (Conv2D)        (None, 56, 56, 256)       590080    
_________________________________________________________________
block3_pool (MaxPooling2D)   (None, 28, 28, 256)       0         
_________________________________________________________________
block4_conv1 (Conv2D)        (None, 28, 28, 512)       1180160   
_________________________________________________________________
block4_conv2 (Conv2D)        (None, 28, 28, 512)       2359808   
_________________________________________________________________
block4_conv3 (Conv2D)        (None, 28, 28, 512)       2359808   
_________________________________________________________________
block4_pool (MaxPooling2D)   (None, 14, 14, 512)       0         
_________________________________________________________________
block5_conv1 (Conv2D)        (None, 14, 14, 512)       2359808   
_________________________________________________________________
block5_conv2 (Conv2D)        (None, 14, 14, 512)       2359808   
_________________________________________________________________
block5_conv3 (Conv2D)        (None, 14, 14, 512)       2359808   
_________________________________________________________________
block5_pool (MaxPooling2D)   (None, 7, 7, 512)         0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 25088)             0         
_________________________________________________________________
dense_1 (Dense)              (None, 128)               3211392   
_________________________________________________________________
dense_2 (Dense)              (None, 10)                1290      
=================================================================
Total params: 17,927,370
Trainable params: 3,212,682
Non-trainable params: 14,714,688
_________________________________________________________________
In [24]:
history =image_classification_model.fit(X_train, y_train, batch_size=32, epochs=5, verbose=1, validation_split = 0.2)
Train on 389 samples, validate on 98 samples
Epoch 1/5
389/389 [==============================] - 237s 608ms/step - loss: 2.2405 - accuracy: 0.4422 - val_loss: 0.7042 - val_accuracy: 0.8163
Epoch 2/5
389/389 [==============================] - 231s 593ms/step - loss: 0.3277 - accuracy: 0.9229 - val_loss: 0.3830 - val_accuracy: 0.8980
Epoch 3/5
389/389 [==============================] - 230s 591ms/step - loss: 0.0851 - accuracy: 0.9846 - val_loss: 0.2953 - val_accuracy: 0.9388
Epoch 4/5
389/389 [==============================] - 240s 618ms/step - loss: 0.0310 - accuracy: 1.0000 - val_loss: 0.2311 - val_accuracy: 0.9490
Epoch 5/5
389/389 [==============================] - 257s 661ms/step - loss: 0.0194 - accuracy: 1.0000 - val_loss: 0.2323 - val_accuracy: 0.9184
In [25]:
def plot_accuracy_loss(history):
    """
        Plot the accuracy and the loss during the training of the nn.
    """
    fig = plt.figure(figsize=(10,8))

    # Plot accuracy
    plt.subplot(221)
    plt.plot(history.history['accuracy'],'bo--', label = "accuracy")
    plt.plot(history.history['val_accuracy'], 'ro--', label = "val_accuracy")
    plt.title("Training and Validation accurarcy")
    plt.ylabel("accuracy")
    plt.xlabel("epochs")
    plt.legend()

    # Plot loss function
    plt.subplot(222)
    plt.plot(history.history['loss'],'bo--', label = "loss")
    plt.plot(history.history['val_loss'], 'ro--', label = "val_loss")
    plt.title("Training and Validation loss")
    plt.ylabel("loss")
    plt.xlabel("epochs")

    plt.legend()
    plt.show()
In [26]:
plot_accuracy_loss(history)
In [27]:
# Predict the values from the validation dataset
y_pred = image_classification_model.predict(X_test)
# Convert predictions classes to one hot vectors 
y_pred = np.argmax(y_pred,axis = 1) 
# Convert validation observations to one hot vectors
y_test = np.argmax(y_test,axis = 1) 


accuracy = accuracy_score(y_test, y_pred)
print("Accuracy : %.2f%%" % (accuracy*100.0))
Accuracy : 89.34%
In [28]:
# Confusion matrix for results
cm = confusion_matrix(y_test, y_pred) 

fig, ax= plt.subplots(figsize=(12,12))
sns.heatmap(cm, annot=True, cmap="Greens",linecolor="gray", ax = ax, fmt='g'); # annot=True to annotate cells. 'fmt' prevents the numbers from going to scientific notation

# labels, title and ticks
ax.set_xlabel('Predicted labels');ax.set_ylabel('True labels'); 
ax.set_title('Confusion Matrix'); 
ax.xaxis.set_ticklabels(['coffeemaker','spatula','frypan','lighting','rug','glass','sofa','knife','chair','bowl']); 
ax.yaxis.set_ticklabels(['coffeemaker','spatula','frypan','lighting','rug','glass','sofa','knife','chair','bowl']);
plt.show()
In [29]:
categories=['coffeemaker','spatula','frypan','lighting','rug','glass','sofa','knife','chair','bowl']
print(classification_report(y_test, y_pred, target_names=categories))
              precision    recall  f1-score   support

 coffeemaker       0.95      1.00      0.97        18
     spatula       0.76      1.00      0.86        16
      frypan       0.80      1.00      0.89        16
    lighting       0.94      0.88      0.91        17
         rug       1.00      0.89      0.94         9
       glass       1.00      0.91      0.95        11
        sofa       1.00      1.00      1.00         5
       knife       1.00      0.50      0.67        10
       chair       0.82      0.90      0.86        10
        bowl       1.00      0.70      0.82        10

    accuracy                           0.89       122
   macro avg       0.93      0.88      0.89       122
weighted avg       0.91      0.89      0.89       122

In [30]:
test_labels=y_test.tolist() # converting the y_test into a list 

# Creating a function which picks random images and identifies the class to which the image belongs
def get_image_and_class(size):
  idx = np.random.randint(len(X_test), size=size) # generating a random image from the test data
  for i in range(len(idx)):
    fig = plt.figure()
    fig.set_size_inches(5,5)
    plt.imshow(X_test[idx,:][i]) 
    plt.show()

# Print the class of the random image picked above
    if test_labels[idx[i]] == 0:
      print('This is a coffeemaker!')
    elif test_labels[idx[i]] == 1:
      print('This is a spatula!')
    elif test_labels[idx[i]] == 2:
      print('This is a frypan!')
    elif test_labels[idx[i]] == 3:
      print('This is a lighting!')
    elif test_labels[idx[i]] == 4:
      print('This is a rug!')
    elif test_labels[idx[i]] == 5:
      print('This is a glass!')
    elif test_labels[idx[i]] == 6:
      print('This is a sofa!')
    elif test_labels[idx[i]] == 7:
      print('This is a knife!')
    elif test_labels[idx[i]] == 8:
      print('This is a chair!')
    elif test_labels[idx[i]] == 9:
      print('This is a bowl!')
    
In [31]:
get_image_and_class(5)
This is a chair!
This is a lighting!
This is a knife!
This is a frypan!
This is a chair!

Part II: Recommender System

A: Image-based recommender system

Idea: given the photo's features and the category that this product belongs to, calculate similarity scores, and find the most similar products in our database.

Method explained:

For the recommendation step, I use the last fully connected layer in the classification model as feature vectors of images. For any images in the dataset, there will be one corresponding feature vector. And this feature vector will be the input for our recommendation model. The workflow of this step is shown in the following bullets.

• Feature extraction: the classification model is used to identify which category the target image belongs to. Then I extract the input from the last fully connected layer of the classification model as features.

• Input of the model: the feature vector of the target image extracted in the above.

• Similarity calculation: using cosine similarity to calculate similarity scores between the feature vector of the target image and feature vectors of all images in the target category to measure the similarity between image pairs. The larger the cosine similarity score, the more similar the two images are.

• Output: top k images (products) that are most like the target image.

The recommendation engine I build in this project is a content-based method. We look at the product features in a content-based method to recommend other similar products based on product attributes.

Content-based filtering

In [32]:
# load the model
vgg_model = vgg16.VGG16(weights='imagenet')

# remove the last layers in order to get features instead of predictions
feat_extractor = Model(inputs=vgg_model.input, outputs=vgg_model.get_layer("fc2").output)

# print the layers of the CNN
feat_extractor.summary()
Model: "model_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_2 (InputLayer)         (None, 224, 224, 3)       0         
_________________________________________________________________
block1_conv1 (Conv2D)        (None, 224, 224, 64)      1792      
_________________________________________________________________
block1_conv2 (Conv2D)        (None, 224, 224, 64)      36928     
_________________________________________________________________
block1_pool (MaxPooling2D)   (None, 112, 112, 64)      0         
_________________________________________________________________
block2_conv1 (Conv2D)        (None, 112, 112, 128)     73856     
_________________________________________________________________
block2_conv2 (Conv2D)        (None, 112, 112, 128)     147584    
_________________________________________________________________
block2_pool (MaxPooling2D)   (None, 56, 56, 128)       0         
_________________________________________________________________
block3_conv1 (Conv2D)        (None, 56, 56, 256)       295168    
_________________________________________________________________
block3_conv2 (Conv2D)        (None, 56, 56, 256)       590080    
_________________________________________________________________
block3_conv3 (Conv2D)        (None, 56, 56, 256)       590080    
_________________________________________________________________
block3_pool (MaxPooling2D)   (None, 28, 28, 256)       0         
_________________________________________________________________
block4_conv1 (Conv2D)        (None, 28, 28, 512)       1180160   
_________________________________________________________________
block4_conv2 (Conv2D)        (None, 28, 28, 512)       2359808   
_________________________________________________________________
block4_conv3 (Conv2D)        (None, 28, 28, 512)       2359808   
_________________________________________________________________
block4_pool (MaxPooling2D)   (None, 14, 14, 512)       0         
_________________________________________________________________
block5_conv1 (Conv2D)        (None, 14, 14, 512)       2359808   
_________________________________________________________________
block5_conv2 (Conv2D)        (None, 14, 14, 512)       2359808   
_________________________________________________________________
block5_conv3 (Conv2D)        (None, 14, 14, 512)       2359808   
_________________________________________________________________
block5_pool (MaxPooling2D)   (None, 7, 7, 512)         0         
_________________________________________________________________
flatten (Flatten)            (None, 25088)             0         
_________________________________________________________________
fc1 (Dense)                  (None, 4096)              102764544 
_________________________________________________________________
fc2 (Dense)                  (None, 4096)              16781312  
=================================================================
Total params: 134,260,544
Trainable params: 134,260,544
Non-trainable params: 0
_________________________________________________________________
In [33]:
# parameters setup
imgs_path  = "C:/Users/linli/Desktop/In progress project/WSI/product_file/"
imgs_model_width, imgs_model_height = 224, 224
nb_closest_images = 5 # number of most similar images to retrieve
In [34]:
files = [imgs_path + x for x in os.listdir(imgs_path) if "jpg" in x]
print("number of images:",len(files))
number of images: 609
In [35]:
# load all the images and prepare them for feeding into the CNN
importedImages = []

for f in files:
    filename = f
    original = load_img(filename, target_size=(224, 224))
    numpy_image = img_to_array(original)
    image_batch = np.expand_dims(numpy_image, axis=0)
    
    importedImages.append(image_batch)
    
images = np.vstack(importedImages)

processed_imgs = preprocess_input(images.copy())
In [36]:
# extract the images features

imgs_features = feat_extractor.predict(processed_imgs)

print("features successfully extracted!")
imgs_features.shape
features successfully extracted!
Out[36]:
(609, 4096)
In [37]:
# compute cosine similarities between images
cosSimilarities = cosine_similarity(imgs_features)
# store the results into a pandas dataframe
cos_similarities_df = pd.DataFrame(cosSimilarities, columns=files, index=files)
cos_similarities_df.head()
Out[37]:
C:/Users/linli/Desktop/In progress project/WSI/product_file/0.jpg C:/Users/linli/Desktop/In progress project/WSI/product_file/1.jpg C:/Users/linli/Desktop/In progress project/WSI/product_file/10.jpg C:/Users/linli/Desktop/In progress project/WSI/product_file/100.jpg C:/Users/linli/Desktop/In progress project/WSI/product_file/101.jpg C:/Users/linli/Desktop/In progress project/WSI/product_file/102.jpg C:/Users/linli/Desktop/In progress project/WSI/product_file/103.jpg C:/Users/linli/Desktop/In progress project/WSI/product_file/104.jpg C:/Users/linli/Desktop/In progress project/WSI/product_file/105.jpg C:/Users/linli/Desktop/In progress project/WSI/product_file/106.jpg ... C:/Users/linli/Desktop/In progress project/WSI/product_file/90.jpg C:/Users/linli/Desktop/In progress project/WSI/product_file/91.jpg C:/Users/linli/Desktop/In progress project/WSI/product_file/92.jpg C:/Users/linli/Desktop/In progress project/WSI/product_file/93.jpg C:/Users/linli/Desktop/In progress project/WSI/product_file/94.jpg C:/Users/linli/Desktop/In progress project/WSI/product_file/95.jpg C:/Users/linli/Desktop/In progress project/WSI/product_file/96.jpg C:/Users/linli/Desktop/In progress project/WSI/product_file/97.jpg C:/Users/linli/Desktop/In progress project/WSI/product_file/98.jpg C:/Users/linli/Desktop/In progress project/WSI/product_file/99.jpg
C:/Users/linli/Desktop/In progress project/WSI/product_file/0.jpg 1.000000 0.566342 0.589725 0.304929 0.283740 0.272410 0.363936 0.302487 0.307303 0.312110 ... 0.279317 0.251743 0.238592 0.336092 0.294794 0.481797 0.266188 0.232489 0.333702 0.278810
C:/Users/linli/Desktop/In progress project/WSI/product_file/1.jpg 0.566342 1.000000 0.486930 0.296348 0.279526 0.276505 0.397123 0.260545 0.249207 0.270370 ... 0.324476 0.362456 0.236665 0.324902 0.259335 0.356800 0.283941 0.253952 0.292422 0.282447
C:/Users/linli/Desktop/In progress project/WSI/product_file/10.jpg 0.589725 0.486930 1.000000 0.319486 0.336696 0.345366 0.464929 0.342896 0.252080 0.282110 ... 0.305154 0.328901 0.308729 0.342598 0.312655 0.398736 0.304422 0.284514 0.369625 0.280276
C:/Users/linli/Desktop/In progress project/WSI/product_file/100.jpg 0.304929 0.296348 0.319486 1.000000 0.776352 0.684804 0.600392 0.671952 0.429179 0.501297 ... 0.573602 0.657322 0.552224 0.338581 0.550519 0.319672 0.499267 0.597011 0.577676 0.503183
C:/Users/linli/Desktop/In progress project/WSI/product_file/101.jpg 0.283740 0.279526 0.336696 0.776352 1.000000 0.861654 0.622486 0.717644 0.490436 0.491441 ... 0.585652 0.705248 0.546135 0.381792 0.623394 0.355076 0.499025 0.640361 0.621840 0.495688

5 rows × 609 columns

In [38]:
# function to retrieve the most similar products for a given one

def image_based_recommendations(given_img):

    print("-----------------------------------------------------------------------")
    print("original product:")
    original = load_img(given_img, target_size=(imgs_model_width, imgs_model_height))
    fig = plt.figure()
    fig.set_size_inches(5,5)
    plt.imshow(original)

    plt.show()

    print("-----------------------------------------------------------------------")
    print("most similar products:")

    closest_imgs = cos_similarities_df[given_img].sort_values(ascending=False)[1:nb_closest_images+1].index
    closest_imgs_scores = cos_similarities_df[given_img].sort_values(ascending=False)[1:nb_closest_images+1]

    for i in range(0,len(closest_imgs)):
        original = load_img(closest_imgs[i], target_size=(imgs_model_width, imgs_model_height))
        fig = plt.figure()
        fig.set_size_inches(5,5)
        plt.imshow(original)
        plt.show()
        print("similarity score : ",closest_imgs_scores[i])
In [39]:
image_based_recommendations(files[1])
-----------------------------------------------------------------------
original product:
-----------------------------------------------------------------------
most similar products:
similarity score :  0.8836961
similarity score :  0.6794099
similarity score :  0.5835591
similarity score :  0.5663423
similarity score :  0.5550671
In [40]:
image_based_recommendations(files[100])
-----------------------------------------------------------------------
original product:
-----------------------------------------------------------------------
most similar products:
similarity score :  0.96511716
similarity score :  0.83996004
similarity score :  0.8389548
similarity score :  0.8268065
similarity score :  0.81590164

B: Text-based recommender system

Idea: if instead given the product title, then calculate similarity scores for the title and find the most similar product titles in my database.

Method explained:

• Data preprocessing: when dealing with text data, we need to convert the text data to numbers that way computer can understand. I convert the word vector of each title then compute Term Frequency-Inverse Document Frequency (TF-IDF) vectors for each title.

What is Term Frequency-Inverse Document Frequency (TF-IDF)?

Term Frequency: a measure of the frequency of the word in the current document.

Inverse Document Frequency: a measure of how rare the word is across documents, which tells us how significant the term is among the documents.

The higher value of the term, the rarer it is in the document.

• Calculation of cosine similarity between the product titles: since we have used the TF-IDF vectorizer, calculating the dot product will directly give us the cosine similarity score. Therefore, we will use sklearn's linear_kernel() instead of cosine_similarities() since it is faster.

• Build the title-based recommender system: a function is created that takes in the product title and returns the top 10 recommendations. The similarity score is arranged in descending order, and results are given based on the score.

In [41]:
data['title'].sample(5)
Out[41]:
570    Williams Sonoma Silicone Spoonula with Stainle...
489                                         Windsor Sofa
144    Nespresso Lattissima Touch by De'Longhi Espres...
96               Cuisinart Coffee On Demand Coffee Maker
411                                Braided Flatweave Rug
Name: title, dtype: object
In [42]:
#Import TfIdfVectorizer from scikit-learn
#from sklearn.feature_extraction.text import TfidfVectorizer

#Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')

#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(data['title'])

#Output the shape of tfidf_matrix
tfidf_matrix.shape
Out[42]:
(609, 732)

There are 609 Products and 732 are unique words in the product title.

In [43]:


# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
In [44]:
cosine_sim_table=pd.DataFrame(cosine_sim)
cosine_sim_table.columns=data['title']
cosine_sim_table.index=data['title']
cosine_sim_table
Out[44]:
title Williams Sonoma Pantry Cereal/Soup Bowl, Set of 6 Melamine Mixing Bowls with Spout, Set of 3 10-Piece Glass Mixing Bowl Set Open Kitchen by Williams Sonoma All Purpose Bowls Flour Shop Melamine Mixing Bowls with Lids, Set of 6 Reactive Glaze Cereal Bowls Japanese Garden Large Bowls, Mixed Palace Garden Mixed Noodle Bowls Melamine Pour Spout Bowls, Set of 3, Red Melamine Pour Spout Bowls, Set of 3, Grey ... Williams Sonoma Scoop & Spread Tool Williams Sonoma Open Kitchen Grey Silicone Utensils, Turner Williams Sonoma Stainless-Steel Slotted Turner/Spatula HARRY POTTER™ SLYTHERIN™ Ultimate Silicone Spatula Set Williams Sonoma Signature Stainless Steel Slotted Spatula Williams Sonoma Signature Nonstick Burger Spatula Open Kitchen by Williams Sonoma Beechwood Angled Spatula Open Kitchen by Williams Sonoma All Nylon Turner/Spatula HARRY POTTER™ RAVENCLAW™ Spatula Williams Sonoma Flex Core Mini Spatulas, Set of 2, Navy
title
Williams Sonoma Pantry Cereal/Soup Bowl, Set of 6 1.000000 0.050803 0.241058 0.152100 0.040861 0.207613 0.000000 0.000000 0.046583 0.049020 ... 0.138531 0.131101 0.163751 0.043031 0.161593 0.164337 0.133007 0.145692 0.000000 0.180857
Melamine Mixing Bowls with Spout, Set of 3 0.050803 1.000000 0.350303 0.144708 0.576495 0.139976 0.126896 0.122795 0.657234 0.691613 ... 0.000000 0.000000 0.000000 0.046421 0.000000 0.000000 0.000000 0.000000 0.000000 0.048609
10-Piece Glass Mixing Bowl Set 0.241058 0.350303 1.000000 0.000000 0.281746 0.000000 0.000000 0.000000 0.052077 0.054801 ... 0.000000 0.000000 0.000000 0.048106 0.000000 0.000000 0.000000 0.000000 0.000000 0.050373
Open Kitchen by Williams Sonoma All Purpose Bowls 0.152100 0.144708 0.000000 1.000000 0.116387 0.139054 0.126060 0.121987 0.132688 0.139628 ... 0.148459 0.438514 0.175487 0.000000 0.173174 0.176115 0.444888 0.487317 0.000000 0.145530
Flour Shop Melamine Mixing Bowls with Lids, Set of 6 0.040861 0.576495 0.281746 0.116387 1.000000 0.112581 0.102061 0.098763 0.319733 0.336458 ... 0.000000 0.000000 0.000000 0.037336 0.000000 0.000000 0.000000 0.000000 0.000000 0.039096
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Williams Sonoma Signature Nonstick Burger Spatula 0.164337 0.000000 0.000000 0.176115 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.160404 0.151801 0.310536 0.095772 0.528626 1.000000 0.252233 0.276289 0.115991 0.157239
Open Kitchen by Williams Sonoma Beechwood Angled Spatula 0.133007 0.000000 0.000000 0.444888 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.129823 0.383468 0.251333 0.077513 0.248021 0.252233 1.000000 0.513226 0.093878 0.127262
Open Kitchen by Williams Sonoma All Nylon Turner/Spatula 0.145692 0.000000 0.000000 0.487317 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.142205 0.560693 0.450985 0.084906 0.271675 0.276289 0.513226 1.000000 0.102831 0.139399
HARRY POTTER™ RAVENCLAW™ Spatula 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.115578 0.519728 0.114054 0.115991 0.093878 0.102831 1.000000 0.000000
Williams Sonoma Flex Core Mini Spatulas, Set of 2, Navy 0.180857 0.048609 0.050373 0.145530 0.039096 0.000000 0.000000 0.000000 0.044571 0.046903 ... 0.132547 0.125438 0.156678 0.041172 0.154613 0.157239 0.127262 0.139399 0.000000 1.000000

609 rows × 609 columns

The more similar the title, the higher the cosine similarity score.

In [45]:
#Construct a reverse map of indices and product titles
indices = pd.Series(data.index, index=data['title']).drop_duplicates()
In [46]:
# Function that takes in a title as input and outputs top 10 most similar product titles
def title_based_recommendations(title, cosine_sim=cosine_sim):
    # Get the index of the product that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all product titles with that title
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the titles based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar titles
    sim_scores = sim_scores[1:11]

    # Get the titles indices
    product_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar titles
    return data['title'].iloc[product_indices]
In [47]:
title_based_recommendations('Staub Perfect Pan')
Out[47]:
174                     Staub Enameled Cast Iron Fry Pan
148    Brim 0.8 Liter Precision Temperature & Perfect...
162    Staub Enameled Cast Iron Traditional Deep Skil...
175    Calphalon Elite Nonstick 3-Piece Fry Pan & Sau...
192                  All-Clad d5 Stainless-Steel Fry Pan
201                     Calphalon Elite Nonstick Fry Pan
547                                Littledeer Pan Paddle
205                 SCANPAN Classic Nonstick Fry Pan Set
171                SCANPAN Professional Nonstick Fry Pan
187         All-Clad d5 Stainless-Steel Nonstick Fry Pan
Name: title, dtype: object
In [48]:
title_based_recommendations('Robinson Clear Glass Pendant')
Out[48]:
400         Robinson Seeded Glass Pendant
379    Katie Conical Pendant, Clear Glass
354                 Kira Clear Table Lamp
361    Katie Conical Pendant, White Glass
387                         Davis Pendant
344                        Emmett Pendant
363                         Agnes Pendant
381                Garrison Brass Pendant
398        Garrison Antique White Pendant
343               Montego Pendant, Rattan
Name: title, dtype: object