Part 1 - is all about getting to know the Dataset using Exploratory analysis, cleaning data, choosing the metrics, and doing the first model prediction experiments.

Problem statement:

My friend is in the job market. However, they keep wasting time applying for fraudulent job postings. They have asked me to use my data skills to filter out fake postings and save them effort. They have mentioned that job postings are abundant, so they would prefer my solution to risk filtering out real posts if it decreases the number of fraudulent posts they apply to. I have access to a dataset consisting of approximately 18'000 job postings, containing both real and fake jobs.

Story published with Jupyter2Hashnode

Have you ever struggled to convert a Jupyter Notebook into a compelling Hashnode story? If so, you're not alone. It can be a daunting task, but fortunately, there's a tool that can simplify the process: Jupyter2Hashnode.

With Jupyter2Hashnode, you can convert Jupyter Notebooks into Hashnode stories with just a single command. The tool compresses images, uploads them to the Hashnode server, updates image URLs in the markdown file, and finally, publishes the story article. It's an effortless way to transform your data analysis or code tutorials into a polished and engaging format.

If you're interested in learning more about Jupyter2Hashnode, there's a detailed guide available on Hashnode (tiagopatriciosantos.hashnode.dev/jupyter2ha..). It's a game-changing tool that can save you time and energy while helping you create high-quality content for your audience. Give it a try and see the difference for yourself!

The part 1

This end-2-end ML (Machine Learning) project is divided into a 3-part series.

Part 1 - is all about getting to know the Dataset using Exploratory analysis, cleaning data, choosing the metrics, and doing the first model prediction experiments.
Part 2 - is about the setup of DagsHub, DVC, and MLFlow to create a version-controlled data science project, as well as tracking experiment parameters and metrics, and comparing experiments.
Part 3 - is all about deployment, where using MLFlow and FastApi we will deploy the model into a WebAPI and serve it with Mogenius, a Virtual DevOps platform.

You can check the detailed Datacamp workspace here.

The Dataset

About Dataset

[Real or Fake] : Fake Job Description Prediction

This dataset contains 18K job descriptions out of which about 800 are fake. The data consists of both textual information and meta-information about the jobs. The dataset can be used to create classification models which can learn the job descriptions which are fraudulent.

_The original source of the data can be found here.

Data Dictionary

Column	Description
job_id	Unique Job ID
title	The title of the job ad entry
location	Geographical location of the job ad
department	Corporate department (e.g. sales)
salary_range	Indicative salary range (e.g. $50,000-$60,000)
company_profile	A brief company description
description	The details description of the job ad
requirements	Enlisted requirements for the job opening
benefits	Enlisted offered benefits by the employer
telecommuting	True for telecommuting positions
hascompanylogo	True if the company logo is present
has_questions	True if screening questions are present
employment_type	Full-type, Part-time, Contract, etc
required_experience	Executive, Entry level, Intern, etc
required_education	Doctorate, Master’s Degree, Bachelor, etc
industry	Automotive, IT, Health care, Real estate, etc
function	Consulting, Engineering, Research, Sales etc
fraudulent	target - Classification attribute

The imports

#code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.lines as mlines
import seaborn as sns
import unicodedata
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, PowerTransformer
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.dummy import DummyClassifier
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

sns.set_theme()
sns.set(font_scale=1.2)

RANDOM_SEED = 42

Read the csv

#code
df = pd.read_csv("https://dagshub.com/tiagopatriciosantos/Datasets/raw/f3ddde257b100018bcb22a7231f899462b34c58f/data/fake_job_postings.csv")
print("(Rows, columns) :", df.shape)
df.sample(n=5, random_state = RANDOM_SEED).head()

[output:]
(Rows, columns) : (17880, 18)

	job_id	title	location	department	salary_range	company_profile	description	requirements	benefits	telecommuting	has_company_logo	has_questions	employment_type	required_experience	required_education	industry	function	fraudulent
4708	4709	Python Engineer	GB, , London	NaN	NaN	NaN	Stylect is a dynamic startup that helps helps ...	We don’t care where you studied or what your G...	We are negotiable on salary and there is the p...	0	1	0	Full-time	Entry level	Unspecified	Apparel & Fashion	Information Technology	0
11079	11080	Entry Level Sales	US, OH, Cincinnati	NaN	55000-75000	NaN	General Summary: Achieves maximum sales profit...	NaN	Great Health and DentalFast Advancement Opport...	1	0	0	Full-time	Entry level	High School or equivalent	Financial Services	Sales	0
12357	12358	Agile Project Manager	US, NY, New York	NaN	NaN	ustwo offers you the opportunity to be yoursel...	At ustwo™ you get to be yourself, whilst deliv...	Skills• Experience interfacing directly with c...	NaN	0	1	0	NaN	NaN	NaN	NaN	NaN	0
14511	14512	Marketing Coordinator	GB, GBN, London	Business:Marketing	NaN	We build software for fashion retailers, to he...	About EDITDEDITD runs the world's biggest appa...	Required Skills / Experience:Ability to analys...	NaN	0	1	0	Full-time	NaN	NaN	NaN	Marketing	0
16691	16692	Full-stack Web Engineer	US, CA, San Francisco	NaN	NaN	Runscope is building tools for developers work...	As a Web Engineer at Runscope you'll be respon...	Extensive front-end web experience (HTML/CSS/J...	Be a part of an experienced team who have work...	0	1	1	Full-time	Mid-Senior level	NaN	NaN	Engineering	0

EDA

Exploratory data analysis.

Columns choose

We will ignore job id

#code
TEXT_COLS = ["title","company_profile","description", "requirements", "benefits"]
NUM_COLS = ["telecommuting", "has_company_logo", "has_questions"]
CAT_COLS = ["location","salary_range", "employment_type" , "required_experience","required_education", "function","industry"]
TARGET_COL = ["fraudulent"]
COLUMNS= TEXT_COLS + NUM_COLS + CAT_COLS + TARGET_COL
df = df[COLUMNS]
df.columns

[output:]
Index(['title', 'company_profile', 'description', 'requirements', 'benefits',
       'telecommuting', 'has_company_logo', 'has_questions', 'location',
       'salary_range', 'employment_type', 'required_experience',
       'required_education', 'function', 'industry', 'fraudulent'],
      dtype='object')

Target: `fraudulent`

#code
print("Total:",df.shape[0])
print("Non Fraudulent:", (-df["fraudulent"]+1).sum())
print("Fraudulent:", df["fraudulent"].sum())
print("Fraudulent percent:", np.round(df["fraudulent"].mean()*100,2), "%")
sns.countplot(data=df, x="fraudulent");

[output:]
Total: 17880
Non Fraudulent: 17014
Fraudulent: 866
Fraudulent percent: 4.84 %

png

Viewing Missing Data

#code
df_aux = df.drop(columns=TARGET_COL)
fig, (ax0, ax1) = plt.subplots(ncols=2, figsize=(15,5))
ax= sns.barplot(x=df_aux.columns, y=df_aux.isnull().sum()/df_aux.shape[0]*100, ax=ax0 )
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
ax.set_title("Missing values percent by column")

ax= sns.barplot(x=df_aux.columns, y=df_aux[df["fraudulent"]==1].isnull().sum()/df_aux[df["fraudulent"]==1].shape[0]*100, ax=ax1 )
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
ax.set_title("Missing values percent by column of fraudulent jobs");

plt.tight_layout()

png

Observations

To take some observations a deeper analysis was made and can been seen >here<

TEXT COLUMNS

title - no missing data and looks good
company_profile - 18.5% percent of missing data in all dataset, when filtering only the fraudulent jobs this goes to nearly 70%, we can make the hypothesis that a job with no company profile have more chance of being fraudulent
description - observations with the value "#NAME?" will be replaced with an empty string
requirements - observations with no meaningful text,['#NAME?', '.', '---', 'TBD', 'Test', '16-21', 'x', 'None'] will be replaced by the empty string
Strings will be stripped and any missing data will be replaced by an empty string
text columns contain text like this #URL_86fd830a95a64e2b30ceed829e63fd384c289e4f01e3c93608b42a84f6e662dd#, with the possible values being [URL_, PHONE_, EMAIL_] and encrypted text, this should be replaced by an empty string, this text does not seem useful
Text contains unicode chars, convert text to ascii
Text contains words that are not separated, ex: pressureExcellent, when a lower capital letter is followed by capitalized letter then will be inserted one whitespace between them

NUMERIC COLUMNS

telecommuting - no missing data, value is either 0 or 1
has_company_logo - no missing data, value is either 0 or 1
has_questions - no missing data, value is either 0 or 1

CATEGORY COLUMNS

location - 2% missing data, comma-separated values, the first value is country code, second is state/city code, more values after the second comma are detailed descriptions of state/city/location, we will keep the country code and state/city code in separate columns and drop this column
employment_type - 19.4 % missing data, 5 types ['Full-time', 'Contract', 'Part-time', 'Temporary', 'Other']
required_experience - almost 40% missing data, 7 types ['Mid-Senior level', 'Entry level', 'Associate', 'Not Applicable', 'Director', 'Internship', 'Executive']
required_education - 45.3% missing data, 13 types ["Bachelor's Degree", 'High School or equivalent', 'Unspecified', "Master's Degree", 'Associate Degree', 'Certification', 'Some College Coursework Completed', 'Professional', 'Vocational', 'Some High School Coursework', 'Doctorate', 'Vocational - HS Diploma', 'Vocational - Degree']
function - 36.1% missing data, 37 different types, "Information Technology" the most represented and "Science" the least one
industry - 27.4% missing data, 131 types "Information Technology and Services", "Computer Software" and "Internet" in the top 3 common
salary_range - more than 80% missing data, from the available data we can see that some data does not correspond to a correct range for example 60-90 and 8-Sep values, with so much missing data and with incorrect data we will drop this column. A suggestion is necessary to review the way data is stored in this column.
Any missing data will be replaced by the word "MISSING"

Cleaning data

TEXT COLUMNS

#code
import re

def split_word(matchobj):
    """
    This function takes a match object and returns a string with a space
    inserted between the first and second characters of the matched string.

    :param matchobj: A match object.
    :return: A string with a space inserted between the first and second
             characters of the matched string
    """
    return  matchobj.group(0)[0] + " " +matchobj.group(0)[1]



df[TEXT_COLS] = df[TEXT_COLS].fillna("")

for col in TEXT_COLS:    
    #encode/decode with the 'ignore' option, it will remove non-ascii characters:
    df[col] = df[col].apply(lambda val: unicodedata.normalize('NFKD', val).encode('ascii', 'replace').decode())    
    # strip text
    df[col] = df[col].str.strip()
    #when a lower capital letter is followed by capitalized letter then will be inserted one whitespace between them
    df[col] = df[col].apply(lambda val: re.sub(r'([a-z][A-Z])', split_word, val))

#code
# cleaning description
filter = df["description"].isin(['#NAME?'])
df.loc[filter, "description"] = ""

# cleaning requirements
filter = df["requirements"].str.strip().isin(['#NAME?', '.', '---', 'TBD', 'Test', '16-21', 'x', 'None'])
df.loc[filter, "requirements"] = ""

# replace encoded strings 
df[TEXT_COLS] = df[TEXT_COLS].replace(r"(#(EMAIL_|PHONE_|URL_)([^#])*#)|((EMAIL_|PHONE_|URL_)([^#])*##)", " " , regex=True)

#replace underscore with whitespace
for col in TEXT_COLS:  
    df[col] = df[col].str.replace("_"," ")

CATEGORY COLUMNS

#code
df[CAT_COLS] = df[CAT_COLS].fillna("MISSING")

Feature engineering

`location`

Split text by comma and keep only the 2 first values that represent the country code and city/state

#code
df[["country", "state"]]=df["location"].str.split(",", expand=True, n=2)[[0,1]]
df[["country", "state"]]

df["country"] = df["country"].str.strip().replace("", "MISSING")
df["state"] = df["state"].str.strip().fillna("MISSING").replace("", "MISSING")

text columns to length columns

#code
LEN_COLS = []
for col in TEXT_COLS:
    new_col = col+"_len"
    df[new_col] = df[col].str.len()
    LEN_COLS.append(new_col)
    NUM_COLS.append(new_col)

Columns choose

#code
CAT_COLS.remove("location")
CAT_COLS = CAT_COLS + ["country","state"]

COLUMNS= TEXT_COLS + NUM_COLS + CAT_COLS + TARGET_COL
df = df[COLUMNS]
df.columns

[output:]
Index(['title', 'company_profile', 'description', 'requirements', 'benefits',
       'telecommuting', 'has_company_logo', 'has_questions', 'title_len',
       'company_profile_len', 'description_len', 'requirements_len',
       'benefits_len', 'salary_range', 'employment_type',
       'required_experience', 'required_education', 'function', 'industry',
       'country', 'state', 'fraudulent'],
      dtype='object')

Country and State

Most of the job posts are from the United States (US). Malaysia(MY) and Bahrain (BH) have the highest rate of fraudulent job posts, with more than a 50% chance of a job post being fraudulent.

#code
from pandas.api.types import CategoricalDtype

fig, (ax0,ax1,ax2) = plt.subplots(ncols=3, figsize=(20,5) )



df_ana = df[["country","state", "fraudulent"]].copy()


countries_top10 = df_ana["country"].value_counts().head(10)
countries_top10_fraudulent = df_ana[df_ana["fraudulent"]==1]["country"].value_counts().head(10)
#to set order by when calculating percentage for each country
cat_country_order = CategoricalDtype( df_ana[df_ana["fraudulent"]==1]["country"].value_counts().index, ordered=True)
df_ana["country"] = df_ana["country"].astype(cat_country_order)


countries_top10_fraudulent_perc = df_ana.sort_values("country").groupby("country")["fraudulent"].mean().head(10)*100

print(countries_top10_fraudulent_perc.round(2))

sns.barplot(x=countries_top10.index, y=countries_top10.values  , ax = ax0).set(title="Top 10 - job posts in country code")
sns.barplot(x=countries_top10_fraudulent.index, y=countries_top10_fraudulent.values  , ax = ax1).set(title="Top 10 - fraudulent job posts in country code")
sns.barplot(x=countries_top10_fraudulent_perc.index.to_list(), y=countries_top10_fraudulent_perc.values  , ax = ax2).set(title="Top 10 - percentage of fraudulent job posts by country code", ylim=(0,100))
plt.tight_layout()

[output:]
country
US          6.85
AU         18.69
GB          0.96
MISSING     5.49
MY         57.14
CA          2.63
QA         28.57
BH         55.56
IN          1.45
PL          3.95
Name: fraudulent, dtype: float64

png

Looking into only US job post, California (CA) have the most quantity of job posts. Maryland (MD) have the highest rate of fake job posts.

#code
fig, (ax0,ax1,ax2) = plt.subplots(ncols=3, figsize=(20,5))
df_country_us = df[df["country"]=="US"]


us_states_top10 = df_country_us["state"].value_counts().head(10)
us_states_top10_fraudulent = df_country_us[df_country_us["fraudulent"]==1]["state"].value_counts().head(10)
#to set order by when calculating percentage for each country
cat_state_order = CategoricalDtype( df_country_us[df_country_us["fraudulent"]==1]["state"].value_counts().index, ordered=True)
df_country_us["state"] = df_country_us["state"].astype(cat_state_order)
us_states_top10_fraudulent_perc = df_country_us.sort_values("state").groupby("state")["fraudulent"].mean().head(10)*100

print(us_states_top10_fraudulent_perc.round(2))

sns.barplot(x=us_states_top10.index, y=us_states_top10.values  , ax = ax0).set(title="Top 10 - job posts in state code of US country")
sns.barplot(x=us_states_top10_fraudulent.index, y=us_states_top10_fraudulent.values  , ax = ax1).set(title="Top 10 - fraudulent job posts in state code of US country")
sns.barplot(x=us_states_top10_fraudulent_perc.index.to_list(), y=us_states_top10_fraudulent_perc.values  , ax = ax2).set(title="Top 10 - percentage fraudulent job posts by state code of US country", ylim=(0,100))
plt.tight_layout()

[output:]
state
TX         15.59
CA          6.97
NY          5.40
MISSING     7.53
MD         32.71
FL          7.23
GA          8.40
IL          4.25
OH          4.84
NC          7.56
Name: fraudulent, dtype: float64

png

Numeric columns

Starting with simple pandas describe for numeric columns:

#code
df.describe()

	telecommuting	has_company_logo	has_questions	title_len	company_profile_len	description_len	requirements_len	benefits_len	fraudulent
count	17880.000000	17880.000000	17880.000000	17880.000000	17880.000000	17880.000000	17880.000000	17880.000000	17880.000000
mean	0.042897	0.795302	0.491723	28.429754	607.254027	1195.626902	587.400895	204.940045	0.048434
std	0.202631	0.403492	0.499945	13.869966	558.460959	886.097778	609.431979	329.356012	0.214688
min	0.000000	0.000000	0.000000	3.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	0.000000	1.000000	0.000000	19.000000	137.000000	592.000000	146.000000	0.000000	0.000000
50%	0.000000	1.000000	0.000000	25.000000	552.000000	1003.000000	465.000000	44.000000	0.000000
75%	0.000000	1.000000	1.000000	35.000000	870.000000	1570.000000	817.000000	290.250000	0.000000
max	1.000000	1.000000	1.000000	142.000000	6184.000000	14881.000000	10795.000000	4460.000000	1.000000

Let's check numeric columns to see how they impact job posts being fraudulent.

using these symbols:

P(T) - the probability of Telecommuting
P(L) - probability of has_company_logo
P(Q) - probability of has_questions
P(F) - the probability of fraudulent

We get these probabilities:

$P(F \mid T)$ = 0.083442 $P(F \mid \overline{T})$ = 0.046865
$P(F \mid L)$ = 0.019902 $P(F \mid \overline{L})$ = 0.15929
$P(F \mid Q)$ = 0.028435 $P(F \mid \overline{Q})$ = 0.067782

Observations:

Posts that have telecommuting (remote job), have a probability almost 2x greater to be fraudulent than not telecommuting.
Not having the company logo have almost 7x the probability of being fraudulent than having a logo.
If screening questions are present, has_questions are true, the job post has almost 1/3 the probability to be fraudulent than not having screening questions.

#code
cols_inspect = ['telecommuting', 'has_company_logo', 'has_questions']

fig, axs = plt.subplots(nrows=len(cols_inspect),figsize=(10,10))
fig.suptitle('Job posts', fontsize=16)

def annotate_percent(ax, n):
    for a in ax.patches:
        val = np.round(a.get_height()/n*100,decimals=2)
        ax.annotate('{:.2f}%'.format(val) ,
            (a.get_x()+a.get_width()/2, a.get_height()),
            ha='center',
            va='center',
            xytext=(0, 6),
            textcoords='offset points',
            fontsize = 8,
          )

for i, col in enumerate(cols_inspect):         
    ax = axs[i]       
    # print("col:",df.groupby(col)["fraudulent"].mean())
    sns.countplot(data=df, x = col, hue="fraudulent", ax=ax).set(title=str.title(col))      
    annotate_percent(ax, df.shape[0])    
    ax.legend(['non fraud', 'fraudulent'])
    ax.set_xticks(ax.get_xticks(),["no", "yes"])
    ax.set_xlabel("")
    # creates a crosstab with the conditional probabilities
    print(pd.crosstab( df["fraudulent"],df[col], normalize='columns', margins=True))

plt.tight_layout()

[output:]
telecommuting         0         1       All
fraudulent                                 
0              0.953135  0.916558  0.951566
1              0.046865  0.083442  0.048434
has_company_logo        0         1       All
fraudulent                                   
0                 0.84071  0.980098  0.951566
1                 0.15929  0.019902  0.048434
has_questions         0         1       All
fraudulent                                 
0              0.932218  0.971565  0.951566
1              0.067782  0.028435  0.048434

png

Inspecting the kernel density of titles, descriptions, and requirements separated by fraudulent and non-fraudulent we can observe that longer titles tend to be fraudulent, and smaller descriptions and requirements tend to be fraudulent job posts.

#code
cols_inspect = ['title_len', 'description_len', 'requirements_len']

fig , axs = plt.subplots(nrows=2, ncols=len(cols_inspect), figsize=(15,5))

for i, col in enumerate(cols_inspect):
    ax = axs[0][i]

    sns.kdeplot(x=df[df["fraudulent"]==0][col], ax =ax)
    ticks = ax.get_xticks()
    xlim = ax.get_xlim()
    ax.set(title=str.title(col)+" non fraudulent" )
    ax.set_xlabel("")

    ax = axs[1][i]

    sns.kdeplot( x=df[df["fraudulent"]==1][col], ax =ax)
    ax.set_xticks(ticks)
    ax.set_xlim(xlim)
    ax.set(title=str.title(col)+" fraudulent" )
    ax.set_xlabel("")
plt.tight_layout()

png

Processing the data

We will split data into training and test sets

#code
train_df, test_df = train_test_split(df, stratify=df["fraudulent"], random_state = RANDOM_SEED) #using stratify because the imbalance of the target feature

#code
# checking fraudulent percent in train_df
print("Fraudulent percent:", np.round(train_df["fraudulent"].mean()*100,2), "%")

[output:]
Fraudulent percent: 4.85 %

Here, we try to scale and transform the len features to make them smoother, less skewed, and more appropriate for modeling.

PowerTransformer attempts to make the data more like a normal distribution, which should soften the impact of extreme outliers and make the distributions smoother.

MinMaxScaler is used as a first step to making sure the numbers are in a reasonable range, as PowerTransformer can fail on very large numbers.

#code
pipeline = make_pipeline(MinMaxScaler(), PowerTransformer())

train_df_norm = pd.DataFrame(pipeline.fit_transform(train_df[LEN_COLS]), columns=LEN_COLS)
train_df_norm = train_df_norm.combine_first(train_df.reset_index()) # Add the other columns back

test_df_norm = pd.DataFrame(pipeline.fit_transform(test_df[LEN_COLS]), columns=LEN_COLS)
test_df_norm = test_df_norm.combine_first(test_df.reset_index()) # Add the other columns back

#code
sns.pairplot(train_df_norm[LEN_COLS+TARGET_COL], hue="fraudulent")

[output:]
<seaborn.axisgrid.PairGrid at 0x7f9a50966940>

png

Model training

For the time being, we'll only use the numeric columns for this model. The textual columns require specialized preprocessing before they can be used as input to a model.

#code
X_train = train_df_norm[NUM_COLS]
y_train = train_df_norm[TARGET_COL]

X_test = test_df_norm[NUM_COLS]
y_test = test_df_norm[TARGET_COL]

Functions

#code
from time import time
from mpl_toolkits import axes_grid1


def f1_recall_precision_threshold(estimator, X, y_true, threshold=0.5):
    """
    This function takes a model, X and y_true as input and returns the predicitons and the f1, precision, recall scores

    Parameters
    ----------
    estimator : sklearn model
        The model to be used for prediction.
    X : numpy array
        The input data.
    y_true : numpy array
        The true labels.
    threshold : float, default=0.5
        Value of the threshold between 0 and 1    
    Returns
    -------
    y_pred : numpy array
        The predicted value for positive class.
    y_pred_proba : numpy array
        The predicted probabilities for positive class.
    f1score : float
        The f1score.
    recall : float
        The recall score.
    precision : float
        The precision score.
    """
    y_pred_proba = estimator.predict_proba(X)[:,1] # predictions probability for positive class
    y_pred = y_pred_proba >= threshold

    f1score = metrics.f1_score(y_true, y_pred)
    precision = metrics.precision_score(y_true, y_pred)
    recall = metrics.recall_score(y_true, y_pred)

    return y_pred, y_pred_proba, f1score, recall, precision


def add_colorbar(ax, im, aspect=20, pad_fraction=0.5, **kwargs):
    """Add a vertical color bar to an image plot."""
    divider = axes_grid1.make_axes_locatable(im.axes)
    width = axes_grid1.axes_size.AxesY(im.axes, aspect=1./aspect)
    pad = axes_grid1.axes_size.Fraction(pad_fraction, width)
    current_ax = plt.gca()
    cax = divider.append_axes("right", size=width, pad=pad)
    plt.sca(ax)
    return im.axes.figure.colorbar(im, cax=cax, **kwargs)

def optimal_values(estimator, X_train, y_train, X_test, y_test, plot_charts=False, threshold_index=None, threshold_value=None):
    """
    This function takes a model, X and y_true as input and returns the optimal threshold, f1score, precision and recall.

    If threshold_index is passed it will use the index to return preditions and metrics based in that index.

    Parameters
    ----------
    estimator : sklearn model
        The model to be used for prediction.
    X_train : numpy array
        The train data.
    y_train : numpy array
        The train labels.
    X_test : numpy array
        The test data.
    y_test : numpy array
        The test labels.
    plot_charts : bool, optional
        Whether to plot the optimal confusion matrix and precision vs recall chart. The default is False.
    threshold_index : int, default=None
        Index of the threshold list, if set then threshold_value will be ignored
    threshold_value : float, default=None
        Value of the threshold to consider

    Returns
    -------
    values : dict
        {"params": {"threshold": threshold_opt},
              "train": {"f1": f1score_opt,
                        "precision":precision_opt,
                        "recall": recall_opt
                        },
              "test": {"f1": test_f1score,
                       "precision": test_precision,
                       "recall": test_recall
                      }
         }   
    """

    # predictions probability for positive class for train set
    y_pred_proba = estimator.predict_proba(X_train)[:,1] 

    # Compute precision-recall pairs for different probability thresholds.
    precision, recall, thresholds = metrics.precision_recall_curve(y_train, y_pred_proba)

#      # precision, recall manual calculations
#     y = np.array(y_true).reshape((-1,1))
#     pred = (y_pred_proba>= thresholds.reshape((-1,1))).T
#     TP = np.sum(np.logical_and(y , pred ), axis=0)
#     FN = np.sum(np.logical_and(y , np.logical_not(pred)), axis=0)
#     FP = np.sum( np.logical_and(np.logical_not(y) , pred), axis=0)
#     recall = np.append((TP /(TP+FN)), 0)
#     precision = np.append((TP /(TP+FP)), 1)
#     print("TP",TP, "FN",FN, "FP", FP,"recall", recall,"precision",precision)

    # calculates f1 scores
    # if threshold_index defined then will use the index to identify the threshold, f1, precision and recall
    # else uses the index of the max f1 score to identify the index to get opt threshold, f1, precision and recall
    f1score = 2*(precision*recall)/(precision+recall)
    if threshold_index:
        index=threshold_index
        threshold_opt = thresholds[index]
    elif threshold_value:
        thresholds_aux = np.where(thresholds <= threshold_value, thresholds, 0)
        index = np.nanargmax(thresholds_aux)    
        threshold_opt = threshold_value
    else:        
        index = np.nanargmax(f1score)
        threshold_opt = thresholds[index]

    f1score_opt = f1score[index] if not np.isnan(f1score[index]) else 0.0
    precision_opt = precision[index]
    recall_opt = recall[index]

    # using the opt threshold based on the f1 score calculates the preditions, f1, precision and recall for test set
    y_test_pred, y_test_pred_proba, test_f1score, test_recall, test_precision = f1_recall_precision_threshold(estimator, X_test, y_test, threshold_opt)

    if plot_charts:

        gs_kw = dict(width_ratios=[1, 3, 1, 1])
        fig, axd  = plt.subplot_mosaic([['text', 'precision_recall', 'cm_train','cm_test']],
                                          gridspec_kw=gs_kw, figsize=(16, 5),
                                          layout="tight")
        ax0= axd["text"]
        ax1= axd["precision_recall"]
        ax2= axd["cm_train"]
        ax3= axd["cm_test"]

        # Set both x-axis limits to [0, 10] and y-axis limits to [0, 20] instead of default [0, 1]
        y_top = 14
        ax0.axis([0, 10, 0, y_top])       
        model_name = estimator.__class__.__name__
        # ax0.set_title("Model : " + model_name, loc='left')            

        ax0.text(0,y_top -1 , "Model : " + model_name) 
        ax0.text(0,y_top -3 , f"threshold ≈ {np.round(threshold_opt,4)}" ) 
        ax0.text(0,y_top -5 , f"Train:" ) 
        ax0.text(1,y_top -6 , f"precision ≈ {np.round(precision_opt,4)}" ) 
        ax0.text(1,y_top -7 , f"recall ≈ {np.round(recall_opt,4)}" )
        ax0.text(1,y_top -8 , f"f1 ≈ {np.round(f1score_opt,4)}" )
        ax0.text(0,y_top -10 , f"Test:" ) 
        ax0.text(1,y_top -11 , f"precision ≈ {np.round(test_precision,4)}" ) 
        ax0.text(1,y_top -12 , f"recall ≈ {np.round(test_recall,4)}" )
        ax0.text(1,y_top -13 , f"f1 ≈ {np.round(test_f1score,4)}" )
        ax0.set_xticks([]) 
        ax0.set_yticks([])
        ax0.grid(False)
        ax0.patch.set_alpha(0)        

        y_pred = y_pred_proba >= threshold_opt                
        cm = metrics.ConfusionMatrixDisplay.from_predictions(y_train, y_pred, ax=ax2, colorbar=False)   
        add_colorbar(ax2, cm.im_)
        ax2.set_title("Train confusion matrix")
        ax2.grid(False)

        y_pred = y_test_pred_proba >= threshold_opt
        cm=metrics.ConfusionMatrixDisplay.from_predictions(y_test, y_pred, ax=ax3, colorbar=False)        
        add_colorbar(ax3, cm.im_)
        ax3.set_title("Test confusion matrix")
        ax3.grid(False)

        # plot line plots with train f1 scores, precision, recall and optimal values, 
        # annotates also some thresholds to easy identify other points of possible interest
        ax1.set_title("Train thresholds metrics evaluation (TH=threshold)")
        sns.lineplot(x=range(len(precision)), y=precision, label="Precision",color='b', ax=ax1)
        sns.lineplot(x=range(len(recall)), y=recall, label="Recall",color='orange', ax=ax1)
        sns.lineplot(x=range(len(f1score)), y=f1score, label="F1 score",color='g', ax=ax1)
        text_annot = f"TH {np.round(threshold_opt,3)}"
        ax1.annotate(text_annot, (index, 0), size=10)
        # text_annot = f"F1 {np.round(f1score_opt,2)}"
        # ax1.annotate(text_annot, (index, f1score_opt), size=10)
        xlim0, xlim1 = ax1.get_xlim()        
        ylim0, ylim1= ax1.get_ylim()
        x_point = index/xlim1
        # y_point = f1score_opt- ylim0/2
        ax1.axhline(y=f1score_opt, xmax=x_point,linewidth=1, color='r', alpha=0.7,ls="--")
        ax1.axvline(x=index, linewidth=1, color='r', alpha=0.7,ls="--")

        blue_line = mlines.Line2D([], [], color='blue', marker='_', label='Precision')
        orange_line = mlines.Line2D([], [], color='orange', marker='_', label='Recall')
        green_line = mlines.Line2D([], [], color='green', marker='_', label='F1 score')
        red_line = mlines.Line2D([], [],linewidth=1, color='red', linestyle='--', alpha=0.7, label='F1 opt')
        ax1.legend(handles=[blue_line,orange_line, green_line,red_line])

        ax1.set_xlabel("Threshold index")
        ax1.set_ylabel("Score")
        minor_tick = np.arange(0, len(f1score), 100)
        ax1.set_xticks(minor_tick, minor=True)
        ax1.grid(visible=True, which='major', axis="x")
        ax1.grid(visible=True, which='minor',axis="x", alpha=0.4)
        if not threshold_index:
            for i in np.linspace(0, index, 6, dtype=int)[1:-1]:
                recall_value = recall[i]
                threshold_value = thresholds[i]
                y_point = recall_value/ylim1
                ax1.axvline(x=i, ymax=y_point,linewidth=1, color='y', alpha=0.8,ls="--" )
                text_annot = f"TH {np.round(threshold_value,3)}"
                ax1.annotate(text_annot, (i, recall_value-0.05), size=10)        


    # dict to return values
    values = {"params": {"threshold": threshold_opt},
              "train": {"f1": f1score_opt,
                        "precision":precision_opt,
                        "recall": recall_opt
                        },
              "test": {"f1": test_f1score,
                       "precision": test_precision,
                       "recall": test_recall
                      }
             }

    return values

Baseline - DummyClassifier

Now, we'll fit a DummyClassifier, this makes predictions that ignore the input features.

This classifier serves as a simple baseline to compare against other more complex classifiers.

#code
clf_baseline = DummyClassifier(strategy='most_frequent', random_state=RANDOM_SEED)
clf_baseline.fit(X_train, y_train)

DummyClassifier(random_state=42, strategy='most_frequent')

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

DummyClassifier

DummyClassifier(random_state=42, strategy='most_frequent')

Evaluation and Metric choose

Since we know that the classes are very imbalanced (only about 4.84% of the questions are labeled as fraudulent), as we can see using the dummy model that has a 50% of probability of a guess being right is giving a ~95% accuracy, we'll avoid using the accuracy metric. Instead, we'll take a look at the precision-recall curve, which can tell us more useful information.

#code
y_pred = clf_baseline.predict(X_test)
print("Accuracy:",np.round( metrics.accuracy_score(y_test, y_pred),4), " = portion of non fraudulent job posts:", np.round(1-np.mean(y_test.values),4))

[output:]
Accuracy: 0.9517  = portion of non fraudulent job posts: 0.9517

Having the following confusion matrix:

	Positive Prediction	Negative Prediction
Positive Class	True Positive (TP)	False Negative (FN)
Negative Class	False Positive (FP)	True Negative (TN)

One thing that we want to avoid is getting in touch with people that posted a fake job post, as stated at the beginning of this notebook "they would prefer that your solution risks filtering out real posts if it decreases the number of fraudulent posts they apply to", so we need to maximize how well our model identifies fake job posts reducing the False Negatives predictions, to deal with that we will use the Recall metric. On the other hand, we don't want our model to have a bad quality predicting fake job posts, having too much not fraudulent job posts classified as fake, we will use the Precision metric to evaluate this.

Recall = Sensitivity = TPR = $\frac{TP}{TP + FN}$
Precision = $\frac{TP}{TP + FP}$

How to choose Precision and Recall? Almost always, in practice, we have to choose between high precision or high recall. It’s usually impossible to have both. We can achieve either of the two by various means:

Assigning a higher weighting to the examples of a specific class (the SVM algorithm accepts weightings of classes as input)
Tuning hyperparameters to maximize precision or recall on the validation set.
Varying the decision threshold for algorithms that return probabilities of classes; for instance, if we use logistic regression or decision tree, to increase recall (at the cost of a lower precision), we can decide that the prediction will be positive only if the probability returned by the model is higher than 0.2. We will use this approach.

Another metric we will use to evaluate how well the model does predictions will be the F1-score. F1-score is the geometric average of precision and recall. F1-score relies on between 0 and 1, with 1 being the best value.

By default, machine learning models use a threshold of 0.5. This means that if a prediction probability is equal to or higher than 0.5, it is classified as "1"; otherwise, it is classified as "0". In our case, we will consider a threshold of 4.85%. This number represents the percentage of fraudulent job posts in the training dataset. We will compare this threshold with the optimal threshold obtained using the Precision-Recall curve. This will help us identify the threshold that produces the maximum F1-score. Tuning or shifting the decision threshold in order to accommodate the broader requirements of the classification problem is generally referred to as "threshold-moving" "threshold-tuning" or simply "thresholding".

f1score = $ {2 \ast \frac{precision * recall}{precision + recall}} $

Great articles about classification metrics and threshold calculations:

Predicting using the default threshold using the dummy model where every prediction is the most frequent class, in this case, is the class "0", it predicts always class "0". Changing the threshold to zero makes all predictions to become class "1". Well, this is the dummy model, nothing more than dummy predictions.

The evaluation of the metrics chosen is most understandable in models that learn something from the data. Next, we will see the LogisticRegression model.

#code
values = optimal_values(clf_baseline, X_train, y_train, X_test, y_test, plot_charts=True, threshold_value=0.5)
#storing this values to compare
compare_dict = {"Baseline dummy": {"default": values}}
print(values)

[output:]
{'params': {'threshold': 0.5}, 'train': {'f1': 0.09246088193456614, 'precision': 0.048471290082028336, 'recall': 1.0}, 'test': {'f1': 0.0, 'precision': 0.0, 'recall': 0.0}}

png

#code
values = optimal_values(clf_baseline, X_train, y_train, X_test, y_test, plot_charts=True)
#storing this values to compare
compare_dict["Baseline dummy"]["f1_opt"]= values
print(values)

[output:]
{'params': {'threshold': 0.0}, 'train': {'f1': 0.09246088193456614, 'precision': 0.048471290082028336, 'recall': 1.0}, 'test': {'f1': 0.09218950064020486, 'precision': 0.04832214765100671, 'recall': 1.0}}

png

#code
values = optimal_values(clf_baseline, X_train, y_train, X_test, y_test, plot_charts=True, threshold_value=0.0485)
#storing this values to compare
compare_dict["Baseline dummy"]["th_opt"]= values
print(values)

[output:]
{'params': {'threshold': 0.0485}, 'train': {'f1': 0.09246088193456614, 'precision': 0.048471290082028336, 'recall': 1.0}, 'test': {'f1': 0.0, 'precision': 0.0, 'recall': 0.0}}

png

#code
# Precision vs recall for each threshold
metrics.PrecisionRecallDisplay.from_estimator(clf_baseline, X_test, y_test)

[output:]
<sklearn.metrics._plot.precision_recall_curve.PrecisionRecallDisplay at 0x7f9a55f014f0>

png

LogisticRegression

Now, we'll fit a basic Logistic Regression model to our data, to see that it manages to learn anything. We're not trying to optimize it yet, just see that it successfully learns something useful and that our preparation of the data helped.

#code
clf_logreg = LogisticRegression(random_state = RANDOM_SEED)
clf_logreg.fit(X_train, np.ravel(y_train))

LogisticRegression(random_state=42)

LogisticRegression

LogisticRegression(random_state=42)

Evaluation numeric features model

#code
values = optimal_values(clf_logreg, X_train, y_train, X_test, y_test, plot_charts=True, threshold_value=0.5)
#storing this values to compare
compare_dict["Logistic numeric"]= {"default": values}
print(values)

[output:]
{'params': {'threshold': 0.5}, 'train': {'f1': 0.0030627871362940277, 'precision': 0.3333333333333333, 'recall': 0.0015384615384615385}, 'test': {'f1': 0.0, 'precision': 0.0, 'recall': 0.0}}

png

It looks like the model learned almost nothing, let's calculate the threshold based on the max f1 score and inspect again the confusion matrix.

The recall is still way too much low, only $\approx$33% of the true fraudulent job posts are predicted as fraudulent in our test, Precision is a little better but still too low, and as a consequence F1-score is also too low.

#code
values = optimal_values(clf_logreg, X_train, y_train, X_test, y_test, plot_charts=True)

#storing this values to compare
compare_dict["Logistic numeric"]["f1_opt"]= values

png

But if we use the 0.0485 thresholds we can see a drastic improvement in our recall, we almost doubled this metric in the train and test datasets, but we have a low precision score making the F1 score also too low

#code
values = optimal_values(clf_logreg, X_train, y_train, X_test, y_test, plot_charts=True, threshold_value=0.0485)

#storing this values to compare
compare_dict["Logistic numeric"]["th_opt"]= values
print(values)

[output:]
{'params': {'threshold': 0.0485}, 'train': {'f1': 0.24142515288487107, 'precision': 0.14593378334940535, 'recall': 0.6984615384615385}, 'test': {'f1': 0.22222222222222224, 'precision': 0.1349206349206349, 'recall': 0.6296296296296297}}

png

#code
metrics.PrecisionRecallDisplay.from_estimator(clf_logreg , X_test, y_test) # AP = average precision
threshold_opt = values["params"]["threshold"]
recall = values["test"]["recall"]
f1score = values["test"]["f1"]
precision = values["test"]["precision"]
plt.scatter(recall, precision, color="r")
text_annot = f"Threshold ≈ {np.round(threshold_opt,4)}\nf1-score ≈ {np.round(f1score,2)}"
plt.annotate(text_annot, (recall+0.02, precision), size=10)

[output:]
Text(0.6496296296296297, 0.1349206349206349, 'Threshold ≈ 0.0485\nf1-score ≈ 0.22')

png

Using only the numeric features we can see that the Logistic Regression model achieves an average precision of 0.25, and we can see that he did really learn useful information from the data, but only when the threshold was moved.

We can also look at the learned feature importances to understand what our model is looking for:

#code
sns.barplot(y=NUM_COLS, x=clf_logreg.coef_[0])

[output:]
<AxesSubplot:>

png

Recall that this is a classification problem with classes 0 and 1. Notice that the coefficients are both positive and negative. The positive scores indicate a feature that predicts class 1, whereas the negative scores indicate a feature that predicts class 0.

has_company_logo have the most important in the model predicting the class 0 while has_question have some importance.

telecommuting have the most importance in predicting the class 1.

title_len have positive importance, description_len and requirements_len have negative importance but almost no importance when predicting the class.

This is aligned with our EDA on numeric columns where we checked the conditional probability and the kernel density of text length.

NLP - Textual Features

So, at this point, we have a sense for our data, have found a good way to scale and normalize our numeric features, and have trained a very basic classifier on it.

Introduction to NLP, Natural Language Processing is a branch of Artificial Intelligence that analyzes, processes, and efficiently retrieves information text data.

The next phase would be to see what happens when we take advantage of using techniques from NLP in our textual features - title, company_profile, description, requirements. We'll use only the textual features to simplify things and isolate the effects.

To start simply, we'll unify only title and description into one big textual column. Later we will try another approach to unifying all textual columns.

#code
train_text_col = train_df_norm["title"] +" "+ train_df_norm["description"] 
test_text_col = test_df_norm["title"] +" "+ test_df_norm["description"]

To turn this text into numerical features that can be used as input to an ML model, we'll use TfidfVectorizer a sensible limit of vocabulary size (max_features=25000). It will split the text into tokens, and give each text in the data a numerical representation of the frequency of tokens in that text.

Note that special care is needed when handling sparse matrices in Pandas, and the result returned TfidfVectorizer is a sparse matrix. If you ignore this special handling, expect to run into out-of-memory errors, unresponsiveness, and crashes as the sparse matrix gets inflated.

#code
tfidf = TfidfVectorizer(max_features=25000)
tfidf.fit(train_text_col)
train_tfidf_df = pd.DataFrame.sparse.from_spmatrix(tfidf.transform(train_text_col), columns=tfidf.get_feature_names_out())

Interesting notes:

The English Longest word in a major dictionary is Pneumonoultramicroscopicsilicovolcanoconiosis (45 letters) ( Pneumoconiosis caused by inhalation of very fine silicate or quartz dust. ) and the second is Hippopotomonstrosesquippedaliophobia (36 letters) we will make a regex expression that will only consider alphanumeric words no longer than 35 letters. Source:irisreading.com/10-longest-words-in-the-eng..
We also will ignore words that start with a number

#code
print("words with more than 36 letters:", [w for w in tfidf.get_feature_names_out() if len(w) >36][:10])

[output:]
words with more than 36 letters: ['yijdnw4jujjilj17glxmbmfmqvbeix022dcqup8rsgcn4zyfax1c1nuxkpu1q66j']

#code

PATTERN =  r"\b[a-zA-Z]\w{2,35}\b" # words that not starts with digits and words with len >2 and <= 36
tfidf = TfidfVectorizer(max_features=25000, token_pattern=PATTERN)
tfidf.fit(train_text_col)
train_tfidf_df = pd.DataFrame.sparse.from_spmatrix(tfidf.transform(train_text_col), columns=tfidf.get_feature_names_out())
test_tfidf_df = pd.DataFrame.sparse.from_spmatrix(tfidf.transform(test_text_col), columns=tfidf.get_feature_names_out())

X_tfidf_train = train_tfidf_df.sparse.to_coo()
X_tfidf_test = test_tfidf_df.sparse.to_coo()

#code
print(train_tfidf_df.shape)
train_tfidf_df.sample(15, axis=1).head()

[output:]
(13410, 25000)

	njhmfa	rome	adm201	volunerable	intrathecal	miniature	liasising	upwards	waqualification	tallahassee	courses	fwa	gene	prudently	wakes
0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
3	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
4	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

#code
clf_tfidf = LogisticRegression(random_state = RANDOM_SEED)
clf_tfidf.fit(X_tfidf_train, np.ravel(y_train))

LogisticRegression(random_state=42)

LogisticRegression

LogisticRegression(random_state=42)

Evaluating the textual model

Below, we can see that the new text-based model performs way better!

This is not surprising, as most information will be contained in the text content of each post.

Without adjusting the threshold we get great precision but a bad recall, F1-score gets near 0.41.

#code
values = optimal_values(clf_tfidf, X_tfidf_train, y_train, X_tfidf_test, y_test, plot_charts=True, threshold_value=0.5)
#storing this values to compare
compare_dict["Logistic tfidf"]= {"default": values}
print(values)

[output:]
{'params': {'threshold': 0.5}, 'train': {'f1': 0.5, 'precision': 0.9954128440366973, 'recall': 0.33384615384615385}, 'test': {'f1': 0.41025641025641024, 'precision': 0.9824561403508771, 'recall': 0.25925925925925924}}

png

Let's calculate the threshold based on the max f1 score and inspect again the confusion matrix.

With a threshold of ≈0.156 we have a good Recall, $\approx$63% of the true fraudulent job posts are predicted as fraudulent in our test, Precision is high as consequence we have a good F1-score.

#code
values = optimal_values(clf_tfidf, X_tfidf_train, y_train, X_tfidf_test, y_test, plot_charts=True)
#storing this values to compare
compare_dict["Logistic tfidf"]["f1_opt"]= values
print(values)

[output:]
{'params': {'threshold': 0.15577321267865177}, 'train': {'f1': 0.7908082408874801, 'precision': 0.815359477124183, 'recall': 0.7676923076923077}, 'test': {'f1': 0.6901763224181361, 'precision': 0.7569060773480663, 'recall': 0.6342592592592593}}

png

Using the 0.0485 thresholds we have a great recall metric, ≈91% of the true fraudulent job posts are predicted as fraudulent in our test. Precision and F1 score are not so great.

#code
values = optimal_values(clf_tfidf, X_tfidf_train, y_train, X_tfidf_test, y_test, plot_charts=True, threshold_value=0.0485)
#storing this values to compare
compare_dict["Logistic tfidf"]["th_opt"]= values
print(values)

[output:]
{'params': {'threshold': 0.0485}, 'train': {'f1': 0.3705238236164546, 'precision': 0.22938805423231953, 'recall': 0.963076923076923}, 'test': {'f1': 0.35304659498207885, 'precision': 0.21888888888888888, 'recall': 0.9120370370370371}}

png

#code
metrics.PrecisionRecallDisplay.from_estimator(clf_tfidf , X_tfidf_test, y_test) # AP = average precision
threshold_opt = values["params"]["threshold"]
recall = values["test"]["recall"]
f1score = values["test"]["f1"]
precision = values["test"]["precision"]
plt.scatter(recall, precision, color="r")
text_annot = f"Threshold ≈ {np.round(threshold_opt,4)}\nf1-score ≈ {np.round(f1score,4)}"
plt.annotate(text_annot, (recall+0.02, precision), size=10)

[output:]
Text(0.9320370370370371, 0.21888888888888888, 'Threshold ≈ 0.0485\nf1-score ≈ 0.353')

png

As a sanity check, it's a good idea to check what terms the textual model cares about when making its decision.

As we see below, it looks like our model learned some things, such as aker and subsea are terms that only show in fraudulent job posts, so makes sense that they are positive predictors with high coefficients. Another curious term is the term 'our', it's a word that brings some engagement with the significance that something exists that is not the case in a fraudulent job post.

#code
fig, (ax0, ax1) =  plt.subplots(ncols=2, figsize=(10,5))

print("Description example with term 'aker':", train_df_norm.loc[120,"description"][:100])

tfidf_coef_df = pd.DataFrame({'coef':clf_tfidf.coef_[0]})
tfidf_coef_df['term'] = tfidf.get_feature_names_out()
tfidf_coef_df['coef_abs'] = tfidf_coef_df['coef'].abs()
tfidf_coef_df = tfidf_coef_df.sort_values('coef_abs', ascending=False)
sns.barplot(y='term', x='coef', data=tfidf_coef_df[:20], ax=ax0).set(title="Top 20 coefficients of the terms in the decision function")


dict_fraudulent_terms = {col:  y_train[train_tfidf_df[col]!=0].mul(100).values.mean().round(2) for col in tfidf_coef_df["term"][:20]}
df_fraudulent_terms = pd.DataFrame(dict_fraudulent_terms.values(), index=dict_fraudulent_terms.keys(), columns=["percent"])

sns.barplot( y=df_fraudulent_terms.index, x=df_fraudulent_terms["percent"], ax=ax1).set(title="Train set fraudulent job post with the term ", xlabel="percent", ylabel="term");

plt.tight_layout()

[output:]
Description example with term 'aker': Fining dining restaurant and lounge seeks EXPERIENCED and SERIOUS servers and bartenders to join the

png

Ngrams

An N-gram means a sequence of N words. So for example, “Medium blog” is a 2-gram (a bigram), “A Medium blog post” is a 4-gram, and “Write on Medium” is a 3-gram (trigram).

Including unigrams, bigrams, or trigrams in the model, is more likely to capture important information that appears as multiple tokens in the text--for example, "account management".

Ngrams (1,2)

Bigrams

#code
from time import time
t0 = time()
PATTERN =  r"\b[a-zA-Z]\w{2,35}\b" # words that not starts with digits and words with len >2 and <= 36
tfidf = TfidfVectorizer(max_features=25000, token_pattern=PATTERN, ngram_range=(1,2))
tfidf.fit(train_text_col)
train_tfidf_df = pd.DataFrame.sparse.from_spmatrix(tfidf.transform(train_text_col), columns=tfidf.get_feature_names_out())
test_tfidf_df = pd.DataFrame.sparse.from_spmatrix(tfidf.transform(test_text_col), columns=tfidf.get_feature_names_out())

X_tfidf_train = train_tfidf_df.sparse.to_coo()
X_tfidf_test = test_tfidf_df.sparse.to_coo()

clf_tfidf = LogisticRegression(random_state = RANDOM_SEED)
clf_tfidf.fit(X_tfidf_train, np.ravel(y_train))

duration_test = time() - t0
print(f"vectorize testing done in {duration_test:.3f}s ")

[output:]
vectorize testing done in 23.022s

As we can see we can achieve better results combining unigrams and bigrams, having $\approx$4% increase of recall and F1 score from the previous model.

#code
values = optimal_values(clf_tfidf, X_tfidf_train, y_train, X_tfidf_test, y_test, plot_charts=True, threshold_value=0.5)
#storing this values to compare
compare_dict["Logistic bigrams"]= {"default": values}
print(values)

[output:]
{'params': {'threshold': 0.5}, 'train': {'f1': 0.5102974828375286, 'precision': 0.9955357142857143, 'recall': 0.34307692307692306}, 'test': {'f1': 0.4609929078014184, 'precision': 0.9848484848484849, 'recall': 0.30092592592592593}}

png

Calculating the threshold based on the max f1 score and inspecting the confusion matrix. With a threshold of ≈0.1619, we have a good Recall, ≈62% of the true fraudulent job posts are predicted as fraudulent in our test, Precision is high as consequence we have a good F1-score, but his metrics are a little worse than in the experiment using online unigrams.

#code
values = optimal_values(clf_tfidf, X_tfidf_train, y_train, X_tfidf_test, y_test, plot_charts=True)
#storing this values to compare
compare_dict["Logistic bigrams"]["f1_opt"]= values
print(values)

[output:]
{'params': {'threshold': 0.16193888550303642}, 'train': {'f1': 0.8114285714285714, 'precision': 0.8643478260869565, 'recall': 0.7646153846153846}, 'test': {'f1': 0.690537084398977, 'precision': 0.7714285714285715, 'recall': 0.625}}

png

Using the 0.0485 threshold we have a great recall metric, ≈92% of the true fraudulent job posts are predicted as fraudulent in our test. Precision and F1 score are not so great but in this case better than using only unigrams with an increase of 2% in our f1 score.

#code
values = optimal_values(clf_tfidf, X_tfidf_train, y_train, X_tfidf_test, y_test, plot_charts=True, threshold_value=0.0485)
#storing this values to compare
compare_dict["Logistic bigrams"]["th_opt"]= values
print(values)

[output:]
{'params': {'threshold': 0.0485}, 'train': {'f1': 0.40252365930599365, 'precision': 0.25317460317460316, 'recall': 0.9815384615384616}, 'test': {'f1': 0.3747645951035781, 'precision': 0.23522458628841608, 'recall': 0.9212962962962963}}

png

Let's check some unigram and bigram words:

#code
print(train_tfidf_df.shape)
train_tfidf_df.sample(15, axis=1, random_state=42).head()

[output:]
(13410, 25000)

	developing deep	wire transfer	frameworks	media management	money transfer	easier	corporate headquarters	looking quickly	backed company	for web	the administrative	basis	design delivery	service nvq	and would
0	0.0	0.051741	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1	0.0	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2	0.0	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
3	0.0	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
4	0.0	0.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

Ngrams (1,3)

Trigrams

#code
from time import time
t0 = time()
PATTERN =  r"\b[a-zA-Z]\w{2,35}\b" # words that not starts with digits and words with len >2 and <= 36
tfidf = TfidfVectorizer(max_features=25000, token_pattern=PATTERN, ngram_range=(1,3))
tfidf.fit(train_text_col)
train_tfidf_df = pd.DataFrame.sparse.from_spmatrix(tfidf.transform(train_text_col), columns=tfidf.get_feature_names_out())
test_tfidf_df = pd.DataFrame.sparse.from_spmatrix(tfidf.transform(test_text_col), columns=tfidf.get_feature_names_out())

X_tfidf_train = train_tfidf_df.sparse.to_coo()
X_tfidf_test = test_tfidf_df.sparse.to_coo()

clf_tfidf = LogisticRegression(random_state = RANDOM_SEED)
clf_tfidf.fit(X_tfidf_train, np.ravel(y_train))

duration_test = time() - t0
print(f"vectorize testing done in {duration_test:.3f}s ")

[output:]
vectorize testing done in 27.225s

The scores are the same as combining unigrams and bigrams using the 0.5 threshold.

#code
values = optimal_values(clf_tfidf, X_tfidf_train, y_train, X_tfidf_test, y_test, plot_charts=True, threshold_value=0.5)
#storing this values to compare
compare_dict["Logistic trigrams"]= {"default": values}
print(values)

[output:]
{'params': {'threshold': 0.5}, 'train': {'f1': 0.512, 'precision': 0.9955555555555555, 'recall': 0.3446153846153846}, 'test': {'f1': 0.4609929078014184, 'precision': 0.9848484848484849, 'recall': 0.30092592592592593}}

png

Calculating the threshold based on the max f1 score and inspecting the confusion matrix. With a threshold of ≈0.1542, we have a better Recall than using only unigrams and bigrams, with an increase of 3% we have ≈65% of the true fraudulent job posts are predicted as fraudulent in our test, Precision is high as consequence we have a good F1-score, the best F1 score till now, with ≈0.71.

#code
values = optimal_values(clf_tfidf, X_tfidf_train, y_train, X_tfidf_test, y_test, plot_charts=True)
#storing this values to compare
compare_dict["Logistic trigrams"]["f1_opt"]= values
print(values)

[output:]
{'params': {'threshold': 0.15417957934023463}, 'train': {'f1': 0.8057784911717495, 'precision': 0.8422818791946308, 'recall': 0.7723076923076924}, 'test': {'f1': 0.7103274559193955, 'precision': 0.7790055248618785, 'recall': 0.6527777777777778}}

png

Using the 0.0485 threshold we have a great recall metric, ≈91.7% of the true fraudulent job posts are predicted as fraudulent in our test, although in this case, the metrics have a little worse scores than using bigrams. So we will stick with bigrams.

#code
values = optimal_values(clf_tfidf, X_tfidf_train, y_train, X_tfidf_test, y_test, plot_charts=True, threshold_value=0.0485)
#storing this values to compare
compare_dict["Logistic trigrams"]["th_opt"]= values
print(values)

[output:]
{'params': {'threshold': 0.0485}, 'train': {'f1': 0.40586173940745457, 'precision': 0.25592607472880674, 'recall': 0.98}, 'test': {'f1': 0.3728813559322034, 'precision': 0.23404255319148937, 'recall': 0.9166666666666666}}

png

Lemmatization

Lemmatization is the process of grouping together the different inflected forms of a word so they can be analyzed as a single item.

Examples of lemmatization:

rocks: rock
corpora: corpus
better: good

This did not improve our metrics. This idea will be abandoned.

#code
# import these modules
import nltk
nltk.download('omw-1.4')
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer

[output:]
[nltk_data] Downloading package omw-1.4 to /home/repl/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to /home/repl/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!

#code
from time import time
t0 = time()
PATTERN =  r"\b[a-zA-Z]\w{2,35}\b" # words that not starts with digits and words with len >2 and <= 36

wnl = WordNetLemmatizer()


def lemmatize_words(text):
    words = text.split()    
    words = [wnl.lemmatize(word) for word in words] 
    return ' '.join(words)

train_text_lem_col = train_text_col.apply(lemmatize_words)
test_text_lem_col = test_text_col.apply(lemmatize_words)


tfidf = TfidfVectorizer(max_features=25000, token_pattern=PATTERN, ngram_range=(1,2))
tfidf.fit(train_text_col)
train_tfidf_df = pd.DataFrame.sparse.from_spmatrix(tfidf.transform(train_text_lem_col), columns=tfidf.get_feature_names_out())
test_tfidf_df = pd.DataFrame.sparse.from_spmatrix(tfidf.transform(test_text_lem_col), columns=tfidf.get_feature_names_out())

X_tfidf_train = train_tfidf_df.sparse.to_coo()
X_tfidf_test = test_tfidf_df.sparse.to_coo()

clf_tfidf = LogisticRegression(random_state = RANDOM_SEED)
clf_tfidf.fit(X_tfidf_train, y_train)

duration_test = time() - t0
print(f"vectorize testing done in {duration_test:.3f}s ")

[output:]
vectorize testing done in 30.663s

Using the 0.5 threshold at first sight we have worst scores than using the Ngrams, f1-score is ≈2% worse.

#code
values = optimal_values(clf_tfidf, X_tfidf_train, y_train, X_tfidf_test, y_test, plot_charts=True, threshold_value=0.5)
#storing this values to compare
compare_dict["Logistic lem"]= {"default": values}
print(values)

[output:]
{'params': {'threshold': 0.5}, 'train': {'f1': 0.5102974828375286, 'precision': 0.9955357142857143, 'recall': 0.34307692307692306}, 'test': {'f1': 0.4444444444444444, 'precision': 0.9841269841269841, 'recall': 0.28703703703703703}}

png

Calculating the threshold based on the max f1 score and inspecting the confusion matrix. Compared with the bigrams we have a better recall and f1 score, with a threshold of ≈0.1614 we have an increase of ≈1% in the recall and ≈3% in the f1 score on our test set.

#code
values = optimal_values(clf_tfidf, X_tfidf_train, y_train, X_tfidf_test, y_test, plot_charts=True)
#storing this values to compare
compare_dict["Logistic lem"]["f1_opt"]= values
print(values)

[output:]
{'params': {'threshold': 0.16137392525357253}, 'train': {'f1': 0.8016260162601626, 'precision': 0.85, 'recall': 0.7584615384615384}, 'test': {'f1': 0.7168831168831168, 'precision': 0.8165680473372781, 'recall': 0.6388888888888888}}

png

Using the 0.0485 threshold we don't have a significant improvement in our metrics compared with the bigrams experiment. With no clear advantage, we will leave behind this idea.

#code
values = optimal_values(clf_tfidf, X_tfidf_train, y_train, X_tfidf_test, y_test, plot_charts=True, threshold_value=0.0485)
#storing this values to compare
compare_dict["Logistic lem"]["th_opt"]= values
print(values)

[output:]
{'params': {'threshold': 0.0485}, 'train': {'f1': 0.401760452687834, 'precision': 0.2524693796918214, 'recall': 0.9830769230769231}, 'test': {'f1': 0.3765373699148533, 'precision': 0.23662306777645659, 'recall': 0.9212962962962963}}

png

Oversampling

Perform over-sampling using RandomOverSampler from the Imbalanced-learn library, over-sample the minority class(es) by picking samples at random with replacement. The bootstrap can be generated in a smoothed manner. See Imbalanced-learn docs

With a balanced train set, we can now assume a Threshold of 0.5. Oversampling the training dataset tends the model to overfit. Overfitting occurs when the model cannot generalize and fits too closely to the training dataset instead. We can see that training metrics have perfect scores but test scores have poor results.

We will leave this idea and not use oversample.

#code
from imblearn.over_sampling import RandomOverSampler

X_train = train_df_norm.drop(columns=TARGET_COL)
y_train = train_df_norm[TARGET_COL]

X_test = test_df_norm.drop(columns=TARGET_COL)
y_test = test_df_norm[TARGET_COL]

# Apply the random over-sampling
randover = RandomOverSampler(random_state=RANDOM_SEED)
X_train_resampled, y_train_resampled = randover.fit_resample(X_train, y_train)
print("Before oversampling X train shape:",X_train.shape, np.mean(y_train) )
print("After oversampling X train shape:",X_train_resampled.shape, np.mean(y_train_resampled) )

[output:]
Before oversampling X train shape: (13410, 22) fraudulent    0.048471
dtype: float64
After oversampling X train shape: (25520, 22) fraudulent    0.5
dtype: float64

#code
train_text_col = X_train_resampled["title"] +" "+ X_train_resampled["description"] 
test_text_col = X_test["title"] +" "+ X_test["description"] 

from time import time
t0 = time()
PATTERN =  r"\b[a-zA-Z]\w{2,35}\b" # words that not starts with digits and words with len >2 and <= 36
tfidf = TfidfVectorizer(max_features=25000, token_pattern=PATTERN, ngram_range=(1,3))
tfidf.fit(train_text_col)
train_tfidf_df = pd.DataFrame.sparse.from_spmatrix(tfidf.transform(train_text_col), columns=tfidf.get_feature_names_out())
test_tfidf_df = pd.DataFrame.sparse.from_spmatrix(tfidf.transform(test_text_col), columns=tfidf.get_feature_names_out())

X_tfidf_train = train_tfidf_df.sparse.to_coo()
X_tfidf_test = test_tfidf_df.sparse.to_coo()

clf_tfidf = LogisticRegression(random_state = RANDOM_SEED)
clf_tfidf.fit(X_tfidf_train, np.ravel(y_train_resampled))

duration_test = time() - t0
print(f"vectorize testing done in {duration_test:.3f}s ")

[output:]
vectorize testing done in 41.962s

#code
values = optimal_values(clf_tfidf, X_tfidf_train, y_train_resampled, X_tfidf_test, y_test, plot_charts=True, threshold_value=0.5)
#storing this values to compare
compare_dict["Logistic oversamp"]= {"default": values}
print(values)

[output:]
{'params': {'threshold': 0.5}, 'train': {'f1': 0.9926871012914268, 'precision': 0.9854803830707445, 'recall': 1.0}, 'test': {'f1': 0.702928870292887, 'precision': 0.6412213740458015, 'recall': 0.7777777777777778}}

png

#code
values = optimal_values(clf_tfidf, X_tfidf_train, y_train_resampled, X_tfidf_test, y_test, plot_charts=True, threshold_value=0.5)
#storing this values to compare
compare_dict["Logistic oversamp"]["f1_opt"]= values
print(values)

[output:]
{'params': {'threshold': 0.5}, 'train': {'f1': 0.9926871012914268, 'precision': 0.9854803830707445, 'recall': 1.0}, 'test': {'f1': 0.702928870292887, 'precision': 0.6412213740458015, 'recall': 0.7777777777777778}}

png

#code
values = optimal_values(clf_tfidf, X_tfidf_train, y_train_resampled, X_tfidf_test, y_test, plot_charts=True, threshold_value=0.5)
#storing this values to compare
compare_dict["Logistic oversamp"]["th_opt"]= values
print(values)

[output:]
{'params': {'threshold': 0.5}, 'train': {'f1': 0.9926871012914268, 'precision': 0.9854803830707445, 'recall': 1.0}, 'test': {'f1': 0.702928870292887, 'precision': 0.6412213740458015, 'recall': 0.7777777777777778}}

png

Comparing metrics

#code

df_compare =pd.DataFrame.from_records(
    [
        (level1, level2, level3, level4, leaf)
        for level1, level2_dict in compare_dict.items()
        for level2, level3_dict in level2_dict.items()
        for level3, level4_dict in level3_dict.items()
        for level4, leaf in level4_dict.items()
    ],
    columns=['Name', 'Option', 'Attribute', 'Metric' , 'value']
)

# df_compare = df_compare.set_index(['Name', 'Option', 'Attribute']).head(20)

g = sns.catplot(data=df_compare, x="Name", y="value", hue="Metric", col="Attribute", row="Option", kind="point")
g.fig.set_figwidth(15)
g.fig.set_figheight(8)

[plt.setp(ax.get_xticklabels(), rotation=90) for ax in g.axes.flat];


for (row_key, col_key),ax in g.axes_dict.items():
    title=""
    if col_key == "params":
        title1 = "Threshold"
    elif col_key == "train":
        title1 = "Train metrics"
    else:
        title1 = "Test metrics"
    if row_key == "default" and col_key == "params":
        title = title1 + " 0.5"
    elif row_key == "default":
        title = title1 + " based on threshold=0.5"
    elif row_key == "f1_opt":
        title = title1 + " based on optimal f1"
    else:
        title = title1 + " based on threshold=0.0485"
    ax.set_title(title)
# pd.DataFrame.from_dict(compare_dict, orient='index')

png

Various Models

Will now test different models applying TfidfVectorizer with the hyperparameters:

max_features=25000
token_pattern = r"\b[a-zA-Z]\w{2,35}\b"
ngram_range=(1,2)

#code
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.naive_bayes import ComplementNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.neural_network import MLPClassifier

classifiers = {
    'Logistic Regression': {
        'model': LogisticRegression(random_state = RANDOM_SEED),
        'metrics': {}
    },
    'Decision Tree': {
        'model': DecisionTreeClassifier(random_state = RANDOM_SEED),
        'metrics': {}
    },
    'Random Forest': {
        'model': RandomForestClassifier(random_state = RANDOM_SEED),
        'metrics': {}
    },
    'Neural Network': {
        'model': MLPClassifier(random_state = RANDOM_SEED, early_stopping=True),
        'metrics': {}
    },
    'AdaBoost': {
        'model': AdaBoostClassifier(random_state = RANDOM_SEED),
        'metrics': {}
    },
    'Gradient Boosting': {
        'model': GradientBoostingClassifier(random_state = RANDOM_SEED),
        'metrics': {}
    }
}

#code
# using tested transformations from examples avobe we construct the final train and test sets

X_train = train_df_norm.drop(columns=TARGET_COL)
y_train = train_df_norm[TARGET_COL]

X_test = test_df_norm.drop(columns=TARGET_COL)
y_test = test_df_norm[TARGET_COL]

train_text_col = X_train["title"] +" "+ X_train["description"] 
test_text_col = X_test["title"] +" "+ X_test["description"] 

from time import time
t0 = time()
PATTERN =  r"\b[a-zA-Z]\w{2,35}\b" # words that not starts with digits and words with len >2 and <= 36
tfidf = TfidfVectorizer(max_features=25000, token_pattern=PATTERN, ngram_range=(1,1))
tfidf.fit(train_text_col)
train_tfidf_df = pd.DataFrame.sparse.from_spmatrix(tfidf.transform(train_text_col), columns=tfidf.get_feature_names_out())
test_tfidf_df = pd.DataFrame.sparse.from_spmatrix(tfidf.transform(test_text_col), columns=tfidf.get_feature_names_out())

X_tfidf_train = train_tfidf_df.sparse.to_coo()
X_tfidf_test = test_tfidf_df.sparse.to_coo()

#code
# we will train using the different models and store the metrics
from time import time
threshold = 0.0485

for k, v in classifiers.items():
    t0 = time()    
    model = v["model"]
    print("Working the model", model.__class__.__name__,"...")
    model.fit(X_tfidf_train, np.ravel(y_train))    
    _, _, test_f1score, test_recall, test_precision = f1_recall_precision_threshold(model, X_tfidf_test, y_test)
    values  = optimal_values(model, X_tfidf_train, y_train, X_tfidf_test, y_test, threshold_value= threshold )
    threshold_opt =  values["params"]["threshold"]
    opt_recall = values["test"]["recall"]
    opt_f1score = values["test"]["f1"]
    opt_precision = values["test"]["precision"]    
    duration_test = time() - t0
    v["metrics"] = {"test precision":test_precision, 
                    "test recall":test_recall, 
                    "test f1":test_f1score,
                    "opt threshold":threshold_opt, 
                    "test opt precision":opt_precision, 
                    "test opt recall":opt_recall, 
                    "test opt f1":opt_f1score, 
                    "exec_time":duration_test}     
    print(f"Model done in {duration_test:.3f}s\n")

df_metrics = pd.DataFrame([{"model":k, **{ mk:mv  for mk, mv in v["metrics"].items()  }} for k, v in classifiers.items()   ])

[output:]
Working the model LogisticRegression ...
Model done in 10.938s

Working the model DecisionTreeClassifier ...
Model done in 26.486s

Working the model RandomForestClassifier ...
Model done in 12.490s

Working the model MLPClassifier ...
Model done in 295.177s

Working the model AdaBoostClassifier ...
Model done in 10.077s

Working the model GradientBoostingClassifier ...
Model done in 39.303s

#code
df_metrics = df_metrics.sort_values("test opt recall", ascending=False)
df_metrics

	model	test precision	test recall	test f1	opt threshold	test opt precision	test opt recall	test opt f1	exec_time
4	AdaBoost	0.687500	0.407407	0.511628	0.0485	0.048322	1.000000	0.092190	10.077329
0	Logistic Regression	0.982456	0.259259	0.410256	0.0485	0.218889	0.912037	0.353047	10.937702
2	Random Forest	0.991935	0.569444	0.723529	0.0485	0.256342	0.888889	0.397927	12.490194
3	Neural Network	0.953333	0.662037	0.781421	0.0485	0.525074	0.824074	0.641441	295.177136
5	Gradient Boosting	0.944444	0.472222	0.629630	0.0485	0.576208	0.717593	0.639175	39.302656
1	Decision Tree	0.676056	0.666667	0.671329	0.0485	0.676056	0.666667	0.671329	26.486268

#code
fig,(ax0, ax1) = plt.subplots(ncols=2, figsize=(15,5))

df_metric_melt=pd.melt(df_metrics, id_vars=["model"],value_vars=["test precision","test recall","test f1"], var_name="metric",value_name="value" )
df_metric_melt["metric"] = df_metric_melt["metric"].replace({"test precision":"precision","test recall":"recall","test f1":"f1"})

ax= sns.pointplot(data=df_metric_melt, x="model", y="value", hue="metric", ax= ax0)
ax.set(title="Models test set evaluation comparision, Threshold 0.5 ")
major_tick = np.arange(0, 1.01, 0.1)
minor_tick = np.arange(0, 1.01, 0.05)
ax.set_yticks(major_tick)
ax.set_yticks(minor_tick, minor=True)
ax.grid(visible=True, which='major', axis="y")
ax.grid(visible=True, which='minor',axis="y", alpha=0.4)
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)

df_metric_melt=pd.melt(df_metrics, id_vars=["model"],value_vars=["test opt precision","test opt recall","test opt f1","opt threshold"], var_name="metric",value_name="value" )
df_metric_melt["metric"] = df_metric_melt["metric"].replace({"test opt precision":"precision","test opt recall":"recall","test opt f1":"f1","opt threshold":"threshold"})

ax= sns.pointplot(data=df_metric_melt, x="model", y="value", hue="metric", ax=ax1)
ax.set(title="Models test set evaluation comparision, Threshold = 0.0485")
ax.set_yticks(major_tick)
ax.set_yticks(minor_tick, minor=True)
ax.grid(visible=True, which='major', axis="y")
ax.grid(visible=True, which='minor',axis="y", alpha=0.4)
ax.set_xticklabels(ax.get_xticklabels(), rotation=90);

png

Part 1 Conclusion

When we apply the "threshold-moving" technique we can get better metrics results.

Neural Network, Random Forest, and Logistic Regression look promising, especially the Neural Network because gives a great recall value and also a great f1 score, using default hyperparameters and using the threshold 0.0485.

By utilizing Ngrams, unigrams, and bigrams, and adjusting the threshold, we were able to achieve satisfactory scores in our evaluation metrics using a basic Logistic Regression model. By setting the threshold at ≈0.0485, we were able to achieve a recall of ≈0.92, indicating that ≈92% of fraudulent jobs in the test set were correctly identified as fraudulent, which is a positive step towards resolving the problem at hand.

Next...

In part 2 of this series we will be setting up version control and a tracking system, in this part, the project is prepared for version control and tracking of experiments, by setting up tools such as DagsHub, DVC, and MLFlow. These tools can help to maintain a clear history of the project, track experiments, compare models and results, and collaborate with other team members.

Next... Part 2 >>

Identify fake job postings! - part 1

Table of contents

Story published with Jupyter2Hashnode

The part 1

The Dataset

About Dataset

Data Dictionary

The imports

Read the csv

EDA

Columns choose

Target: fraudulent

Viewing Missing Data

Observations

Cleaning data

TEXT COLUMNS

CATEGORY COLUMNS

Feature engineering

location

text columns to length columns

Columns choose

Country and State

Numeric columns

Processing the data

Model training

Functions

Baseline - DummyClassifier

Evaluation and Metric choose

LogisticRegression

Evaluation numeric features model

NLP - Textual Features

Evaluating the textual model

Ngrams

Ngrams (1,2)

Ngrams (1,3)

Lemmatization

Oversampling

Comparing metrics

Various Models

Part 1 Conclusion

Next...

Target: `fraudulent`

`location`