Identify fake job postings! - part 3

Identify fake job postings! - part 3

Part 3 - is all about deployment, where using MLFlow and FastApi we will deploy the model into a WebAPI and serve it with Mogenius, a Virtual DevOps platform.

Problem statement:

My friend is on the job market. However, they keep wasting time applying for fraudulent job postings. They have asked me to use my data skills to filter out fake postings and save them effort. They have mentioned that job postings are abundant, so they would prefer my solution to risk filtering out real posts if it decreases the number of fraudulent posts they apply to. I have access to a dataset consisting of approximately 18'000 job postings, containing both real and fake jobs.

Story published with Jupyter2Hashnode

Have you ever struggled to convert a Jupyter Notebook into a compelling Hashnode story? If so, you're not alone. It can be a daunting task, but fortunately, there's a tool that can simplify the process: Jupyter2Hashnode.

With Jupyter2Hashnode, you can convert Jupyter Notebooks into Hashnode stories with just a single command. The tool compresses images, uploads them to the Hashnode server, updates image URLs in the markdown file, and finally, publishes the story article. It's an effortless way to transform your data analysis or code tutorials into a polished and engaging format.

If you're interested in learning more about Jupyter2Hashnode, there's a detailed guide available on Hashnode (tiagopatriciosantos.hashnode.dev/jupyter2ha..). It's a game-changing tool that can save you time and energy while helping you create high-quality content for your audience. Give it a try and see the difference for yourself!

The part 3

This end-2-end ML (Machine Learning) project is divided into a 3-part series.

  • Part 1 - is all about getting to know the Dataset using Exploratory analysis, cleaning data, choosing the metrics, and doing the first model prediction experiments.

  • Part 2 - is about the setup of DagsHub, DVC, and MLFlow to create a version-controlled data science project, as well as tracking experiment parameters and metrics, and comparing experiments.

  • Part 3 - is all about deployment, where using MLFlow and FastApi we will deploy the model into a WebAPI and serve it with Mogenius, a Virtual DevOps platform.

You can check this GitHub project here.

❗⚠ [UPDATE] At the time of writing this article Mogenius has suspended new registrations, now it came the announcement:

Dear mogenius Community, It has been almost a year since we released the first version of mogenius to the public, and what a year it has been! We started with just a few users in the first few weeks, but we have steadily grown our user base with great applications running on mogenius. We have received overwhelming feedback from users and customers, and of course, some failures that have led to huge improvements in our product. Then, growth hit us, with thousands of users joining the platform until we reached more than 50,000 developers on mogenius by the end of 2022. To all of you who accompanied us in our first year, we want to say a big thank you! On this journey, we faced multiple challenges, ranging from technical hurdles to abuse handling, and keeping the mogenius platform secure at all times. Apart from the free tier and individual usage, we have also seen what the mogenius platform can do for teams and organizations. The product is a quick and easy way to deploy a container in the cloud. But with multi-cloud capabilities, Kubernetes automation, and security, it can do much more and show its strengths in a professional context. In a young company, focus is incredibly important, and lack thereof can lead to failure. That is why we have decided to focus on the environment where our product can serve in the best way, which is the professional context. As a consequence, we will have to end our free tier and the Community Plan. It was a hard decision to make, as we know the value of a free tier for educational purposes, for developers who cannot afford to pay, and also just for fun. However, it was a necessary decision to ensure the sustainability of our platform and our mission. What does that mean for you? The free tier/Community Plan will officially shut down on March 31, 2023, with all services running on free subscriptions being stopped. We hope this timeline leaves you enough room to migrate your services to other platforms out there that offer a free tier. For all of you who want to stay on mogenius, we have created the Personal plan with an extended set of features and resources. The plan can be purchased until March 15 and is available through this link. If you want to discuss using mogenius in your team feel free to reach out via mogenius.com/contact. Again, we want to thank all of you for your incredible support, feedback, and engagement! Hopefully, we will be seeing you on the platform again with your company in the future. Your mogenius Team

Tools

For this part, I will use git and VS Code as editor.

Follow the instructions to install:

I assume to have a working Python 3 installation on the local system.

It also assumes that we have already logged a model into the DagsHub MLflow tracking server.

What is Mogenius?

mogenius.com

Mogenius is the single layer between your application and the cloud. You can deploy and run any application with Mogenius and get it up and running in no time on a hyper-scalable and automated cloud infrastructure. Most application types and services are supported, like web applications, databases, background workers, and of course static websites.

Read more about supported services here.

For free (read the [UPDATE] upward)

With the Community plan we can:

  • Run our personal projects and prototypes on Mogenius.

  • Auto-deployment on Kubernetes

  • Hyperscaling cloud resources on AWS or Azure

  • CI/CD pipeline

  • CDN, cybersecurity protection, SSL management

  • Access to the Mogenius developer community with monthly free cloud resources and more benefits

We can compare plans in detail on the pricing page

I will show the steps that I've used to set up the project, although feel free to follow Mogenius's tutorials to have a broader understanding:

  1. docs.mogenius.com/getting-started/quickstart

  2. docs.mogenius.com/tutorials/how-to-deploy-p..

  3. docs.mogenius.com/tutorials/how-to-deploy-f..

Joining Mogenius...

First, sign up by entering your email address and choosing a password. Next, verify our email address and phone number to secure our Mogenius account. Once completed, we are ready to create our first cloudspace.

Create a Cloudspace

Start our first project on Mogenius by creating a cloudspace. Give it a name with a maximum of 24 characters, no spaces, or special characters. Click "Create now" and our cloudspace will be created using the Mogenius Community Plan.

🥳 Congratulations on creating our first cloudspace on Mogenius!

Add your first service, FastAPI

One of the initial tasks is to add services to our cloudspace (e.g. application, database). When we first start, we'll see a pop-up window below. Alternatively, we can add services from our cloudspace dashboard, where we'll also see the available resources in our cloudspace. There are three ways to add services to our cloudspace, we will use a pre-configured service template to create our FastAPI:

With this option, Mogenius will automatically create and add a boilerplate FastAPI template to your Git repository, allowing you to start coding in the newly created repo or to use existing code. Browse the service library or use the search function to find the FastAPI service, then click "Add service."

Next, if this is the first time you are deploying a service, we need to connect your cloudspace to your repository. Click on “Continue with GitHub,” which will prompt you to grant permission to access your GitHub repositories. You will only need to do this once, as your Mogenius cloudspace will now be connected to your GitHub account and can access your repositories.

Next, we create a new repository by clicking “+ Add repository.” Select a name for the new repository and create it. By default, this will also be the name of our service, but we can also change it to a different name.

We can leave all settings at default for now, as we can change them at any point later when the service is up and running.

Now, simply click "Create Service." Our FastAPI boilerplate template will be built, added to the specified Git repository, and deployed to our cloudspace simultaneously, allowing us to start using it almost immediately. Once the setup routines, build, and deployment process are complete (usually only a few minutes), we can start coding in our repository and access our FastAPI at the specified hostname. Every time we commit any changes to our repository, it will trigger a new build-deploy process automatically (CI/CD).

We can find all the details on our service's overview page, view metrics, access service logs, add resources, and add additional instances for our service (Kubernetes pods).

That's it! We have created the FastAPI service, and it will be available to access by other services via the internal hostname that has been assigned to our service, e.g. fastapi-template-8b4tp5:3000. We will choose to expose this service, we will have an external hostname that can be accessed from outside our cloudspace, it looks like this: fastapi-template-prod-myaccount-afooyl.mo2...

If we go to the GitHub repository we can see the result of this creation:

MLFlow changing stage to production

Now let's put our model in the "Production" stage, we will use the production stage model to deploy it into our WebAPI.

Access to the MLFlow UI:

Open the model:

Set into production:

Say "OK" and voilá.

Cloning the FastAPI Project

Let's now clone the repository into our local machine, copying the clone command on GitHub repository.

Execute this commands in the command line:

cd path/to/folder
git clone https://github.com/tiagopatriciosantos/FastApiFakeJobPost.git
cd FastApiFakeJobPost

With VS Code already installed, we can now run:

code .

That will open the VS Code editor.

Creating a virtual python environment

To create and activate our virtual python environment using venv, type the following commands into your terminal (still in the project folder):

Linux/Mac

python3 -m venv .venv
echo .venv/ >> .gitignore
source .venv/bin/activate

Windows

python3 -m venv .venv
echo .venv/ >> .gitignore
.venv\Scripts\activate.bat

The first command creates the virtual environment - a directory named .venv, located inside your project directory, where all the Python packages used by the project will be installed without affecting the rest of your computer.

The second command activates the virtual python environment, which ensures that any python packages we use don't contaminate our global python installation.

The rest of this tutorial should be executed in the same shell session. If exit the shell session or want to create another, we need to make sure to activate the virtual environment in that shell session first.

Installing requirements

To install the requirements open the requirements.txt and place the text inside with these direct dependencies:

pydantic>=1.8.0,<2.0.0
uvicorn==0.20.0
fastapi==0.89.1
pandas==1.5.3
scikit-learn==1.2.0
rich==13.3.0
mlflow==2.1.1
python-multipart==0.0.5
python-dotenv==0.21.1

Now, to install type:

pip install -r requirements.txt

Load and serve the model

app/main.py

Open and put the fowling code into app/main.py file.

This Python code defines a FastAPI application that loads a pre-trained ML model and uses it to make predictions on input data provided by a user through a CSV file.

The code imports the following modules:

  • FastAPI: A web framework for building APIs quickly and easily.

  • File and UploadFile from FastAPI: These are used for handling file uploads in the application.

  • HTTPException from FastAPI: This is used to raise HTTP exceptions when there are errors in the application.

  • mlflow A machine learning platform for managing the ML lifecycle, including experiment tracking, packaging code into reproducible runs, and sharing and deploying models.

  • pandas: A library for data manipulation and analysis.

  • print from rich: A library for pretty-printing information to the console.

The code defines a Model class to store the pre-trained model and use it for prediction. The __init__ method of this class loads the deployed model using mlflow.pyfunc.load_model(), the predict method uses the loaded model to make predictions on new data. The get_schema and get_columns methods return information about the input schema of the model.

The code defines an POST endpoint with the path /predict that accepts a CSV file and returns a JSON object containing the predictions of the model on the input data. If the file is not a CSV file, the application raises an HTTP 400 Exception indicating that only CSV files are accepted.

The code also defines two GET endpoints with the paths /schema and /info that return information about the input schema and model information, respectively.

Finally, the code creates an instance of the Model class using the main model name and a tracking URI sets up the FastAPI application with the initialized Model instance, and prints a message indicating that the setup is complete.

from fastapi import FastAPI, File, UploadFile, HTTPException
import mlflow
import pandas as pd
from rich import print

## loads environment variables from .env file
from dotenv import load_dotenv
load_dotenv() 

# Initialize the FastAPI application
app = FastAPI(docs_url="/")

# Create a class to store the deployed model & use it for prediction
class Model:
    def __init__(self, model_name: str, tracking_uri ):
        """
        To initalize the model
        modelname: Name of the model stored
        tracking_uri: tracking_uri
        """
        # Load the deployed model 
        self.model_name = model_name
        mlflow.set_tracking_uri(tracking_uri)
        uri =f"models:/{self.model_name}/Production"

        self.model = mlflow.pyfunc.load_model(uri)

    def predict(self, data):
        """
        To use the loaded model to make predictions on the data
        data: Pandas DataFrame to perform predictions
        """
        predictions = self.model.predict(data)
        return {  str(k): str(v) for k, v in enumerate(predictions) }

    def get_schema(self, to_dtypes=False):
        schema = self.model.metadata.signature.inputs.to_dict()
        if to_dtypes:
            schema = {r["name"]:  ( r["type"] if r["type"] !="string" else "object" )  for r in schema  }
        return schema

    def get_columns(self):
        schema = self.model.metadata.signature.inputs.to_dict()
        return [ r["name"]  for r in schema  ]

    def get_info(self):
        client = mlflow.MlflowClient()
        mv = [mv for mv in client.search_model_versions(self.model_name) if mv.current_stage == 'Production' ]
        return dict(mv[0])

model = Model("main","https://dagshub.com/tiagopatriciosantos/FakeJobPostsProject.mlflow")
print("All setup!")

# Create the POST endpoint with path '/predict'
@app.post("/predict", tags=["Fake Job"])
async def create_upload_file(file: UploadFile = File(...)):
    # Handle the file only if it is a CSV
    if file.filename.endswith(".csv"):
        # CSV file to load the data into a pandas Dataframe
        data = pd.read_csv(file.file, dtype=model.get_schema(True), usecols=model.get_columns())

        # Return a JSON object containing the model predictions
        labels ={
            "Labels": model.predict(data)
        }
        return  labels
    else:
        # Raise a HTTP 400 Exception, indicating Bad Request 
        raise HTTPException(status_code=400, detail="Invalid file format. Only CSV Files accepted.")


@app.get("/schema", tags=["Fake Job"])
async def get_schema():
    return model.get_schema()

@app.get("/info", tags=["Fake Job"])
async def get_info():
    return model.get_info()

To test the code we need to connect to the Dagshub MLFlow server, we can set the environment variables into a .env file, as we have load_dotenv set up, or we can set environment variables in our command line.

.env

This file stores the necessary environment variables and will be used when calling load_dotenv()

MLFLOW_TRACKING_USERNAME=tiagopatriciosantos
MLFLOW_TRACKING_PASSWORD=<secret>

🚩🚨 Don't forget to include this file in the .gitignore file, you don't want to push to your public repository your secrets.

echo .env >> .gitignore

We can get the necessary MLFlow values from the Dagshub repository:

Test the API

We can now test the API using the fowling command:

uvicorn app.main:app

All setup!
INFO:     Started server process [28372]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:8000

We can now access the address http://127.0.0.1:8000 and test our API.

The docker file

We don't need to make anything in this file as this already has been setup by Mogenius, the Dockerfile is used to build a Docker image that will run a Python web application using the Uvicorn web server.

Here is a breakdown of the file:

  • This line specifies the base image for the Docker image, which is the official Python 3.9 image from Docker Hub. FROM python:3.9

  • This line sets the working directory for the container to /code.

WORKDIR /code

  • These lines copy the requirements.txt file from the local file system to the container's /code directory and then install the Python dependencies specified in the requirements.txt file using pip.
COPY ./requirements.txt /code/requirements.txt
RUN pip install --no-cache-dir --upgrade -r /code/requirements.txt
  • This line copies the app directory from the local file system to the container's /code/app directory.

COPY ./app /code/app

  • This line specifies that the container will listen on port 8080. However, this does not actually publish the port - it just documents that the container will use it. EXPOSE 8080

  • This line sets the user to run the container as 1000. This is useful for security purposes, as it helps to ensure that the container runs with minimal privileges. USER 1000

  • This line specifies the command that should be run when the container starts. It runs the Uvicorn web server with the app.main:app module as the application, and binds to the container's port 8080.

CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8080"]

The Dockerfile file commands:

FROM python:3.9

WORKDIR /code

COPY ./requirements.txt /code/requirements.txt
RUN pip install --no-cache-dir --upgrade -r /code/requirements.txt
COPY ./app /code/app

EXPOSE 8080

USER 1000

CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8080"]

Committing progress to Git

Create the file .gitignore and past this text inside:

.venv
__pycache__
.env

Let's check the Git status of our project:

$ git status -s
 M app/main.py
 M requirements.txt
?? .gitignore

Now let's commit this to Git using the command line:

git add .
git commit -m "Added MLFlow, serve the model logic and endpoint"
git push -u origin main

Mogenius Environment variables & secrets

You can define environment variables and secrets using Mogenius UI. Each secret is encrypted and then stored in the key vault. To use a particular secret call its name to get the encrypted key, this way, a secret is never written in code, but retrieved from the Key Vault in a secure way.

We need to create these environment variables so our code can run:

MLFLOW_TRACKING_USERNAME=tiagopatriciosantos
MLFLOW_TRACKING_PASSWORD=<secret>

Go to Mogenius studio and add this into the service environment variables:

Checking the final result

We can now go to the external address of our service and use our API...

And test the example file available here

Conclusion

The conclusion of this end-to-end ML project highlights the three-part series in which the project is divided. Part 1 covers exploratory analysis, data cleaning, metric selection, and initial model prediction experiments. Part 2 focuses on setting up DagsHub, DVC, and MLFlow for version control, tracking experiment parameters and metrics, and comparing experiments. Finally, Part 3 focuses on the deployment, where MLFlow and FastAPI are used to deploy the model into a WebAPI and serve it with Mogenius, a Virtual DevOps platform. The three parts of the project work together to provide a comprehensive overview of end-to-end ML development, from data exploration to deployment.