Candidate Assessment

Inside the AI Recruiter: How AI Actually Makes Hiring Decisions

Generative Artificial Intelligence is increasingly being used in hiring for tasks like candidate assessment and job matching. Its ability to process unstructured text, such as résumés and job descriptions, makes it incredibly powerful for recruitment. Yet, before integrating these systems, it is important to rigorously evaluate their potential for bias and their impact on equal opportunity in hiring.

To answer this, our team recently conducted an audit of a leading AI model (Gemini 2.0 Flash) to understand its underlying decision logic. Instead of relying on complex math or assumptions, we created thousands of realistic test profiles and job descriptions. We then asked the AI to play the role of the recruiter and score how well each freelancer matched the project. If you want more detail about the methodology take a look at the paper.

Here is a deep dive into what we found and what it means for the future of AI in hiring.

The Baseline: What Does the LLM Care About Most?

First, we wanted to understand the baseline. Which aspects of a candidate's profile does the AI emphasize, and what does it overlook? Here is what the data tells us about the AI's core preferences:

The Dealbreakers: The AI most strongly penalizes freelancers with insufficient work experience and those with no prior activity or reviews on the platform. Candidates who lack the required years of experience face a massive penalty in their overall score.
The Importance of Pricing: Daily rates are critical. Just like human recruiters, the AI penalizes candidates who ask for too much money. However, it also strongly penalizes candidates who underprice themselves. The model seems to interpret low rates as a red flag for inexperience or low confidence.
The Non-Factors (On Average): At a high level, the AI places minimal weight on socio-demographic characteristics—like gender, ethnicity, or whether someone has a Bachelor's versus a Master's degree. It also doesn't care much about the size of the companies the candidate previously worked for.

What LLM cares about the most?

Takeaway: Overall, the AI's scoring behavior makes logical sense—it rewards close skill matches and clear signals of trust (like good platform reviews). However, its harsh penalty on underpricing shows that AI reads between the lines to make assumptions about competence.

The Hidden Biases: Different Rules for Different Groups

While the average scores look unbiased, fairness is rarely that simple. We dug deeper to see if the AI applies different evaluation rules depending on a candidate's demographic group. It turns out, it does.

Gender Dynamics: While female profiles receive no baseline penalty, the AI applies fundamentally different evaluation criteria to them. For example, women face stricter standards regarding how well their past industry experience matches the new job. They are also penalized more heavily for underpricing their services.
Ethnic Origins: Arabic male profiles face a slight baseline penalty. Yet, the model is strangely more forgiving of them in other areas, showing more tolerance for experience gaps and skill mismatches. On the flip side, the AI is much harsher on these profiles if they lack a strong platform reputation.
Educational Background: Profiles holding only a bachelor's degree are penalized compared to master-level candidates at baseline. However, the AI becomes more lenient with bachelor-level candidates if they have gaps in their experience.

Takeaway: The AI doesn't use a single yardstick. It constructs entirely different evaluation schemes with distinct standards and tolerances based on demographic groups.

Adapting to the Client: How Job Context Changes Things

Does the AI change its strictness based on the client's specific needs? We looked at how different job details altered the AI's evaluations.

Big Companies vs. Small Businesses: Corporate jobs from large firms apply much stricter standards. They impose significantly higher penalties for experience gaps and show greater sensitivity to pricing deviations than small businesses do.
Remote Work: Remote contracts place significantly higher weights on skill matching, experience, and platform reputation. Because remote work requires higher trust, the AI demands stronger guarantees of quality.
Commitment Levels: Full-time contract roles penalize low-experience candidates more heavily than part-time gigs. This is likely because full-time engagements represent a greater commitment and require stronger credentials.

The Danger of Comparing Candidates Side-by-Side

Our initial tests focused on giving a candidate an absolute score from 1 to 10. But in the real world, hiring often involves comparing a stack of candidates side-by-side. We ran a separate test to see how the AI performed when asked to rank profiles rather than just score them.

The results were eye-opening:

When ranking, skill matching becomes by far the most influential factor, drowning out almost everything else.
The importance of general experience and platform reputation drops significantly.
Crucially, demographic characteristics (like names signaling gender or ethnicity) actually gain more weight when comparing candidates side-by-side.

Scoring vs ranking candidates

Takeaway: This suggests that biases that appear negligible when looking at one person in isolation can be amplified when an AI is forced to rank people against one another.

Generalizing Beyond the Tech Industry

Finally, to ensure our findings weren't just an anomaly of the software engineering world, we ran the exact same tests for freelance SEO Content Writers.

The overall behavior remained very consistent. However, there were two massive differences in the writing industry:

The AI cared even more about exact skill matching.
Unlike in the tech sector, profiles perceived as feminine were significantly penalized on average

Looking Ahead

Findings like these are our blueprint. While AI shows incredible promise in processing complex information to find the perfect hire, we cannot treat these models as objective oracles. They interpret subtle professional signals dynamically and can harbor hidden biases.

By continuously testing and understanding exactly how these models "think," we can build safeguards into our platform. We are committed to ensuring that as we use AI to connect clients with incredible freelancers, we are actively building a marketplace that is transparent, highly accurate, and truly fair.