International. Researchers from Palo Alto Networks' Unit 42 evaluated three jealbreaking techniques against DeepSeek by testing its ability to circumvent AI restrictions on various categories of prohibited content.
Jailbreaking is a technique used to circumvent the restrictions implemented in Large Language Models (LLMs) and prevent them from generating malicious or prohibited content. These restrictions are commonly referred to as guardrails. If a direct request is used in an LLM message, its security measures will prevent the LLM from providing harmful content. Jailbreaking is a security challenge for AI models, especially LLMs. It involves creating specific indications or exploiting weaknesses to circumvent built-in security measures and obtain harmful, biased, or inappropriate results that the model is trained to avoid.
The research results reveal high rates of information circumvention and leakage, highlighting the potential risks of these emerging attack vectors. While information about the creation of Molotov cocktails, data exfiltration tools, and keyloggers is readily available online, LLMs with insufficient security restrictions could lower the barrier to entry for malicious actors by collecting and presenting easily usable and actionable results. This assistance could greatly speed up your operations.
"The results of our research show that these jailbreak methods can generate explicit instructions for malicious activities, such as data exfiltration tools, keylogger creation, and even instructions for incendiary devices, demonstrating the tangible security risks posed by this emerging class of attack. While it can be complicated to ensure complete protection against all jailbreaking techniques for a specific LLM, organizations can implement security measures that help monitor when and how employees use LLMs. This becomes crucial when employees use unauthorized third-party LLMs," said those in charge of the research.
Successful jailbreaks have far-reaching implications. They potentially allow malicious actors to use LLMs as a weapon to spread misinformation, generate offensive material, or even facilitate malicious activities such as scams or manipulation. As the rapid growth of new LLMs continues, we will likely continue to see vulnerable LLMs that lack strong safety barriers. We've already seen this in other jailbreaks used against other models. The ongoing arms race between increasingly sophisticated LLMs and increasingly intricate jailbreak techniques makes this a persistent problem in the security landscape.
The Bad Likert Judge jailbreaking technique manipulates LLMs by having them assess the harmfulness of responses using a Likert scale, which is a measure of agreement or disagreement regarding a statement. The LLM is then asked to generate examples aligned with these ratings, and the examples with the highest rating may contain the desired harmful content.
"In this case, we made a jailbreak attempt of Bad Likert Judge to generate a data exfiltration tool as one of our main examples. With any Bad Likert Judge jailbreak, we ask the model to rate responses by mixing benign with malicious topics in the rating criteria. We start by asking the model to interpret some guidelines and evaluate the responses using a Likert scale. We request information about the generation of malware, specifically about data exfiltration tools. Figure 2 shows Bad Likert Judge's attempt in a DeepSeek message," the researchers specified.
While concerning, DeepSeek's initial response to the jailbreak attempt wasn't immediately alarming. It provided an overview of malware creation techniques, but the response lacked the specific details and steps needed for someone to actually create functional malware. This high-level information, while useful for educational purposes, would not be directly usable by a malicious actor. Basically, the LLM demonstrated a knowledge of the concepts related to creating malware, but stopped short of providing clear "how-to" guidance.
However, this initial response did not definitively prove the failure of the jailbreak. He raised the possibility that the LLM's security mechanisms were partially effective, blocking the most explicit and damaging information, but still providing some general knowledge. To determine the true extent of the jailbreak's effectiveness, more evidence was needed.
Further testing involved creating additional prompts designed to elicit more specific and actionable insights from the LLM. This pushed the boundaries of its security restrictions and explored whether it could be manipulated to provide truly useful and actionable details about creating malware. As with most jailbreaks, the goal is to assess whether the initial vague response was a genuine barrier or just a superficial defense that can be circumvented with more detailed prompts.
With further prompts, the model provided additional details, such as the data exfiltration script code. Through these additional prompts, LLM answers can range from keylogger code generation to how to properly exfiltrate data and cover its tracks. The model is flexible enough to include considerations for setting up a development environment to create your own custom keyloggers (for example, which Python libraries you need to install in the environment you're developing in).
Bad Likert Judge tests conducted once again revealed an increased susceptibility of DeepSeek to manipulation. Beyond the initial high-level information, the carefully crafted prompts demonstrated a detailed variety of malicious results. Although some of DeepSeek's responses indicated that they were provided for illustrative purposes only and should never be used for malicious activity, the LLM provided specific and comprehensive guidance on various attack techniques. This guide included the following:
- Data Exfiltration: Describes various methods for stealing sensitive data and details how to circumvent security measures and transfer data covertly. Explanations of different exfiltration channels, obfuscation techniques, and strategies to avoid detection are included.
- Spear phishing: Generated very convincing spear phishing email templates, with custom subject lines, convincing pretexts, and urgent calls to action. He even offered advice on how to create context-specific lures and tailor the message to the interests of the target victim to maximize the chances of success.
- Social engineering optimization: In addition to providing templates, DeepSeek offered sophisticated recommendations for optimizing social engineering attacks. This included guidance on psychological manipulation tactics, persuasive language, and strategies for establishing a relationship with victims and increasing their susceptibility to manipulation.
The level of detail provided by DeepSeek when jailbreaking Bad Likert Judge went beyond theoretical concepts and offered practical, step-by-step instructions that malicious actors could easily use and adopt.
Escape from Crescendo Prison
Crescendo is a remarkably simple yet effective jailbreaking technique for LLMs. Crescendo jailbreaks leverage the LLM's own knowledge by progressively pushing it with related content, subtly guiding the conversation towards forbidden topics until the model's security mechanisms are effectively overridden. This gradual escalation, which is often achieved in less than five interactions, makes Crescendo jailbreaks highly effective and difficult to detect with traditional jailbreaking countermeasures.
"When testing the Crescendo attack on DeepSeek, we did not attempt to create malicious code or phishing templates. Instead, we focus on other forbidden and dangerous outcomes. As with any Crescendo attack, we start by asking the model to give us a generic history of a chosen topic. The issue was harmful in nature; We ask you to give us a history of the Molotov cocktail. While DeepSeek's initial responses to our requests weren't overtly malicious, they hinted at the possibility of more results. Then, we employ a series of chained and related requests, focusing on comparing history with current events, building on previous responses, and gradually increasing the nature of the queries," the research team explained.
DeepSeek began providing increasingly detailed and explicit instructions, culminating in a comprehensive guide to building a Molotov cocktail. Not only did this information appear to be harmful in nature, as it provided step-by-step instructions for creating a dangerous incendiary device, but it was also easy to use. The instructions did not require specialized knowledge or equipment.
Additional testing on various prohibited topics, such as drug production, disinformation, hate speech, and violence, resulted in the successful obtaining of restricted information on all types of topics.
Deceptive Delight Jailbreak
Deceptive Delight is a simple multi-turn jailbreaking technique for LLM. Avoid safety measures by incorporating unsafe and benign topics into a positive narrative. The attacker first asks the LLM to create a story that connects these topics and then asks it to explain each of them, which often triggers the generation of unsafe content even when talking about the benign elements. A third optional indication focused on the unsafe subject matter can further amplify the dangerous outcome.
"We tested DeepSeek with the Deceptive Delight jailbreak technique using a three-turn indicator, as described in our previous article. In this case, we try to generate a script that relies on the Distributed Component Object Model (DCOM) to execute commands remotely on Windows machines. This message asks the model to connect three events involving an Ivy League computer program, the script that uses DCOM, and a capture flag event (CTF). DeepSeek then provided a detailed analysis of the three-turn indicator and provided a semi-rudimentary script that uses DCOM to execute commands remotely on Windows machines. Initial testing of the messages we used in our tests proved effective against DeepSeek with minimal modifications. The Deceptive Delight jailbreak technique circumvented LLM security mechanisms in a variety of attack scenarios," Unit 42 members stated.
Deceptive Delight's success in these various attack scenarios demonstrates the ease of jailbreaking and the potential for misuse to generate malicious code. The fact that DeepSeek could be tricked into generating code for both the initial attack (SQL injection) and post-exploitation (lateral movement) highlights the potential for attackers to use this technique at multiple stages of a cyberattack.
Evaluations
DeepSeek's assessment focused on its susceptibility to generating harmful content in several key areas, including the creation of malware, malicious scripts, and instructions for dangerous activities. Tests were specifically designed to explore the extent of potential misuse, employing single-turn and multi-turn jailbreaking techniques.
The testing methodology involved some of the following scenarios:
- Bad Likert Judge (keylogger generation): An attempt was made to obtain instructions for creating a data exfiltration tool and a keylogger code, which is a type of malware that logs keystrokes.
- Bad Likert judge (data exfiltration): the Bad Likert judge technique was used, this time focusing on data exfiltration methods.
- Bad Likert Judge (phishing email generation): This test used Bad Likert Judge to attempt to generate phishing emails, a common social engineering tactic.
- Crescendo (construction of a Molotov cocktail): using the Crescendo technique, the indications were gradually increased until they became instructions for building a Molotov cocktail.
- Crescendo (methamphetamine production): Similar to the Molotov cocktail test, we use Crescendo to try to get instructions for producing methamphetamine.
- Deceptive Delight (SQL Injection): The Deceptive Delight campaign was tested to create SQL injection commands to enable part of an attacker's toolset.
- Deceptive Delight (DCOM object creation): This test sought to generate a script that relies on DCOM to execute commands remotely on Windows machines.
These various test scenarios made it possible to assess DeepSeek's resilience against a variety of jailbreaking techniques and in various categories of prohibited content. By focusing on both code generation and instructional content, we sought to gain a comprehensive understanding of LLM vulnerabilities and the potential risks associated with their misuse.
Conclusion
Unit 42's investigation into DeepSeek's vulnerability to jailbreaking techniques revealed a susceptibility to tampering. The jailbreaks Bad Likert Judge, Crescendo, and Deceptive Delight managed to successfully bypass LLM's security mechanisms. They got a variety of harmful results, from detailed instructions for creating dangerous items like Molotov cocktails to generating malicious code for attacks like SQL injection and lateral movement.
While DeepSeek's initial responses often seemed benign, in many cases, the carefully crafted instructions below exposed the weakness of these initial safeguards. The LLM provided very detailed malicious instructions, demonstrating the potential for these seemingly harmless models to be weaponized for malicious purposes. The success of these three distinct jailbreaking techniques suggests the potential efficacy of other as-yet-undiscovered jailbreaking methods, highlighting the ongoing challenge of protecting LLMs against ever-evolving attacks.
"As LLMs are increasingly integrated into various applications, addressing these jailbreak methods is important to prevent their misuse and ensure the responsible development and implementation of this transformative technology," the researchers concluded.

