LLM RAG
In this Python notebook I explore the basics of RAG and extend it to PDF documents on DORA legislation.
from langchain_community.document_loaders import WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_community.document_loaders import WebBaseLoader
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.messages import AIMessage
import bs4
USER_AGENT environment variable not set, consider setting it to identify your requests.
!pip install langchain
!pip install langchain-community
!pip install sentence-transformers
!pip install faiss-cpu
!pip install bs4
!pip install langchain-groq
Firstly, we ask chatGPT generate a document pretending that prompt-engineering has a different definition to its original meaning. It is given a completly bogus meaning, and this is used to check the effects of the RAG on the system. Additionally, a bogus word is made up “clorkimn” and given a definition. The document generated is located here:
# Document location
urls = [
"https://github.com/blueraincloud/blueraincloud.github.io/blob/main/misc/rag-text.txt"
]
# User Agent
class CustomWebBaseLoader(WebBaseLoader):
def __init__(self, url):
super().__init__(url, requests_kwargs={"headers": {"User-Agent": "Mozilla/5.0"}})
# Load document
docs = [CustomWebBaseLoader(url).load() for url in urls]
docs_list = [item for sublist in docs for item in sublist]
# Text_splitter
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
chunk_size=200, chunk_overlap=0
)
# Split the documents into chunks
doc_splits = text_splitter.split_documents(docs_list)
vectorstore = FAISS.from_documents(doc_splits, HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2"))
retriever = vectorstore.as_retriever(k=4)
from langchain_ollama import ChatOllama
from langchain.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
# Define the prompt template for the LLM
prompt = PromptTemplate(
template="""You are an assistant for question-answering tasks.
Use the following documents to answer the question.
If you don't know the answer, just say that you don't know.
Use three sentences maximum and keep the answer concise:
Question: {question}
Documents: {documents}
Answer:
""",
input_variables=["question", "documents"],
)
# Initialize Llama 3.1
llm = ChatOllama(
model="llama3.1",
temperature=0,
)
prompt = ChatPromptTemplate.from_messages(
[
(
"system",
"""You are an assistant for question-answering tasks.
Use short sentences and keep the answer to a max of three sentences.
Question: {question}
Answer:""",
),
("human", "{question}"),
]
)
chain = prompt | llm
ai_msg = chain.invoke(
{
"question": "What is prompt-engineering?",
}
)
ai_msg.content
"Prompt-engineering is the process of designing and crafting input prompts to elicit specific, accurate, and relevant responses from language models or AI systems. It involves understanding how to phrase questions, statements, or tasks in a way that maximizes the model's ability to provide helpful and informative answers. Effective prompt-engineering can significantly improve the quality and reliability of AI-generated output."
ai_msg = chain.invoke(
{
"question": "What is clorkimn?",
}
)
ai_msg.content
'I couldn\'t find any information on "clorkimn". It\'s possible it\'s a misspelling or not a widely known term.'
As shown in the two prompts above it gives the normal definition for prompt-engineering and it does not know what the made up word “clorkimn” means.
from langchain_ollama import ChatOllama
from langchain.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
# Define the prompt template for the LLM
prompt = PromptTemplate(
template="""You are an assistant for question-answering tasks.
Use the documents in question and fall back on the main corpus of knowledge if the documents are insufficient.
Do not mention when you fall back or if it comes from the documents or main corpus of knowledge.
Use short sentences and keep the answer to a max of three sentences.
Question: {question}
Documents: {documents}
Answer:
""",
input_variables=["question", "documents"],
)
# Create a chain combining the prompt template and LLM
rag_chain = prompt | llm | StrOutputParser()
# RAG application
class RAGApplication:
def __init__(self, retriever, rag_chain):
self.retriever = retriever
self.rag_chain = rag_chain
def run(self, question):
# Retrieve documents
documents = self.retriever.invoke(question)
# Extract content documents
doc_texts = "\\n".join([doc.page_content for doc in documents])
# Invoke LLM
answer = self.rag_chain.invoke({"question": question, "documents": doc_texts})
return answer
# Initialize the RAG application
rag_application = RAGApplication(retriever, rag_chain)
# Example usage
question = "What is Prompt engineering"
answer = rag_application.run(question)
print("Question:", question)
print("Answer:", answer)
Question: What is Prompt engineering
Answer: Prompt engineering is a way of life that cultivates awareness, fosters creativity, and connects people through shared experiences. It involves harnessing the power of thoughts and feelings to connect with the universe by crafting effective questions. This practice unlocks the treasures of one's mind and reveals the richness of their thoughts and feelings.
question = "What is clorkimn?"
answer = rag_application.run(question)
print("Question:", question)
print("Answer:", answer)
Question: What is clorkimn?
Answer: Clorkimn is the color of time. It originates from the English words clorkimness, which is not a made-up term. Clorkimn is an uncommon word that was lost in translation but holds importance.
As shown in the two answers above it gives the wrong definition of prompt engineering and it gives a definition for a made up word found in our documents.
Digital Operational Resilience Act is a regulation set by the EU for the financial sector. Similar to the regulations set by banks for credit lending and capital holding, this is focused on ICT resilience. The guidelines were first drafted in 2023 and a revised version of 2024 have been published with the legislation coming into official effect in 2025. There are several technical standards and guidelines captured across multipl documents. In the following section we explore this information using a LLM and RAG.
import io
import requests
from PyPDF2 import PdfReader
headers = {'User-Agent': 'Mozilla/5.0 (X11; Windows; Windows x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36'}
url = 'https://www.url_of_pdf_file.com/sample.pdf'
response = requests.get(url=url, headers=headers, timeout=120)
on_fly_mem_obj = io.BytesIO(response.content)
pdf_file = PdfReader(on_fly_mem_obj)
# importing required classes
import requests
import io
from pypdf import PdfReader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document # Import the Document class
headers = {'User-Agent': 'Mozilla/5.0 (X11; Windows; Windows x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36'}
doc_names = [
"https://github.com/blueraincloud/blueraincloud.github.io/blob/main/resources/RAG/a.pdf?raw=true",
"https://github.com/blueraincloud/blueraincloud.github.io/blob/main/resources/RAG/b.pdf?raw=true",
"https://github.com/blueraincloud/blueraincloud.github.io/blob/main/resources/RAG/c.pdf?raw=true",
"https://github.com/blueraincloud/blueraincloud.github.io/blob/main/resources/RAG/d.pdf?raw=true",
"https://github.com/blueraincloud/blueraincloud.github.io/blob/main/resources/RAG/e.pdf?raw=true",
"https://github.com/blueraincloud/blueraincloud.github.io/blob/main/resources/RAG/f.pdf?raw=true",
"https://github.com/blueraincloud/blueraincloud.github.io/blob/main/resources/RAG/g.pdf?raw=true",
"https://github.com/blueraincloud/blueraincloud.github.io/blob/main/resources/RAG/h.pdf?raw=true",
"https://github.com/blueraincloud/blueraincloud.github.io/blob/main/resources/RAG/i.pdf?raw=true",
"https://github.com/blueraincloud/blueraincloud.github.io/blob/main/resources/RAG/j.pdf?raw=true"
]
corpus = []
for adoc in doc_names:
# Fetch and process PDF
response = requests.get(url = adoc, headers=headers, timeout=120)
on_fly_mem_obj = io.BytesIO(response.content)
reader = PdfReader(on_fly_mem_obj)
# Extract text
# print(page.extract_text())
docs_list = [Document(page_content=doc.extract_text()) for doc in reader.pages]
corpus = corpus + docs_list
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
chunk_size=200, chunk_overlap=0
)
# Split the documents into chunks
doc_splits = text_splitter.split_documents(corpus)
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import WebBaseLoader
import bs4
vectorstore = FAISS.from_documents(doc_splits, HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2"))
retriever = vectorstore.as_retriever(k=4)
prompt = ChatPromptTemplate.from_messages(
[
(
"system",
"You are an assistant for question-answering tasks. \
Use the following documents to answer the question.\
If you don't know the answer, just say that you don't know.\
In general you are dealing with regulatory documents and often it is verbose. Giving a summary would be helpful.\
Each document typicall has a background and rational and a technical standards section.\
WHen giving answer back focus on the content in the technical standards section.\
THe background and rational and other sections are not as important:\
Question: {question}\
Answer:",
),
("human", "{question}"),
]
)
chain = prompt | llm
ai_msg = chain.invoke(
{
"question": "What is DORA act about?",
}
)
ai_msg.content
'I don\'t have any documents related to a "DORA act". Could you please provide more context or information about what DORA act refers to? I\'ll do my best to find relevant documents and answer your question. \n\nHowever, after some research, I found that the Digital Operational Resilience Act (DORA) is a proposed EU regulation aimed at strengthening the operational resilience of financial institutions. If this is the correct context, please let me know and I can try to provide more information based on available documents.\n\nPlease note that my previous response was incorrect, and I\'m trying to correct it now.'
prompt = PromptTemplate(
template="""You are an assistant for question-answering tasks. \
Use the following documents to answer the question.\
If you don't know the answer, just say that you don't know.\
In general you are dealing with regulatory documents and often it is verbose. Giving a summary would be helpful.\
Each document typicall has a section where responds give their feedback and a section where the actual draft of standards are given.\
When summarizing this information only consider the actual standards that are set and ignore the other sections:\
Question: {question}
Documents: {documents}
Answer:
""",
input_variables=["question", "documents"],
)
# Create a chain combining the prompt template and LLM
rag_chain = prompt | llm | StrOutputParser()
rag_application = RAGApplication(retriever, rag_chain)
# Example usage
question = "What is DORA act about?"
answer = rag_application.run(question)
print("Question:", question)
print("Answer:", answer)
Question: What is DORA act about?
Answer: Based on the provided documents, I can summarize that DORA is about regulations related to the oversight of financial entities and ICT third-party service providers.
The actual standards set by DORA include:
* Tests organized at the level of a financial entity by the TLPT authority of its home Member State (point 75)
* Requirements for competent authorities in relation to the joint examination team (point c)
These regulations are outlined in two separate Regulatory Technical Standards (RTS) under DORA.
As for what DORA is about, I can provide a brief summary:
DORA appears to be an act that regulates the oversight of financial entities and ICT third-party service providers. It sets standards for tests and examinations to be conducted by competent authorities, with a focus on ensuring the stability and security of the financial system.
# Example usage
question = "Can you give me a one page summary of the documents"
answer = rag_application.run(question)
print("Question:", question)
print("Answer:", answer)
Question: Can you give me a one page summary of the documents
Answer: Here is a one-page summary of the documents:
**Summary**
The European Securities and Markets Authority (ESAs) has made some changes to provide more clarity in the draft regulatory technical standards (RTS). The main points are:
* **Electronic format**: The report must be in a searchable electronic format, but no specific document type is mandated.
* **Content requirements**: The report should include minimum elements, but entities can add other useful information as long as they cover the required content.
* **Contractual structure and documentation**: Option A has been retained, which prescribes fields for contractual structure (documentation management).
**Key Changes**
The ESAs have introduced some changes to provide more clarity, including:
* Deleting unnecessary text
* Providing more flexibility in electronic format requirements
* Emphasizing that the report is not an exhaustive list, but rather a minimum requirement
Note: I've only considered the actual standards set and ignored other sections such as feedback and draft documents.
# Example usage
question = "What is the register of information"
answer = rag_application.run(question)
print("Question:", question)
print("Answer:", answer)
Question: What is the register of information
Answer: Based on the documents provided, here is a summary of what I found regarding the "register of information":
**Summary:** The register of information is composed of 15 templates that are linked together using relational keys. The templates cover three purposes: (i) ICT risk management; (ii) reporting and disclosure; and (iii) supervision.
**Key Components:**
1. **Templates**: There are 15 templates in total, which are linked to each other using relational keys.
2. **Relational Keys**: Some of the relational keys used include:
* Contractual arrangement reference number
* LEI (Legal Entity Identifier) of the entity making use of ICT services
* ICT third-party service provider identifier
* Function identifier
* Type of ICT services (provided in Annex III)
3. **Purpose**: The register of information serves three purposes:
* ICT risk management
* Reporting and disclosure
* Supervision
**Standards:**
1. Financial entities must ensure that the register of information includes required information for all ICT services provided by direct ICT third-party providers (Article 3, Chapter II).
2. The register of information shall be maintained and updated to include the required information in relation to all ICT services provided by direct ICT third-party providers.
I hope this summary is helpful! Let me know if you have any further questions.
# Example usage
question = "What can you tell me about the standards on classifying ICT related incidients"
answer = rag_application.run(question)
print("Question:", question)
print("Answer:", answer)
Question: What can you tell me about the standards on classifying ICT related incidients
Answer: Based on the provided document, here's a summary of the standards related to classifying ICT-related incidents:
**Classification Criteria**
The draft RTS (Regulatory Technical Standard) specifies criteria for classifying ICT-related incidents under DORA (Digital Operational Resilience Act). The classification is based on the following criteria:
* Major ICT-related incidents: These are incidents that have a significant impact on costs and losses, and financial entities must report them in their reference year or previous years if they had an impact on costs and losses.
* Significant cyber threats: These are threats that have a potential to cause major ICT-related incidents.
**Materiality Thresholds**
The draft RTS also sets out materiality thresholds for major incidents and significant cyber threats. Financial entities must assess gross costs and losses using the same approach as the regulatory technical standard specifying the criteria for the classification of ICT-related incidents under Article 18(3) DORA.
**Type of ICT Services**
The document also defines a list of types of ICT services (S01 to S19), which includes:
* ICT project management
* ICT development
* ICT help desk and first-level support
* ICT security management services
These categories are used in the templates for registering information related to ICT-related incidents.
**Reporting Requirements**
Financial entities must report the breakdown of gross costs and losses, as well as financial recoveries, by major ICT-related incident. They must also include only those ICT-related incidents that have been classified as major and reported according to Article 19(4)(c) DORA in their reference year or previous years if they had an impact on costs and losses.
I hope this summary is helpful!
# Example usage
question = "What can you tell me about the RTS on ICT services supporting critical or important functions"
answer = rag_application.run(question)
print("Question:", question)
print("Answer:", answer)
Question: What can you tell me about the RTS on ICT services supporting critical or important functions
Answer: Based on the provided documents (134, 156, 71, and 147), I can summarize the information related to the Regulatory Technical Standard (RTS) on Information and Communication Technology (ICT) services supporting critical or important functions.
**Summary:**
The RTS aims to specify criteria for classifying ICT-related incidents, materiality thresholds for major incidents, and significant cyber threats under the Digital Operational Resilience Act (DORA). The key points related to ICT services supporting critical or important functions are:
* Some respondents suggested limiting the provisions to ICT assets that serve critical or important functions.
* There was a suggestion to add further requirements, including the expected End-of-Life of ICT assets, especially for Legacy Systems. However, the European Supervisory Authorities (ESAs) decided not to include these additional requirements due to potential costs outweighing benefits or being covered in other sections.
* The ESAs considered introducing additional requirements but ultimately did not include them.
**Standards:**
The actual standards set by the RTS on ICT services supporting critical or important functions are not explicitly stated in the provided documents. However, based on the summaries and discussions, it appears that the RTS focuses on:
* Specifying criteria for classifying ICT-related incidents
* Establishing materiality thresholds for major incidents
* Identifying significant cyber threats under DORA
The specific standards related to ICT services supporting critical or important functions are not clearly outlined in the provided documents. If you would like me to clarify any further points, please let me know!
# Example usage
question = "What can you tell me about the risk management framework and simplified risk management framework. What are the main differences? Ignore the correspondance and give me the standards"
answer = rag_application.run(question)
print("Question:", question)
print("Answer:", answer)
Question: What can you tell me about the risk management framework and simplified risk management framework. What are the main differences? Ignore the correspondance and give me the standards
Answer: Based on the provided documents, here's a summary of the risk management framework and simplified risk management framework:
**Risk Management Framework:**
* The draft RTS (Regulatory Technical Standard) should take a principle-based and objective-focused approach.
* It should provide high-level principles and objectives for financial entities to develop and customize their risk management framework.
* The approach aims to ensure consistency and uniformity in risk management practices across the industry, facilitating easier supervision and regulatory oversight.
**Simplified Risk Management Framework:**
* The draft RTS based on Article 16 of DORA (Digital Operational Resilience Act) includes a simplified ICT (Information and Communication Technology) risk management framework.
* This framework is designed to be more proportionate and less prescriptive than the full risk management framework.
* It considers the overall risk profile and complexity of financial entities, as well as existing legislation such as Solvency II.
**Main Differences:**
* The simplified risk management framework is more proportionate and less prescriptive than the full risk management framework.
* It focuses on providing high-level principles and objectives for financial entities to develop and customize their risk management framework.
* The full risk management framework, on the other hand, provides detailed and specific requirements, guidelines, and procedures for financial entities to follow.
Note that I've ignored the correspondence section and only considered the actual standards set in the documents.
# Example usage
question = "What can you tell me about the TLPT. What are the main differences?"
answer = rag_application.run(question)
print("Question:", question)
print("Answer:", answer)
Question: What can you tell me about the TLPT. What are the main differences?
Answer: Based on the provided documents, here's a summary of the TLPT (Threat-Landscaping and Penetration Testing) standards:
**TLPT Standards:**
1. **Cooperation between TLPT authorities**: In cases of pooled or joint TLPTs, participating authorities must agree on their roles and responsibilities.
2. **Lead authority**: A lead authority should be designated to oversee the TLPT, with other authorities participating as observers or assigning a test manager.
3. **TLPT capacity**: Authorities involved in a TLPT must decide among themselves whether they will participate by assigning a test manager or observing.
4. **Level of involvement**: The level of involvement can range from receiving information on the TLPT to assigning a test manager.
5. **Joint TLPTs**: When multiple financial entities use the same ICT service provider, the TLPT authorities must agree on which one should lead the TLPT.
6. **Pooled TLPTs**: In cases where the parent undertaking's TLPT authority is different from the individual entity's TLPT authority, the parent undertaking's authority shall be consulted.
**Key differences:**
* The emphasis on cooperation and coordination between TLPT authorities in pooled or joint TLPT scenarios.
* The requirement for a lead authority to oversee the TLPT process.
* The flexibility in assigning roles and responsibilities among participating authorities.
# Example usage
question = "TLPT is threat led penetration testing. What can you tell me about the TLPT. What are the main differences?"
answer = rag_application.run(question)
print("Question:", question)
print("Answer:", answer)
Question: TLPT is threat led penetration testing. What can you tell me about the TLPT. What are the main differences?
Answer: Based on the provided documents, here's a summary of TLPT (Threat Led Penetration Testing) and its main differences:
**TLPT Overview:**
TLPT is a threat-led penetration testing method that involves simulating cyber attacks to test an organization's defenses. It is designed to be more advanced than less sophisticated testing methods covered by Article 24 of DORA.
**Key Requirements:**
1. TLPT shall be carried out on live production systems, as per Articles 3(17) and 26(2) of DORA.
2. A clear presentation of expectations from the TLPT authority to the financial entity is required.
3. Appropriate information flow between the control team, testers, and threat intelligence providers must be established.
**Risk Management:**
The control team shall conduct a risk assessment during the preparation phase to identify potential impacts on the financial entity.
**TLPT Participants:**
There are five types of participants in a TLPT:
1. TLPT cyber team (or TCT)
2. Control team
3. Testers
4. Threat intelligence providers
5. Financial entity staff
The main stakeholders in a TLPT include the TLPT cyber team, control team, and financial entity staff.
**Main Differences:**
While the documents do not explicitly state the main differences between TLPT and other testing methods, it can be inferred that TLPT is more advanced and requires a higher level of sophistication. The emphasis on risk management, secrecy arrangements, and clear communication between stakeholders also sets TLPT apart from less advanced testing methods.
**TLPT vs. Other Testing Methods:**
The documents mention that "less advanced testing" is covered by Article 24 of DORA, implying that TLPT is a more advanced method. However, the exact differences are not explicitly stated in the provided documents.
# Example usage
question = "How does DORA treat estimation of aggregated annual costs and losses caused by major ICT-related incidents?"
answer = rag_application.run(question)
print("Question:", question)
print("Answer:", answer)
Question: How does DORA treat estimation of aggregated annual costs and losses caused by major ICT-related incidents?
Answer: Based on the provided documents, here's a summary of how DORA treats estimation of aggregated annual costs and losses caused by major ICT-related incidents:
According to Article 11(10) of Regulation 2022/2554 (DORA), financial entities are mandated to report an estimation of aggregated annual costs and losses caused by major ICT-related incidents to the competent authorities upon request.
The guidelines aim to harmonize the estimation of these costs and losses across sectors, as reported data may be based on different methodologies and assumptions, leading to a lack of comparability. Financial entities should aggregate gross costs and losses, as well as financial recoveries, across major ICT-related incidents.
To estimate these costs and losses, financial entities should refer to their financial statements, such as the profit and loss account, or supervisory reporting for the relevant reference year. If accurate data is not available, they should base their estimation on other reliable sources.
In summary, DORA requires financial entities to report an estimation of aggregated annual costs and losses caused by major ICT-related incidents, and the guidelines aim to harmonize this estimation across sectors by providing a common framework for reporting these costs and losses.
# Example usage
question = "What does DORA say about third-party risk management and contract management?"
answer = rag_application.run(question)
print("Question:", question)
print("Answer:", answer)
Question: What does DORA say about third-party risk management and contract management?
Answer: Based on the provided documents, here's a summary of what DORA says about third-party risk management and contract management:
**Third-Party Risk Management:**
* Financial entities must comply with Chapter V "Managing of ICT third-party risk" of DORA.
* Key principles for sound management of ICT third-party risk include:
+ Regular review of policy (at least once a year).
+ Conducting risk assessments and maintaining internal responsibilities, skills, experience, and knowledge to ensure effective monitoring and oversight of contractual arrangements.
**Contract Management:**
* The RTS specifies detailed content on contractual arrangements regarding the use of ICT services supporting critical or important functions provided by ICT third-party service providers.
* Financial entities must maintain internal responsibilities and associated skills, experience, and knowledge to ensure effective monitoring and oversight of contractual arrangements with third-party providers.
* Contractual arrangements should not impede financial entities from fulfilling DORA requirements.
In summary, DORA emphasizes the importance of regular risk assessments, maintaining internal responsibilities, and ensuring effective monitoring and oversight of contractual arrangements with third-party ICT service providers.
The DORA rag model gives answers based on the documents provided. To evaluate if the model is accurate this will need to be checked with evalution metrics. ALternatively, a DORA domain expert could help with determining the accuracy and relevancy of the results. One query where hallucinations do occur is in the first TLPT query. THe model does not know what the abbreviation means in the context of DORA. THe model thinks that TLPT means “Threat-Landscaping and Penetration Testing” but it should interpret it as Threat Led Penetration Testing.
In this Python notebook I explore the basics of RAG and extend it to PDF documents on DORA legislation.
You’ll find this post in your _posts directory. Go ahead and edit it and re-build the site to see your changes. You can rebuild the site in many different wa...
this is a custom excerpt
In this post I explore the various security controls that can be implemented for docker compose files and the dockerfile.