CodeQL is a declarative static analyzer owned by GitHub, whose purpose is to discover security vulnerabilities. Declarative means that, to use CodeQL, you write rules describing the vulnerabilities you want to catch, and you let an engine check your rules against your code. If there is a match, an alert is raised. Static means that it checks your source code, as opposed to checking specific runs. Owned by GitHub means that CodeQL’s engine is not open-source: it’s free to use only on research and open-source code. If you want to use CodeQL on proprietary code, you need a GitHub Advanced Security license. CodeQL rules, that model specific programming languages and libraries, however, are open-source.
CodeQL is designed to do two things:
- Perform all kinds of quality and compliance checks. CodeQL’s query language is expressive enough to describe
a variety of patterns (e.g., “find any loop, enclosed in a function named
foo
, when the loop’s body contains a call to functionbar
”). As such, it enables complex, semantic queries over codebases, which can uncover a wide range of issues and patterns. - Track the flow of tainted data. Tainted data is data provided by a potentially malicious user. If tainted data is sent to critical operations (database requests, custom processes) without being sanitized, it can have catastrophic consequences, such as data loss, a data breach, arbitrary code execution, etc. Statements of your source code from where tainted data originates are called sources, while statements of your source code where tainted data is consumed are called sinks.
This tutorial is targeted at software and security engineers that want to try out CodeQL, focusing on the second use case from above. I explain how to setup CodeQL, how to write your first taint tracking query, and give a methodology for doing so.
Writing the vulnerable code
First, I need to write some code to execute my query against. As the attack surface, I’m choosing calls to the sarge Python library, for three reasons:
- It is available on PyPI, so it is easy to install.
- It is niche enough that it is not already modeled in CodeQL’s Python standard library,
so out of the box queries from CodeQL won’t catch vulnerabilities that use
sarge
. We need to write our own rules. - It performs calls to subprocess.Popen,
which is a data sink. As a consequence, code calling
sarge
is prone to having command injection vulnerabilities.
For my data source, I use flask.
That’s because HTTP requests contain user-provided data, and as such,
they are modeled as data sources in CodeQL’s standard library.
With both sarge
and flask
in place, we can write the following vulnerable code:
from flask import Flask, request
import sarge
app = Flask(__name__)
@app.route("/", methods=["POST"])
def user_to_sarge_run():
"""This function shows a vulnerability: it forwards user input (through a POST request) to sarge.run."""
print("/ handler")
if request.method != "POST":
return "Method not allowed"
default_value = "default"
received: str = request.form.get("key", "default")
print(f"Received: {received}")
sarge.run(received) # Unsafe, don't do that!
return "Called sarge"
To run the application locally, execute in one terminal:
> flask --debug run
In another terminal, trigger the vulnerability as follows:
> curl -X POST http://localhost:5000/ -d "key=ls"
Now observe that in the terminal running the app, the ls
command (provided by the user! 💣) was executed:
/ handler
Received: ls
app.py __pycache__ README.md requirements.txt
Wow, pretty scary right! What if I had passed the string rm -Rf ~/*
? Now let’s see how to catch this vulnerability with CodeQL.
Running CodeQL on the CLI
To run CodeQL on the CLI, I need to download the CodeQL binaries from the github/codeql-cli-binaries repository.
At the time of writing, there are CodeQL binaries for the three major platforms. Where I clone this repository doesn’t matter,
as long as the codeql
binary ends up in PATH
.
Then, because I am going to write my own queries (as opposed to solely using the queries shipped with CodeQL),
I need to clone CodeQL’s standard library: github/codeql.
I recommend putting this repository in a folder that is a sibling of the repository being analyzed.
In this manner, the codeql
binary will find it automatically.
Before I write my own query, let’s run standard CodeQL queries for Python. First, I need to create a database. Instead of analyzing code at each run, CodeQL’s way of operating is to:
- Store the code in a database,
- Then run one or many queries on the database.
While I develop a query, and so iterate on step 2 above, having the two steps distinct saves computing time. As long as the code being analyzed doesn’t change, there is no need to rebuild the database. Let’s build the codebase as follows:
> codeql database create --language=python codeql-db --source-root=.
Now that the database is created, let’s call the python-security-and-quality
(a set of default queries for Python, provided
by CodeQL’s standard library) queries:
> codeql database analyze codeql-db python-security-and-quality --format=sarif-latest --output=codeql.sarif
# Now, transform the SARIF output into CSV, for better human readibility; using https://pypi.org/project/sarif-tools/
> sarif csv codeql.sarif
> cat codeql.csv
Tool,Severity,Code,Description,Location,Line
CodeQL,note,py/unused-local-variable,Variable default_value is not used.,app.py,12
Indeed, in the snippet above, it looks like the developer intended to use a variable to store the value "default"
but forgot to use it in the end.
This is not a security vulnerability, but it exemplifies the kind of programming mistakes that CodeQL’s default rules find.
Note that the vulnerability of passing data from the POST
request to the sarge.run
call is not yet caught. That is because
sarge
is not in CodeQL
’s list of supported Python libraries.
Writing a query to model sarge.run
: modeling the source
The sarge.run
function executes a command, like
subprocess does. As such it is a sink for tainted data:
one should make sure that data passed to sarge.run
is controlled.
CodeQL performs a modular analysis: it doesn’t inspect the source code of your dependencies. As a consequence,
you need to model your dependencies’ behavior for them to be treated correctly by CodeQL’s analysis.
Modeling tainted sources and sinks is done by implementing the
DataFlow::ConfigSig
interface:
/** An input configuration for data flow. */
signature module ConfigSig {
/** Holds if `source` is a relevant data flow source. */
predicate isSource(Node source);
/** Holds if `sink` is a relevant data flow sink. */
predicate isSink(Node sink);
}
In this snippet, a predicate
is a function returning a Boolean, while Node
is a class modeling statements in the source code.
So to implement isSource
I need to capture the Node
that we deem relevant sources of tainted data w.r.t. sarge.run
.
Since any source of tainted data is dangerous if you send its content to sarge.run
, I implement isSource
as follows:
predicate isSource(DataFlow::Node source) { source instanceof ActiveThreatModelSource }
Threat models
control which sources of data are considered dangerous. Usually, only remote sources (data in an HTTP request,
packets from the network) are considered dangerous. That’s because, if local sources (content of local files, content passed by the user in the terminal)
are tainted, it means an attacker has already such a level of control on your software that you are doomed.
That is why, by default, CodeQL’s default threat model is to only consider remote sources.1
In isSource
, by using ActiveThreatModelSource
, we declare that the sources of interest are the sources of the current active threat model.
To make sure that ActiveThreatModelSource
works correctly on my codebase, I write the following test query in file Scratch.ql
:
import python
import semmle.python.Concepts
from ActiveThreatModelSource src
select src, "Tainted data source"
Because this file depends on the python
APIs of CodeQL, I need to put a qlpack.yml
file close to Scratch.ql
, as follows:
name: smelc/sarge-queries
version: 0.0.1
extractor: python
library: false
dependencies:
codeql/python-queries: "*"
I can now execute Scratch.ql
as follows:
> codeql database analyze codeql-db queries/Scratch.ql --format=sarif-latest --output=codeql.sarif
> sarif csv codeql.sarif
> cat codeql.csv
Tool,Severity,Code,Description,Location,Line
CodeQL,note,py/get-remote-flow-source,Tainted data source,app.py,1
This seems correct: something is flagged. Let’s make it more visual by running the query in VSCode.
For that I need to install the CodeQL extension.
To run queries within vscode
, I first need to specify the database to use. It is the codeql-db
folder which
we created with codeql database create
above:
Now I run the query by right-clicking in its opened file:
Doing so opens the CodeQL results view:
I see that the import of request
is flagged as a potential data source. This is correct: in my program,
tainted data can come through usages of this package.
Writing a query to model sarge.run
: modeling the sink
This is where things gets more interesting. As per the ConfigSig
interface above, I need to implement isSink(Node sink)
,
so that it captures calls to sarge.run
. Because CodeQL
is a declarative2 object-oriented language, this means isSink
must return true
for subclasses of Node
that represent calls to sarge.run
. Let me describe a methodology to discover how to do that.
First, modify the Scratch.ql
query to find out all instances of Node
in my application:
import python
import semmle.python.dataflow.new.DataFlow
from DataFlow::Node src
select src, "DataFlow::Node"
Executing this query in VSCode yields the following results:
Wow, that’s a lot of results! In a real codebase with multiple files, this would be unmanageable.
Fortunately code completion works in CodeQL, so I can filter the results using the where
clause, discovering
the methods to call by looking at completions on the .
symbol. Since the call to sarge.run
I am looking for is at line 17,
I can refine the query as follows:
from DataFlow::Node src, Location loc
where src.getLocation() = loc
and loc.getFile().getBaseName() = "app.py"
and loc.getStartLine() = 17
select src, "DataFlow::Node"
With these constraints, the query returns only a handful of results:
Still, there are 4 hits on line 17. Let’s see how I can disambiguate those. For this, CodeQL provides the getAQlClass
predicate
that returns the most specific type a variable has (as explained in
CodeQL zero to hero part 3):
from DataFlow::Node src, Location loc
where src.getLocation() = loc
and loc.getFile().getBaseName() = "app.py"
and loc.getStartLine() = 17
select src, src.getAQlClass(), "DataFlow::Node"
See how the select
clause now includes src.getAQlClass()
as second element. This makes the CodeQL Query Results show it
in the central column:
There are many more results, and that is because entries that were indistinguishable before are now disambiguated by the class.
If in doubt, one can consult the list of class of CodeQL’s standard Python library
to understand what each class is about. In our case, I had read the
official documentation on using CodeQL for Python,
and I recognize the CallNode
class from this list.
As the documentation explains, there is actually an API to retrieve CallNode
instances corresponding to functions imported from a distant module, using
the moduleImport
function. Let’s use it to restrict our Node
s to be instances of CallNode
(using a cast) and
this call being a call to sarge.run
:
import python
import semmle.python.dataflow.new.DataFlow
import semmle.python.ApiGraphs
from DataFlow::Node src
where src.(API::CallNode) = API::moduleImport("sarge").getMember("run").getACall()
select src, "CallNode calling sarge.run"
Executing this query yields the only result we want:
Putting this all together, I can finalize the implementation of ConfigSig
as shown below.
The getArg(0)
suffix models that the tainted data flows into sarge.run
’s first argument:
private module SargeConfig implements DataFlow::ConfigSig {
predicate isSource(DataFlow::Node source) {
source instanceof ActiveThreatModelSource
}
predicate isSink(DataFlow::Node sink) {
sink = API::moduleImport("sarge").getMember("run").getACall().getArg(0)
}
}
Following the official template for queries tracking tainted data, I write the query as follows:
module SargeFlow = TaintTracking::Global<SargeConfig>;
from SargeFlow::PathNode source, SargeFlow::PathNode sink
where SargeFlow::flowPath(source, sink)
select sink.getNode(), source, sink, "Tainted data passed to sarge"
Executing this query in VSCode returns the paths (list of steps) along which the vulnerability takes place:
Conclusion
I have demonstrated how to use CodeQL to model a Python library, covering the setup and steps a developer must do to write his/her first CodeQL query. I gave a methodology to be able to write instances of CodeQL interfaces, even when one is lacking intimate knowledge of CodeQL APIs. I believe this is important, as the CodeQL ecosystem is small and the number of resources is limited: users of CodeQL often have to find out what to write on their own, with limited support from both the tooling and from generative AI tools (probably because the number of resources on CodeQL is small, so the results of generative AI systems are poor too).
To dive deeper, I recommend reading the official CodeQL for Python resource and join the GitHub Security Lab Slack to get support from CodeQL users and developers. And remember that this tutorial’s material is available at tweag/sarge-codeql-minimal if you want to experiment with this tutorial yourself!
Behind the scenes
Clément is a Director of Engineering, leading the Build Systems department. He studied Computer Science at Telecom Nancy and received his PhD from Université Nice Sophia Antipolis, where he proved multithreaded programs using linear logic. His technical background includes functional programming, compilers, provers, distributed systems, and build systems.
If you enjoyed this article, you might be interested in joining the Tweag team.