Andrei is back to discuss recent academic research into malware within the Python/PyPI ecosystem and whether it is possible to effectively combat it with open source tooling, plus we cover security updates for Unbound, libuv, node.js, the Linux kernel, libgit2 and more.
56 unique CVEs addressed
getaddrinfo()
- but
would then fail to NUL-terminate the string - as such, getaddrinfo()
would
read past the end of the buffer and the address that got resolved may not be
the intended one - so then a remote attacker who could influence this could
end up causing the application to contact a different address than expected
and so perhaps access internal services etc<<<<<<....
then would cause exponential performance
degredationsend()
rather than public_send()
which allowed access to
private methods to directly execute system calls@
git_index_add
/erc/resolv.conf
/etc/hosts
, /etc/nsswitch.conf
or anything specifed via the
HOSTALIASES
environment variable - if has an embedded NUL as the first
character in a new line, would then attempt to read memory prior to the start
of the buffer and hence an OOB read -> crashHey, Alex!
We will continue our journey today beyond the scope of the previous episodes. We’ve delved into the realms of network security, federated infrastructures, and vulnerability detection and assessment.
Last year, the Ubuntu Security Team participated in the Linux Security Summit in Bilbao. At that time, I managed to have a discussion with Zach, who hosted a presentation at the Supply Chain Security Con entitled “Will Large-Scale Automated Scanning Stop Malware on OSS Repositories?”. I later discovered that his talk was backed by a paper that he and his colleagues from Chainguard had published.
With this in mind, today we will be examining “Bad Snakes: Understanding and Improving Python Package Index Malware Scanning”, which was published last year in ACM’s International Conference on Software Engineering.
The aim of the paper is to highlight the current state of the Python and PyPi ecosystems from a malware detection standpoint, identify the requirements for a mature malware scanner that can be integrated into PyPi, and ascertain whether the existing open-source tools meet these objectives.
With this in mind, let’s start by understanding the context.
Applications can be distributed through repositories. This means that the applications are packaged into a generic format and published in either managed or unmanaged repositories. Users can then install the application by querying the repositories, downloading the application in a format that they can unpack through a client, and subsequently run on their hosts.
There are numerous repositories out there. Some target specific operating systems, as is the case with Debian repositories, the Snap Store, Google Play, or the Microsoft Store. Others are designed to store packages for a specific programming language, such as PyPi, npm, and RubyGems. Firefox Add-ons and the Chrome extension store target a specific platform, namely the browser.
Another relevant characteristic when discussing repositories is the level of curation. The Ubuntu Archive is considered a curated repository of software packages because there are several trustworthy contributors able to publish software within the repository. Conversely, npm is unmanaged because any member of the open-source community can publish anything in it.
We will discuss the Python Package Index extensively, which is the de facto unmanaged repository for the Python programming language. As of the 7th of March 2024, there were 5.4 million releases for 520 thousand projects and nearly 800 thousand users. It is governed by a non-profit organisation and run by volunteers worldwide.
Software repositories foster the dependencies of software on other pieces of software, controlled by different parties. As seen in campaigns such as the SolarWinds SUNBURST attack, this can go awry. Attackers can gain control over software in a company’s supply chain, gain initial access to their infrastructure, and exploit this advantage.
Multiple attack vectors are possible. Accounts can be hijacked. Attackers may publish packages with similar names (in a tactic known as typosquatting). They can also leverage shrink-wrapped clones, which are duplicates of existing packages, where malicious code is injected after gaining users’ trust. While covering all attack vectors is beyond the scope of this podcast episode, you can find a comprehensive taxonomy in a paper called “Taxonomy of Attacks on Open-Source Software Supply Chains”, which lists over 100 unique attack vectors.
From 2017 to 2022, the number of unique projects removed from PyPi increased rapidly: 38 in the first year, followed by 130, 60, 500, 27 thousands, and finally 12 thousands in the last year. Despite the fact that most of these were reported as malware, it’s worth noting that the impact of some of them is limited due to the lack of organic usage.
These attacks can be mitigated by implementing techniques such as multi-factor authentication, software signing, update frameworks, or reproducible builds, but the most widespread method is malware analysis.
Some engines check for anomalies via static and dynamic heuristics, while others rely on signatures due to their simplicity. Once a piece of software is detected as malicious, its hash is added to a deny list that is embedded in the anti-malware engine. Each file is then hashed and the result is checked against the deny list. If the heuristics or the hash comparison identifies the file as malicious, it is either reported, blocked, or deleted depending on the strategy implemented by the anti-malware engine.
These solutions are already implemented in software repositories. In the case of PyPi, malware scanning was introduced in February 2022 with the assistance of a malware check feature in Warehouse, the application serving PyPi. However, it was disabled by the administrators two years later and ultimately removed in May 2023 due to an overload of alerts.
In addition to this technical solution, PyPi also capitalises on a form of social symbiosis. Software security companies and individuals conduct security research, reporting any discovered malware to the PyPi administrators via email. The administrators typically allocate 20 minutes per week to review these malware reports and remove any packages that can be verified as true positives. Ultimately, the reporting companies and individuals gain reputation or attention for their brands, products, and services.
In addition to information about software repositories, supply chain attacks, malware analysis, and PyPi, the researchers also interviewed administrators from PyPi to understand their requirements for a malware analysis tool that could assist them. The three interviews, each lasting one hour, were conducted in July and August 2022 and involved only three individuals. This limited number of interviews is due to the focus on the PyPi ecosystem, where only ten people are directly involved in malware scanning activities.
When discussing requirements, the administrators desired tools with a binary outcome, which could be determined by checking if a numerical score exceeds a threshold or not. The decision should also be supported by arguments. While administrators can tolerate false negatives, they aim to reduce the rate of false positives to zero. The tool should also operate on limited resources and be easy to adopt, use and maintain.
But do the current solutions tick these boxes?
The researchers selected tools based on a set of criteria: analysing the code of the packages, having public detection techniques, and detection rules. Upon examining the available solutions, they found that only three could be used for evaluation in the context of their research: PyPi’s malware checks, Bandit4Mal, and OSSGadget’s OSS Detect Backdoor.
Regarding the former, it should be noted that the researchers did not match the YARA rules only against the setup files, but also against all files in the Python package. The second, Bandit4Mal, is an open-source version of Bandit that has been adapted to include multiple rules for detecting malicious patterns in the AST generated from a program’s codebase. The last, OSSGadget’s OSS Detect Backdoor, is a tool developed by Microsoft in June 2020 to perform rule-based malware detection on each file in a package.
These tools were tested against both malicious and benign Python packages. The researchers used two datasets containing 168 manually-selected malicious packages. For the benign packages, they selected 1,400 popular packages and one thousand randomly-selected benign Python packages.
For the evaluation process, they considered an alert in a malicious package to be a true positive and an alert in a benign package to be a false positive.
The true positive rate was 85% for the PyPi checks, the same for OSS Detect Backdoor and 90% for Bandit4Mal. The false positive rates ranged from 15% for the PyPi checks over the random packages, to 80% for Bandit4Mal on popular packages.
The tools ran in a time-effective manner, with a median time of around two seconds per package across all datasets. The maximum runtime was recorded for Ansible’s package, which was scanned in 26 minutes.
Despite their efficient run times, we can infer from these results that the tools are not accurate enough to meet the demands of PyPi’s administrators. The analysts may be overwhelmed by alerts for benign packages, which could interfere with their other operations.
And with this, we can conclude the episode of the Ubuntu Security Podcast, which details the paper “Bad Snakes: Understanding and Improving Python Package Index Malware Scanning”. We have discussed software repositories, malware analysis, and malware-related operations within PyPi. We’ve also explored the requirements that would make a new open-source Python malware scanner suitable for the PyPi administrators and evaluated how the current solutions perform.
If you come across any interesting topics that you believe should be discussed, please email us at [email protected].
Over to you, Alex!