Efficient way to read string files in Rust

Today, we will explore a few different methods for reading files in Rust. As a frontend engineer with a huge curiosity for backend and low-level systems, experimenting with Rust has been quite an enjoyable journey. Last year, this system language caught my attention, and since then, I’ve found myself returning to it time and again.

 

Recently, I’ve started a small Rust CLI project to deepen my understanding of the language. The task is seemingly simple: create a tool that searches for a string within a file and outputs all lines containing that string to the terminal (basic form of wellknown grep). However, what began as a straightforward exercise soon evolved into an exploration of efficient file reading techniques, particularly for handling large datasets.

 

In this blog post, I’ll share my discovery, presenting several approaches for reading files. Along the way, I’ll explain the inner workings of each solution and provide insights from my performance analysis. While not a comprehensive guide, I hope this exploration inspires you to experiment with Rust and deepen your understanding of its capabilities. You can find my project at github/simple-rust-grep-tool.

 

This is not a step by step guide, you should be able to create new Rust project, know how to run it, and explore and learn on your own about the crates I’m using here. Focus is on finding performant way to read files from disk.

Brief explanation of the simple grep tool

Let’s say our text file name is random_text.txt and we want to print only the lines that contain "hello world". To invoke the tool we’ll use the following command:

$ cargo run "hello world" random_text.txt

Upon executing this command we should see in the terminal the appropriate lines. If the file does not exist or is somehow corrupted, the tool will provide an appropriate error message.

Of course, if there are no lines with the phrase you can update the tool to print out the appropriate message too (I'm not doing it here).

Text file I’m using here has 10mb, each line is 80 characters long. I’m generating it with Rust too (out of scope for this post, but you can find the script I’m using in the Github repo).

Let’s start

I’ll focus on two files here: main.rs and util.rs. Main function parses command line arguments and invokes different functions for reading a file and searching for particular phrase in it.

 

util.rs is a module that contains my solution functions: read_file_as_a_string, read_file_line_by_line and read_file_line_by_line_better.

 

Here is the content of the main file:

use std::time::Instant;

use clap::Parser;

use util::read_file_as_a_string;

mod util;

#[derive(Parser)]
struct Cli {
    pattern: String,
    path: std::path::PathBuf,
}

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Record the start time
    let start_time = Instant::now();

    // Parse command line arguments
    let args = Cli::parse();

    // Read the file and print lines or return an err
    read_file_as_string(&args)?;

    let end_time = Instant::now();

    // Calculate the elapsed time
    let elapsed_time = end_time.duration_since(start_time);

    // Print the elapsed time in milliseconds
    println!("Program execution time: {} ms", elapsed_time.as_millis());

    Ok(())
}

The Cli struct defines arguments I want to parse from the invocation command. For parsing arguments I use clap crate which is an excelent command line argument parser for Rust. Additionally, I employ the instant struct to measure the execution time of the tool. I’m here invoking read_file_as_string as a first solution, later when I intend to test another 2 approaches I’ll just replace this function with another ones.

A more elegant approach would involve adding support for an additional flag that users can provide to the grep command. This argument would enable users to specify their preferred method for reading the file. Based on the provided flag, the tool would then invoke the corresponding function.

The easiest solution: read_to_string

The most basic approach to read a file is to use standard library std::fs::read_to_string function. This will read the entire contents of a file into a string which mean it will read the whole file into memory (RAM), however large the file might be.

use std::fs::File;

use anyhow::{Context, Result};

use crate::Cli;

pub fn read_file_as_string(args: &Cli) -> Result<()> {
    let content = std::fs::read_to_string(&args.path)
        .with_context(|| format!("Could not read the file {}", args.path.display()))?;

    for line in content.lines() {
        if line.contains(&args.pattern) {
            println!("{}", line);
        }
    }

    Ok(())
}

I really like using with_context from anyhow crate to show nicer errors and more context alongside original errors.

 

This works, but is not good for large files, since it can lead to memory exhaustion, causing the program to run out of memory and potentially crash.

Performance: read_to_string

This solution took 128ms, 114ms, and 114ms, respectively, to complete the task in three consecutive calls.

A better solution: BufReader

The better approach is to use std::io::BufReader struct from the standard library. This struct adds buffering to any reader. A buffered reader is usually more efficient way to read from a file in Rust (true for other languages too), due to its ability to minimize the number of system calls and improve performance by reading data in larger chunks and storing it in memory for faster access (check the section below about system calls). Its buffer size is fixed (default is 8KiB, it is configurable) so for large files it will not store the whole file in memory.

 

Here is a solution that read line by line using buffered reader:

use std::fs::File;
use std::io::{BufRead, BufReader};

use anyhow::{Context, Result};

use crate::Cli;

pub fn read_file_line_by_line(args: &Cli) -> Result<(), Box<dyn std::error::Error>> {
    let f = File::open(&args.path)
        .with_context(|| format!("Could not read the file {}", args.path.display()))?;

    let reader = BufReader::new(f);

    for line in reader.lines() {
        let line = line?;

        if line.contains(&args.pattern) {
            println!("{}", line);
        }
    }

    Ok(())
}

Performance: BufReader using .lines()

I thought I found the approach I want to go on with, but then I tested it and the results were not what I was expecting. This solution took 239ms, 180ms, 181ms. What?! Slower than the basic read-to-string solution? So I researched.

 

The lines() method automatically attached to anything implementing std::io::BufRead trait is really convenient but it has a big performance problem - the iterator returns an owned String so we need to allocate a new string on the heap for every line in the file. While memory allocation itself is generally fast, the overhead of allocating and deallocating memory for each line can become significant, especially for large files with many short lines (which is my text file).

Ok, now really better approach

So, the obvious “fix” to the previous approach is to create manually a String buffer and read line by line into this buffer (I named this buffer line). This means we only ever have one string buffer and there is no allocation and dealocation of heap memory for every line (if new line is longer than previous one there will be new allocation).

 

Before filling the buffer with new line we clear the buffer (if the new line is shorter than the previous one/ones and we don’t clear the buffer it will end up with data from both/multiple lines).

use std::fs::File;
use std::io::{BufRead, BufReader};

use anyhow::{Context, Result};

use crate::Cli;

pub fn read_file_line_by_line_better(args: &Cli) -> Result<(), Box<dyn std::error::Error>> {
    let f = File::open(&args.path)
        .with_context(|| format!("Could not read the file {}", args.path.display()))?;

    let mut reader = BufReader::new(f);
    let mut line = String::new();

    loop {
        let len = reader.read_line(&mut line)?;

        if len == 0 {
            break;
        }

        if line.contains(&args.pattern) {
            println!("{}", line.trim());
        }

        line.clear();
    }

    Ok(())
}

Performance: BufReader using one buffer

After running the tool few times it has been confirmed that this approach is indeed superior to the two previous methods. The tool took 110ms, 107ms and 110ms (for 3 consecutive calls) to read the file and print the lines.

Adjust the size of the buffer

The default buffer size is 8 KiB which is maybe too much for the lines my text file has. I was playing a bit with this and found that the size of around 3 KiB gives the same performanse. But keep in mind that the larger the size of the buffer, the fewer system calls will occur. Try multiple values and test performance. You can adjust the size of the buffer when you create it using:

let mut reader = BufReader::with_capacity(3000, f);

You should definitelly check the official docs for BufReader to find out other ways you can use and configure it.

A note on system calls

A system call is a mechanism for an application program on a machine to interact with machine’s operating system (OS). Application programs use system calls to request services from the OS’s kernel.

 

The kernel provides a set of interfaces by which processes running in user-space can interact with the system. These interfaces give applications controlled access to hardware, a mechanism with which to create new processes and communicate with existing ones, and the capability to request other OS resources.

 

When a program invokes a system call, the execution context switches from user to kernel mode, allowing the system to access hardware and perform the required operations safely. After the operation is completed, the control returns to the user mode, and the program continues its execution.

 

File reading in Rust, as in many programming languages, involves interactions with the underlying operating system. These interactions typically include system calls. System calls involve doing a bunch of work to prepare to hand over control to the operating system so that it can do the read operation. That work is the same whether you’re reading one byte at a time or plenty.

 

BufReader makes it so that the system calls always read a lot at a time even when you only ask for a little bit of data, so that when you ask for the next little bit of data it doesn’t need to go through the work all over again (it already has the requested data in the buffer). It’s the same with lines in my example above, it will read as many data it can in the buffer and will read lines from it into our line String buffer as long as there are full lines. Next time there is no full line, `BufReader“ will create another system call and fill its buffer again.

 

Using read_to_string makes it even more efficient in terms of the number of system calls, but if you’re only using part of the file at a time it causes a lot of unnecessary memory use. BufReader is intended to be a middle ground without excessive memory usage or excessive system calls.

Further performance improvements

The contains method from Rust’s standard library that I’m using in the examples above to find a substring in a string is generally efficient for most use cases, but its performance might degrade for very large strings or if you’re doing many substring checks in a tight loop. The primary reason for its potential slowness in those scenarios is its implementation. In those cases it’s advisable to explore alternative approaches. Really good one is memchr crate.

 

In the example provided earlier, I demonstrated reading files line by line. However, depending on your specific requirements, you might find it necessary to read the file in byte chunks rather than line by line. Additionally, employing tokenization can offer further flexibility and customization for various use cases.

 

Until next time, take care! 🦀

by diksipav - © 2024