curl: downloading a file only if the source is newer than the destination.

Recently a friend of mine had a problem; he wanted to download a file only if the file on the web server was newer than the already…

curl: downloading a file only if the source is newer than the destination.
Photo by Denise Jans on Unsplash

Recently a friend of mine had a problem; he wanted to download a file only if the file on the web server was newer than the already downloaded file; this was needed not only to save bandwidth since the file was quite big but also to avoid necessary checks after the download to process or not to process the file, there are many ways to implement this, but the simplest one is to use curl which has this built-in capability, let’s see how!.

HTTP Headers

Let’s talk first about HTTP headers which will help us understand how they can help us to verify if a file in the web server is newer than an already downloaded file. An HTTP header is a field of an HTTP request or response that passes additional context and metadata about the request or response. One essential header is the Last-Modified header which contains a date and time when the origin server believes the resource was last modified. It is used as a validator to determine if the resource is the same as the previously stored one.

HTTP HEAD

But how can we get the headers without downloading the file? HTTP protocol, apart from the GET/POST/DELETE/PUT methods, has the HEAD method as well, which returns the same headers as GET but without downloading content.

We can use curl to make a HEAD request.

$ curl -I http://127.0.0.1:8000/1.txt 
HTTP/1.0 200 OK 
Server: SimpleHTTP/0.6 Python/3.8.10 
Date: Sun, 22 Jan 2023 10:40:41 GMT 
Content-type: text/plain 
Content-Length: 34 
Last-Modified: Sat, 21 Jan 2023 18:36:14 GMT

We see that the Last-modified header has the value of Sat, 21 Jan 2023 18:36:14 GMT (note that the timestamp is in the GMT zone). Verifying this for this file in the web server file system (GMT + 2h).

$ stat --printf="%y\n" 1.txt 
2023-01-21 20:36:14.507641678 +0200

We can see that, indeed, the Last-Modified header returns the time that a file was modified in the web server file system.

How can i use this header?

curl has a built-in parameter for that job! it downloads the file only if the Last-Modified header is newer than the already downloaded file modification time attribute. This parameter is the -z, let’s see how it works.

$ curl -z 1.txt http://127.0.0.1:8000/1.txt -o 1.txt 
Warning: Illegal date format for -z, --time-cond (and not a file name). 
Warning: Disabling time condition. See curl_getdate(3) for valid date syntax. 
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current 
                                 Dload  Upload   Total   Spent    Left  Speed 
100    34  100    34    0     0  17000      0 --:--:-- --:--:-- --:--:-- 17000

Running this first time without ever having downloaded the file shows a warning message; this is normal because there is no file for the -z parameter to compare the time stamps, and we can also see that curl has downloaded the file.

Running this second time, we can see no warnings and no indication that curl has download content; every value is 0.

$ curl -z 1.txt http://127.0.0.1:8000/1.txt -o 1.txt 
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current 
                                 Dload  Upload   Total   Spent    Left  Speed 
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0

From the web server logs, we can see that the first try had an HTTP response of 200 (success downloading the file), and the second one of 304 (Not Modified) indicates no need to retransmit the requested resources.

127.0.0.1 - - [22/Jan/2023 12:55:17] "GET /1.txt HTTP/1.1" 200 - 
127.0.0.1 - - [22/Jan/2023 12:57:17] "GET /1.txt HTTP/1.1" 304 -

You know, be wondering how I can use it with my script?. well, if you plan to use the curl exit code, you will be disappointed, curl, despite the HTTP status code being 200 or 304, returns a system exit code of 0 which makes it a bit unusable on your scripts.

$ curl -z 1.txt http://127.0.0.1:8000/1.txt -o 1.txt 
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current 
                                 Dload  Upload   Total   Spent    Left  Speed 
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0 
kpatronas@nautilus:~$ echo $? 
0

But we can make curl to print only the exit code of an operation like this

$ curl -s -o /dev/null -w "%{http_code}\n" -z 1.txt http://127.0.0.1:8000/1.txt -o 1.txt 
304

In a bash script, this could be like that, to decide what action you will take if you only have downloaded a new file.

#!/bin/bash 
isnewfile=$(curl -s -o /dev/null -w "%{http_code}\n" -z 1.txt http://127.0.0.1:8000/1.txt -o 1.txt) 
 
if [ "$isnewfile" = "200" ]; then 
    echo "File is new." 
else 
    echo "File is not new." 
fi

Conclusion

Linux tools like curl are powerful, and you must keep in mind that probably every problem you have was the problem of someone else first, and there is already a streamlined solution like the -z parameter, which is a very cool feature I think :)

Join Medium with my referral link - Konstantinos Patronas
As a Medium member, a portion of your membership fee goes to writers you read, and you get full access to every story…