curl: downloading a file only if the source is newer than the destination.
Recently a friend of mine had a problem; he wanted to download a file only if the file on the web server was newer than the already…
Recently a friend of mine had a problem; he wanted to download a file only if the file on the web server was newer than the already downloaded file; this was needed not only to save bandwidth since the file was quite big but also to avoid necessary checks after the download to process or not to process the file, there are many ways to implement this, but the simplest one is to use curl which has this built-in capability, let’s see how!.
HTTP Headers
Let’s talk first about HTTP headers which will help us understand how they can help us to verify if a file in the web server is newer than an already downloaded file. An HTTP header is a field of an HTTP request or response that passes additional context and metadata about the request or response. One essential header is the Last-Modified header which contains a date and time when the origin server believes the resource was last modified. It is used as a validator to determine if the resource is the same as the previously stored one.
HTTP HEAD
But how can we get the headers without downloading the file? HTTP protocol, apart from the GET/POST/DELETE/PUT methods, has the HEAD method as well, which returns the same headers as GET but without downloading content.
We can use curl to make a HEAD request.
$ curl -I http://127.0.0.1:8000/1.txt
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/3.8.10
Date: Sun, 22 Jan 2023 10:40:41 GMT
Content-type: text/plain
Content-Length: 34
Last-Modified: Sat, 21 Jan 2023 18:36:14 GMTWe see that the Last-modified header has the value of Sat, 21 Jan 2023 18:36:14 GMT (note that the timestamp is in the GMT zone). Verifying this for this file in the web server file system (GMT + 2h).
$ stat --printf="%y\n" 1.txt
2023-01-21 20:36:14.507641678 +0200We can see that, indeed, the Last-Modified header returns the time that a file was modified in the web server file system.
How can i use this header?
curl has a built-in parameter for that job! it downloads the file only if the Last-Modified header is newer than the already downloaded file modification time attribute. This parameter is the -z, let’s see how it works.
$ curl -z 1.txt http://127.0.0.1:8000/1.txt -o 1.txt
Warning: Illegal date format for -z, --time-cond (and not a file name).
Warning: Disabling time condition. See curl_getdate(3) for valid date syntax.
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 34 100 34 0 0 17000 0 --:--:-- --:--:-- --:--:-- 17000Running this first time without ever having downloaded the file shows a warning message; this is normal because there is no file for the -z parameter to compare the time stamps, and we can also see that curl has downloaded the file.
Running this second time, we can see no warnings and no indication that curl has download content; every value is 0.
$ curl -z 1.txt http://127.0.0.1:8000/1.txt -o 1.txt
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0From the web server logs, we can see that the first try had an HTTP response of 200 (success downloading the file), and the second one of 304 (Not Modified) indicates no need to retransmit the requested resources.
127.0.0.1 - - [22/Jan/2023 12:55:17] "GET /1.txt HTTP/1.1" 200 -
127.0.0.1 - - [22/Jan/2023 12:57:17] "GET /1.txt HTTP/1.1" 304 -You know, be wondering how I can use it with my script?. well, if you plan to use the curl exit code, you will be disappointed, curl, despite the HTTP status code being 200 or 304, returns a system exit code of 0 which makes it a bit unusable on your scripts.
$ curl -z 1.txt http://127.0.0.1:8000/1.txt -o 1.txt
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
kpatronas@nautilus:~$ echo $?
0But we can make curl to print only the exit code of an operation like this
$ curl -s -o /dev/null -w "%{http_code}\n" -z 1.txt http://127.0.0.1:8000/1.txt -o 1.txt
304In a bash script, this could be like that, to decide what action you will take if you only have downloaded a new file.
#!/bin/bash
isnewfile=$(curl -s -o /dev/null -w "%{http_code}\n" -z 1.txt http://127.0.0.1:8000/1.txt -o 1.txt)
if [ "$isnewfile" = "200" ]; then
echo "File is new."
else
echo "File is not new."
fiConclusion
Linux tools like curl are powerful, and you must keep in mind that probably every problem you have was the problem of someone else first, and there is already a streamlined solution like the -z parameter, which is a very cool feature I think :)