I am trying to crawl a few sites usingHTTPoison
.What’s the best way to check if the crawler is getting the latest content if Last-Modified and Etag headers are missing?
Maybe try computing a hash of the last crawled page on this endpoint.
defmodule Scraper do
defmodule Page do
defstruct [:hash, :content, :path]
end
def scrape!(%Page{path: path} = old_page) do
%{body: body} = HTTPoison.get!(path)
new_page = %Page{hash: :crypto.hash(:md5, body), content: body, path: path}
case stale?(new_page, old_page) do
true -> # do nothing?
false -> # side effect (update db or something)
end
new_page
end
def stale?(new_page, old_page)
def stale?(%Page{hash: hash}, %Page{hash: hash}), do: true
def stale?(_, _), do: false
...
end
Thanks @idiot.But if I crawl say for example, http://www.engadget.com/rss.xml
then how would the hash determine if the site has updated or not without checking the headers?
The hash would be different, I think. Or am I misunderstanding what you mean?
@idiot…let me explain…I am parsing the above link to get new entries.But I don’t want to get the content if it hasn’t been changed or modified since the last crawl because that will amount to wasting bandwidth on both ends.So, only way to left to crawl it efficiently is by checking the Etag
or Last-Modified
header.Now, if both of them are missing then what other choice do I have to check for new content.Does that make it clear?
Does that make it clear?
Yes, I understand now.
Now, if both of them are missing then what other choice do I have to check for new content.
Don’t know, depends on the website. To avoid wasting bandwidth, you can request only part of the page (like the first 100 bytes with range
header), and compute the hash of that.
But that would only detect changes in the first 100 bytes of the page right?
Which is fine if they prepend stuff. If they append then grab the last bit?
But the first 100 byte might always be this(just an example)
' rss xmlns:dc="https://purl.org/dc/elements/1.1/" xmlns:itunes="https://www.itunes.com/dtds/podcast-1.0.dtd" version="2.0">
<channel>
<title>Engadget RSS Feed</title>
<link>https://www.engadget.com/rss.xml</link>
<image>
<url>
https://www.blogsmithmedia.com/www.engadget.com/media/feedlogo.gif?cachebust=true
</url>
<title>Engadget RSS Feed</title>
<link>https://www.engadget.com/rss.xml</link>
</image>
<language>en-us</language>`
Then take more. ^.^
As long as your packet is below 2k or so (depending on MTU, and including the heavy http headers) it’s very cheap to request up to that, plus if you are doing crawling anyway you can do it parallel.
You could indeed do a HTTP request for (the last) part of the response but that seems complex (how do you calculate the byte offset + range?), and inefficient, given you’ll want to do a request for the full page when the content has changed anyway, I assume.
If your question is wether you can know in advance the page has changed (given previous responses didn’t include an Expires
header), then the answer is no. So then the question becomes how do I decide on an optimal crawling frequency.
how do you calculate the byte offset + range?
I thought it was as easy as range: bytes=-10
.
Apparantly yes, according to the spec. Whether all sites support it is another question (
range
header in general).
You can’t. You can’t even know if those headers are present. These headers do only indicate that something may be different than the copy that you already have, not how current that content may be.
But even to check for their original purpose, it may fail: I’ve often seen sites that just use the current time for the header, totally ignoring the last change of the content, or even worse, they have those headers, but put some constant in there. No changes for years.
So, the only way to be sure is to fetch everything and diff to the last known version.
To answer your actual question, about how to determine changes when those headers are missing while we assume they are done properly when they exist:
- If
HEAD
request misses those headers, justGET
the full ressource and treat it as if it were new, also send a message to the resources maintainer explaining that those headers are missing and why he should put them in. - Hope that the maintainer will actually fix his problem.
@NobbZ can we not get the headers like this
response = HTTPoison.get(url)
case response do
{:ok, %HTTPoison.Response{headers: headers}} -> IO.inspect(headers)
end
HTTPoison.get/3
does a full GET
request, body included. This way you save nothing by checking the headers first, you already have downloaded the full body…
If you really want to only check the headers without fetching the body you need to issue a HEAD
request, roughly like this:
case HTTPoison.request(:head, url) do
{:ok, %HTTPoison.Response{headers: headers}} -> IO.inspect(headers)
_ -> IO.puts "meeeeh…"
end
Maybe you could use the “If-Modified-Since” header, supplying the time you last scraped the page?
See: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/If-Modified-Since
Servers that do not provide etag
or last-modified
headers, usually do not care for if-modified-since
and deliver anyway…
For something like an RSS feed, can you check the content-length header value?