Crawl a site with missing Last-Modified and Etag headers

elixirnewbie · September 28, 2017, 1:14pm

I am trying to crawl a few sites usingHTTPoison.What’s the best way to check if the crawler is getting the latest content if Last-Modified and Etag headers are missing?

idi527 · September 28, 2017, 1:42pm

Maybe try computing a hash of the last crawled page on this endpoint.

defmodule Scraper do
  defmodule Page do
    defstruct [:hash, :content, :path]
  end

  def scrape!(%Page{path: path} = old_page) do
    %{body: body} = HTTPoison.get!(path)
    new_page = %Page{hash: :crypto.hash(:md5, body), content: body, path: path}
    
    case stale?(new_page, old_page) do
      true -> # do nothing?
      false -> # side effect (update db or something)
    end
    
    new_page
  end

  def stale?(new_page, old_page)
  def stale?(%Page{hash: hash}, %Page{hash: hash}), do: true
  def stale?(_, _), do: false

  ...

end

elixirnewbie · September 28, 2017, 1:49pm

Thanks @idiot.But if I crawl say for example, http://www.engadget.com/rss.xml
then how would the hash determine if the site has updated or not without checking the headers?

idi527 · September 28, 2017, 1:52pm

The hash would be different, I think. Or am I misunderstanding what you mean?

elixirnewbie · September 28, 2017, 2:01pm

@idiot…let me explain…I am parsing the above link to get new entries.But I don’t want to get the content if it hasn’t been changed or modified since the last crawl because that will amount to wasting bandwidth on both ends.So, only way to left to crawl it efficiently is by checking the Etag or Last-Modified header.Now, if both of them are missing then what other choice do I have to check for new content.Does that make it clear?

idi527 · September 28, 2017, 2:04pm

Does that make it clear?

Yes, I understand now.

Now, if both of them are missing then what other choice do I have to check for new content.

Don’t know, depends on the website. To avoid wasting bandwidth, you can request only part of the page (like the first 100 bytes with range header), and compute the hash of that.

ravernkoh · September 28, 2017, 3:47pm

But that would only detect changes in the first 100 bytes of the page right?

OvermindDL1 · September 28, 2017, 3:52pm

Which is fine if they prepend stuff. If they append then grab the last bit?

elixirnewbie · September 28, 2017, 5:23pm

But the first 100 byte might always be this(just an example)

   ' rss xmlns:dc="https://purl.org/dc/elements/1.1/" xmlns:itunes="https://www.itunes.com/dtds/podcast-1.0.dtd" version="2.0">
<channel>
<title>Engadget RSS Feed</title>
<link>https://www.engadget.com/rss.xml</link>
<image>
<url>
https://www.blogsmithmedia.com/www.engadget.com/media/feedlogo.gif?cachebust=true
</url>
<title>Engadget RSS Feed</title>
<link>https://www.engadget.com/rss.xml</link>
</image>
<language>en-us</language>`

OvermindDL1 · September 28, 2017, 5:26pm

Then take more. ^.^

As long as your packet is below 2k or so (depending on MTU, and including the heavy http headers) it’s very cheap to request up to that, plus if you are doing crawling anyway you can do it parallel.

thomasbrus · September 28, 2017, 5:30pm

You could indeed do a HTTP request for (the last) part of the response but that seems complex (how do you calculate the byte offset + range?), and inefficient, given you’ll want to do a request for the full page when the content has changed anyway, I assume.

If your question is wether you can know in advance the page has changed (given previous responses didn’t include an Expires header), then the answer is no. So then the question becomes how do I decide on an optimal crawling frequency.

idi527 · September 28, 2017, 5:49pm

how do you calculate the byte offset + range?

I thought it was as easy as range: bytes=-10.

thomasbrus · September 28, 2017, 6:12pm

Apparantly yes, according to the spec. Whether all sites support it is another question (range header in general).

NobbZ · September 28, 2017, 7:29pm

You can’t. You can’t even know if those headers are present. These headers do only indicate that something may be different than the copy that you already have, not how current that content may be.

But even to check for their original purpose, it may fail: I’ve often seen sites that just use the current time for the header, totally ignoring the last change of the content, or even worse, they have those headers, but put some constant in there. No changes for years.

So, the only way to be sure is to fetch everything and diff to the last known version.

To answer your actual question, about how to determine changes when those headers are missing while we assume they are done properly when they exist:

If HEAD request misses those headers, just GET the full ressource and treat it as if it were new, also send a message to the resources maintainer explaining that those headers are missing and why he should put them in.
Hope that the maintainer will actually fix his problem.

elixirnewbie · September 28, 2017, 8:01pm

@NobbZ can we not get the headers like this

response = HTTPoison.get(url)
  case response do
{:ok, %HTTPoison.Response{headers: headers}} ->  IO.inspect(headers)
			
 end

NobbZ · September 28, 2017, 8:07pm

HTTPoison.get/3 does a full GET request, body included. This way you save nothing by checking the headers first, you already have downloaded the full body…

If you really want to only check the headers without fetching the body you need to issue a HEAD request, roughly like this:

case HTTPoison.request(:head, url) do
  {:ok, %HTTPoison.Response{headers: headers}} -> IO.inspect(headers)
  _ -> IO.puts "meeeeh…"
end

Desty · October 4, 2017, 5:53pm

Maybe you could use the “If-Modified-Since” header, supplying the time you last scraped the page?

See: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/If-Modified-Since

NobbZ · October 4, 2017, 6:35pm

Servers that do not provide etag or last-modified headers, usually do not care for if-modified-since and deliver anyway…

ccachor · November 2, 2018, 3:52pm

For something like an RSS feed, can you check the content-length header value?