I’m currently trying to get published journal information (author, title, date published etc) automatically from webpages. I’m trying to avoid scraping since this is against most publishers ToS. The best solution I have so far is:

  1. Get
    tag from page at URL
  2. Search CrossRef for matching title. Get DOI of top result.
  3. Get the rest of the information from CrossRef

Unfortunately sometimes the wrong journal article is chosen when searching CrossRef. This especially happens with newer publications which have a DOI, but don’t seem to show up on CrossRef search at first.

I’ve also tried using Regex to find a DOI on the page, but this comes up with all the DOIs listed in the references as well so this doesn’t help me.

Since the DOI points to the ultimate URL, is there any way to reverse the process to get the DOI from the URL?

Changed status to publish