@donohoe

NYT / Nose to Tail

 

http://donohoe.io

@donohoe

Who am I?

  1. Former-Times employee

    About 7 years working on the site: Customer Service, CMS, Frontend, Mixed Backend, Product Dev

  2. Internally 'hacked' for Prototypes

    Times People, Times Wire, Travel, T Mag, & other stuff that never made it

  3. Externally do it for fun

    Ochs Chrome extension, Send to Instapaper by Section, iPad web apps, just fun

    Twitter accounts (@nytalerts, @nytlite, @timespeople)

What is covered in this session?

  1. JSON, JS, RSS

    The Times is littered with rich undocumented data sources

  2. Computing image URLs

    Hidden gems, but there is a catch...

  3. Times Machine API

    A little known hidden API that covers a lot of public domain material

and more...

  1. Pre-generated Markup

    Breaking news alerts, content, and less exciting things

  2. Slideshows

    Lets liberate some great images (again, a catch)

  3. Random stuff & your Questions

    Ask me anything, I'll answer honestly if I can

Relax

  1. DON'T MAKE NOTES

    I'll post all notes and links as I go along

Developer Network

  1. NYTimes has a bucket full of API's'

    In case you didn't already know

  2. Sign-up

    http://developer.nytimes.com

  3. Actually well documented and reliable

    Article Search, News Wire, Times People, Times Tags, Real Estate, NY State, Movies, Most Popular and so on...

  4. That is all I have to say about that

Lets get started

  1. JS, JSONP

    RSS is boring, but still helpful in places. That will follow later.

All Sectionfronts exists in JSON and JSONP

  1. Not just RSS

  2. JSONP Rules

    You don't need a proxy

    Directly hook into the most recent content with externally hosted Javascript

  3. We don't list this anywhere

Switch URL between .json and .jsonp for Callback function

jsonFeedCallback({
   "status" : "ok",
   "title" : "NYT > Politics",
   "link" : "http://www.nytimes.com/pages/politics/index.html",
   "lastBuildDate" : "Tue, 22 Nov 2011 03:33:56 GMT\n",
   "items" : [
      {
         "title" : "Panel Fails to Reach Deal on...",
         "link" : "http://www.nytimes.com/2011/11/...",
         "guid" : "http://www.nytimes.com/2011/11/...",
         "description" : "President Obama promised...",
         "byline" : "By JENNIFER STEINHAUER, HELENE...",
         "pubdate" : "Tue, 22 Nov 2011 03:26:42 GMT",
         "media" : [

Examples

  1. Homepage (JSON)

    http://json8.nytimes.com/pages/index.json
  2. World (JSONP)

    http://json8.nytimes.com/pages/world/index.jsonp
  3. Most Emailed

    Not a standard Section, but its there...

    http://json8.nytimes.com/pages/mostemailed/index.jsonp
  4. You get the idea...

    arts, books, arts/music, theater, dining, health, science, sports, technology, fashion, etc.

JS & Slideshows

  1. There is no Slideshows feed

  2. There is a Navigation element

JS & Slideshows

  1. Courtesy of View Source

    http://graphics8.nytimes.com/js/multimedia/slideshow/navData.js
var nav ={
	"imgHost":"http:\/\/graphics8.nytimes.com\/",
	"sections":
		{"section_name":"world",
		"display_name":"World",
		"dropdown_flag":true,
		"slides":[
			{"imgSrc":"images\/2011\/10\/07...",
			"popup":false,
			"isSelect":false,
			"url":"http:\/\/www.nytimes.com\/...",
				"headline":"Future of Euro Coul	d..."
			},...

JS & Slideshows

  1. Handy for simple web apps

  2. Scraping

    If storing as JSON in Blobs you don't need to parse RSS/XML

  3. Not just Sections

    Also covers Most Emailed and Homepage

Times Machine API

  1. Created by Derek Gottfrid back in August 2007 (@derekg)

  2. Archival page data from 1856 to 1922

  3. Not behind auth. Completely accessible to public

Times Machine API

  1. The API comes down to two simple things really; JSON and computable URLs

  2. Thats it. Done.

Times Machine API

Times Machine API

Times Machine API

Times Machine API

Times Machine API

Year: 1907

  1. http://s3.amazonaws.com/page.archive.nytimes.com/1907/index.js
  2. The content of this is a JSON array with the available months:

    ["01","02","03","04","05","06","07","08","09","10","11","12"]

You see where this is going?

  1. Lets skip to to December 3rd

    http://s3.amazonaws.com/page.archive.nytimes.com/1907/12/03/index.js
  2. The content of this is a JSON array with the available Pages:

    [ "P1" , "P2" , "P3" , "P4" , "P5" , "P6" , "P7" , "P8" , "P9" , "P10" , "P11" , "P12" , "P13" , "P14" , "P15" , "P16" ]

Times Machine API

  1. YYYY/MM/DD/{Page ID from index.js}/data.js

  2. Retrieve the entire page data by constructing this URL:

    http://s3.amazonaws.com/page.archive.nytimes.com/1907/12/03/P1/data.js
  3. Its big... and not necessarily clear what some values explain

  4. Note to Self

    Open the link to show them what you mean

What this means is...

  1. Javascript array of article zone information thats appears in the newspaper page

  2. An article may take up multiple zones - since the data is de-normalized

  3. Its big... and not necessarily clear what some values explain

  4. You will get duplicate data

Times Machine API: data.js

[ {"articleid": "106157325",
 "bottom": 896,
 "col": "1",
 "dat": "September 18, 1884",
 "hdl": "GOV. CLEVELAND"S VISITORS.",
 "left": 33,
 "lp": "ALBANY, N.Y., Sept. 17.--There were many...",
 "pdate": "18840918",
 "pdd": "18",
 "pdm": "09",
 "pdy": "1884",
 "pg": "5",
 "right": 144,
 "secpg": "5",
 "sortpg": "0005",
 "tom": "Article",
 "top": 865,
 "url": "1884\\/09\\/18\\/106157325... "} ... ]

What this means is...

  1. Large Image width is 890px wide

    YYYY/MM/DD/{Page ID from index.js}/05.jpg

    http://s3.amazonaws.com/page.archive.nytimes.com/1910/09/19/P1/05.js

  2. Small Image width is 290px

    YYYY/MM/DD/{Page ID from index.js}/10.jpg

    http://s3.amazonaws.com/page.archive.nytimes.com/1910/09/19/P1/10.js

  3. Note to Self: Open the images

But....

  1. Co-ordinates

    Not all fields will always be present - left,top,right,bottom - align with the large image 05.jpg

  2. Article ID is always present

  3. URL can be used to calculate where the PDF lives but you will need the private/public secrets to generate a valid url :(

  4. You will get duplicate data

Don't make assumptions!

  1. The New York Times wasn't always published 7 days a week

    Union strikes and historical reasons

  2. See it in action:

    http://timesmachine.nytimes.com/browser

Video

  1. Dig

    Dedicated feeds for blogs and other Streams, just dig

  2. Sorry

    It is RSS

  3. Essentially, just append: ?rss=1

    http://video.nytimes.com/video/playlist/ny-region/city-room/1194811622245/index.html?rss=1
    http://video.nytimes.com/video/playlist/style/on-the-street/1247463985977/index.html?rss=1

Video: URL-tastic

  1. View Source

    Find a Video in the section you want, View Source, and grab the RSS link

  2. Collect

    Build a list of Playlist IDs, and then construct the URL

  3. Example

    http://video.nytimes.com/video/playlist/WHATEVER/1247463985977/WHATEVER.html?rss=1

Video

  1. Brightcove

    The Times is an investor in, and uses Brightcove for their video hosting, player skin, etc

  2. Read the Brightcove API docs. Potential for fun.

  3. A Boxee channel for NYT based on this would be easy... just say'in'

Video: Other Sources

  1. YouTube

    http://www.youtube.com/user/TheNewYorkTimes
  2. Vimeo

    http://vimeo.com/nytimes

Scraping

There are many times when you just need data and content

  1. PHP

    Simple PHP DOM Parser

    Simple. Does the job.

    $html = file_get_html('http://www.nytimes.com/');
  2. Ruby

    Never tried it, don't know. Good luck.

Scraping

There are many times when you just need data and content

  1. Homepage

    Interstitial ads can affect scraping

    Append ?adxnnl=1

    http://nytimes.com/?adxnnl=1

    Disclaimer: Not certain this still works

  2. Articles

    Best bet is the get full text

    Append ?pagewanted=all to any Article URL

Frontpage Snapshots

From the "Todays Paper" page:

http://www.nytimes.com/pages/todayspaper/index.html

Frontpage Snapshots

NYT - NYC Edition

  1. NYT - NYC Edition

    http://graphics8.nytimes.com/images/2011/11/21/nytfrontpage/scan_paper.jpg http://www.nytimes.com/images/2011/11/21/nytfrontpage/scan.jpg

  2. View Source

    Variations for various editions and the IHT (IHT links to a PDF)

Frontpage Snapshots

  1. Note the date-based URL

    http://graphics8.nytimes.com/images/2011/11/21/nytfrontpage/scan_paper.jpg http://www.nytimes.com/images/2011/11/21/nytfrontpage/scan.jpg
  2. Sizes

    scan_paper.jpg = 175x341px

    scan.jpg = 348x640px

  3. Whitespace

    Newspaper size changed a year or so ago

Frontpage Snapshots

Frontpage Snapshots: Limitation

  1. 2002

    Only goes back to January 24th 2002

    Jan 23rd, Jan 24th

Photographs & Illustrations

There is a treasure throve of beautiful images

  1. The Good News

    Amazing large images (1024x640) images are published on the web site

  2. The Bad News

    Rights. Ownership.

    By contract the NYT can only have photos and images on NYT domains (nytiimes.com, iht.com)

    Legally obligated to enforce PHOTOGRAPHERS rights

Photographs & Illustrations

So with that said...

  1. Respect

    People work hard (a.k.a Risk Their Lives) for these images

  2. Prototype

    I have only used the following ifnormation for prototyping purposes

    To "show whats possible" in legally unencumbered world

    Seriously.

Photographs & Illustrations

Numerous image sizes (Width * Height - Width is Fixed but Height may vary)

  1. Thumbnails (75x75)
  2. Thumbnail Wide (190x126)
  3. Inline (190x123, 190x297 etc)
  4. Sectionfront Span (395x286)
  5. Homepage Medium (337x250)
  6. Large or Span (600x350px)
  7. Popup (650x421)
  8. Jumbo (1024x663) - the Holy Grail

Photographs & Illustrations

Photographs & Illustrations

Photographs & Illustrations

  1. Given an Image Thumbnail (75x75)

    http://graphics8.nytimes.com/images/2011/11/13/books/review/Sutton3/Sutton3-thumbStandard.jpg

Photographs & Illustrations

Change image name, Sutton3-thumbStandard.jpg, to:

  1. Sutton-thumbWide.jpg (190x126: Sometimes present)

  2. Sutton-articleInline.jpg (190x123, 190x297 etc: Sometimes present)

  3. Sutton-hpMedium.jpg (337x250: Rare, only if featured on Homepage)

  4. Sutton-sfSpan.jpg (395x286: Occasional)

  5. Sutton-articleLarge.jpg (600x350px: Most of the time)

  6. Sutton-popup.jpg (650x421: Common)

  7. Sutton-jumbo.jpg (1024x663: Most of the time!)

Photographs & Illustrations

Photographs & Illustrations; PHP

Given a feed of images, check for presence of Jumbo images without downloading images in full

Link to sample code
    $ch = curl_init ($url);
    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_BINARYTRANSFER,1);
    curl_setopt($ch, CURLOPT_RANGE, "0-10240");

LENS Blog

LENS Blog

  1. Flash

    Embed params usually contain data sources

  2. Data

    Data sources sometimes link out to even more data sources

LENS Blog

  1. http://lens.blogs.nytimes.com

    Yet again, View Source is your friend

LENS Blog

var so = new SWFObject('http://www.nytimes.com/packages/flash/multimedia/swfs/AS3MultiloaderDark.swf','nytSWF','970','750',9,'#1A1A1A');
so.addParam('allowScriptAccess','always');
so.addParam('allowFullScreen','true');
so.addParam('wmode','opaque');
so.addParam('BASE','');
so.addVariable('contentPath','http://graphics8.nytimes.com/.../lens_hp.swf');
so.addVariable('allowCaching',true);
so.addVariable('dataURL','http://lens.blogs.nytimes.com/asset-data/');
so.write('swfcontainer');

LENS Blog

http://lens.blogs.nytimes.com/asset-data/

<posts>
 <post>
  <title>Pictures of the Day: Egypt and Elsewhere</title> 
  <byline>By THE NEW YORK TIMES</byline> 
  <date>November 21, 2011, 04:27 pm</date>
  <keywords></keywords>
  <tags>Afghanistan,Cairo,Damir Sagolj,Egypt,Kevin Frayer,Lib...</tags> 
  <excerpt>Photographs from Egypt, Sri Lanka, Afghanistan and...</excerpt> 
  <url>http://lens.blogs.nytimes.com/2011/11/21/pictures-of-the-day-egypt-and-elsewhere-15/</url>
  <photo>
   <url>http://graphics8.nytimes.com/images/2011/11/21/blogs/20111112POD-slide-NMCI/20111112POD-slide-NMCI-custom3.jpg</url>
  </photo>
  <asset>http://graphics8.nytimes.com/packages/flash/multimedia/TEMPLATES/Lens/data/20111121POD.xml</asset>
</post>

LENS Blog

/packages/flash/multimedia/TEMPLATES/Lens/data/20111121POD.xml

<photos>
  <slide>
    <photo>
      <credit>Moises Saman for The New York Times</credit>
      <caption>A wounded protester was treated by Red Crescent medics at...</caption>
      <url>http://graphics8.nytimes.com/images/2011/11/21/blogs/20111112POD-slide-8BGR/20111112POD-slide-8BGR-jumbo.jpg</url>
      <width>1024</width>
      <height>707</height>
    </photo>
    <related>
      <link>
        <type>article</type>
        <label/>
        <url/>
      </link>
    </related>
    </slide>
  ...


Fragments & Includes

  1. Publicly accessible but little known

    The stuff of iframes, 3rd party shells, modules

  2. Unstyled

    Dependent on larger page CSS and JS

  3. Cron

    Some are no longer in use but Cron job keeps on churning

Fragments & Includes

<div class="abColumn breakingNewsAlert">
    Breaking News
    <span class="timestamp">9:14 AM ET:</span>
    <h2>Something major shapes the world</h2>
</div>
<!--#breakingNewsAlert -->

Fragments & Includes

Fragments & Includes

Fragments & Includes

Fragments & Includes

Questions...

http://donohoe.io/projects/nytimes/2011-hack-day/deck/

http://bit.ly/nyt-internals

/

#