
About 7 years working on the site: Customer Service, CMS, Frontend, Mixed Backend, Product Dev
Times People, Times Wire, Travel, T Mag, & other stuff that never made it
Ochs Chrome extension, Send to Instapaper by Section, iPad web apps, just fun
Twitter accounts (@nytalerts, @nytlite, @timespeople)
The Times is littered with rich undocumented data sources
Hidden gems, but there is a catch...
A little known hidden API that covers a lot of public domain material
Breaking news alerts, content, and less exciting things
Lets liberate some great images (again, a catch)
Ask me anything, I'll answer honestly if I can
I'll post all notes and links as I go along
In case you didn't already know
http://developer.nytimes.com
Article Search, News Wire, Times People, Times Tags, Real Estate, NY State, Movies, Most Popular and so on...
RSS is boring, but still helpful in places. That will follow later.
♦
You don't need a proxy
Directly hook into the most recent content with externally hosted Javascript
We don't list this anywhere
Look at the META tags
<meta name="sectionfront_jsonp" content="http://json8.nytimes.com/pages/politics/index.jsonp">
jsonFeedCallback({
"status" : "ok",
"title" : "NYT > Politics",
"link" : "http://www.nytimes.com/pages/politics/index.html",
"lastBuildDate" : "Tue, 22 Nov 2011 03:33:56 GMT\n",
"items" : [
{
"title" : "Panel Fails to Reach Deal on...",
"link" : "http://www.nytimes.com/2011/11/...",
"guid" : "http://www.nytimes.com/2011/11/...",
"description" : "President Obama promised...",
"byline" : "By JENNIFER STEINHAUER, HELENE...",
"pubdate" : "Tue, 22 Nov 2011 03:26:42 GMT",
"media" : [
Not a standard Section, but its there...
http://json8.nytimes.com/pages/mostemailed/index.jsonparts, books, arts/music, theater, dining, health, science, sports, technology, fashion, etc.
var nav ={
"imgHost":"http:\/\/graphics8.nytimes.com\/",
"sections":
{"section_name":"world",
"display_name":"World",
"dropdown_flag":true,
"slides":[
{"imgSrc":"images\/2011\/10\/07...",
"popup":false,
"isSelect":false,
"url":"http:\/\/www.nytimes.com\/...",
"headline":"Future of Euro Coul d..."
},...
If storing as JSON in Blobs you don't need to parse RSS/XML
Also covers Most Emailed and Homepage
Created by Derek Gottfrid back in August 2007 (@derekg)
Archival page data from 1856 to 1922
Not behind auth. Completely accessible to public
The API comes down to two simple things really; JSON and computable URLs
Thats it. Done.
The content of this is a JSON array with the available months:
["01","02","03","04","05","06","07","08","09","10","11","12"]
Lets skip to to December 3rd
http://s3.amazonaws.com/page.archive.nytimes.com/1907/12/03/index.jsThe content of this is a JSON array with the available Pages:
[ "P1" , "P2" , "P3" , "P4" , "P5" , "P6" , "P7" , "P8" , "P9" , "P10" , "P11" , "P12" , "P13" , "P14" , "P15" , "P16" ]
YYYY/MM/DD/{Page ID from index.js}/data.js
Retrieve the entire page data by constructing this URL:
http://s3.amazonaws.com/page.archive.nytimes.com/1907/12/03/P1/data.jsIts big... and not necessarily clear what some values explain
Open the link to show them what you mean
Javascript array of article zone information thats appears in the newspaper page
An article may take up multiple zones - since the data is de-normalized
Its big... and not necessarily clear what some values explain
You will get duplicate data
[ {"articleid": "106157325",
"bottom": 896,
"col": "1",
"dat": "September 18, 1884",
"hdl": "GOV. CLEVELAND"S VISITORS.",
"left": 33,
"lp": "ALBANY, N.Y., Sept. 17.--There were many...",
"pdate": "18840918",
"pdd": "18",
"pdm": "09",
"pdy": "1884",
"pg": "5",
"right": 144,
"secpg": "5",
"sortpg": "0005",
"tom": "Article",
"top": 865,
"url": "1884\\/09\\/18\\/106157325... "} ... ]
YYYY/MM/DD/{Page ID from index.js}/05.jpg
http://s3.amazonaws.com/page.archive.nytimes.com/1910/09/19/P1/05.jsYYYY/MM/DD/{Page ID from index.js}/10.jpg
http://s3.amazonaws.com/page.archive.nytimes.com/1910/09/19/P1/10.jsNote to Self: Open the images
Not all fields will always be present - left,top,right,bottom - align with the large image 05.jpg
Article ID is always present
URL can be used to calculate where the PDF lives but you will need the private/public secrets to generate a valid url :(
You will get duplicate data
The New York Times wasn't always published 7 days a week
Union strikes and historical reasons
See it in action:
http://timesmachine.nytimes.com/browser♦
Dedicated feeds for blogs and other Streams, just dig
It is RSS
Essentially, just append: ?rss=1
http://video.nytimes.com/video/playlist/ny-region/city-room/1194811622245/index.html?rss=1Find a Video in the section you want, View Source, and grab the RSS link
Build a list of Playlist IDs, and then construct the URL
http://video.nytimes.com/video/playlist/WHATEVER/1247463985977/WHATEVER.html?rss=1
The Times is an investor in, and uses Brightcove for their video hosting, player skin, etc
Read the Brightcove API docs. Potential for fun.
A Boxee channel for NYT based on this would be easy... just say'in'
There are many times when you just need data and content
Simple. Does the job.
$html = file_get_html('http://www.nytimes.com/');
Never tried it, don't know. Good luck.
There are many times when you just need data and content
Interstitial ads can affect scraping
Append ?adxnnl=1
http://nytimes.com/?adxnnl=1Disclaimer: Not certain this still works
Best bet is the get full text
Append ?pagewanted=all to any Article URL
♦
From the "Todays Paper" page:
http://www.nytimes.com/pages/todayspaper/index.html
NYT - NYC Edition
NYT - NYC Edition
http://graphics8.nytimes.com/images/2011/11/21/nytfrontpage/scan_paper.jpg http://www.nytimes.com/images/2011/11/21/nytfrontpage/scan.jpgVariations for various editions and the IHT (IHT links to a PDF)
scan_paper.jpg = 175x341px
scan.jpg = 348x640px
Newspaper size changed a year or so ago
There is a treasure throve of beautiful images
Amazing large images (1024x640) images are published on the web site
Rights. Ownership.
By contract the NYT can only have photos and images on NYT domains (nytiimes.com, iht.com)
Legally obligated to enforce PHOTOGRAPHERS rights
So with that said...
People work hard (a.k.a Risk Their Lives) for these images
I have only used the following ifnormation for prototyping purposes
To "show whats possible" in legally unencumbered world
Seriously.
Numerous image sizes (Width * Height - Width is Fixed but Height may vary)
Given any image URL, all variations are computable!
Not all image types are guaranteed to be present for each image
Change image name, Sutton3-thumbStandard.jpg, to:
Given a feed of images, check for presence of Jumbo images without downloading images in full
Link to sample code $ch = curl_init ($url);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_BINARYTRANSFER,1);
curl_setopt($ch, CURLOPT_RANGE, "0-10240");
♦
Embed params usually contain data sources
Data sources sometimes link out to even more data sources
Yet again, View Source is your friend
var so = new SWFObject('http://www.nytimes.com/packages/flash/multimedia/swfs/AS3MultiloaderDark.swf','nytSWF','970','750',9,'#1A1A1A');
so.addParam('allowScriptAccess','always');
so.addParam('allowFullScreen','true');
so.addParam('wmode','opaque');
so.addParam('BASE','');
so.addVariable('contentPath','http://graphics8.nytimes.com/.../lens_hp.swf');
so.addVariable('allowCaching',true);
so.addVariable('dataURL','http://lens.blogs.nytimes.com/asset-data/');
so.write('swfcontainer');
http://lens.blogs.nytimes.com/asset-data/
<posts>
<post>
<title>Pictures of the Day: Egypt and Elsewhere</title>
<byline>By THE NEW YORK TIMES</byline>
<date>November 21, 2011, 04:27 pm</date>
<keywords></keywords>
<tags>Afghanistan,Cairo,Damir Sagolj,Egypt,Kevin Frayer,Lib...</tags>
<excerpt>Photographs from Egypt, Sri Lanka, Afghanistan and...</excerpt>
<url>http://lens.blogs.nytimes.com/2011/11/21/pictures-of-the-day-egypt-and-elsewhere-15/</url>
<photo>
<url>http://graphics8.nytimes.com/images/2011/11/21/blogs/20111112POD-slide-NMCI/20111112POD-slide-NMCI-custom3.jpg</url>
</photo>
<asset>http://graphics8.nytimes.com/packages/flash/multimedia/TEMPLATES/Lens/data/20111121POD.xml</asset>
</post>
/packages/flash/multimedia/TEMPLATES/Lens/data/20111121POD.xml
<photos>
<slide>
<photo>
<credit>Moises Saman for The New York Times</credit>
<caption>A wounded protester was treated by Red Crescent medics at...</caption>
<url>http://graphics8.nytimes.com/images/2011/11/21/blogs/20111112POD-slide-8BGR/20111112POD-slide-8BGR-jumbo.jpg</url>
<width>1024</width>
<height>707</height>
</photo>
<related>
<link>
<type>article</type>
<label/>
<url/>
</link>
</related>
</slide>
...
The stuff of iframes, 3rd party shells, modules
Dependent on larger page CSS and JS
Some are no longer in use but Cron job keeps on churning
<div class="abColumn breakingNewsAlert">
Breaking News
<span class="timestamp">9:14 AM ET:</span>
<h2>Something major shapes the world</h2>
</div>
<!--#breakingNewsAlert -->
http://graphics8.nytimes.com/bcvideo/1.0/iframe/bcHomeIframe.html?&playlistId=1194811622188
Lots of iframes out there...
Ideas for Hacks?
API stuff?
/
#