GitXplorerGitXplorer
l

ted-transcript-crawler

public
0 stars
0 forks
0 issues

Commits

List of commits on branch master.
Unverified
7fc78289f150af808b5df13f18fc33b96d2260a2

Update output

committed 7 years ago
Unverified
90ae29c79f31262797b4be77566d8740ec317bf8

Update usage.

committed 7 years ago
Unverified
251e4226be2db43ec35d2c4159970c3a35bf0474

Ted transcript crawler

committed 7 years ago
Unverified
b9ebdc866ceb332b790b07adfa001680606236dc

Initial commit

committed 7 years ago

README

The README file for this repository.

ted-transcript-crawler

A crawler to automatically download all the transcript of TED talks. This crawler was built using Scrapy based on this tutorial https://blakeboswell.github.io/2016/scrapy-tedtalk/ but have modified it to be usable with the latest version of TED Website.

To run:

  1. Install Scrapy
  2. Download or clone the repo
  3. run cd ted-transcript-crawler/ted
  4. run scrapy crawl ted_crawl

Output:

Outputs are stripped off all the html elements and contains only plaintext and whitespace. The outputs are saved in Json-line format.