GitXplorerGitXplorer
n

PhpUrlScanner

public
1 stars
3 forks
2 issues

Commits

List of commits on branch master.
Unverified
d338c2af8ccfbd52f626a02d632119f46c28e42c

Update README.md

nnicodmf committed 11 years ago
Unverified
f0b917af40f90d6e0a80c660ab75af4620cdc3b6

Mise a jour du Readme

nnicodmf committed 12 years ago
Unverified
5eaf598c0d90a1468c8f23ed8cf1862abd236923

Update README.md

nnicodmf committed 13 years ago
Unverified
644b7685c776c006212a3748fee7e3bd4e33e1a2

Update README.md

nnicodmf committed 13 years ago
Unverified
7eb6739a0fd09b9d3233d7daa459243ed7d9b476

Update README.md

nnicodmf committed 13 years ago
Unverified
11912592785fb82656371ba84b5b0d23c186ff10

Update README.md

nnicodmf committed 13 years ago

README

The README file for this repository.

A set of class which create a command line utility which scans a entire site, a sub folder of a site or set of given urls. The library is design to be simple and speed. Analysed of silex take by example 4 seconds for 150 urls (68 crawled resources and 82 curled links).

Use

The system require goutte in the same folder (https://github.com/fabpot/Goutte.git)

The class can be simply use by typing in the command line :

php scanner 'url' [ [url2] [url2] ... [url3] ]

Simple tests

wget https://github.com/fabpot/Goutte/blob/b966bcbd7220bc5cbfe0d323e22499aa022a6c75/goutte.phar?raw=1 -O goutte.phar && wget https://raw.github.com/nicodmf/PhpUrlScanner/master/scanner.php && php scanner.php http://getcomposer.org

In this example, the 404 are normal as the pages demand an identification and refuse simple connection without following another url.

You can test too another url :

php scanner.php http://silex.sensiolabs.org/

Use with php code

Three statics methods provide the scan process :

<?php
 Scanner::collect_and_return($url, $test_externals, $with_subpath, $with_sub_domain, $max_depth)
 Scanner::collect_and_save($url, $file, $test_externals, $with_subpath, $with_sub_domain, $max_depth)
 Scanner::get_status($url)

As the scan take time, the simpliest way is to collect and save the result in a serialized file. This file could be simply unserialized later to permit functionnals analysis.

The serialized file contains the "Resources" object created in the scan process.

Internal

The scan is a loop which crawle html files, identify links (for now just links in an anchor tag), analyses and sorts those: external links will analized by simple curls request, internals resources crawle by Goutte.

Futur

If the libs interessed some peoples :

  • Integration of symfony command component
  • Separation in file for each class
  • Better composer/packagist integration
  • Compilation as a phar
  • Utilisation of child class for the storage class resources/resource/url
  • Add link in script, header, css files
  • Add other controls (w3c compliance for css/html)