GitXplorerGitXplorer
s

phpscraper-keyword-length-distribution-example

public
4 stars
0 forks
0 issues

Commits

List of commits on branch master.
Unverified
3198a488b670f84d2b7b8718234910fe36b79828

Update funding.yml

sspekulatius committed 2 years ago
Unverified
09546a68311c19f9b9f2b561dde9289d5cc3938c

Update sponsor link

sspekulatius committed 2 years ago
Verified
523cc475e27d20650c95262639a56f375c773be2

Adding related links in

sspekulatius committed 4 years ago
Unverified
098bf54f53e31147c6333f035d90921430b9f970

Adding clear identifier for license in

sspekulatius committed 4 years ago
Unverified
c8adf64aade68e8be4d933857c43dada0db867f7

MINOR: fixing typos

sspekulatius committed 4 years ago
Unverified
ea6598f209e317a3c9ebd7aa39b3215193e9ad61

MINOR: tidy up readme

sspekulatius committed 4 years ago

README

The README file for this repository.

Keyword Length Distribution Example using PHPScraper

PHPScraper is a scraping library aimed at making web-scraping easier. It simplifies the coding effort involved by reducing verbosity.

This is an example of the library scraping keywords from the Wikipedia article "Online Advertising". After the keyword extraction, the data is processed to analyze the distribution of keyword length. The expected output can be found below.

Within PHPScraper, the library RAKE PHP Plus is used. RAKE stands for "Rapid Automatic Keyword Extraction" algorithm.

You can also check the PHPScraper keyword extraction example and Keyword Merge Package.

Installation

This example has been built on PHP 7.2.24 run on an Ubuntu-based Linux distro.

To run this example you will need to clone the repository and install the dependencies:

git clone git@github.com:spekulatius/phpscraper-keyword-scraping-example.git
composer install

If you would like to make changes you will need to fork the repository.

Execution

$ php keyword-length-distribution.php

Result

As graphic:

Keyword Length Distribution for "Online Marketing"

As text:

This page contains around 1989 keywords/phrases.
Below are some selected keyword extractions.

Length Distribution of Keywords:

Array
(
    [1] => 7
    [2] => 5
    [3] => 46
    [4] => 95
    [5] => 80
    [6] => 84
    [7] => 129
    [8] => 137
    [9] => 117
    [10] => 103
    [11] => 91
    [12] => 71
    [13] => 76
    [14] => 58
    [15] => 82
    [16] => 71
    [17] => 72
    [18] => 76
    [19] => 51
    [20] => 57
    [21] => 40
    [22] => 33
    [23] => 45
    [24] => 34
    [25] => 21
    [26] => 29
    [27] => 29
    [28] => 17
    [29] => 22
    [30] => 17
    [31] => 17
    [32] => 11
    [33] => 11
    [34] => 10
    [35] => 11
    [36] => 8
    [37] => 7
    [38] => 10
    [39] => 10
    [40] => 3
    [41] => 5
    [42] => 4
    [43] => 5
    [44] => 4
    [45] => 5
    [46] => 2
    [47] => 3
    [48] => 2
    [49] => 2
    [51] => 3
    [52] => 3
    [53] => 3
    [54] => 1
    [55] => 2
    [56] => 3
    [57] => 2
    [59] => 1
    [61] => 1
    [66] => 1
    [67] => 1
    [70] => 1
    [71] => 1
    [76] => 2
    [77] => 1
    [81] => 2
    [84] => 1
    [85] => 1
    [93] => 1
    [97] => 1
    [99] => 1
    [106] => 1
    [107] => 1
    [110] => 1
    [120] => 1
    [121] => 1
    [123] => 1
    [142] => 1
    [148] => 1
    [156] => 1
    [159] => 1
    [166] => 1
    [169] => 1
    [174] => 1
    [191] => 1
    [193] => 1
    [195] => 1
    [201] => 1
    [205] => 1
    [229] => 1
    [252] => 1
    [269] => 1
    [280] => 1
    [288] => 1
    [308] => 1
    [332] => 1
    [392] => 1
    [408] => 1
    [473] => 1
    [506] => 1
    [1422] => 1
    [2184] => 1
)

1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,51,52,53,54,55,56,57,59,61,66,67,70,71,76,77,81,84,85,93,97,99,106,107,110,120,121,123,142,148,156,159,166,169,174,191,193,195,201,205,229,252,269,280,288,308,332,392,408,473,506,1422,2184

7,5,46,95,80,84,129,137,117,103,91,71,76,58,82,71,72,76,51,57,40,33,45,34,21,29,29,17,22,17,17,11,11,10,11,8,7,10,10,3,5,4,5,4,5,2,3,2,2,3,3,3,1,2,3,2,1,1,1,1,1,1,2,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1

Please note: These results might have changed by now.