Scrapping Data Using PHP

As part of my PS Now UK Database project I wanted to pull the completion time data from howlongtobeat.com

Scrapping data is something that I had played around with a few years back but I had forgotten how it works.  After an afternoon of research & a lot of trial and error, I managed to create some code that could to that very thing.

The biggest hurdle to get over was pin pointing what the actually search for.  Most websites will give all their tags either a class or id.  In this case, the value I was looking for was inside a plain old <div> tag with no class or ID.  This proved to be a problem.

To understand what data was being pulled from the site I used the print_r() command which showed me I was dealing with a  multi-dimension array.  Something I had not worked with before.  Luckily the print_r() command gives you the locations of each value in the array & using this data I was able to withdraw the data I required.

<pre><?php

function hltb($hltbID) {
    // Function use || to extract the time data from howlongtobeat based upon the game ID stored in the data base
    
    // 1.Get Data based upon hltbID value
    $html = file_get_contents("http://howlongtobeat.com/game.php?id=" . $hltbID);
    
    // howlongtobeat does not name the div that contains the actual time so I will go one div above and then sort through the data
    preg_match_all(
        '/<div>(.*?)<\/div>/s',
        $html, //scrapped file
        $posts,
        PREG_SET_ORDER // formats data into an array of posts
    );
    // 2. Sort  through array to find values needed & print
    echo "<br>Main Story " . $posts[0][1];
    echo "<br>Main + Extras " . $posts[1][1];
    echo "<br>Completionist " . $posts[2][1];
    echo "<br>Combined " . $posts[3][1];
};

include 'conn.php';
// Pull gameID from URL
$newID = $_GET["id"];

    $sql = "SELECT * FROM psnow WHERE gameID=".$newID;
    $result = mysql_query($sql) or die(mysql_error()); 

//Pull the howlongtobeat ID, gameName & gameCount from database
If (mysql_num_rows($result) > 0) {
                while ($row = mysql_fetch_array($result)) {
                    $count = $row['gameCount']+1; // When clicked to view HLTB +1 to count (future use)
                    $gameID = $row['gameID'];
            		echo "<p><b>". $row['gameName'] . "</b>"; //Print Game Name
            		hltb($row['gameHLTB']); //Pass the howlongtobeat game ID into the function
            		
            		//Update the gameCount
            		$update = "UPDATE psnow SET gameCount=" . $count . " WHERE gameID=".$gameID;
            		$result2=mysql_query($update);
		}//end while
}; //end if
?>

Once I was happy with it, I put the code into a function and called it for every single data row in my database (we are talking over 300 entries here).  As you can imagine, the site became slow as it was scrapping data for every entry in my database.  This was not an efficient way of doing things.

So back to the drawing board I went.  To over come this hang time I moved the function into it’s own php file which I then would call into a <div> on the main page by using Jquery.


function loadQueryResults($hltb_ID) {
$('#hltb').load('hltb.php?id=' + $hltb_ID);
$.scrollTo('#right',200);
$('#top10').load('test.php');
return false;
}

It is such a simple script but what it does fixed my problem.  Each entry in my database is a clickable link which has the onClick function added.  When clicked the unique gameID is passed into this function which in turn passes it into the hltb.php file.  Inside that file is the function from before and some extra php to display the game name.  This separate page is then loaded into the the div with the id of htlb.

Problem solved.

desmayer.co.uk/psnow

Leave a Reply

Your email address will not be published. Required fields are marked *