Scrapping Data Using PHP

As part of my PS Now UK Database project I wanted to pull the completion time data from

Scrapping data is something that I had played around with a few years back but I had forgotten how it works.  After an afternoon of research & a lot of trial and error, I managed to create some code that could to that very thing.

The biggest hurdle to get over was pin pointing what the actually search for.  Most websites will give all their tags either a class or id.  In this case, the value I was looking for was inside a plain old <div> tag with no class or ID.  This proved to be a problem.

To understand what data was being pulled from the site I used the print_r() command which showed me I was dealing with a  multi-dimension array.  Something I had not worked with before.  Luckily the print_r() command gives you the locations of each value in the array & using this data I was able to withdraw the data I required.


function hltb($hltbID) {
    // Function use || to extract the time data from howlongtobeat based upon the game ID stored in the data base
    // 1.Get Data based upon hltbID value
    $html = file_get_contents("" . $hltbID);
    // howlongtobeat does not name the div that contains the actual time so I will go one div above and then sort through the data
        $html, //scrapped file
        PREG_SET_ORDER // formats data into an array of posts
    // 2. Sort  through array to find values needed & print
    echo "<br>Main Story " . $posts[0][1];
    echo "<br>Main + Extras " . $posts[1][1];
    echo "<br>Completionist " . $posts[2][1];
    echo "<br>Combined " . $posts[3][1];

include 'conn.php';
// Pull gameID from URL
$newID = $_GET["id"];

    $sql = "SELECT * FROM psnow WHERE gameID=".$newID;
    $result = mysql_query($sql) or die(mysql_error()); 

//Pull the howlongtobeat ID, gameName & gameCount from database
If (mysql_num_rows($result) > 0) {
                while ($row = mysql_fetch_array($result)) {
                    $count = $row['gameCount']+1; // When clicked to view HLTB +1 to count (future use)
                    $gameID = $row['gameID'];
            		echo "<p><b>". $row['gameName'] . "</b>"; //Print Game Name
            		hltb($row['gameHLTB']); //Pass the howlongtobeat game ID into the function
            		//Update the gameCount
            		$update = "UPDATE psnow SET gameCount=" . $count . " WHERE gameID=".$gameID;
		}//end while
}; //end if

Once I was happy with it, I put the code into a function and called it for every single data row in my database (we are talking over 300 entries here).  As you can imagine, the site became slow as it was scrapping data for every entry in my database.  This was not an efficient way of doing things.

So back to the drawing board I went.  To over come this hang time I moved the function into it’s own php file which I then would call into a <div> on the main page by using Jquery.

function loadQueryResults($hltb_ID) {
$('#hltb').load('hltb.php?id=' + $hltb_ID);
return false;

It is such a simple script but what it does fixed my problem.  Each entry in my database is a clickable link which has the onClick function added.  When clicked the unique gameID is passed into this function which in turn passes it into the hltb.php file.  Inside that file is the function from before and some extra php to display the game name.  This separate page is then loaded into the the div with the id of htlb.

Problem solved.


Finding Old Work

Boy what a weekend!  It all started late last night when I found my final University project back up on my Google Drive.  A quick transfer to my live server and with a few tweaks I was looking at my final project that I developed 4 years ago!  The only thing that was 100% broken was the Google geocoding but that was due to the updated APIs.  One Sunday morning later and they were all fixed.

What I did find impressive was how well this thing works.  I am not putting myself down here but before starting this project I knew very little about PHP and MySQL.  It was all self taught and when I look back at all the features of this project I must say I am impressed with what I created.

I have spent today looking through the lines of codes working out what I was trying to achieve.  Luckily I did drop a few comments here and there but past me was a very messy coder.  The project had quite a lot going for it but it looks like time ran out during development but as I am topping up my programming skills this year I am going to use this time to expand on what is currently there.

I already have a few ideas for new features for this old project and even started implementing them in the background.

Miuni –