Scraping your HTML site to Joomla

by Peter Martin / @pe7er

JandBeyond 2017 in Kraków, Poland,
Friday 1 June 2017

About me


Peter Martin (@pe7er)




Lent (Nijmegen-Noord), Netherlands

About me

www.db8.nl
Joomla support & development

Since 2005 : installing sites, configuring Options
Content > Articles > [Options] >
Show Category: Hide
Show Author: Hide
Show Publish Date: Hide

Options Manager
(My 1st commercial Component)
- Import / Export Options of your extensions to use at other websites.
www.db8.eu / JED listing

Joomla volunteer:
* Global Forum Moderator
* Joomla! Operations
   Department Coordinator
* Mentor GSoC 2016 + 2017

* Joomla Bug Squad
* Pizza Bugs & Fun (in NL)
* Dutch Joomla Developers meetups (in NL)



Organizes:
* Linux Usergroup Nijmegen
* Open Coffee Nijmegen

Overview

  • Scraping
  • Manual scraping
  • Automated scraping
  • Scraping tools
  • DIY Scraping
  • Demo



Presentation: http://slides.db8.nl/
Demo code: https://github.com/pe7er/scrape-demo

1. What is scraping?


Scraping

= extracting data from websites

Reuse (or steal) content
Spam bots
Hack bots

Price monitoring
Search Engines


Missing API
Migrate content

2009 - Book


Webbots, Spiders, and Screen Scrapers,
A guide to developing Internet Agents with php/cURL
by Michael Schrenk, 2007


2011 - Deal of the day

Website: Every 24 hours a new deal
Email notification at 12:00. Discount only 24 hours
Comment section:
“Nice, I just bought two.” or “Bought it last time, they are good!”


After 24 hour: archived as "missed deal"
Comments NOT visible anymore

Me: coded PHP scraper + crontab trigger (11:30 daily)
added content to MySQL

Result: some "users" claimed to order 2,000 euro / month

2013 - Bitcoins

Raspberry Pi + Butterfly Labs Jalapeno 10 GH/s

How to keep track of the exchange rate?


Me: coded PHP scraper
crontab trigger hourly
added content to MySQL + email notification

2015 - Daily deals

Everybody likes discounts

Me: When I buy something I need, I would love a discount

Problem: don't bother me with daily deal emails

Me: coded PHP scraper
(PHP / crontab / MySQL / compare keywords)
if daily deal = my keyword: email notification

2016 - Migrate website content

Customer:

“ Last week we discussed converting an existing site to Joomla.”
“ I have more information: the website has no database and is in HTML.”
“ What are the costs to convert the data of the pages (content) and put it in a joomla database (articles)?”
Me: PHP scraper
content to MySQL in Joomla #__content structure

Before scraping website

Four questions:
1. What data to scrape?

2. How many pages?

3. Content still valid?

4. What structure?

Ok, I lied a bit, one more:
5. Is it legal?

2. Scraping manually


Modern Times (1936)

DIY or LSEDI

DIY or LSEDI

Let Someone Else Do It?







ETC...




3. Automated Scraping


Automated scraping


structure of pages

no anti-scrape protection

links to all content


4. Scraping tools


Browser Addons

Google Chrome

Scraper (dvhtn)
Advanced Web Scraper (datascraping.co)
Data Scraper (data-miner.io)
Web Scraper (Martins Balodis)

Firefox

ffscrap (ivan_zderadicka)
OutWit Hub (OutWit Technologies)

SaaS - Software As A Service

Import.io
Webhose.io
Dexi.io (aka cloudscraper.com)
scrapinghub.com
80legs.com
All paid services...

Software

OutWit Hub
HTTrack
visualscraper.com
(free limited: 5,000 Pages, 50,000 records, 100 projects)
parsehub.com
(free limited: 200 pages per run, 40 min, 5 public projects)

Wget

$ wget
	--limit-rate=200k
	--no-clobber
	--convert-links
	--random-wait
	-r -p -E -e
	robots=off
	-U mozilla
	http://www.example.com

5. How to DIY

My DIY Scraping

Download
Retrieve all files and store local
Spider
get all links from index.html
Scan
Scan all HTML files
Get content part
Get part of HTML
Store in Database
Store in Joomla #__content table format

Download

$ wget
--limit-rate=200k --no-clobber
--convert-links --random-wait -r -p -E -e
robots=off
-U mozilla
http://example.com






Spider

function getDirContents($dir = '/home/username/imported-site/')
{

	$di = new RecursiveDirectoryIterator($dir,RecursiveDirectoryIterator::SKIP_DOTS);
	$it = new RecursiveIteratorIterator($di);

	foreach($it as $file) {
		if (pathinfo($file, PATHINFO_EXTENSION) == "html") {
			//echo $file, PHP_EOL, '
'; $results = array_keys(iterator_to_array($it)); } } return $results; }






Scan each page

$links = getDirContents('/home/username/imported-site/');
foreach($links AS $link):

	if(mime_content_type($link) == 'text/html') {
		$link = "http://localhost/".substr($link, 16);

		$results_page = curl($link);
	}
[..]






Scan using cURL

function curl($url){
$options = Array(
  CURLOPT_RETURNTRANSFER => TRUE,  // option to return the webpage data
  CURLOPT_FOLLOWLOCATION => TRUE,  // follow 'location' HTTP headers
  CURLOPT_AUTOREFERER => TRUE, // Automatically set the referer
  CURLOPT_CONNECTTIMEOUT => 120,   // time (in seconds) before the request times out
  CURLOPT_TIMEOUT => 120,  // max. amount of time for cURL to execute queries
  CURLOPT_MAXREDIRS => 10, // max. number of redirections to follow
  CURLOPT_USERAGENT => 'Awesome webscraper', // Setting the user agent
  CURLOPT_URL => $url, // option with the $url variable passed into the function
);
$ch = curl_init();
curl_setopt_array($ch, $options); // use assigned array data from $options
$data = curl_exec($ch); // Execute request and assigning returned data to $data
curl_close($ch);
return $data; }



Get content 1/2

$links = getDirContents('/home/username/imported-site/');

foreach($links AS $link):

if(mime_content_type($link) == 'text/html') {
$link = "http://localhost/".substr($link, 16);

$results_page = curl($link);
$content = scrape_between($results_page, '<UNIQUE START TAG>', '<TAG AFTER CONTENT>');
$title = scrape_between($content, '

', '

'); $introtext = scrape_between($content, '', '
');






Get content 2/2

function scrape_between($data, $start, $end)
{
	$data = stristr($data, $start); // Stripping all data from before $start
	$data = substr($data, strlen($start)); // Stripping $start
	$stop = stripos($data, $end); // Getting the position of the $end of the data
	$data = substr($data, 0, $stop); // Strip all data
	// from after and include $end of data to scrape

	return $data;   // Return scraped data
}






Store in Database 2/2

[..]
$query = 'INSERT INTO `' . $dbTable . '` (`id`, `title`, `alias`, `introtext`,
`fulltext`, `state`, `catid`, `created`, `created_by`, `created_by_alias` )
VALUES( '. ' null, '.
	"'". $mysqli->real_escape_string( $title ) . "', " .
	"'". $mysqli->real_escape_string( $alias ) . "', " .
	"'". $mysqli->real_escape_string( $introtext ) . "', " .
	"'". $mysqli->real_escape_string( $link ) . "', " .
	//	"'', " .
	"0, 2, '2016-03-03 00:00:01', 604, 'Peter'); ";

if (!$mysqli->query($query))
{
	printf("%d Row inserted.\n", $mysqli->affected_rows);
	"error";
}

}
endforeach;







Problems...


...or challenges

Some don't like getting scraped

Detect:
Unusual traffic/high download rate -> Slow down!

Single User Agent -> Vary User Agents

Single IP address -> Vary IP: use a proxy or VPN

Repetitive tasks and too fast -> Build in random delay

Using honeypots -> not following robot.txt guidelines?
-> retrieve links to honeypot pages with display:none

Prevent getting Scraped?


Make it more difficult

if you don't like getting scraped

Check unusual traffic, same IPs etc
Use Honeypot
-> Block IP addresses
and/or Use CAPTCHA


Use commercial anti-bot services
Obfuscate data using CSS sprites or JavaScript
Adding small variations to HTML/CSS tags

6. Demo

Thanks!





Presentation: http://slides.db8.nl/

Code: https://github.com/pe7er/scrape-demo


Peter Martin
e-mail: info at db8.nl
twitter: @pe7er

Photo Credits

garden-gardening-rake-tool
Degskrapor - Micke, 2007
Webbots, Spiders, and Screen Scrapers - Michael Schrenk, 2007
Butterfly Labs Bitcoin miner - arstechnica.com, 2013
typewriter-1138293_1280.png
Modern Times (1936)
CTRL C V
Copy Paste
typing-robot
rake
Command Line - Peter Martin
Plamuurmes - M.Minderhoud, 2004
Eiskratzer - Stefan Flöper, 2007
tv-implosion.mp4