by Peter Martin /
@pe7er
JandBeyond 2017 in Kraków, Poland,
Friday 1 June 2017
Peter Martin (@pe7er)
Lent (Nijmegen-Noord), Netherlands
www.db8.nl
Joomla support & development
Since 2005
: installing sites, configuring Options
Content > Articles > [Options] >
Show Category: Hide
Show Author: Hide
Show Publish Date: Hide
Options Manager
(My 1st commercial Component)
- Import / Export
Options of your extensions to use at other websites.
www.db8.eu /
JED listing
Joomla volunteer:
* Global Forum Moderator
* Joomla! Operations
Department Coordinator
* Mentor GSoC 2016 + 2017
* Joomla Bug Squad
* Pizza Bugs & Fun (in NL)
* Dutch Joomla Developers meetups (in NL)
Organizes:
* Linux Usergroup Nijmegen
* Open Coffee Nijmegen
Presentation:
http://slides.db8.nl/
Demo code:
https://github.com/pe7er/scrape-demo
Reuse (or steal) content
Spam bots
Hack bots
Price monitoring
Search Engines
Missing API
Migrate content
Website: Every 24 hours a new deal
Email notification at 12:00.
Discount only 24 hours
Comment section:
“Nice, I just bought two.” or
“Bought it last time, they are good!”
After 24 hour: archived as "missed deal"
Comments NOT visible anymore
Me: coded PHP scraper + crontab trigger (11:30 daily)
added content to MySQL
Result: some "users" claimed to order 2,000 euro / month
Raspberry Pi + Butterfly Labs Jalapeno 10 GH/s
How to keep track of the exchange rate?
Me: coded PHP scraper
crontab trigger hourly
added content to MySQL + email notification
Everybody likes discounts
Me: When I buy something I need, I would love a discount
Problem: don't bother me with daily deal emails
Me: coded PHP scraper
(PHP / crontab / MySQL / compare keywords)
if daily deal = my keyword: email notification
Customer:
“ Last week we discussed converting an existing site to Joomla.”
“ I have more information: the website has no database and is in HTML.”
“ What are the costs to convert the data of the pages (content) and put it in a joomla database (articles)?”Me: PHP scraper
Four questions:
1. What data to scrape?
2. How many pages?
3. Content still valid?
4. What structure?
Ok, I lied a bit, one more:
5. Is it legal?
structure of pages
no anti-scrape protection
links to all content
Scraper (dvhtn)
Advanced Web Scraper (datascraping.co)
Data Scraper (data-miner.io)
Web Scraper (Martins Balodis)
ffscrap (ivan_zderadicka)
OutWit Hub (OutWit Technologies)
Import.io
Webhose.io
Dexi.io (aka cloudscraper.com)
scrapinghub.com
80legs.com
All paid services...
OutWit Hub
HTTrack
visualscraper.com
(free limited: 5,000 Pages, 50,000 records, 100 projects)
parsehub.com
(free limited: 200 pages per run, 40 min, 5 public projects)
$ wget
--limit-rate=200k
--no-clobber
--convert-links
--random-wait
-r -p -E -e
robots=off
-U mozilla
http://www.example.com
Download
Retrieve all files and store local
Spider
get all links from index.html
Scan
Scan all HTML files
Get content part
Get part of HTML
Store in Database
Store in Joomla #__content table format
$ wget
--limit-rate=200k --no-clobber
--convert-links --random-wait -r -p -E -e
robots=off
-U mozilla
http://example.com
function getDirContents($dir = '/home/username/imported-site/')
{
$di = new RecursiveDirectoryIterator($dir,RecursiveDirectoryIterator::SKIP_DOTS);
$it = new RecursiveIteratorIterator($di);
foreach($it as $file) {
if (pathinfo($file, PATHINFO_EXTENSION) == "html") {
//echo $file, PHP_EOL, '
';
$results = array_keys(iterator_to_array($it));
}
}
return $results;
}
$links = getDirContents('/home/username/imported-site/');
foreach($links AS $link):
if(mime_content_type($link) == 'text/html') {
$link = "http://localhost/".substr($link, 16);
$results_page = curl($link);
}
[..]
function curl($url){
$options = Array(
CURLOPT_RETURNTRANSFER => TRUE, // option to return the webpage data
CURLOPT_FOLLOWLOCATION => TRUE, // follow 'location' HTTP headers
CURLOPT_AUTOREFERER => TRUE, // Automatically set the referer
CURLOPT_CONNECTTIMEOUT => 120, // time (in seconds) before the request times out
CURLOPT_TIMEOUT => 120, // max. amount of time for cURL to execute queries
CURLOPT_MAXREDIRS => 10, // max. number of redirections to follow
CURLOPT_USERAGENT => 'Awesome webscraper', // Setting the user agent
CURLOPT_URL => $url, // option with the $url variable passed into the function
);
$ch = curl_init();
curl_setopt_array($ch, $options); // use assigned array data from $options
$data = curl_exec($ch); // Execute request and assigning returned data to $data
curl_close($ch);
return $data; }
$links = getDirContents('/home/username/imported-site/');
foreach($links AS $link):
if(mime_content_type($link) == 'text/html') {
$link = "http://localhost/".substr($link, 16);
$results_page = curl($link);
$content = scrape_between($results_page, '<UNIQUE START TAG>', '<TAG AFTER CONTENT>');
$title = scrape_between($content, '', '
');
$introtext = scrape_between($content, '', '');
function scrape_between($data, $start, $end)
{
$data = stristr($data, $start); // Stripping all data from before $start
$data = substr($data, strlen($start)); // Stripping $start
$stop = stripos($data, $end); // Getting the position of the $end of the data
$data = substr($data, 0, $stop); // Strip all data
// from after and include $end of data to scrape
return $data; // Return scraped data
}
[..]
$query = 'INSERT INTO `' . $dbTable . '` (`id`, `title`, `alias`, `introtext`,
`fulltext`, `state`, `catid`, `created`, `created_by`, `created_by_alias` )
VALUES( '. ' null, '.
"'". $mysqli->real_escape_string( $title ) . "', " .
"'". $mysqli->real_escape_string( $alias ) . "', " .
"'". $mysqli->real_escape_string( $introtext ) . "', " .
"'". $mysqli->real_escape_string( $link ) . "', " .
// "'', " .
"0, 2, '2016-03-03 00:00:01', 604, 'Peter'); ";
if (!$mysqli->query($query))
{
printf("%d Row inserted.\n", $mysqli->affected_rows);
"error";
}
}
endforeach;
Detect:
Unusual traffic/high download rate
-> Slow down!
Single User Agent
-> Vary User Agents
Single IP address
-> Vary IP: use a proxy or VPN
Repetitive tasks and too fast
-> Build in random delay
Using honeypots
-> not following robot.txt guidelines?
-> retrieve links to honeypot pages with display:none
Check unusual traffic, same IPs etc
Use Honeypot
-> Block IP addresses
and/or Use CAPTCHA
Use commercial anti-bot services
Obfuscate data using CSS sprites or JavaScript
Adding small variations to HTML/CSS tags
Presentation: http://slides.db8.nl/
Code: https://github.com/pe7er/scrape-demo
Peter Martin
e-mail: info at db8.nl
twitter: @pe7er
garden-gardening-rake-tool
Degskrapor - Micke, 2007
Webbots, Spiders, and Screen Scrapers - Michael Schrenk, 2007
Butterfly Labs Bitcoin miner - arstechnica.com, 2013
typewriter-1138293_1280.png
Modern Times (1936)
CTRL C V
Copy Paste
typing-robot
rake
Command Line - Peter Martin
Plamuurmes - M.Minderhoud, 2004
Eiskratzer - Stefan Flöper, 2007
tv-implosion.mp4