A quick post about something that grabbed my attention quickly.
Scraping
Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites.
Source: https://en.wikipedia.org/wiki/Web_scraping
I already did some research on the subject when I was playing around with my raspberry pi. There is a lot out there, especially for python.
But what about my favourite programming language Haxe?
Again this is a quick search! And this is what I found.
A (very?) old project from Jonas Malaco Filho on github. Check out this code : jonas-haxe and specificly the scraper part of it. Written for Neko, with primarily undocumented classes like neko.vm.Mutex Once you have the html page you can start getting the data from it!
You will need a html/xml parser; I found one written by Yaroslav Sivakov – HtmlParser haxe library It also can be found on haxelib: http://lib.haxe.org/p/HtmlParser/
I found a little (old) project haxe/php project that I will post as a reference https://github.com/andor44/old_scraper. But then it stops…
Not a field that a lot of haxe-developers walk. Fun!
Update #2
- The htmlparser doesn’t work with the html code I am scraping. So I need to focus the parts I want to use. Regular expressions are the way to go, and I suck at them. Luckily I found a online tool that helps with testing the regex: http://www.regexr.com/ from an old flash hero gskinner.
- Another thing I ran into, was the data from https sites. You need something “extra” to download html files from there: install hxssl via haxelib
haxelib install hxssl
and add it to your build.hxml-lib hxssl
Update #1
I am coding this with openfl/regular expressions, but perhaps a better way to-go is node.js! And you can use node.js with Haxe (perhaps not completely ready: hxnodejs but probably good enough for the examples below).
- https://scotch.io/tutorials/scraping-the-web-with-node-js
- http://nrabinowitz.github.io/pjscrape/
- https://medialab.github.io/artoo/
- https://github.com/ruipgil/scraperjs
- http://www.smashingmagazine.com/2015/04/web-scraping-with-nodejs/
- https://impythonist.wordpress.com/2015/01/06/ultimate-guide-for-scraping-javascript-rendered-web-pages/
- http://code.tutsplus.com/tutorials/screen-scraping-with-nodejs–net-25560
- http://noodlejs.com/
- http://webscraper.io/
I can’t really say how to start with node.js and Haxe because I have never tried it, but what I have red about it shouldn’t be a big problem. Fun again!
Read this
Some interesting reads… somewhat related to haxe
- https://blog.hartleybrody.com/web-scraping/
- http://blog.databigbang.com/tag/java/
- https://github.com/databigbang/stream-oriented-knuth-morris-pratt
2 replies on “Scraping with Haxe”
Did you end up with some kind of Haxe scraping library yet? I’m interested for a personal project. I have a plan to scrape a website with lots of pages (with categories and subcategories), and collect some data in a local database or in files. I don’t mind where it’s running on that much.
Hi Mark,
short answer.. no not yet.
Currently I am just hacking my way to my goal, learning more about regular expressions.
Perhaps the code I mentioned (https://github.com/jonasmalacofilho/jonas-haxe/tree/haxe3migration/src/jonas/scraper) is more for you (it’s a mystery box for me).
grt
Matthijs