Gruss Forum • View topic - Advice please regarding learning scraping data from websites

by **danjuma** » Fri Mar 19, 2010 1:35 pm

Hi guys,

I am thinking of learning a technique for the above. After a quick reference to wiki, found out there are various tecniques which has confused me more

Basically you have data scraping, screen scraping, web scraping, data mining, report mining etc.

What I want to be able to do is just extract certain data from websites like fixed odds and spreadbetting sites etc on to an excel sheet. So what are the pros and cons of the above techniques please in terms of ease of learning/implementing, most appropriate, most flexible etc please, as I don't really have the time/need/intelligence :lol:

to learn them all?

Many thanks

by **doris_day** » Fri Mar 19, 2010 1:43 pm

From my own limited experience it depends what data you want to scrape and how the site is coded.
The simplest way is to use a web query in Excel. Very often that's all you need to do. I'd always give that a try first and you might be surprised how easy it is.
I've used data extraction software that's worked fine but I supoose the best way is to code your own tailor made app using VB, C## or whatever language you decide to learn.
I'm now at the point where I've decided to learn VB properly and work from there.
Best of luck.

by **doris_day** » Fri Mar 19, 2010 1:46 pm

Another way is not to learn anything at all and go to somewhere like RAC (Rentacoder) and get one of the great coders they have to do it all for you. They're cheaper than you think

by **osknows** » Fri Mar 19, 2010 8:36 pm

I would say you have 2 seperate things there

data scraping, screen scraping & web scraping is about actually getting the data from the source

data mining & report mining is more about using the data to uncover trends/information

Basically if anything can be seen on the screen then 99.9% of the time you can get the data, how you get it varies and is constantly changing as sites update, introduce new technology and implement security measures. It is good practice to abide by any rules and not to overburden any site with too many requests.

doris_day has given a very good place to start using excel and webqueries. My advice is that if you intend to use extracted information for non-live analysis then excel and VBA is by far the easiest method as excel files are easily ported to other apps for analysis and storage. VB & C# would be better suited for live situations and building complete apps from scratch.

Website technology has moved beyond the days where all data is in the HTML code; often these days there are technologies like SOAP, Javascript and AJAX which update only parts of the webpage. Sometimes you may need to build a bot to replicate a user access. It all depends on the site and technology being used.

The best thing is to start a small project with just one site. See if you can extract the data you need using..
1. Excel webquery
2. Source HTML
3. Parsing Javascript/SOAP/AJAX

If you can do these you should be able to get to 75%+ of sites.

This is also a good book http://www.heatonresearch.com/book/http ... sharp.html

by **xraymitch** » Fri Mar 19, 2010 9:12 pm

In addition to the excellent advice from Doris and osknows have a look at imacros from iopus they have a freeware version and also a business version which comes with a 30 day unlimited free trial.

Whatever you do with a web browser, iMacros can automate it.
Form Filling, Web Scripting, Data Extraction, Web Testing, Excel Web Queries.

http://www.iopus.com/download/

http://www.iopus.com/imacros/compare/

Ray

by **Timstertoo** » Fri Mar 19, 2010 10:24 pm

Maybe these two links can help you on your way:

http://www.packtpub.com/article/web-scr ... ith-python
http://www.packtpub.com/article/web-scr ... hon-part-2

Python is an open source language, easy to learn for beginners and plenty of info freely available on the web.

Good luck!

by **danjuma** » Sat Mar 20, 2010 1:20 pm

Thanks guys for all your advice/suggestions. Much appreciated.

by **danjuma** » Sat Mar 20, 2010 1:42 pm

xraymitch wrote:In addition to the excellent advice from Doris and osknows have a look at imacros from iopus they have a freeware version and also a business version which comes with a 30 day unlimited free trial.

Whatever you do with a web browser, iMacros can automate it.
Form Filling, Web Scripting, Data Extraction, Web Testing, Excel Web Queries.

http://www.iopus.com/download/

http://www.iopus.com/imacros/compare/

Ray

Thanks xraymitch. I already know about iMacros, but the bit I need which is for 'extracting'data is not actaully in the free version. Also, I have stopped using firefox these days, as for some reason for some time now, it's been awfully slow, even though I'm using the latest version 3.6. I now use Oepra 10.50, brilliant so far.

Advice please regarding learning scraping data from websites

Advice please regarding learning scraping data from websites

Who is online

Sports betting software from Gruss Software