To acquire data for my project mentioned here, I wanted to scrape the comments section of a norwegian newspaper.
Initially I thought it would be as simple as getting the source and applying whatever regular expressions that would give me what I wanted. Sadly it was not that simple.
Once the I got the data I load it to a java routine which remove unwanted data based on regular expressions (it is faster to access the DOM, I would do that if I had lots of data). Then I generate descriptive parameters (word count, punctuation amount , etc) based on post content. The resulting parameters are stored as JSON. Then the resulting JSON is used as basis for the scatterplots here.