0. data gathering: use whatever scripts and generate some csv files.
1. ETL phase: use hadoop or pig to process and save result to cassandra or
mongodb.
2. online streaming process. Usually use kafka as queue and use either storm
or spark streaming to process it quickly.
3. off line analysis: use hadoop mapredue or spark to do detailed analysis.
4. data persistence: save to s3 / hdfs , or cassandra
5. you may need cache layer. No need to hit DB or process query every time.
candidates are memcache or redis. (prefer redis)