amazon s3 - Easiest way to persist Cassandra data to S3 using Spark -
i trying figure out how best store , retrieve data, s3 cassandra, using spark: have log data store in cassandra. run spark using dse perform analysis of data, , works beautifully. log data grows daily, , need 2 weeks worth in cassandra @ given time. still need store older logs somewhere @ least 6 months, , after research, s3 glaciar looks promising solution. i'd use spark, run daily job finds logs day 15, deletes them cassandra, , sends them s3. problem this: can't seem settle on right format save cassandra rows file, such can 1 day potentially load file spark, , run analysis, if have to. want run analysis in spark 1 day, not persist data cassandra. json seems obvious solution, there other format not considering? should use spark sql? advice appreciated before commit 1 format or another.
apache spark designed kind of use case. storage format columnar databases. provides column compression , indexing.
it becoming de facto standard. many big data platforms adopting or @ least providing support it. can query efficiently directly in s3 using sparksql, impala or apache drill. can run emr jobs against it.
to write data parquet using spark, use dataframe.saveasparquetfile
.
depending on specific requirements may end not needing separate cassandra instance.
you may find this post interesting
Comments
Post a Comment