amazon s3 - Easiest way to persist Cassandra data to S3 using Spark -


i trying figure out how best store , retrieve data, s3 cassandra, using spark: have log data store in cassandra. run spark using dse perform analysis of data, , works beautifully. log data grows daily, , need 2 weeks worth in cassandra @ given time. still need store older logs somewhere @ least 6 months, , after research, s3 glaciar looks promising solution. i'd use spark, run daily job finds logs day 15, deletes them cassandra, , sends them s3. problem this: can't seem settle on right format save cassandra rows file, such can 1 day potentially load file spark, , run analysis, if have to. want run analysis in spark 1 day, not persist data cassandra. json seems obvious solution, there other format not considering? should use spark sql? advice appreciated before commit 1 format or another.

apache spark designed kind of use case. storage format columnar databases. provides column compression , indexing.

it becoming de facto standard. many big data platforms adopting or @ least providing support it. can query efficiently directly in s3 using sparksql, impala or apache drill. can run emr jobs against it.

to write data parquet using spark, use dataframe.saveasparquetfile.

depending on specific requirements may end not needing separate cassandra instance.

you may find this post interesting


Comments

Popular posts from this blog

qt - Using float or double for own QML classes -

Create Outlook appointment via C# .Net -

ios - Swift Array Resetting Itself -