Google recommended practice for storage/compression in BigQuery?

I am designing storage for a very large text files for a pipeline on Google cloud that support ANSI SQL. Also, need to support compression and parallel load from the input locations using Google recommended practice. I am aware that BigQuery supports only gzip compression. Google best practice also says avro is the preferred format for BigQuery. Also, i believe it supports avro compressed format as well as per documentation. So, I believe the answer could be to transform text files to compressed avro using dataflow, and then Use cloud storage and BigQuery permanent linked tables. There were also other options involving using Grid computing for compressing text files to gzip, and then using BigQuery for storage and query, and one other option for transform text files to avro using dataflow, and use only BigQuery for storage and query. I believe the answer is to transform text files to compressed avro using dataflow. Please help.

  • post-author-pic
    Matthew U
    01-14-2019

    I agree with your assessment that Avro is the preferred choice for compressed files that need to support parallel reads. Without knowing other details, I can't think of a reason why we wouldn't want to use BigQuery for both storage and reading of Avro data since it supports that format natively.

  • post-author-pic
    Roshan R
    01-15-2019

    Thanks.

  • post-author-pic
    Vijay S
    01-20-2019

    i think  transform text files to compressed avro using dataflow, and then Use cloud storage and BigQuery permanent linked tables

  • post-author-pic
    Roshan R
    01-21-2019

    Dataflow i agree. Why not use BigQuery for query and storage, since it natively stores?

Looking For Team Training?

Learn More