What's the best way to convert text/csv file into PARQUET# DataSciences - 数据科学
s*h
1 楼
I have text/csv files and want to upload them into Cloudera cluster, and use
them in Spark.
What's the best way to upload and convert text/csv file into PARQUET format?
Two load, use either file manager in Hue or SFTP?
To convert, I can think of 3 ways:
A.
In HIVE, create external table based on the original file,
then create new external table in PARQUET format ?
B.
In Spark, wse Scala code to convert ? Conversion speed might be a concern.
https://developer.ibm.com/hadoop/blog/2015/12/03/parquet-for-sp
C.
Using Apache Drill? Anyone has installed Apache Drill on CDH before?
Conversion speed would be better. https://www.mapr.com/blog/how-convert-csv-
file-apache-parquet-using-apache-drill
Need install Apache Drill first: https://drill.apache.org/docs/installing-
drill-on-the-cluster/
With Sqoop, it's much easier as we have setting "--as-parquetfile".
Thanks!
them in Spark.
What's the best way to upload and convert text/csv file into PARQUET format?
Two load, use either file manager in Hue or SFTP?
To convert, I can think of 3 ways:
A.
In HIVE, create external table based on the original file,
then create new external table in PARQUET format ?
B.
In Spark, wse Scala code to convert ? Conversion speed might be a concern.
https://developer.ibm.com/hadoop/blog/2015/12/03/parquet-for-sp
C.
Using Apache Drill? Anyone has installed Apache Drill on CDH before?
Conversion speed would be better. https://www.mapr.com/blog/how-convert-csv-
file-apache-parquet-using-apache-drill
Need install Apache Drill first: https://drill.apache.org/docs/installing-
drill-on-the-cluster/
With Sqoop, it's much easier as we have setting "--as-parquetfile".
Thanks!