首页 > 编程技术 > 怎么选择一个合适的数据格式

[悬赏]怎么选择一个合适的数据格式 (已翻译43%)

查看 (235次)
英文原文:How to Choose a Data Format
标签: 编程技术
admin 发布于 2017-04-25 11:05:46 (共 7 段, 本文赏金: 27元)
参与翻译(3人): jollogn1992 glma urnotlyy 默认 | 原文

【待悬赏】 赏金: 2元

Editor’s Note: Welcome to Throwback Thursdays! Every third Thursday of the month, we feature a classic post from the earlier days of our company, gently updated as appropriate. We still find them helpful, and we think you will, too! The original version of this post can be found here. We’ll also be at DataEngConf on April 28th, talking about data formats; learn more and sign up for our slideshere

It’s easy to become overwhelmed when it comes time to choose a data format. Picture it: you have just built and configured your new Hadoop Cluster. But now you must figure out how to load your data. Should you save your data as text, or should you try to use Avro or Parquet? Honestly, the right answer often depends on your data. However, in this post I’ll give you a framework for approaching this choice, and provide some example use cases.



【已悬赏】 赏金: 2元

HDFS上有各种可用的数据格式,并且你的选择很大程度上影响你项目的表现和空间要求。这里提供的发现基于我的团队过去的经验,以及我们在文本、Apache Hadoop的SequenceFile、Apache Avro、Apache Parquet和ORC格式的文件上读写运行时间的对比测试。

关于这些数据格式的更多细节,包括特征以及他们结构的概况,可以在这里找到。

urnotlyy
翻译于 2017-11-17 19:44:09
 

参与本段翻译用户:
urnotlyy

显示原文内容

【已悬赏】 赏金: 1元

Where to start?

There are several considerations that need to be taken into account when trying to determine which data format you should use in your project; here, we discuss the most important ones you will encounter: system specifications, data characteristics, and use case scenarios.

当决定程序应该选择哪种数据格式时,需要考虑诸多因素;在此,我们主要讨论会遇到最重要的几点:系统规格,数据特点,以及使用场景。

jollogn1992
翻译于 2017-05-29 00:02:19
 

参与本段翻译用户:
jollogn1992

显示原文内容

【待悬赏】 赏金: 4元

System specifications

Start by looking at the technologies you’ve chosen to use, and their characteristics; this includes tools used for ETL(Extract, Transform and Load) processes as wells as tools used to query and analyze the data. This information will help you figure out which format you’re able to use.

Not all tools support all of the data formats, and writing additional data parsers and converters will add unwanted complexity to the project. For example, as of the writing of this post, Impala does not offer support for ORC format; therefore, if you are planning on running the majority of your queries in Impala then ORC would not be a good candidate. You can, instead, use the similar RCFile format, or Parquet.

You should also consider the reality of your system. Are you constrained on storage or memory? Some data formats can compress more than others. For example, datasets stored as Parquet and ORC with snappy compression can reduce their size to a quarter of the size of their uncompressed text format counterpart, and Avro with deflate compression can achieve similar results. However, writing into any of these formats will be more memory intensive, and you might have to tune the memory settings in your system to allocate more memory. There are many options that can be tweaked to adapt your system, and it is often desirable to run some tests in your system before fully committing to using a format. We will talk about some of the tests that you can run in a later section of this post.



【待悬赏】 赏金: 11元

Characteristics and size of the data

The next consideration is around the data you want to process and store in your system. Let’s look at some of the aspects that can impact performance in a data format.

How is your raw data structured?

Maybe you have regular text format or csv files and you are considering storing them as such. While text files are readable by humans, easy to troubleshoot, and easy to process, they can impact the performance of your system because they have to be parsed every time. Text files also have an implicit format (each column is a certain value) and if you are not careful documenting this, it can cause problems down the line.

If your data is in an xml and json format, then you might run into some issues with file splitability in HDFS. Splitability determines the ability to process parts of a file independently which in turn enables parallel processing in Hadoop, therefore if your data is not splittable we lose the parallelism that allows fast queries. More advanced data formats (Sequence, Avro, Parquet, ORC) offer splitability regardless of the compression codec.

What does your pipeline look like, and what steps are involved?

Some of the file formats were optimized to work in certain situations. For example, Sequence files were designed to easily share data between Map Reduce (MR) jobs, so if your pipeline involves MR jobs then Sequence files make an excellent option. In the same vein, columnar data formats such as Parquet and ORC were designed to optimize query times; if the final stage of your pipeline needs to be optimized, using a columnar file format will increase speed while querying data.

How many columns are being stored and how many columns are used for the analysis?

Columnar data formats like Parquet and ORC offer an advantage (in terms of querying speed) when you have many columns but only need a few of those columns for your analysis since Parquet and ORC increase the speed at which the queries are performed. However, that advantage can be foregone if you still need all the columns during search, in which case you could experiment within your system to find the fastest alternative. Another advantage of columnar files is in the way they compress the data, which saves both space and time.

Does your data change over time? If it does, how often does it happen and how does it change?

Knowing whether your data changes often is important because then we have to consider how a data format handles schema evolution. Schema evolution is the term used for denoting when the structure of a file has changed after being previously stored with a different structure, such changes in structure can include the change of data type for a column, the addition of columns, and the removal of columns. Text files do not explicitly store the schema, so when a new person joins the project is up to them to figure out what columns and column values the data has. If your data changes suddenly (addition of columns, deletion of columns, changes on the data types) then you need to figure out how to reconcile older data and new data with the format.

Certain file formats handle the schema evolution more elegantly than others. For example, at the moment Parquet only allows the addition of new columns at the end of columns and it doesn’t handle deletion of columns, whereas Avro allows for addition, deletion, and renaming of multiple columns. If you know your data is bound to change often (maybe developers add new metrics every few months to help tracking usage of an app) then Avro will be a good option. If your data doesn’t change often or won’t change, schema evolution is not needed.

Additional things to keep in mind with schema evolution are the trade-offs of keeping track of the newer schemas. If the schema for a data format like Avro or Parquet needs to be specified (rather than extracted from the data) then we will require more effort storing and creating the schema files.



【待悬赏】 赏金: 4元

Use case scenarios

Each of the data formats has its own strengths, weaknesses, and trade-offs, so the decision on which format to use should be based on your specific use cases and systems.

If your main focus is to be able to write data as fast as possible and you have no concerns about space, then it might be acceptable to just store your data in text format with the understanding that query times for large data sets will be longer.

If your main concern is being able to handle evolving data in your system, then you can rely on Avro to save schemas. Keep in mind, though, that when writing files to the system Avro requires an pre-populated schema, which might involve some additional processing at the beginning.

Finally, if your main use case is analysis of the data and you would like to optimize the performance of the queries, then you might want to take a look at a columnar format such as Parquet or ORC because they offer the best performance in queries, particularly for partial searches where you are only reading specific columns. However, the speed advantage might decrease if you are reading all the columns.

There is a pattern in the mentioned uses cases: if a file takes longer to write, it is because it has been optimized to increase speed during reads.



【已悬赏】 赏金: 3元

-测试

我们已经讨论过怎么通过几个影响因子来为你的系统选择正确的数据格式.为了l提供更多综合的解释,我们根据经验来比较不同的数据格式在读和写文件方面的表现。在hdfs当中,我们创建一些定量的测试,来比较下面5数据类型

-Text

-Sequence

-Avro

-Parquet

-ORC

我们使用下面的技术来 进行不同的探索性查询来测试计算的时间

-hive

-Impala

我们测试三个不同的数据集

-窄的数据集--它包含10,000,000行,10列,像Apache日志文件

-宽的数据集---它包含4,000,000行,1000列,一开始的几列人为定义的数据,其他的集合是通过随机数和真假值来设定的

-巨大的宽数据集---1Tb的数据集合包含302,924,000条数据

-点击这里,你能在自己的系统里尝试测试全部结果和对应适合的代码

glma
翻译于 2017-09-09 12:55:19
 

参与本段翻译用户:
glma

显示原文内容

GMT+8, 2018-1-23 22:14 , Processed in 0.083534 second(s), 11 queries .