Quantcast
Channel: MarsHut
Viewing all articles
Browse latest Browse all 6551

Large number of files in a directory vs large number of directories with less number of files

$
0
0
Hi,

I am trying to figure s strategy around partitions in hive. I'm thinking
either a monthly or a daily partition. The usage directs me go towards the
daily partition scheme(querying etc), but I'm not sure what would be the
HDFS, Name Node limitations to this.

If for a daily partition I would have 3-4 GB of file in each partition and
for 2 years I might end up having

700 and odd directories with one file each. On the contrary in monthly I
would have 24 directories with each directory having 30 or 31 files of 4 GB
each.

Most of my queries are in the date range and I was thinking daily
partitions would be more effective as it doesn't have to scan all the files
for the month in case of a monthly partition.

I would like to know what other considerations should I think about before
making a decision.

1) Name node/ HDFS limitations
2) Archiving files
3) compression

and may be more.

I would really appreciate any inputs on this

Thanks
Kishore

Viewing all articles
Browse latest Browse all 6551

Trending Articles