Hey All,
I've got a cluster with 5 data nodes (2 master nodes). The cluster has ~100
indices, w/ doc counts in the 1k - 50k range. There is a low/medium amount
of index load going into the cluster via the bulk api and a large amount of
search traffic going in in the 40K queries per second range.
I'm running these data nodes on ec2 (c3.8xl's) with a 30GB heap, though at
the time of the following sample I was testing out running with a 20GB
heap. The process runs well for a while, a couple hours to a day or two
depending on traffic, and then it get's into a bad state where there is
continual doing long gc runs, ie every minute doing a stop the world run
for 30-45sec, and seemingly getting very little out of it (ie starting with
18.8GB heap usage and going to 18.3GB heap usage).
Here the red line is a data node that is exhibiting the behavior. This is a
graph of the "old" generation growing to nearly the complete heap size and
then staying there for hours. During this time the application is severely
degraded.
<https://lh4.googleusercontent.com/-JXEVIJVBDDY/U8kyUY7hhyI/AAAAAAAACBo/dceW7JJGKiA/s1600/Screen+Shot+2014-07-18+at+10.37.44+AM.png>
Example of one of the gc runs during this time (again they run every minute
or so).
[2014-07-18 00:24:24,735][WARN ][monitor.jvm ] [prod-targeting-es2]
[gc][old][10799][27] duration [41.5s], collections [1]/[42.5s], total
[41.5s]/[2.2m], memory [18.8gb]->[18.3gb]/[19.8gb], all_pools {[young]
[733.2mb]->[249.9mb]/[1.4gb]}{[survivor] [86mb]->[0b]/[191.3mb]}{[old]
[18gb]->[18.1gb]/[18.1gb]}
We are running es 1.2.2 . We had been running Oracle 7u25 and we've tried
upgrading to 7u65 with no effect. I just did a heap dump analysis using
jmap and Eclipse Memory Analyzer and found that 85% of the heap was taken
up with filter cache
<https://lh4.googleusercontent.com/-KZ8SJD-o32M/U8kzdtC0KhI/AAAAAAAACBw/TeWTvmOc1rc/s1600/Screen+Shot+2014-07-18+at+1.34.44+AM.png>
We are doing a lot of "bool" conditions in our queries, so that may be a
factor in the hefty filter cache.
Any ideas out there? Right now I have to bounce my data nodes every hour or
two to ensure I don't reach this degraded state.
I've got a cluster with 5 data nodes (2 master nodes). The cluster has ~100
indices, w/ doc counts in the 1k - 50k range. There is a low/medium amount
of index load going into the cluster via the bulk api and a large amount of
search traffic going in in the 40K queries per second range.
I'm running these data nodes on ec2 (c3.8xl's) with a 30GB heap, though at
the time of the following sample I was testing out running with a 20GB
heap. The process runs well for a while, a couple hours to a day or two
depending on traffic, and then it get's into a bad state where there is
continual doing long gc runs, ie every minute doing a stop the world run
for 30-45sec, and seemingly getting very little out of it (ie starting with
18.8GB heap usage and going to 18.3GB heap usage).
Here the red line is a data node that is exhibiting the behavior. This is a
graph of the "old" generation growing to nearly the complete heap size and
then staying there for hours. During this time the application is severely
degraded.
<https://lh4.googleusercontent.com/-JXEVIJVBDDY/U8kyUY7hhyI/AAAAAAAACBo/dceW7JJGKiA/s1600/Screen+Shot+2014-07-18+at+10.37.44+AM.png>
Example of one of the gc runs during this time (again they run every minute
or so).
[2014-07-18 00:24:24,735][WARN ][monitor.jvm ] [prod-targeting-es2]
[gc][old][10799][27] duration [41.5s], collections [1]/[42.5s], total
[41.5s]/[2.2m], memory [18.8gb]->[18.3gb]/[19.8gb], all_pools {[young]
[733.2mb]->[249.9mb]/[1.4gb]}{[survivor] [86mb]->[0b]/[191.3mb]}{[old]
[18gb]->[18.1gb]/[18.1gb]}
We are running es 1.2.2 . We had been running Oracle 7u25 and we've tried
upgrading to 7u65 with no effect. I just did a heap dump analysis using
jmap and Eclipse Memory Analyzer and found that 85% of the heap was taken
up with filter cache
<https://lh4.googleusercontent.com/-KZ8SJD-o32M/U8kzdtC0KhI/AAAAAAAACBw/TeWTvmOc1rc/s1600/Screen+Shot+2014-07-18+at+1.34.44+AM.png>
We are doing a lot of "bool" conditions in our queries, so that may be a
factor in the hefty filter cache.
Any ideas out there? Right now I have to bounce my data nodes every hour or
two to ensure I don't reach this degraded state.