Below list of topics are covered in this article
- What are C-states, cstates, or C-modes?
- How can I disable processor sleep states?
- How to read and interpret /dev/cpu_dma_latency?
- What is the maximum C-state allowed for my CPU?
- How do I check the existing latency value for different C-states?
- How to check and monitor the CPU c-state usage in Linux per CPU and core?
- What is POLL idle state ?
- Why the OS might ignore BIOS settings?
- How to check currently loaded driver?
What are C-states, cstates, or C-modes?
There are various power modes of the CPU which are determined on the basis of their current usage and are collectively called “C-states” or “C-modes.”
The lower-power mode was first introduced with the 486DX4 processor. To the present, more power modes has been introduced and enhancements has been made to each mode for the CPU to consume less power in these low-power modes.
- Each state of the CPU utilises different amount of power and impacts the application performance differently.
- Whenever a CPU core is idle, the builtin power-saving logic kicks in and tries to transition the core from the current C-state to a higher C-state, turning off various processor components to save power
- But you also need to understand that every time an application tries to bind itself to a CPU to do some task, the respective CPU has to come back from its "deeper sleep state" to "running state" that needs more time to wake up the CPU and be again 100% up and running. It also has to be done in an atomic context, so that nothing tries to use the core while it's being powered up.
- So the various modes to which the CPU transitions are called C-states
- They are usually starting in C0, which is the normal CPU operating mode, i.e., the CPU is 100% turned on
- With increasing C number, the CPU sleep mode is deeper, i.e., more circuits and signals are turned off and more time the CPU will require to return to C0 mode, i.e., to wake-up.
- Each mode is also known by a name and several of them have sub-modes with different power saving – and thus wake-up time – levels.
Below table explains all the CPU C-states and their meaning
How can I disable processor sleep states?
Latency sensitive applications do not want the processor to transition into deeper C-states, due to the delays induced by coming out of the C-states back to C0. These delays can range from hundreds of microseconds to milliseconds.
There are various methods to achieve this.
Method 1
By booting with the kernel command line argument processor.max_cstate=0 the system will never enter a C-state other than zero.
You can add these variable in your grub2 file. Append "processor.max_cstate=0" as shown below
GRUB_CMDLINE_LINUX="novga console=ttyS0,115200 panic=1 numa=off elevator=cfq rd.md.uuid=f6015b65:f15bf68d:7abf04cc:e53fa9a2 rd.lvm.lv=os/root rd.md.uuid=a66dd4fd:9bf06835:5c2bc8df:f150487f rd.md.uuid=84bfe346:bb18024a:054d652a:d7678fa4 processor.max_cstate=0"
Rebuild your initramfs
Reboot the node to activate the changes
Method 2
- The second method is to use the Power Management Quality of Service interface (PM QOS).
- The file /dev/cpu_dma_latency is the interface which when opened registers a quality-of-service request for latency with the operating system.
- A program should open /dev/cpu_dma_latency, write a 32-bit number to it representing a maximum response time in microseconds and then keep the file descriptor open while low-latency operation is desired. Writing a zero means that you want the fastest response time possible.
- Various tuned profile can do this by reading the file continously and writing a value based on the input provided foe eg, network-latency, latency-performance etc.
Below is a snippet from latency-performance tuned file
force_latency=1
Here as you see this file will always be on open state by the tuned as long as tuned is in running state
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
tuned 1543 root 8w CHR 10,61 0t0 1192 /dev/cpu_dma_latency
These profiles write force_latency as 1 to make sure the CPU c-state does not enters deeper C state other than C1.
How to read and interpret /dev/cpu_dma_latency?
If we use normal text editor tool to read this file then the output would be something like
▒5w
Since this value is "raw" (not encoded as text) you can read it with something like hexdump.
00000000 00 94 35 77 |..5w|
00000004
When you read this further
2000000000
It tells us that the current latency value time is 2000 seconds which is the time a CPU would need or take to come up from a deeper C state to C0.
When we set a tuned profile with force_latency=1
For example here I will set tuned profile of network-latency
Check the existing active profile
Current active profile: network-latency
Now lets check the latency value
00000000 01 00 00 00 |....|
00000004
As you see the latency value has been changed to 1 micro second.
What is the maximum C-state allowed for my CPU?
We have multiple CPU c-states as you can see in the above table but depending upon the latency values and other max_cstate value provided in the GRUB the maximum allowed c-states for any processor can vary.
Below file should give the value from your node
9
How do I check the existing latency value for different C-states?
The latency value may change depending upon various C-states and the transition time from deeper C-states to C0.
Below command shall give you the existing latency values of all the c-states per cpu
# for state in state{0..4} ; do echo c-$state `cat $state/name` `cat $state/latency` ; done
c-state0 POLL 0
c-state1 C1-HSW 2
c-state2 C1E-HSW 10
c-state3 C3-HSW 33
c-state4 C6-HSW 133
Similar value can be grepped for all the available CPUs by changing the cpu number in the above highlighted area.
How to check and monitor the CPU c-state usage in Linux per CPU and core?
You can use "turbostat" tool for this purpose which will give you runtime value for the CPU c-state usage for all the available CPU and cores.
I will be using 'turbostat' and 'stress' tool to monitor the CPU c-state and put some load on my CPU respectively.
To install these rpms you can use
# yum install stress
For example
Case 1: Using throughput-performance tuned profile
To check the currently active profile
Current active profile: throughput-performance
With this our latency value is default i.e. 2000 seconds
00000000 00 94 35 77 |..5w|
00000004
Check the output using turbostat
Core CPU Avg_MHz Busy% Bzy_MHz TSC_MHz IRQ SMI CPU%c1 CPU%c3 CPU%c6 CPU%c7 CoreTmp PkgTmp PkgWatt RAMWatt PKG_%RAM_%
- - 6 0.34 1754 2597 2963 640 1.24 0.07 98.35 0.00 54 61 29.33 6.65 0.00 0.00
0 0 5 0.30 1817 2597 116 40 0.76 0.06 98.88 0.00 51 61 15.36 2.62 0.00 0.00
1 8 7 0.39 1722 2597 253 40 1.84 0.08 97.69 0.00 52
2 1 5 0.28 1786 2597 97 40 1.04 0.04 98.64 0.00 51
3 9 4 0.22 1811 2597 45 40 0.45 0.00 99.32 0.00 51
4 2 5 0.29 1883 2597 86 40 0.69 0.06 98.96 0.00 53
5 10 4 0.22 1830 2597 39 40 0.46 0.00 99.31 0.00 52
6 3 7 0.39 1682 2597 279 40 1.67 0.07 97.87 0.00 54
7 11 7 0.39 1762 2597 200 40 1.79 0.08 97.75 0.00 51
0 4 8 0.43 1837 2597 268 40 1.59 0.07 97.91 0.00 37 49 13.97 4.03 0.00 0.00
1 12 7 0.39 1734 2597 251 40 1.49 0.10 98.02 0.00 40
2 5 5 0.27 1727 2597 84 40 0.64 0.06 99.03 0.00 39
3 13 5 0.27 1837 2597 70 40 0.58 0.03 99.12 0.00 40
4 6 6 0.32 1775 2597 164 40 1.07 0.04 98.56 0.00 40
5 14 6 0.37 1675 2597 234 40 1.44 0.07 98.13 0.00 40
6 7 7 0.43 1735 2597 299 40 1.75 0.15 97.68 0.00 39
7 15 9 0.56 1634 2597 478 40 2.63 0.16 96.66 0.00 38
As you see all the available CPU and cores are at c-6 state because all are free. Now if I start putting stress then the CPU will start transitioing from C6 to c0 state and c6 will become free as all CPU will be in running state
- - 384 13.84 2782 2594 16172 640 2.14 0.17 83.84 0.00 54 58 42.87 8.42 0.00 0.00
0 0 419 15.09 2790 2590 896 40 1.19 0.08 83.64 0.00 50 58 21.18 3.16 0.00 0.00
1 8 255 9.21 2778 2590 1073 40 4.91 0.55 85.34 0.00 51
2 1 439 15.76 2793 2591 892 40 1.29 0.05 82.90 0.00 54
3 9 441 15.81 2800 2591 997 40 0.64 0.02 83.53 0.00 53
4 2 439 15.74 2797 2592 890 40 0.80 0.06 83.39 0.00 54
5 10 258 9.39 2758 2594 1118 40 5.34 0.41 84.86 0.00 51
6 3 317 11.43 2780 2594 962 40 3.47 0.32 84.78 0.00 52
7 11 327 11.86 2764 2594 1236 40 5.00 0.41 82.73 0.00 50
0 4 39 1.46 2660 2594 485 40 2.31 0.22 96.01 0.00 37 47 21.69 5.26 0.00 0.00
1 12 461 16.68 2767 2594 1314 40 2.69 0.16 80.47 0.00 46
2 5 465 16.68 2791 2595 944 40 0.86 0.08 82.38 0.00 41
3 13 458 16.50 2779 2595 1067 40 1.32 0.14 82.04 0.00 46
4 6 463 16.63 2788 2596 1243 40 0.99 0.07 82.31 0.00 46
5 14 452 16.31 2778 2596 1001 40 1.27 0.11 82.31 0.00 46
6 7 462 16.58 2789 2596 1023 40 0.77 0.05 82.60 0.00 44
7 15 452 16.29 2776 2597 1031 40 1.45 0.07 82.19 0.00 41
Core CPU Avg_MHz Busy% Bzy_MHz TSC_MHz IRQ SMI CPU%c1 CPU%c3 CPU%c6 CPU%c7 CoreTmp PkgTmp PkgWatt RAMWatt PKG_%RAM_%
- - 2428 86.63 2804 2599 85363 656 6.08 0.96 6.33 0.00 57 60 119.27 17.04 0.00 0.00
0 0 2377 84.85 2802 2600 5756 41 9.47 1.09 4.59 0.00 55 60 55.56 6.59 0.00 0.00
1 8 1835 65.48 2801 2602 5742 41 20.04 2.11 12.37 0.00 54
2 1 2802 99.93 2803 2601 5037 41 0.07 0.00 0.00 0.00 57
3 9 2802 99.93 2803 2601 5035 41 0.07 0.00 0.00 0.00 56
4 2 2802 99.94 2803 2600 5044 41 0.06 0.00 0.00 0.00 57
5 10 1992 71.12 2802 2598 5688 41 16.62 1.77 10.50 0.00 54
6 3 2799 99.94 2803 2599 5049 41 0.06 0.00 0.00 0.00 57
7 11 1914 68.39 2801 2598 5720 41 18.45 2.09 11.07 0.00 51
0 4 2066 73.79 2800 2600 5335 41 9.85 2.19 14.17 0.00 46 53 63.72 10.45 0.00 0.00
1 12 2803 99.86 2807 2600 5088 41 0.14 0.00 0.00 0.00 52
2 5 656 23.46 2800 2597 3312 41 21.81 6.10 48.63 0.00 45
3 13 2799 99.86 2807 2597 5610 41 0.14 0.00 0.00 0.00 53
4 6 2799 99.86 2807 2597 7143 41 0.14 0.00 0.00 0.00 51
5 14 2799 99.86 2807 2597 5044 41 0.14 0.00 0.00 0.00 50
6 7 2799 99.86 2807 2597 5679 41 0.14 0.00 0.00 0.00 50
7 15 2799 99.86 2807 2597 5081 41 0.14 0.00 0.00 0.00 48
Core CPU Avg_MHz Busy% Bzy_MHz TSC_MHz IRQ SMI CPU%c1 CPU%c3 CPU%c6 CPU%c7 CoreTmp PkgTmp PkgWatt RAMWatt PKG_%RAM_%
- - 2421 86.42 2807 2595 84373 656 6.28 1.07 6.23 0.00 59 62 120.52 17.00 0.00 0.00
0 0 2798 99.83 2808 2595 5039 41 0.17 0.00 0.00 0.00 57 62 55.92 6.54 0.00 0.00
1 8 1891 67.58 2803 2595 5151 41 16.92 2.72 12.78 0.00 55
2 1 2798 99.83 2808 2595 5032 41 0.17 0.00 0.00 0.00 59
3 9 2798 99.83 2808 2595 6068 41 0.17 0.00 0.00 0.00 58
4 2 2798 99.83 2808 2595 5041 41 0.17 0.00 0.00 0.00 58
5 10 1527 54.56 2804 2595 5540 41 24.02 3.73 17.70 0.00 56
6 3 2793 99.83 2808 2590 5045 41 0.17 0.00 0.00 0.00 58
7 11 1692 60.57 2804 2590 5556 41 20.66 3.24 15.53 0.00 54
0 4 1425 50.99 2800 2595 5251 41 19.20 4.24 25.57 0.00 48 57 64.60 10.46 0.00 0.00
1 12 2799 99.85 2809 2595 5053 41 0.15 0.00 0.00 0.00 54
2 5 2799 99.84 2809 2595 5054 41 0.16 0.00 0.00 0.00 53
3 13 1419 50.79 2800 2595 4642 41 17.88 3.22 28.11 0.00 49
4 6 2799 99.85 2809 2595 5059 41 0.15 0.00 0.00 0.00 55
5 14 2799 99.84 2809 2595 5047 41 0.16 0.00 0.00 0.00 53
6 7 2799 99.84 2809 2595 6206 41 0.16 0.00 0.00 0.00 53
7 15 2801 99.84 2809 2597 5589 41 0.16 0.00 0.00 0.00 50
Now towards the end as you see the Busy% increases and the CPU state under c-6 is reduced which means the CPU are currently in running state.
Case 2: Change tuned profile to latency-performance
Current active profile: latency-performance
Next monitor the CPU c-state when the system is idle
- - 61 2.17 2800 2597 2923 656 97.83 0.00 0.00 0.00 68 74 78.78 6.14 0.00 0.00
0 0 363 13.00 2800 2597 56 41 87.00 0.00 0.00 0.00 65 74 39.31 2.22 0.00 0.00
1 8 4 0.14 2800 2597 9 41 99.86 0.00 0.00 0.00 68
2 1 4 0.14 2800 2597 23 41 99.86 0.00 0.00 0.00 66
3 9 61 2.17 2800 2597 211 41 97.83 0.00 0.00 0.00 66
4 2 5 0.18 2800 2597 93 41 99.82 0.00 0.00 0.00 67
5 10 4 0.14 2800 2597 20 41 99.86 0.00 0.00 0.00 66
6 3 4 0.15 2800 2597 25 41 99.85 0.00 0.00 0.00 68
7 11 8 0.28 2800 2597 337 41 99.72 0.00 0.00 0.00 64
0 4 4 0.16 2800 2597 68 41 99.84 0.00 0.00 0.00 57 66 39.46 3.93 0.00 0.00
1 12 4 0.14 2800 2597 34 41 99.86 0.00 0.00 0.00 58
2 5 5 0.18 2800 2597 134 41 99.82 0.00 0.00 0.00 58
3 13 38 1.36 2800 2597 928 41 98.64 0.00 0.00 0.00 59
4 6 433 15.50 2800 2597 35 41 84.50 0.00 0.00 0.00 59
5 14 7 0.24 2800 2597 375 41 99.76 0.00 0.00 0.00 59
6 7 4 0.14 2800 2597 17 41 99.86 0.00 0.00 0.00 58
7 15 21 0.74 2800 2597 558 41 99.26 0.00 0.00 0.00 55
As you see even when the CPU and cores are sitting idle still the CPU won't transition to deeper c-states since we are forcing it to stay at C1
What is POLL idle state ?
If cpuidle is active, X86 platforms have one special idle state. The POLL idle state is not a real idle state, it does not save any power. Instead, a busy-loop is executed doing nothing for a short period of time. This state is used if the kernel knows that work has to be processed very soon and entering any real hardware idle state may result in a slight performance penalty.
There exist two different cpuidle drivers on the X86 architecture platform:
"acpi_idle" cpuidle driver
The acpi_idle cpuidle driver retrieves available sleep states (C-states) from the ACPI BIOS tables (from the _CST ACPI function on recent platforms or from the FADT BIOS table on older ones). The C1 state is not retrieved from ACPI tables. If the C1 state is entered, the kernel will call the hlt instruction (or mwait on Intel).
"intel_idle" cpuidle driver
In kernel 2.6.36 the intel_idle driver was introduced. It only serves recent Intel CPUs (Nehalem, Westmere, Sandybridge, Atoms or newer). On older Intel CPUs the acpi_idle driver is still used (if the BIOS provides C-state ACPI tables). The intel_idle driver knows the sleep state capabilities of the processor and ignores ACPI BIOS exported processor sleep states tables.
Why the OS might ignore BIOS settings?
- The OS might ignore BIOS settings based on the idle driver which is in use.
- If one uses intel_idle (the default on intel machines) the OS can ignore ACPI and BIOS settings, i.e. the driver can re-enable the C-states.
- In case one disables intel_idle and uses the older acpi_idle driver the OS should follow the BIOS settings.
One can disable the intel_idle driver by:
passing idle=* (where * can be e.g. poll, i.e. idle=poll)
How to check currently loaded driver?
- The intel_idle driver is a CPU idle driver that supports modern Intel processors.
- The intel_idle driver presents the kernel with the duration of the target residency and exit latency for each supported Intel processor.
- The CPU idle menu governor uses this data to predict how long the CPU will be idle
intel_idle
Or you can also use below command
[ 1.766866] intel_idle: MWAIT substates: 0x2120
[ 1.766868] intel_idle: v0.4.1 model 0x3F
[ 1.767023] intel_idle: lapic_timer_reliable_states 0xffffffff
[ 1.835938] cpuidle: using governor menu
I hope the article was useful.