Proper and safe evaluation of realtime capability#4132
Conversation
|
I do think it to be very problematic that getting the RT status is so involved. There should be a simple test that does not involve instantiating larger parts of infrastructure. |
|
:-D The CI has no realtime: It is somewhat a chicken and egg problem. Not involving the RT infrastructure needs separate code and this can always not be in sync with RT. Involving RT runs a lot of stuff. Let's think about this. Exactly the issue here: #4129 What do you think about a parameter / pin? So you can use getp / gets to ask for realtime? Or I could add a new command path to halcmd / rtapi_app that does not start the hal if it is not yet running. |
|
Running non-RT is not an error per se (see CI). Therefore, you cannot and must not "simply" or "blindly" force one or the other. There are use-cases where you want to know the RT status and that does not always mean that you will or will not be running either. Finding out what the RT status is or will be must be lightweight and may be different from where you ask. Doing it in a component or from the cmd-line may be different, depending how you ask and with what intention you ask. |
f1d22a7 to
97bb377
Compare
|
Both CI failures are small:
Two runtime things I noticed while reading:
Minor: |
b73daaf to
5dfd303
Compare
|
Thanks. Fixed. One quirk: RTAI in userspace is called LXRT. I should rename this consistently. |
7f6e60d to
fbf2b81
Compare
|
Hmm, time for a break, to many force pushes, sorry. I will continue tomorrow. So you are OK with the general concept? Then I will polish it up, update the doc and do some more testing in different combinations. Open:
Deferred:
|
|
The CI fixes and the LXRT / fallback handling look good now. On naming, with @BsAtHome's #4099 in mind: For the two open how-tos:
|
Well, I don't own it. However, I agree with the argument to leave the get/set moniker to the hal pin/param data access. |
|
The last few commits are still figuring out ways to solve all issues, not ready for code style review yet. @grandixximo Thanks a lot for the hints. Helps a lot not to have to search everything.
|
6477a97 to
af73e08
Compare
|
And of course, after debian package install, the realtime script is not any more in path. And there was also no define for the path where it is. af73e08 adds one. For scripts, |
|
Due to error handling is annoying not knowing when rtapi_app is running or not: It now works exactly the same as RTAI. And it generated a new ToDo: Cleanup the realtime script. It is a mess. There are a lot of RTAI only parts executed in uspace mode. Like loading an empty list of modules and so on. So far, I just added the things I needed. |
|
Two notes after the latest push. Naming: you're right, I'll withdraw my concern. CI / auto-start: the two failures ( Separately: is dropping the auto-start actually needed for the RT-status goal, or is it a cleanup that could be its own PR? Keeping them decoupled would let the getrt/ |
|
If these test fails due to a missing realtime start, they most probably fail also with RTAI, i have to test it. It is not needed in this PR. However, as soon you like to use I will move it in a separate PR as initialy planned, this one gets already big. |
|
Please squash 😁 3 commits is clean |
This helps to check if hal functions can be used or if a component needs to be created first.
|
So, squashed and cleanup. New:
ToDo:
I will look into this. Is there a script to do this? |
|
No dedicated rebase script, it's manual. Run |
…istic Query realtime status with 'realtime verify' (from LinuxCNC#4132) rather than probing the setuid bit. latency-histogram asks the realtime layer directly; latency-test relies on the existing "POSIX non-realtime" note.
…istic Query realtime status with 'realtime verify' (from LinuxCNC#4132) rather than probing the setuid bit. latency-histogram asks the realtime layer directly; latency-test relies on the existing "POSIX non-realtime" note.
|
So, I went trough the full testing matrix. Most works as expected. There is just something to consider: This is strictly speaking still realtime, just a bad one. On the terminal, there is a warning. I could change to: and then check type > REALTIME_TYPE_PREEMPT_DYNAMIC. Or alternatively, use |
|
After some consideration, I think auto selecting unknown or PREEMPT_DYNAMIC is probably a bad idea and can generate a lot of confusion for not to experienced users. Rather select the worst possible option, so it is clear something is wrong, instead of selecting a halve working variant. So with the last commit, LINUXCNC_FORCE_REALTIME=1 is needed to select these options and there is a new type REALTIME_TYPE_UNKNOWN (SCHED_FIFO available but not PREEMT_DYNAMIC) Feedback? |
|
The "pick the worst option so it's obvious something is wrong" instinct is right for backend selection,we shouldn't auto-load a half-working RTAI/Xenomai backend. But this lands on the same problem @BsAtHome flagged earlier, that we "must not blindly force one or the other" and the answer may differ by where you ask. The commit gates one knob too many: it conflates what type do I report with do I grant SCHED_FIFO. Unforced Suggestion: keep using Smaller, not blockers:
|
The stepconf / pncconf should be fixed. Not having a way for "proceed anyway" is not nice. This is also nothing new, behaves like this also in 2.9 with vanilla kernel + setuid active.
Just checked: Using SCHED_FIFO without PREEMT_RT was introduced by #3964. Before, SCHED_OTHER was used when the kernel is not PREEMT_RT or otherwise realtime capable. SCHED_FIFO was used with non-PREEMT_RT kernels only with LINUXCNC_FORCE_REALTIME=1. Also, since #3964 is_rt is broken and always returns false. So it tracks back this behavior how it was before essentially. One other option: If SCHED_FIFO is available but no realtime kernel, I can still use SCHED_FIFO. In this case, I will set REALTIME_TYPE_NONE / is_rt=false.
It was essentially void before. The only case when it returned 1 was when rtapi_is_realtime() failed, which was checked later with
Python hands back: There are many functions doing this, all undocumented. Same for the hal returning -EINVAL. I can document this everywhere where missing.
Ah yes, i forgot to mention this. I believe unloading hal_lib was missed at rtapi_app exit. This is needed to set back to REALTIME_TYPE_UNINITIALIZED when rtapi_app is not running. I hope there are no other side effects, however I did not see one until now.
Thanks, I will fix this. |
…_app Python is_rt / is_sim now use "realtime verify" which uses "rtapi_app check_rt" in uspace / returns true in rtai. If realtime not running: rtapi_app performs the checks an returns the state If realtime running: rtapi_app calls master to perform the check and returns the state
The same checks are performed always the same now. If something is not properly checked, makeApp() fill fail instead of just chosing a different RT implementation by itsself. New function: rtapi_realtime_type_t rtapi_get_realtime_type(void)
91a894b to
ce89d89
Compare
|
About the behaivor with a vanilla kernel, options:
always:
Normal: LINUXCNC_FORCE_REALTIME=1
Normal: LINUXCNC_FORCE_REALTIME=1 I am for 2. Consistent and no change in behaivour compared to before nonroot. Shows issues when the wrong kernel is running.
|
|
Agreed, go with 2. I came in pushing 3, but you've convinced me 2 is the right call for a machine controller. The only one who actually benefits from best-effort SCHED_FIFO on a non-RT kernel is someone running real hardware on the wrong kernel, and that's exactly the case we want to make visibly degraded rather than silently "good enough", since masked latency spikes turn into following errors and lost steps on a real machine. Dev and sim don't need SCHED_FIFO anyway, so 2 costs them nothing, and So: 2 it is. Restores pre-rootless behavior, consistent, and surfaces a wrong-kernel boot instead of papering over it. |
|
One correction to my wording, then the concrete reason for 2. My "3 hides the problem without admitting it" was wrong: per your tables 2 and 3 report identically ( But the "Unexpected realtime delay" warning is gated on the scheduler policy, not on // uspace_posix.cc, RtapiTask::wait(), in the overrun branch:
if (policy == SCHED_FIFO)
unexpected_realtime_delay(task);
That makes the option-3 normal row ( So option-3 normal as drawn would behave like option 1 with That's why 2 is clean: scheduler, overrun warning, and |
|
Option 3 would need some additional code changes to separate SCHED_FIFO from the warning and the reported RT type. Surely possible but if you agree option 2 is the best, this is already implemented. @BsAtHome Are you also fine with option 2? Done:
Still open:
|
| //Call realtime verify to gather realtime status | ||
| //Most probably we don't have realtime running yet | ||
| int ret = system(EMC2_REALTIME " verify > /dev/null"); | ||
| int exit_stat = WEXITSTATUS(ret); | ||
| if(exit_stat != 0 && exit_stat != 1){ | ||
| PyErr_Format(PyExc_RuntimeError, "realtime verify failed, system() return value %i / exit %i", ret, exit_stat); |
There was a problem hiding this comment.
Calling system() in the hal module is extremely ugly and a sign that something is amiss.
There was a problem hiding this comment.
The only alternative I see is to remove is_sim / is_rt and call realtime verify in the two places this was used.
There was a problem hiding this comment.
Making the constants properties will allow you to check in a function. The library user will not see the difference.
Question: is is_rt and is_sim called before a component is created?
If there is HAL connection, then you can directly use a pin/param's value exported by the rtapp's inner process.
If there is no connection (hal_init() was not called) then you need to establish a temporary connection (component) so that you can get the values you need.
There was a problem hiding this comment.
To use properties that call get_realtime_type() successfully, two conditions must be fulfilled.
- rtapi_app must be running
- A component must exist
Both are not fulfilled at the two places this is used:
linuxcnc/src/emc/usr_intf/pncconf/pncconf.py
Line 655 in c1f4b6c
To have is_sim / is_rt useful, it should work always, you should not have to check that you fulfill this conditions. Two options:
- Use
realtime verify-> In this PR - Make sure both conditions are met
I already implemented 2. before 1. it but discarded it again, mainly to not overgrow this PR into yet another "fix everything" PR: #4132 (comment)
- rtapi_app must be running
- This needs a start / stop command
- Removing autostart / autostop is not really needed but would make sense to avoid inconsistencies
- Works mostly, some tests fail due to
realtime start/realtime stopis missing. They probably will also fail with RTAI. - Commit: hdiethelm@612183a
- A component must exist
- We can create a temporary one and discard it again
- Works, needs some cleanup
- Commit: hdiethelm@cdcc344
Both commits contain also some changes that I found useful anyway and are still in this PR.
At the moment, rtapi_app is started at first load_rt command and exits at exit or after last component has been unloaded. This makes it cumbersome to use get_realtime_type() due to you can do it only after first load_rt, otherwise you will get REALTIME_TYPE_UNINITIALIZED.
RTAI always needs realtime start to before first command, realtime stop at the end, so this is already there. That's why I think it is an option to change from auto to manual start/stop without to much issues.
If you think ^^ is a better solution, I can add it again. It looks in general cleaner to me but it might also break more things.
There was a problem hiding this comment.
BTW: Due to it is only used in two places, replacing it by realtime verify and remove is_sim / is_rt would be option 3. However, I don't know who uses this in his own private machine configs.
There was a problem hiding this comment.
To use properties that call get_realtime_type() successfully, two conditions must be fulfilled.
- rtapi_app must be running
- A component must exist
Very true. This is always the case for halmodule to be useful at all.
Only 99%. You can use the hal with only user components and have rtapi_app not running. Not sure why but you can. In this case you probably won't bother to check for RT.
Alternatives may be appropriate here.
If you are ok with (somewhat above in the huge thread):
- Python is_sim / is_rt: I would prefer just removing them
I just have to find a way to use the realtime script inside this stepconf / pncconf. The check is anyway annoying in there, you are forced to have realtime, no way to "continue anyway". So I will get rid of is_sim / is_rt, use realtime status and fix the check so it is not stopping you by force. Ok?
There was a problem hiding this comment.
Very true. This is always the case for halmodule to be useful at all.
Only 99%. You can use the hal with only user components and have rtapi_app not running. Not sure why but you can. In this case you probably won't bother to check for RT.
And that is the point. When you only use it for non-RT, then why ask about RT? You aren't using RT in non-RT so it is just like "Hey, look, I can run non-RT on RT systems! Am I cool or what?". Right...
Alternatives may be appropriate here.
If you are ok with (somewhat above in the huge thread):
* Python is_sim / is_rt: I would prefer just removing them
Yes, it is not used anywhere else and it is rather uninformative/complicated/etc..
We should just create a RO parameter in rtapi_app (the RT running core) that sets the appropriate value to indicate how the system is running.
I just have to find a way to use the realtime script inside this stepconf / pncconf. The check is anyway annoying in there, you are forced to have realtime, no way to "continue anyway". So I will get rid of is_sim / is_rt, use
realtime statusand fix the check so it is not stopping you by force. Ok?
Lets try, yes.
Question: is configuration building with stepconf and pncconf different depending on whether it detects running on a RT system? What about cross-configuration (like cross-compiling)?
There was a problem hiding this comment.
We should just create a RO parameter in rtapi_app (the RT running core) that sets the appropriate value to indicate how the system is running.
There is no parameter withouth a component.
I have a commit somewhere creating a static component at hal_lib init, but this fails all tests due to there is a new unexpexted component and parameter. But I could add just a simple rt_info.comp you can manually load if needed.
Lets try, yes.
Question: is configuration building with stepconf and pncconf different depending on whether it detects running on a RT system? What about cross-configuration (like cross-compiling)?
It checks RT only to say you are not allowed to do axis tests withouth RT and locks you out.
There was a problem hiding this comment.
RT components always run in the rtapi_app context. They can use rtapi_*() calls at any time to determine their status.
Userspace components need a HAL connection and could query a HAL parameter. This parameter must be added via a RT component. Other userspace could use halcmd getp.
The question is then do we add this component implicitly or do we need to do so explicitly.
The case for explicit (put it in a .hal file) is there because if you need the info, then you already know that you need the info and add it to the tree of components.
The case for implicit is that you no longer have to think about adding the component and the information is there. If this can be done without that unloading RT becomes a problem (the rmmod problem, I believe), then this may be the easier option for users. Otherwise, explicit must be the way.
For stepconf/pncconf, well, I guess it can spawn a realtime verify and sample the result? Would that suffice? That said, stepconf and pncconf need to build a working .hal file and load/run RT anyway for testing axes. Why not let them load the isrt.comp and sample the information? If it runs in sim-mode, then the programs can disable their controls or tear down rtapi_app again?
There was a problem hiding this comment.
With thia PR, you have hal_get_realtime_type() that works in user / rt comps and python. So no real need for a parameter for most usecases I think.
Only issue: This works only after the first load_rt due to rtapi_app is only running afterward. hdiethelm@612183a would fix this.
For stepconf/pncconf
realtime verify would work, except the install path of this scipt needs to be available, i did not find it exposed in python yet.
hal.get_realtime_type() would also work, but only after the first load_rt. No need for a parameter, it will be there anyway only after the first load_rt, implicit or explicit, same issue.
Or I just add the commit above and fix the tests, then hal_get_realtime_type() will work after halrun.
rtapi_is_realtime() was always unreliable and broken for user components for some time. It is fixed but only available for rt components now. hal_get_realtime_type() returns the true running realtime type through the HAL for user and realtime components. This function is also exposed through python hal. rtapi_app now unloads hal_lib at exit, so everything is cleaned up properly and realtime_type is set back to REALTIME_TYPE_UNINITIALIZED.
REALTIME_TYPE_UNKNOWN / REALTIME_TYPE_PREEMPT_DYNAMIC need LINUXCNC_FORCE_REALTIME=1 to be set. These two options are most likely not desired and should not be selected automatically.
Intended to be used with:
#4107
Will fix is_sim / is_rt which is broken: #4129
Two new functionality's for two use-cases:
realtime statuscan be used to check if realtime is running.The realtime script is extended with the
verifycommand returning 0 if RT capable, 1 if not.It is intended to use when not running and running.
rtapi_app getrtand returns the stateThere is the new function
hal_get_realtime_type()returning the type of the actually running realtime system trough the hal.is_sim / is_rt use
realtime verifyat init.rtapi_is_realtime()is deprecated: It works only in real time context since #3964 and was never 100% reliable, also according to the doc.