From 817df620bedae9c1daa0497f64a901d51e5bd2dd Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <thomas@schwinge.name>
Date: Sat, 26 Mar 2011 00:52:08 +0100
Subject: Some more IRC discussions.

---
 open_issues/anatomy_of_a_hurd_system.mdwn          |  73 +++++++++
 open_issues/ext2fs_page_cache_swapping_leak.mdwn   |  23 +++
 open_issues/pfinet_vs_system_time_changes.mdwn     |  42 ++++++
 ...dez-vous_leading_to_duplicate_port_destroy.mdwn | 163 +++++++++++++++++++++
 open_issues/sudo_date_crash.mdwn                   |  16 --
 open_issues/unit_testing.mdwn                      |  20 +++
 6 files changed, 321 insertions(+), 16 deletions(-)
 create mode 100644 open_issues/anatomy_of_a_hurd_system.mdwn
 create mode 100644 open_issues/ext2fs_page_cache_swapping_leak.mdwn
 create mode 100644 open_issues/pfinet_vs_system_time_changes.mdwn
 create mode 100644 open_issues/rpc_to_self_with_rendez-vous_leading_to_duplicate_port_destroy.mdwn
 delete mode 100644 open_issues/sudo_date_crash.mdwn

(limited to 'open_issues')

diff --git a/open_issues/anatomy_of_a_hurd_system.mdwn b/open_issues/anatomy_of_a_hurd_system.mdwn
new file mode 100644
index 00000000..e1d5c9d8
--- /dev/null
+++ b/open_issues/anatomy_of_a_hurd_system.mdwn
@@ -0,0 +1,73 @@
+[[!meta copyright="Copyright © 2011 Free Software Foundation, Inc."]]
+
+[[!meta license="""[[!toggle id="license" text="GFDL 1.2+"]][[!toggleable
+id="license" text="Permission is granted to copy, distribute and/or modify this
+document under the terms of the GNU Free Documentation License, Version 1.2 or
+any later version published by the Free Software Foundation; with no Invariant
+Sections, no Front-Cover Texts, and no Back-Cover Texts.  A copy of the license
+is included in the section entitled [[GNU Free Documentation
+License|/fdl]]."]]"""]]
+
+[[!taglink open_issue_documentation]]
+
+A bunch of this should also be covered in other (introductionary) material,
+like Bushnell's Hurd paper.  All this should be unfied and streamlined.
+
+IRC, freenode, #hurd, 2011-03-08
+
+    <foocraft> I've a question on what are the "units" in the hurd project, if
+      you were to divide them into units if they aren't, and what are the
+      dependency relations between those units(roughly, nothing too pedantic
+      for now)
+    <antrik> there is GNU Mach (the microkernel); there are the server
+      libraries in the Hurd package; there are the actual servers in the same;
+      and there is the POSIX implementation layer in glibc
+    <antrik> relations are a bit tricky
+    <antrik> Mach is the base layer which implements IPC and memory management
+    <foocraft> hmm I'll probably allocate time for dependency graph generation,
+      in the worst case
+    <antrik> on top of this, the Hurd servers, using the server libraries,
+      implement various aspects of the system functionality
+    <antrik> client programs use libc calls to use the servers
+    <antrik> (servers also use libc to communicate with other servers and/or
+      Mach though)
+    <foocraft> so every server depends solely on mach, and no other server?
+    <foocraft> s/mach/mach and/or libc/
+    <antrik> I think these things should be pretty clear one you are somewhat
+      familiar with the Hurd architecture... nothing really tricky there
+    <antrik> no
+    <antrik> servers often depend on other servers for certain functionality
+
+---
+
+IRC, freenode, #hurd, 2011-03-12
+
+    <dEhiN> when mach first starts up, does it have some basic i/o or fs
+      functionality built into it to start up the initial hurd translators?
+    <antrik> I/O is presently completely in Mach
+    <antrik> filesystems are in userspace
+    <antrik> the root filesystem and exec server are loaded by grub
+    <dEhiN> o I see
+    <dEhiN> so in order to start hurd, you would have to start mach and
+      simultaneously start the root filesystem and exec server?
+    <antrik> not exactly
+    <antrik> GRUB loads all three, and then starts Mach. Mach in turn starts
+      the servers according to the multiboot information passed from GRUB
+    <dEhiN> ok, so does GRUB load them into ram?
+    <dEhiN> I'm trying to figure out in my mind how hurd is initially started
+      up from a low-level pov
+    <antrik> yes, as I said, GRUB loads them
+    <dEhiN> ok, thanks antrik...I'm new to the idea of microkernels, but a
+      veteran of monolithic kernels
+    <dEhiN> although I just learned that windows nt is a hybrid kernel which I
+      never knew!
+    <rm> note there's a /hurd/ext2fs.static
+    <rm> I belive that's what is used initially... right?
+    <antrik> yes
+    <antrik> loading the shared libraries in addition to the actual server
+      would be unweildy
+    <antrik> so the root FS server is linked statically instead
+    <dEhiN> what does the root FS server do?
+    <antrik> well, it serves the root FS ;-)
+    <antrik> it also does some bootstrapping work during startup, to bring the
+      rest of the system up
diff --git a/open_issues/ext2fs_page_cache_swapping_leak.mdwn b/open_issues/ext2fs_page_cache_swapping_leak.mdwn
new file mode 100644
index 00000000..0ace5cd3
--- /dev/null
+++ b/open_issues/ext2fs_page_cache_swapping_leak.mdwn
@@ -0,0 +1,23 @@
+[[!meta copyright="Copyright © 2011 Free Software Foundation, Inc."]]
+
+[[!meta license="""[[!toggle id="license" text="GFDL 1.2+"]][[!toggleable
+id="license" text="Permission is granted to copy, distribute and/or modify this
+document under the terms of the GNU Free Documentation License, Version 1.2 or
+any later version published by the Free Software Foundation; with no Invariant
+Sections, no Front-Cover Texts, and no Back-Cover Texts.  A copy of the license
+is included in the section entitled [[GNU Free Documentation
+License|/fdl]]."]]"""]]
+
+[[!tag open_issue_hurd]]
+
+IRC, OFTC, #debian-hurd, 2011-03-24
+
+    <youpi> I still believe we have an ext2fs page cache swapping leak, however
+    <youpi> as the 1.8GiB swap was full, yet the ld process was only 1.5GiB big
+    <pinotree> a leak at swapping time, you mean?
+    <youpi> I mean the ext2fs page cache being swapped out instead of simply
+      dropped
+    <pinotree> ah
+    <pinotree> so the swap tends to accumulate unuseful stuff, i see
+    <youpi> yes
+    <youpi> the disk content, basicallyt :)
diff --git a/open_issues/pfinet_vs_system_time_changes.mdwn b/open_issues/pfinet_vs_system_time_changes.mdwn
new file mode 100644
index 00000000..a9e1e242
--- /dev/null
+++ b/open_issues/pfinet_vs_system_time_changes.mdwn
@@ -0,0 +1,42 @@
+[[!meta copyright="Copyright © 2010, 2011 Free Software Foundation, Inc."]]
+
+[[!meta license="""[[!toggle id="license" text="GFDL 1.2+"]][[!toggleable
+id="license" text="Permission is granted to copy, distribute and/or modify this
+document under the terms of the GNU Free Documentation License, Version 1.2 or
+any later version published by the Free Software Foundation; with no Invariant
+Sections, no Front-Cover Texts, and no Back-Cover Texts.  A copy of the license
+is included in the section entitled [[GNU Free Documentation
+License|/fdl]]."]]"""]]
+
+[[!tag open_issue_hurd]]
+
+IRC, unknown channel, unknown date.
+
+    <grey_gandalf> I did a sudo date...
+    <grey_gandalf> and the machine hangs
+
+This was very likely as misdiagnosis:
+
+IRC, freenode, #hurd, 2011-03-25
+
+    <tschwinge> antrik: I suspect it'S some timing stuff in pfinet that perhaps
+      uses absolute time, and somehow wildely gets confused?
+    <antrik> tschwinge: BTW, pfinet doesn't actually die I think -- it just
+      drops open connections...
+    <antrik> perhaps it thinks they timed out
+    <tschwinge> antrik: Isn't the translator restarted instead?
+    <antrik> don't think so
+    <antrik> when pfinet actually dies, I also loose the NFS mounts, which
+      doesn't happen in this case
+    <antrik> hehe "... and the machine hangs"
+    <antrik> he didn't bother to check that the machine is perfectly fine, only
+      the SSH connection got dropped
+    <tschwinge> Ah, I see.  So it'S perhaps indeed simply closes TCP
+      connections that have been without data for ``too long''?
+    <antrik> yeah, that's my guess
+    <antrik> my clock is speeding, so ntpdate sets it in the past
+    <antrik> perhaps there is some math that concludes the connection have been
+      inactive for -200 seconds, which (unsigned) is more than any timeout :-)
+    <tschwinge> (The other way round, you might likely get some integer
+      wrap-around, and thus the same result.)
+    <tschwinge> Yes.
diff --git a/open_issues/rpc_to_self_with_rendez-vous_leading_to_duplicate_port_destroy.mdwn b/open_issues/rpc_to_self_with_rendez-vous_leading_to_duplicate_port_destroy.mdwn
new file mode 100644
index 00000000..9db92250
--- /dev/null
+++ b/open_issues/rpc_to_self_with_rendez-vous_leading_to_duplicate_port_destroy.mdwn
@@ -0,0 +1,163 @@
+[[!meta copyright="Copyright © 2011 Free Software Foundation, Inc."]]
+
+[[!meta license="""[[!toggle id="license" text="GFDL 1.2+"]][[!toggleable
+id="license" text="Permission is granted to copy, distribute and/or modify this
+document under the terms of the GNU Free Documentation License, Version 1.2 or
+any later version published by the Free Software Foundation; with no Invariant
+Sections, no Front-Cover Texts, and no Back-Cover Texts.  A copy of the license
+is included in the section entitled [[GNU Free Documentation
+License|/fdl]]."]]"""]]
+
+[[!tag open_issue_hurd]]
+
+[RPC to self with rendez-vous leading to duplicate port
+destroy](http://lists.gnu.org/archive/html/bug-hurd/2011-03/msg00045.html)
+
+IRC, freenode, #hurd, 2011-03-14
+
+    <antrik> youpi: I wonder, why does the root FS call diskfs_S_dir_lookup()
+      at all?...
+    <youpi> errr, because a client asked for it?
+    <youpi> (problem with RPCs is you can't easily know where they come from :)
+      )
+    <youpi> (especially when it's the root fs...)
+    <antrik> ah, it's about a client request... didn't see that
+    <youpi> well, I just said "is called", yes
+    <antrik> I do not really understand though why it tries to reauthenticate
+      against itself...
+    <antrik> I fear my memory of the lookup mechanism grew a bit dim
+    <youpi> see the source
+    <youpi> it's about a translated entry
+    <antrik> (and I never fully understood some aspects anyways...)
+    <youpi> it needs to start the translated entry as another user, possibly
+    <antrik> yes, but a translated entry normally would be served by *another*
+      process?...
+    <youpi> sure, but ext2fs has to prepare it
+    <youpi> thus reauthenticate to prepare the correct set of rights
+    <antrik> prepare what?
+    <youpi> rights
+    <youpi> so the process is not root, doesn't have / opened as root, etc.
+    <antrik> rights for what?
+    <youpi> err, about everything
+    <antrik> IIRC the reauthentication is done by the parent FS on the port to
+      the *translated* node
+    <antrik> and the translated node should be a different process?...
+    <youpi> that's not what I read in the source
+    <youpi> fshelp_fetch_root
+    <youpi> ports[INIT_PORT_CRDIR] = reauth (getcrdir ());
+    <youpi> here, getcrdir() returns ext2fs itself
+    <antrik> well, perhaps the issue is that I have no idea what
+      fshelp_fetch_root() does, nor why it is called here...
+    <youpi> it notably starts the translator that dir_lookup is looking at, if
+      needed
+    <youpi> possibly as a different user, thus reauthentication of CRDIR
+    <antrik> so this is about a port that is passed to the translator being
+      started?
+    <youpi> no
+    <youpi> well, depends on what you mean by "port"
+    <youpi> it's about reauthenticating a port to be passed to the translator
+      being started
+    <youpi> and for that a rendez-vous port is needed for the reauthentication
+    <youpi> and that's the one at stake
+    <antrik> yeah, I meant the port that is reauthenticated
+    <antrik> what is CRDIR?
+    <youpi> current root dir ...
+    <antrik> so the parent translator passes it's own root dir to the child
+      translator; and the issue is that for the root FS the root dir points to
+      the root FS itself...
+    <youpi> yes
+    <antrik> OK, that makes sense
+    <youpi> (but that's only one example, rgrep mach_port_destroy hurd/ show
+      other potential issues)
+    <antrik> well, that's actually what I wanted to mention next... why is the
+      rendez-vous port destroyed, instead of just deallocating the port right
+      and letting reference counting to it's thing?...
+    <antrik> do its thing
+    <youpi> "just to make sure" I guess
+    <antrik> it's pretty obvious that this will cause trouble for any RPC
+      referencing itself...
+    <youpi> well, follow-up with that on the list
+    <youpi> with roland/tb in CC
+    <youpi> only they would know any real reason for destroy
+    <youpi> btw, if you knew how we could make _hurd_select()'s raw __mach_msg
+      call be interruptible by signals, that'll permit to fix sudo
+    <youpi> (damn, I need sleep, my tenses are all wrong)
+    <antrik> BTW, does this cause any actual trouble?...
+    <antrik> I don't know much about interruption... cfhammer might have a
+      better idea, he look into that stuff quite a bit AIUI
+    <antrik> looked
+    <antrik> (hehe, it's not only your tenses... guess there's something in the
+      ether ;-) )
+    <youpi> it makes sudo, mailq, etc. fail sometimes
+    <antrik> I mean the rendez-vous thing
+    <youpi> that's it, yes 
+    <youpi> sudo etc. fail at least due to this
+    <antrik> so these are two different problems that both affect sudo?
+    <antrik> (rendez-vous and interruption I mean)
+    <youpi> yes
+    <youpi> with my patch the buildds have much fewer issues, but still some
+    <youpi> (my interrupt-related patch)
+    <youpi> I'm installing a s/destroy/deallocate/ version of ext2fs on the
+      buildds, we'll see how it behaves
+    <youpi> (it fixes my testcase at least)
+    <antrik> interrupt-related patch?
+    <antrik> only thing interrupt-related I remember was the reauthentication
+      race...
+    <youpi> that's what I mean
+    <antrik> well, cfhammer investigated this is quite some depth, explaining
+      quite well why the race is only mitigated but still exists... problem is
+      that we didn't know how to fix it properly
+    <antrik> because nobody seems to understand the cancellation code, except
+      perhaps for Roland and Thomas
+    <antrik> (and I'm not even entirely sure about them :-) )
+    <antrik> I think his findings and our conclusions are documented on the
+      ML...
+    <youpi> by "much fewer issues", I mean that some of the symptoms have
+      disappeared, others haven't
+    <antrik> BTW, couldn't the rendez-vous thing be worked around by simply
+      ignoring the errors from the failing deallocate?...
+    <youpi> no, failing deallocate are actually dangerous
+    <antrik> why?
+    <youpi> since the name might have been reused for something else in the
+      meanwhile
+    <youpi> that's the whole point of the warning I had added in the kernel
+      itself
+    <antrik> I see
+    <youpi> such things really deserve tracking, since they can have any kind
+      of consequence
+    <antrik> does Mach try to reuse names quickly, rather than only after
+      wrapping around?...
+    <youpi> it seems to
+    <antrik> OK, then this is a serious problem indeed
+    <youpi> (note: I rarely divine issues when there aren't actual frequent
+      symptoms :) )
+    <antrik> well, the problem with the warning is that it only shows in the
+      cases that do *not* cause a problem... so it's hard to associate them
+      with any specific issues
+    <youpi> well, most of the time the port is not reused quickly enough
+    <youpi> so in most case it shows up more often than causing problem
+
+IRC, freenode, #hurd, 2011-03-14
+
+    <youpi> ok, mach_port_deallocate actually can't be used
+    <youpi> since mach_reply_port() returns a receive right, not a send right
+    * youpi guesses he will really have to manage to understand all that port
+        stuff completely
+    <antrik> oh, right
+    <antrik> youpi: hm... now I'm confused though. if one client holds a
+      receive right, the other client (or in this case the same process) should
+      have a send or send-once right -- these should *not* share the same name
+      in my understanding
+    <antrik> destroying the receive right should turn the send right into a
+      dead name
+    <antrik> so unless I'm missing something, the destroy shouldn't be a
+      problem, and there must be something else going wrong
+    <antrik> hm... actually I'm probably wrong
+    <antrik> yeah, definitely wrong. receive rights and "ordinary" send rights
+      share the name. only send-once rights are special
+    <antrik> I wonder whether the problem could be worked around by using a
+      send-once right...
+    <antrik> mach_port_mod_refs(mach_task_self(), name,
+      MACH_PORT_RIGHT_RECEIVE, -1) can be used to deallocate only the receive
+      right
+    <antrik> oh, you already figured that out :-)
diff --git a/open_issues/sudo_date_crash.mdwn b/open_issues/sudo_date_crash.mdwn
deleted file mode 100644
index 53303abc..00000000
--- a/open_issues/sudo_date_crash.mdwn
+++ /dev/null
@@ -1,16 +0,0 @@
-[[!meta copyright="Copyright © 2010 Free Software Foundation, Inc."]]
-
-[[!meta license="""[[!toggle id="license" text="GFDL 1.2+"]][[!toggleable
-id="license" text="Permission is granted to copy, distribute and/or modify this
-document under the terms of the GNU Free Documentation License, Version 1.2 or
-any later version published by the Free Software Foundation; with no Invariant
-Sections, no Front-Cover Texts, and no Back-Cover Texts.  A copy of the license
-is included in the section entitled [[GNU Free Documentation
-License|/fdl]]."]]"""]]
-
-[[!tag open_issue_gnumach]]
-
-IRC, unknown channel, unknown date.
-
-    <grey_gandalf> I did a sudo date...
-    <grey_gandalf> and the machine hangs
diff --git a/open_issues/unit_testing.mdwn b/open_issues/unit_testing.mdwn
index a5ffe19d..feda3be4 100644
--- a/open_issues/unit_testing.mdwn
+++ b/open_issues/unit_testing.mdwn
@@ -320,3 +320,23 @@ freenode, #hurd channel, 2011-03-07:
       this, and just generally though that some sort of automated testing is
       needed, and thus started collecting ideas.
     <tschwinge> antrik: You're of course invited to fix that.
+
+IRC, freenode, #hurd, 2011-03-08
+
+(After discussing the [[anatomy_of_a_hurd_system]].)
+
+    <antrik> so that's what your question is actually about?
+    <foocraft> so what I would imagine is a set of only-this-server tests for
+      each server, and then we can have fun adding composite tests
+    <foocraft> thus making debugging the composite scenarios a bit less tricky
+    <antrik> indeed
+    <foocraft> and if you were trying to pass a composite test, it would also
+      help knowing that you still didn't break the server-only test
+    <antrik> there are so many different things that can be tested... the
+      summer will only suffice to dip into this really :-)
+    <foocraft> yeah, I'm designing my proposal to focus on 1) make/use a
+      testing framework that fits the Hurd case very well 2) write some tests
+      and docs on how to write good tests
+    <antrik> well, doesn't have to be *one* framework... unit testing and
+      regression testing are quite different things, which can be covered by
+      different frameworks
-- 
cgit v1.2.3