Quantcast

tracking down SERVFAIL

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

tracking down SERVFAIL

Jeremy Kister-9
For the past couple of weeks, my resolvers that are configured as
FORWARDONLY have been having seemingly random lookup failures on
aol.com's TXT RRs (other RRs seem fine, other TXT RRs on other domains
seem fine).  With the pressure on to get it working, i've been
restarting dnscache which, at least temporarily, fixes the problem.

The target resolver gives the expected answers to queries all the time.

# cd /service/dnscache-204.9.96.7
# grep . env/FORWARDONLY root/servers/@
env/FORWARDONLY:1
root/servers/@:64.115.47.7
# dig @204.9.96.7 aol.com txt

; <<>> DiG 8.4 <<>> @204.9.96.7 aol.com txt
; (1 server found)
;; res options: init recurs defnam dnsrch no-nibble2
;; got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 9706
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
;; QUERY SECTION:
;;      aol.com, type = TXT, class = IN

;; Total query time: 38 msec
;; FROM: minnie to SERVER: 204.9.96.7
;; WHEN: Sun Jan  6 22:16:53 2008
;; MSG SIZE  sent: 25  rcvd: 25

# dig @64.115.47.7 aol.com txt

; <<>> DiG 8.4 <<>> @64.115.47.7 aol.com txt
; (1 server found)
;; res options: init recurs defnam dnsrch no-nibble2
;; got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 3307
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 0
;; QUERY SECTION:
;;      aol.com, type = TXT, class = IN

;; ANSWER SECTION:
aol.com.                1m36s IN TXT    "spf2.0/pra ip4:152.163.225.0/24
ip4:205.188.139.0/24 ip4:205.188.144.0/24 ip4:205.188.156.0/23
ip4:205.188.159.0/24 ip4:64.12.136.0/23 ip4:64.12.138.0/24
ip4:64.12.143.99/32 ip4:64.12.143.100/32 ip4:64.12.143.101/32
ptr:mx.aol.com ?all"
aol.com.                1m36s IN TXT    "v=spf1 ip4:152.163.225.0/24
ip4:205.188.139.0/24 ip4:205.188.144.0/24 ip4:205.188.156.0/23
ip4:205.188.159.0/24 ip4:64.12.136.0/23 ip4:64.12.138.0/24
ip4:64.12.143.99/32 ip4:64.12.143.100/32 ip4:64.12.143.101/32
ptr:mx.aol.com ?all"

;; Total query time: 29 msec
;; FROM: minnie to SERVER: 64.115.47.7
;; WHEN: Sun Jan  6 22:17:53 2008
;; MSG SIZE  sent: 25  rcvd: 517


converted logs from dnscache-204.9.96.7:
query 101909 204.9.96.41 50990 6 TXT aol.com.
tx 0 16 aol.com. . 40732f07
servfail aol.com. input/output error
sent 101909 25

converted logs from dnscache-64.115.47.7:
query 162746 204.9.96.7 46927 17544 TXT aol.com.
sent 162746 25
tcpopen 204.9.96.7:4631
tcpclose 204.9.96.7:4631 broken pipe



Any suggestions about what I should do ?  I've left my ns2 in the broken
state so that we can get more data, if needed.


--

Jeremy Kister
http://jeremy.kister.net./
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: tracking down SERVFAIL

Jeremy Kister-9
On 1/6/2008 10:28 PM, Jeremy Kister wrote:
> For the past couple of weeks, my resolvers that are configured as
> FORWARDONLY have been having seemingly random lookup failures on

Note that the target resolver and the FORWARDONLY resolver both live on
the same machine, running the same flavor of dnscache, but live in
separate /etc/dnscache directories with their own configuration
differences there.


--

Jeremy Kister
http://jeremy.kister.net./
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: tracking down SERVFAIL

Matthew Dempsky-3
In reply to this post by Jeremy Kister-9
On 1/6/08, Jeremy Kister <[hidden email]> wrote:
> converted logs from dnscache-64.115.47.7:
> query 162746 204.9.96.7 46927 17544 TXT aol.com.
> sent 162746 25
> tcpopen 204.9.96.7:4631
> tcpclose 204.9.96.7:4631 broken pipe

Is there a "drop 162746" line in your logs?  If not, it looks like
because aol.com's TXT records are so large, dnscache needs to fallback
to using TCP sockets to transmit the DNS records, and that connection
is being dropped somehow.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: tracking down SERVFAIL

Jeremy Kister-9
On 1/7/2008 4:00 AM, Matthew Dempsky wrote:
> Is there a "drop 162746" line in your logs?  If not, it looks like
> because aol.com's TXT records are so large, dnscache needs to fallback
> to using TCP sockets to transmit the DNS records, and that connection
> is being dropped somehow.

Nope, no related drop entries in the log.  But note that it seems to me
that the communication problem exists between the two local recursive
name servers --

64.115.47.7 can get the correct data, presumably directly from aol.com's
authoritative name servers.
204.9.96.7 talks to 64.115.47.7, and cannot get the data.

--

Jeremy Kister
http://jeremy.kister.net./

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: tracking down SERVFAIL

Matthew Dempsky-3
On 1/7/08, Jeremy Kister <[hidden email]> wrote:
> Nope, no related drop entries in the log.  But note that it seems to me
> that the communication problem exists between the two local recursive
> name servers --

Sorry if I was unclear.  I meant that the two dnscache instances were
needing to fallback to TCP for communication, and that TCP connection
was being dropped, not the TCP connection between dnscache and
aol.com's servers.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: tracking down SERVFAIL

Jeremy Kister-9
In reply to this post by Jeremy Kister-9
On 1/7/2008 11:06 AM, Charlie Brady wrote:
> Do you have any packet filtering on your loopback interfaces?

No local filtering at all.

 > Have you sniffed the traffic?

Yes.  Nothing that I see interesting:
204.9.96.42 -> 204.9.96.7 DNS C aol.com. Internet TXT ?

             0: 0800 201a e4f8 00d0 b788 cd91 0800 4500    .. ..ø........E.
            16: 0035 fc12 0000 4011 2661 cc09 602a cc09    .5....@.&a..`*..
            32: 6007 e341 0035 0021 097e c2d3 0100 0001    `..A.5.!.~......
            48: 0000 0000 0000 0361 6f6c 0363 6f6d 0000    .......aol.com..
            64: 1000 01                                 ...

204.9.96.7 -> 204.9.96.42 DNS R  Error: 2(Server Fail)

             0: 00d0 b788 cd91 0800 201a e4f8 0800 4500    ........ ..ø..E.
            16: 0035 f46d 4000 ff11 2f05 cc09 6007 cc09    .5ôm@.../...`...
            32: 602a 0035 e341 0021 88fb c2d3 8182 0001    `*.5.A.!........
            48: 0000 0000 0000 0361 6f6c 0363 6f6d 0000    .......aol.com..
            64: 1000 01

--

Jeremy Kister
http://jeremy.kister.net./



Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: tracking down SERVFAIL

Jeremy Kister-9
In reply to this post by Jeremy Kister-9
On 1/7/2008 11:33 AM, Matthew Dempsky wrote:
> Sorry if I was unclear.  I meant that the two dnscache instances were
> needing to fallback to TCP for communication, and that TCP connection
> was being dropped, not the TCP connection between dnscache and
> aol.com's servers.

you were clear.  I was just emphasizing that I was confused about how
two ip addresses on a local host couldn't talk tcp to each other.

--

Jeremy Kister
http://jeremy.kister.net./



Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: tracking down SERVFAIL

Charlie Brady
In reply to this post by Jeremy Kister-9

On Mon, 7 Jan 2008, Jeremy Kister wrote:

>  > Have you sniffed the traffic?
>
> Yes.  Nothing that I see interesting:
> 204.9.96.42 -> 204.9.96.7 DNS C aol.com. Internet TXT ?
>
>              0: 0800 201a e4f8 00d0 b788 cd91 0800 4500    .. ..ø........E.
>             16: 0035 fc12 0000 4011 2661 cc09 602a cc09    .5....@.&a..`*..
>             32: 6007 e341 0035 0021 097e c2d3 0100 0001    `..A.5.!.~......
>             48: 0000 0000 0000 0361 6f6c 0363 6f6d 0000    .......aol.com..
>             64: 1000 01                                 ...
>
> 204.9.96.7 -> 204.9.96.42 DNS R  Error: 2(Server Fail)

You earlier concluded that there was a communication problem between the
two dnscache instances:

 Nope, no related drop entries in the log.  But note that it seems to me
 that the communication problem exists between the two local recursive
 name servers --

The sniff trace shows that there is traffic in both directions. Is that
all the traffic you see? Did you sniff both UDP and TCP?

The pipe error reported by the upstream of your two instances (204.9.96.7)  
should correlate with a 0 byte TCP socket read. See t_rw() in dnscache.c.

The content of the return traffic is also interesting. It tells you that
204.9.96.7 is reporting SERVFAIL, so you don't just have lost responses
going back to 204.9.96.42.

--
Charlie

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: tracking down SERVFAIL

Jeremy Kister-9
In reply to this post by Jeremy Kister-9
On 1/6/2008 10:28 PM, Jeremy Kister wrote:
> For the past couple of weeks, my resolvers that are configured as
> FORWARDONLY have been having seemingly random lookup failures on
> aol.com's TXT RRs (other RRs seem fine, other TXT RRs on other domains

I have a bit more data; im not sure if it's interesting:

My the forwardonly name server has the RR.  but when the TTL expires, it
shows SERVFAIL before showing the new data again.

minnie> dig @204.9.96.7 aol.com txt

; <<>> DiG 8.4 <<>> @204.9.96.7 aol.com txt
; (1 server found)
;; res options: init recurs defnam dnsrch no-nibble2
;; got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 45447
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 0
;; QUERY SECTION:
;;      aol.com, type = TXT, class = IN

;; ANSWER SECTION:
aol.com.                0S IN TXT       "v=spf1 ip4:152.163.225.0/24
ip4:205.188.139.0/24 ip4:205.188.144.0/24 ip4:205.188.156.0/23
ip4:205.188.159.0/24 ip4:64.12.136.0/23 ip4:64.12.138.0/24
ip4:64.12.143.99/32 ip4:64.12.143.100/32 ip4:64.12.143.101/32
ptr:mx.aol.com ?all"
aol.com.                0S IN TXT       "spf2.0/pra ip4:152.163.225.0/24
ip4:205.188.139.0/24 ip4:205.188.144.0/24 ip4:205.188.156.0/23
ip4:205.188.159.0/24 ip4:64.12.136.0/23 ip4:64.12.138.0/24
ip4:64.12.143.99/32 ip4:64.12.143.100/32 ip4:64.12.143.101/32
ptr:mx.aol.com ?all"

;; Total query time: 21 msec
;; FROM: minnie to SERVER: 204.9.96.7
;; WHEN: Tue Jan  8 14:47:14 2008
;; MSG SIZE  sent: 25  rcvd: 517

minnie> dig @204.9.96.7 aol.com txt

; <<>> DiG 8.4 <<>> @204.9.96.7 aol.com txt
; (1 server found)
;; res options: init recurs defnam dnsrch no-nibble2
;; got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 65232
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
;; QUERY SECTION:
;;      aol.com, type = TXT, class = IN

;; Total query time: 273 msec
;; FROM: minnie to SERVER: 204.9.96.7
;; WHEN: Tue Jan  8 14:47:15 2008
;; MSG SIZE  sent: 25  rcvd: 25

minnie> dig @204.9.96.7 aol.com txt

; <<>> DiG 8.4 <<>> @204.9.96.7 aol.com txt
; (1 server found)
;; res options: init recurs defnam dnsrch no-nibble2
;; got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 13692
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 0
;; QUERY SECTION:
;;      aol.com, type = TXT, class = IN

;; ANSWER SECTION:
aol.com.                4m57s IN TXT    "spf2.0/pra ip4:152.163.225.0/24
ip4:205.188.139.0/24 ip4:205.188.144.0/24 ip4:205.188.156.0/23
ip4:205.188.159.0/24 ip4:64.12.136.0/23 ip4:64.12.138.0/24
ip4:64.12.143.99/32 ip4:64.12.143.100/32 ip4:64.12.143.101/32
ptr:mx.aol.com ?all"
aol.com.                4m57s IN TXT    "v=spf1 ip4:152.163.225.0/24
ip4:205.188.139.0/24 ip4:205.188.144.0/24 ip4:205.188.156.0/23
ip4:205.188.159.0/24 ip4:64.12.136.0/23 ip4:64.12.138.0/24
ip4:64.12.143.99/32 ip4:64.12.143.100/32 ip4:64.12.143.101/32
ptr:mx.aol.com ?all"

;; Total query time: 41 msec
;; FROM: minnie to SERVER: 204.9.96.7
;; WHEN: Tue Jan  8 14:47:18 2008
;; MSG SIZE  sent: 25  rcvd: 517


Not knowing if this was just a local configuration issue, I set up a new
dnscache instance on a new ip address, set up as FORWARDONLY.  It
exhibits the same behavior.  Does anyone have a similar setup that can
test for aol's txt record ?

--

Jeremy Kister
http://jeremy.kister.net./
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: tracking down SERVFAIL

Matthew Dempsky-3
On 1/8/08, Jeremy Kister <[hidden email]> wrote:
> Not knowing if this was just a local configuration issue, I set up a new
> dnscache instance on a new ip address, set up as FORWARDONLY.  It
> exhibits the same behavior.  Does anyone have a similar setup that can
> test for aol's txt record ?

I setup a second dnscache instance on my home network, and I am
experiencing the same problems as you.  I'm looking into this more at
the moment.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: tracking down SERVFAIL

Daryl Tester-4
In reply to this post by Jeremy Kister-9
Jeremy Kister wrote:

> Not knowing if this was just a local configuration issue, I set up a new
> dnscache instance on a new ip address, set up as FORWARDONLY.  It
> exhibits the same behavior.  Does anyone have a similar setup that can
> test for aol's txt record ?

I'm seeing similar results here (I have a forwarding cache pointing to
a resolving cache).  I have digs of various vintages which are printing
varying results, one in particular is reporting ";; Got bad packet:
unexpected end of input", yet works when forcing the query using tcp
(the request that generated this was against the resolving cache as
well as forwarding).  Unfortunately I can't restart the caches at
this point in time to see

I'll snarf some packet dumps later today when I have some free time.
I just wanted to report that you're not alone.  :-)


--
Regards,
  Daryl Tester

"Verbogeny is one of the pleasurettes of a creatific thinkerizer."
  --  Peter da Silva
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: tracking down SERVFAIL

Matthew Dempsky-3
In reply to this post by Matthew Dempsky-3
On 1/8/08, Matthew Dempsky <[hidden email]> wrote:
> I setup a second dnscache instance on my home network, and I am
> experiencing the same problems as you.  I'm looking into this more at
> the moment.

Running ktrace on the FORWARDONLY dnscache reveals this line when
dnscache is trying to write(2) the DNS query packet to TCP socket
connection to the other dnscache instance:

 11333 dnscache CALL  write(0,0x87c74265,0xfffffe16)
 11333 dnscache RET   write -1 errno 22 Invalid argument

For some reason, the length argument is being changed from 25 to 0xfffffe16.

For comparison, when it sends the packet via UDP, the sendto(2) ktrace log is:

 11333 dnscache CALL  sendto(0,0x87c74062,0x19,0,0,0)
 11333 dnscache GIO   fd 0 wrote 25 bytes
       "\^Z\240\^A\0\0\^A\0\0\0\0\0\0\^Caol\^Ccom\0\0\^P\0\^A"
 11333 dnscache RET   sendto 25/0x19

(This is on OpenBSD 4.0.)
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: tracking down SERVFAIL

Matthew Dempsky-3
In reply to this post by Jeremy Kister-9
Can you try the patch below?  As far as I can tell, d->pos is not set
to 0 in this case (but I'm still trying to double check).

--- dns_transmit.c.orig 2008-01-08 13:58:33.000000000 -0800
+++ dns_transmit.c 2008-01-08 13:57:56.000000000 -0800
@@ -166,6 +166,7 @@
       taia_uint(&d->deadline,10);
       taia_add(&d->deadline,&d->deadline,&now);
       if (socket_connect4(d->s1 - 1,ip,53) == 0) {
+        d->pos = 0;
         d->tcpstate = 2;
         return 0;
       }
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: tracking down SERVFAIL [Hey Russ!!]

Jeremy Kister-9
On 1/8/2008 5:00 PM, Matthew Dempsky wrote:
> Can you try the patch below?  As far as I can tell, d->pos is not set
> to 0 in this case (but I'm still trying to double check).

Hi Matthew,

I have applied your patch and the two forwardonly servers (on the same
machine) have been good for over an hour.  I hope it's not premature to
say "YOU GOT IT!!" :)  Thanks for your time.


Russ: after testing this patch for a few more days, i'd like to
recommend that you include it in the recommended section of tinydns.org,
as well as consider it critical for netdjbdns.


--

Jeremy Kister
http://jeremy.kister.net./
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: tracking down SERVFAIL [Resolved - Hey Russ!!]

Jeremy Kister-9
On 1/8/2008 11:17 PM, Jeremy Kister wrote:
> I have applied your patch and the two forwardonly servers (on the same
> machine) have been good for over an hour.  I hope it's not premature to
> say "YOU GOT IT!!" :)  Thanks for your time.

/service/dnscache-204.9.96.7: up (pid 12103) 87985 seconds

.. and i can still resolve aol.com's txt RR.  i'm calling the issue
resolved.

Can you tell me more about the repercussions of this issue?  How exactly
does not resetting pos effect TCP answers?

Specifically, I'm confused why the forwardonly cache can get answers for
some period of time, and then suddenly stop.  And if i understand the
problem correctly, the pos has to do with the server side, so why does
querying the cache directly work, when the forwardonly agent doesnt ?

> Russ: after testing this patch for a few more days, i'd like to
> recommend that you include it in the recommended section of tinydns.org,
> as well as consider it critical for netdjbdns.

yea, that.


--

Jeremy Kister
http://jeremy.kister.net./
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: tracking down SERVFAIL [Resolved - Hey Russ!!]

Matthew Dempsky-3
On 1/9/08, Jeremy Kister <[hidden email]> wrote:
> Can you tell me more about the repercussions of this issue?  How exactly
> does not resetting pos effect TCP answers?

The d->pos variable stores how much of the DNS query has been
transmitted over TCP.  By not resetting it when connecting to a local
server, the dns_transmit code will not properly send a proper DNS
query.

If left uninitialized to a value less than or equal to the query
packet length, it will send an invalid query (some suffix of the
legitimate query); otherwise, it will try to write invalid memory
addresses to the socket.  (On OpenBSD, this will always fail with
EINVAL; I haven't checked any other operating systems.)

> Specifically, I'm confused why the forwardonly cache can get answers for
> some period of time, and then suddenly stop.  And if i understand the
> problem correctly, the pos has to do with the server side, so why does
> querying the cache directly work, when the forwardonly agent doesnt ?

I'm not sure I understand the question.

The problem is that the dns_transmit code is broken for TCP queries
that reuse the same dns_transmit struct to communicate with a local
DNS cache.  How the problem plays out in your situation:

  1. dnscache receives a query for aol.com's TXT records
  2. it forwards the request to other local cache and gets a response
with the truncation bit set
  3. it sends the query again using the t[0] variable in dnscache
(which contains a query struct, which contains a dns_transmit struct)
  4. the connect(2) on line 168 of dns_transmit.c succeeds
immediately, so the state-machine proceeds to tcpstate == 2 without
setting t[0].q.dt.pos to 0, but that variable is stored in static
memory so it's zero-filled initially anyways
  5. the DNS query is successfully sent, and the other DNS cache
responds; at the end of tcpstate == 5, t[0].q.dt.pos still contains
the length of the query response
  6. after some time, the TXT records TTL runs out, and dnscache needs
to forward its query again
  7. same thing as step 4, except this time t[0].q.dt.pos is non-zero,
which corrupts the transmission of the TCP query, and causes
dns_transmit to misinterpret the write(2) error as a SERVFAIL

Another way to reproduce this is to just run "dnstxt aol.com aol.com"
with an unpatched djbdns-1.05 installation on a machine running
dnscache.  The bug here is that dns_resolve_tx.pos is reused without
reinitialization.
Loading...