1.0 CAN problems

  • Hi,
    I am trying to get the CAN bus working with the new version. We had it working with version 0.2. This is what I have done.

    Code
    1. copied buildroot-2010.08/target/device/f+s/addons/lib/modules/2.6.28.6/modules.dep to /lib/modules/2.6.28 on picomod
    2. modprobe spi_s3c
    3. modprobe can
    4. modprobe mcp251x
    5. ip link set can0 type can bitrate 100000
    6. ip link set can0 up


    ifconfig shows that the can interface is up and receiving packets, however when trying to open a socket using the can_rx example program, I get

    Code
    1. # ./can_rx can0
    2. Error during creating socket


    In fact, can_rx.c would not compile because PF_CAN was not defined. I had to copy the definition of PF_CAN from the 0.2 version of can_rx.c (which is the only difference) to make it compile.
    It seems that the new version expects this to be defined in <linux/socket.h> (/usr/local/arm/4.3.1-eabi-armv6/usr/include/linux/socket.h) but this is conditional on

    Code
    1. #if defined(__KERNEL__) || !defined(__GLIBC__) || (__GLIBC__ < 2)

    which equates to false.
    I am a little confused - can you help ? Thanks.

  • I think there are some problems with the CAN on 1.0.
    Now I can open a socket with the can_rx test program (hacked to define PF_CAN as 29 otherwise it doesn't compile), and receive data - I changed the filter mask back to 0 like in version 0.2 and made it loop forever reading the socket.
    The problem is that after a while, CAN just locks up. No more packets are received (or can be transmitted).
    ifconfig shows no more packets received but :

    Code
    1. ip link set can0 down
    2. ip link set can0 up txqueuelen 1000 type can bitrate 100001


    gets it working again, at least for a while.
    Sometimes it works for a few hundred packets, sometimes just a couple.
    With version 0.2, I found that setting the baudrate to 100000 didn't work at all - I had to use 100001 (or 99999, 100002...) then it worked fine. It would appear that the MCP2515 is configured incorrectly at 100000bps.
    This issue remains with the new version but is not a problem for me however the interface just stopping is certainly is...
    Has any body had the CAN reliably receiving packets with this version of the driver ?
    From the socketcan mailing lists, it looks like the MCP251x driver is a bit of a work in progress and has been updated since 2.6.33.
    Before I go looking at the driver source code, has anyone got any ideas ?

  • Hi


    I have modified, for an experiment the drivers/net/can/mcp251x.c
    file to use bitsettings closer to the ones we use in our product.
    (ie.e SAM == 1 and a slightly larger phase2 segment)


    Am I right in assuming the clock on the MCP2515 on the picomod6 is 10MHz ??



    My experimental mod behaves in the same way as the one in the build.
    It works for 40 to 200 messages and then seems to hang.
    Other devices on this can bus are still working
    without interupption.


    This makes me think there may be some kind of logic or buffer problem in the
    net/can code


    // MY MODS


    static int mcp251x_do_set_bittiming(struct net_device *net)
    {
    struct mcp251x_priv *priv = netdev_priv(net);
    struct can_bittiming *bt = &priv->can.bittiming;
    struct spi_device *spi = priv->spi;


    //mcp251x_write_reg(spi, CNF1, ((bt->sjw - 1) << CNF1_SJW_SHIFT) |
    // (bt->brp - 1));
    //mcp251x_write_reg(spi, CNF2, CNF2_BTLMODE |
    // (priv->can.ctrlmode & CAN_CTRLMODE_3_SAMPLES ?
    // CNF2_SAM : 0) |
    // ((bt->phase_seg1 - 1) << CNF2_PS1_SHIFT) |
    // (bt->prop_seg - 1));
    //mcp251x_write_bits(spi, CNF3, CNF3_PHSEG2_MASK,
    // (bt->phase_seg2 - 1));
    //
    mcp251x_write_reg(spi, CNF1, 0x04); // RPC 22OCT2010 FORCE 100bps
    mcp251x_write_reg(spi, CNF2, 0xED); // with can bit style as ETC6000
    mcp251x_write_reg(spi, CNF3, 0x06); // unit. Assume 10MHz Clock on picomod


    dev_info(&spi->dev, "CNF: 0x%02x 0x%02x 0x%02x\n",
    mcp251x_read_reg(spi, CNF1),
    mcp251x_read_reg(spi, CNF2),
    mcp251x_read_reg(spi, CNF3));


    return 0;
    }

  • I find the behavior of the can interface strange.
    I I delibrately set the baud rate to an incorrect value I still get some messages.


    The canbus is designed to check a CRC-15 on the messages and if this does not match
    up there is no valid message available in the MCP2515.


    This driver seems to happily return rubbish values and then simply lock up.
    This is incorrect behaviour.


    Does this driver look at the rxd bits in the 2515 or does it simply read
    whatever happens to be in the buffers regardless ?

  • Looking into the can lock-up, I have modified mcp251x.ko to print to the console when mcp251x_can_isr() is called.
    This happens for a while then stops, as expected. Looking at the interrupt output pin on the 2515 itself, the pin gets stuck in the low state meaning that is is waiting to be serviced.
    From here, I will keep looking into it but any help is appreciated as we are very close to a deadline and may have to go to production with version 0.2 if I can't make the can work reliably soon.
    Darren.

  • The bug in the can driver is related to the MCP2515 interrupt.
    When the can locks up, the INT pin on the 2515 is stuck low meaning that the interrupt has not been serviced. Since the interrupt is falling edge triggered, briefly shorting this pin to Vcc then letting it drop again triggers the picomod interrupt and the CAN comes back to life (at least for a while).
    Looking at mcp251x.c, the ISR seems to just queue the reading of the 2515. When (if...) the queue is processed it then clears the INTF flags which should cause the INT pin to come high again.
    I am not sure reading the MCP2515 data later is such a good idea. Since the 2515 has such a small amount of rx buffering, it should be read immediately in my view. The latest version from kernel.org appears to do this. In any case, the driver as supplied has a problem where occasionally the 2515 does not get read. I am not sure where the problem actually is though.


    Looking at version 0.2 (which worked fine for me), the interrupt is level triggered, not edge triggered so it can't get stuck like this. The IRQ then disables further interrupts until the 2515 has been serviced, at which point they are enabled again. I have modified version 1.0 to do this too and the lock-up problem has gone. If anybody is interested, I can post a patch but I still suspect I am getting some missed frames and I am not sure why at the moment.


    I did look at back porting a later version of the driver but it's not something I have done before so I don't imagine it will be a 5 minute job...

  • Quote from "djlegge"

    When the can locks up, the INT pin on the 2515 is stuck low meaning that the interrupt has not been serviced. Since the interrupt is falling edge triggered, briefly shorting this pin to Vcc then letting it drop again triggers the picomod interrupt and the CAN comes back to life (at least for a while).


    You are fully right. I never understood why this should be an edge triggered interrupt. The interrupt is serviced as long as there are sources for the interrupt. But if the ISR does not handle all sources (for example because a new source comes up immediately after the ISR has checked all sources but before the IRQ flag of the GPIO pin is cleared), then the IRQ pin stays low and the interrupt is not triggered again. This is a race condition which will happen quite often in the real world. But if configured as a low level interrrupt, this does not pose a problem anymore as thge interrupt is trigger again immediately. Therefore I changed the edge triggered interrupt to a low level triggered interrupt in V0.2, which resulted in a much better behavior. However I really missed to apply this change to the new driver in V1.0. Sorry for this. So thanks, you have found and solved a major problem.


    I have now applied the modification and it will appear in the next version.


    Quote

    Looking at mcp251x.c, the ISR seems to just queue the reading of the 2515. When (if...) the queue is processed it then clears the INTF flags which should cause the INT pin to come high again.
    I am not sure reading the MCP2515 data later is such a good idea. Since the 2515 has such a small amount of rx buffering, it should be read immediately in my view. The latest version from kernel.org appears to do this. In any case, the driver as supplied has a problem where occasionally the 2515 does not get read. I am not sure where the problem actually is though.


    Going via the standard SPI driver is a major bottleneck for CAN. We always planned to have an alternative driver for CAN that accesses SPI directly without using the generic SPI driver. However we haven't finished it yet as other items were more urgent.


    Quote

    but I still suspect I am getting some missed frames and I am not sure why at the moment.


    Do you mean you still miss frames with the modified V1.0 that were not missing in the original V0.2 version?


    Quote

    I did look at back porting a later version of the driver but it's not something I have done before so I don't imagine it will be a 5 minute job...


    That's exactly the problem. When I ported this CAN version back in May, I looked at all kernels up to the then current 2.6.34rc4. and also at the berlios sources, where the Socket CAN is developed. 2.6.33 was the first version that had mcp2515 support included. I tried to port 2.6.34.rc4, but there were some other changes in the kernel, that made porting rather difficult. The berlios source was designed to be compiled off-tree and because of this was also not well suited. So I decided to port 2.6.33.2, which was the newest stable version back then. This was already rather complicated due to the new way of setting the properties with the ip program instead of the /sys directory.


    Probably I'll have another look at the newer kernel releases in the future. Now that we have a (hopefully) working version from 2.6.33, maybe it's easier to port back from the newest version to "our" 2.6.33 than it was to port back to the old 2.6.28 version.


    Thanks again for locating this IRQ configuration problem.

    F&S Elektronik Systeme GmbH
    As this is an international forum, please try to post in English.
    Da dies ein internationales Forum ist, bitten wir darum, Beiträge möglichst in Englisch zu verfassen.

  • I think for me version 1.0 is working at least as well as 0.2 after this fix. I thought I was missing frames but upon investigation that was a problem in our software, not the driver.
    If it is missing frames, it is very few and not causing a problem. I think we are getting all frames ok.
    We still have to set the baud rate to 100001 for it to work with our other can bus devices but since it does work this is not an issue for us.
    Thanks for your reply...

  • This patch :
    http://www.mail-archive.com/so….berlios.de/msg00530.html
    seems to be required. Without it, frames received with errors are processed as normal frames. The change is only one line changing

    Code
    1. frame->can_id = can_id;

    to

    Code
    1. frame->can_id |= can_id;

    when an error frame is detected.
    I have one known problem left, that is that if a can message does not transmit, sometimes the can interface just goes off and the only way to restart it is to reload the driver and bring it up again.
    This seems to happen often if the bus is not terminated very well but does not happen simply if no other unit acknowledges the message. Any ideas anyone ?

  • Well, in case it is useful to anyone else :

    Code
    1. ip link set can0 txqueuelen 100 type can bitrate 100000 restart-ms 500


    'restart-ms 500' makes the can-bus come back up after going to 'bus off' after 500ms. If you do not have restart-ms in there, tx errors cause the can driver to be but to sleep permanently as far as I can tell.

  • I believe we have found another bug in mcp215x.c as provided with the picomod.


    In mcp251x_irq_work_handler() the can interrupt flag register is read into a local variable for checking. Then it is immediately cleared :

    Code
    1. intf = mcp251x_read_reg(spi, CANINTF);
    2. mcp251x_write_bits(spi, CANINTF, intf, 0x00);


    Next, some checks are made for error conditions etc. and some time later the flag variable is checked to see if there are received can messages to be read back from the 2515.

    Code
    1. if (intf & CANINTF_RX0IF)
    2. mcp251x_hw_rx(spi, 0);
    3. if (intf & CANINTF_RX1IF)
    4. mcp251x_hw_rx(spi, 1);


    The trouble with this is that clearing CANINTF tells the MCP2515 that the two receive buffers are now free for new messages to be stored in. We are doing this before we have actually read the messages out. What seems to happen is that occasionally (a few times a day but depends on message rate), a newly received message gets moved into one of the receive buffers during the time that we are reading the old message out. This results in message corruption (typically, the ID for one message is returned with the data field of another).


    Changing the code so that CANINTF is cleared AFTER mcp251x_hw_rx() is called appears to fix this problem for us.
    The change is pretty simple, just move the call to mcp251x_write_bits() above to just after the buffer checking code below. We also clear the CANINTF flag before going to sleep in CAN_STATE_BUS_OFF state in the same function (there is a 'return;' there) because that is what the original code did. It is probably not necessary though.
    I hope this might be useful to somebody.