1.0 CAN problems

djlegge · Oct 20th 2010

Hi,
I am trying to get the CAN bus working with the new version. We had it working with version 0.2. This is what I have done.

Code

copied buildroot-2010.08/target/device/f+s/addons/lib/modules/2.6.28.6/modules.dep to /lib/modules/2.6.28 on picomod
modprobe spi_s3c
modprobe can
modprobe mcp251x
ip link set can0 type can bitrate 100000
ip link set can0 up

ifconfig shows that the can interface is up and receiving packets, however when trying to open a socket using the can_rx example program, I get

Code

# ./can_rx can0
Error during creating socket

In fact, can_rx.c would not compile because PF_CAN was not defined. I had to copy the definition of PF_CAN from the 0.2 version of can_rx.c (which is the only difference) to make it compile.
It seems that the new version expects this to be defined in <linux/socket.h> (/usr/local/arm/4.3.1-eabi-armv6/usr/include/linux/socket.h) but this is conditional on

Code

#if defined(__KERNEL__) || !defined(__GLIBC__) || (__GLIBC__ < 2)

which equates to false.
I am a little confused - can you help ? Thanks.

djlegge · Oct 21st 2010

It seems the following is missing from the readme.txt :

Code

modprobe can-raw

Then you can open the socket.

djlegge · Oct 21st 2010

I think there are some problems with the CAN on 1.0.
Now I can open a socket with the can_rx test program (hacked to define PF_CAN as 29 otherwise it doesn't compile), and receive data - I changed the filter mask back to 0 like in version 0.2 and made it loop forever reading the socket.
The problem is that after a while, CAN just locks up. No more packets are received (or can be transmitted).
ifconfig shows no more packets received but :

Code

ip link set can0 down
ip link set can0 up txqueuelen 1000 type can bitrate 100001

gets it working again, at least for a while.
Sometimes it works for a few hundred packets, sometimes just a couple.
With version 0.2, I found that setting the baudrate to 100000 didn't work at all - I had to use 100001 (or 99999, 100002...) then it worked fine. It would appear that the MCP2515 is configured incorrectly at 100000bps.
This issue remains with the new version but is not a problem for me however the interface just stopping is certainly is...
Has any body had the CAN reliably receiving packets with this version of the driver ?
From the socketcan mailing lists, it looks like the MCP251x driver is a bit of a work in progress and has been updated since 2.6.33.
Before I go looking at the driver source code, has anyone got any ideas ?

robin48gx · Oct 22nd 2010

Hi

I have modified, for an experiment the drivers/net/can/mcp251x.c
file to use bitsettings closer to the ones we use in our product.
(ie.e SAM == 1 and a slightly larger phase2 segment)

Am I right in assuming the clock on the MCP2515 on the picomod6 is 10MHz ??

My experimental mod behaves in the same way as the one in the build.
It works for 40 to 200 messages and then seems to hang.
Other devices on this can bus are still working
without interupption.

This makes me think there may be some kind of logic or buffer problem in the
net/can code

// MY MODS

static int mcp251x_do_set_bittiming(struct net_device *net)
{
struct mcp251x_priv *priv = netdev_priv(net);
struct can_bittiming *bt = &priv->can.bittiming;
struct spi_device *spi = priv->spi;

//mcp251x_write_reg(spi, CNF1, ((bt->sjw - 1) << CNF1_SJW_SHIFT) |
// (bt->brp - 1));
//mcp251x_write_reg(spi, CNF2, CNF2_BTLMODE |
// (priv->can.ctrlmode & CAN_CTRLMODE_3_SAMPLES ?
// CNF2_SAM : 0) |
// ((bt->phase_seg1 - 1) << CNF2_PS1_SHIFT) |
// (bt->prop_seg - 1));
//mcp251x_write_bits(spi, CNF3, CNF3_PHSEG2_MASK,
// (bt->phase_seg2 - 1));
//
mcp251x_write_reg(spi, CNF1, 0x04); // RPC 22OCT2010 FORCE 100bps
mcp251x_write_reg(spi, CNF2, 0xED); // with can bit style as ETC6000
mcp251x_write_reg(spi, CNF3, 0x06); // unit. Assume 10MHz Clock on picomod

dev_info(&spi->dev, "CNF: 0x%02x 0x%02x 0x%02x\n",
mcp251x_read_reg(spi, CNF1),
mcp251x_read_reg(spi, CNF2),
mcp251x_read_reg(spi, CNF3));

return 0;
}

robin48gx · Oct 22nd 2010

http://picasaweb.google.co.uk/…Album#5530815822531984162

OK the OSC to the MCP2515 is 20MHz.
Each sine wave is 50nS, so 1/50e-9 = 20,000,0000

Tosc == 50nS

to get a TQ of 500ns I need to divide the clock by ten.

So putting 9 in CNF1 will divide by ten and give me an SJW of 1

Now I just need 20 TQ to build my can bit.... which would make each bit 10uS long, i.e. 100bps

robin48gx · Oct 22nd 2010

I find the behavior of the can interface strange.
I I delibrately set the baud rate to an incorrect value I still get some messages.

The canbus is designed to check a CRC-15 on the messages and if this does not match
up there is no valid message available in the MCP2515.

This driver seems to happily return rubbish values and then simply lock up.
This is incorrect behaviour.

Does this driver look at the rxd bits in the 2515 or does it simply read
whatever happens to be in the buffers regardless ?

djlegge · Oct 25th 2010

Looking into the can lock-up, I have modified mcp251x.ko to print to the console when mcp251x_can_isr() is called.
This happens for a while then stops, as expected. Looking at the interrupt output pin on the 2515 itself, the pin gets stuck in the low state meaning that is is waiting to be serviced.
From here, I will keep looking into it but any help is appreciated as we are very close to a deadline and may have to go to production with version 0.2 if I can't make the can work reliably soon.
Darren.

djlegge · Oct 26th 2010

The bug in the can driver is related to the MCP2515 interrupt.
When the can locks up, the INT pin on the 2515 is stuck low meaning that the interrupt has not been serviced. Since the interrupt is falling edge triggered, briefly shorting this pin to Vcc then letting it drop again triggers the picomod interrupt and the CAN comes back to life (at least for a while).
Looking at mcp251x.c, the ISR seems to just queue the reading of the 2515. When (if...) the queue is processed it then clears the INTF flags which should cause the INT pin to come high again.
I am not sure reading the MCP2515 data later is such a good idea. Since the 2515 has such a small amount of rx buffering, it should be read immediately in my view. The latest version from kernel.org appears to do this. In any case, the driver as supplied has a problem where occasionally the 2515 does not get read. I am not sure where the problem actually is though.

Looking at version 0.2 (which worked fine for me), the interrupt is level triggered, not edge triggered so it can't get stuck like this. The IRQ then disables further interrupts until the 2515 has been serviced, at which point they are enabled again. I have modified version 1.0 to do this too and the lock-up problem has gone. If anybody is interested, I can post a patch but I still suspect I am getting some missed frames and I am not sure why at the moment.

I did look at back porting a later version of the driver but it's not something I have done before so I don't imagine it will be a 5 minute job...

fs-support_HK · Oct 26th 2010

Quote from "djlegge"

When the can locks up, the INT pin on the 2515 is stuck low meaning that the interrupt has not been serviced. Since the interrupt is falling edge triggered, briefly shorting this pin to Vcc then letting it drop again triggers the picomod interrupt and the CAN comes back to life (at least for a while).

You are fully right. I never understood why this should be an edge triggered interrupt. The interrupt is serviced as long as there are sources for the interrupt. But if the ISR does not handle all sources (for example because a new source comes up immediately after the ISR has checked all sources but before the IRQ flag of the GPIO pin is cleared), then the IRQ pin stays low and the interrupt is not triggered again. This is a race condition which will happen quite often in the real world. But if configured as a low level interrrupt, this does not pose a problem anymore as thge interrupt is trigger again immediately. Therefore I changed the edge triggered interrupt to a low level triggered interrupt in V0.2, which resulted in a much better behavior. However I really missed to apply this change to the new driver in V1.0. Sorry for this. So thanks, you have found and solved a major problem.

I have now applied the modification and it will appear in the next version.

Quote

Looking at mcp251x.c, the ISR seems to just queue the reading of the 2515. When (if...) the queue is processed it then clears the INTF flags which should cause the INT pin to come high again.
I am not sure reading the MCP2515 data later is such a good idea. Since the 2515 has such a small amount of rx buffering, it should be read immediately in my view. The latest version from kernel.org appears to do this. In any case, the driver as supplied has a problem where occasionally the 2515 does not get read. I am not sure where the problem actually is though.

Going via the standard SPI driver is a major bottleneck for CAN. We always planned to have an alternative driver for CAN that accesses SPI directly without using the generic SPI driver. However we haven't finished it yet as other items were more urgent.

Quote

but I still suspect I am getting some missed frames and I am not sure why at the moment.

Do you mean you still miss frames with the modified V1.0 that were not missing in the original V0.2 version?

Quote

I did look at back porting a later version of the driver but it's not something I have done before so I don't imagine it will be a 5 minute job...

That's exactly the problem. When I ported this CAN version back in May, I looked at all kernels up to the then current 2.6.34rc4. and also at the berlios sources, where the Socket CAN is developed. 2.6.33 was the first version that had mcp2515 support included. I tried to port 2.6.34.rc4, but there were some other changes in the kernel, that made porting rather difficult. The berlios source was designed to be compiled off-tree and because of this was also not well suited. So I decided to port 2.6.33.2, which was the newest stable version back then. This was already rather complicated due to the new way of setting the properties with the ip program instead of the /sys directory.

Probably I'll have another look at the newer kernel releases in the future. Now that we have a (hopefully) working version from 2.6.33, maybe it's easier to port back from the newest version to "our" 2.6.33 than it was to port back to the old 2.6.28 version.

Thanks again for locating this IRQ configuration problem.

djlegge · Oct 26th 2010

I think for me version 1.0 is working at least as well as 0.2 after this fix. I thought I was missing frames but upon investigation that was a problem in our software, not the driver.
If it is missing frames, it is very few and not causing a problem. I think we are getting all frames ok.
We still have to set the baud rate to 100001 for it to work with our other can bus devices but since it does work this is not an issue for us.
Thanks for your reply...

djlegge · Dec 16th 2010

This patch :
http://www.mail-archive.com/so….berlios.de/msg00530.html
seems to be required. Without it, frames received with errors are processed as normal frames. The change is only one line changing

Code

frame->can_id = can_id;

to

Code

frame->can_id |= can_id;

when an error frame is detected.
I have one known problem left, that is that if a can message does not transmit, sometimes the can interface just goes off and the only way to restart it is to reload the driver and bring it up again.
This seems to happen often if the bus is not terminated very well but does not happen simply if no other unit acknowledges the message. Any ideas anyone ?

djlegge · Dec 17th 2010

Well, in case it is useful to anyone else :

Code

ip link set can0 txqueuelen 100 type can bitrate 100000 restart-ms 500

'restart-ms 500' makes the can-bus come back up after going to 'bus off' after 500ms. If you do not have restart-ms in there, tx errors cause the can driver to be but to sleep permanently as far as I can tell.

djlegge · Feb 4th 2011

I believe we have found another bug in mcp215x.c as provided with the picomod.

In mcp251x_irq_work_handler() the can interrupt flag register is read into a local variable for checking. Then it is immediately cleared :

Code

intf = mcp251x_read_reg(spi, CANINTF);
mcp251x_write_bits(spi, CANINTF, intf, 0x00);

Next, some checks are made for error conditions etc. and some time later the flag variable is checked to see if there are received can messages to be read back from the 2515.

Code

if (intf & CANINTF_RX0IF)
mcp251x_hw_rx(spi, 0);
if (intf & CANINTF_RX1IF)
mcp251x_hw_rx(spi, 1);

The trouble with this is that clearing CANINTF tells the MCP2515 that the two receive buffers are now free for new messages to be stored in. We are doing this before we have actually read the messages out. What seems to happen is that occasionally (a few times a day but depends on message rate), a newly received message gets moved into one of the receive buffers during the time that we are reading the old message out. This results in message corruption (typically, the ID for one message is returned with the data field of another).

Changing the code so that CANINTF is cleared AFTER mcp251x_hw_rx() is called appears to fix this problem for us.
The change is pretty simple, just move the call to mcp251x_write_bits() above to just after the buffer checking code below. We also clear the CANINTF flag before going to sleep in CAN_STATE_BUS_OFF state in the same function (there is a 'return;' there) because that is what the original code did. It is probably not necessary though.
I hope this might be useful to somebody.

Share